back to index

Lesson 6: Practical Deep Learning for Coders 2022


Chapters

0:0 Review
2:9 TwoR model
4:43 How to create a decision tree
7:2 Gini
10:54 Making a submission
15:52 Bagging
19:6 Random forest introduction
20:9 Creating a random forest
22:38 Feature importance
26:37 Adding trees
29:32 What is OOB
32:8 Model interpretation
35:47 Removing the redundant features
35:59 What does Partial dependence do
39:22 Can you explain why a particular prediction is made
46:7 Can you overfit a random forest
49:3 What is gradient boosting
51:56 Introducing walkthrus
54:28 What does fastkaggle do
62:52 fastcore.parallel
64:12 item_tfms=Resize(480, method='squish')
66:20 Fine-tuning project
67:22 Criteria for evaluating models
70:22 Should we submit as soon as we can
75:15 How to automate the process of sharing kaggle notebooks
80:17 AutoML
84:16 Why the first model run so slow on Kaggle GPUs
87:53 How much better can a new novel architecture improve the accuracy
88:33 Convnext
91:10 How to iterate the model with padding
92:1 What does our data augmentation do to images
94:12 How to iterate the model with larger images
96:8 pandas indexing
98:16 What data-augmentation does tta use?

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, so welcome back to
00:00:02.000 | Not welcome back to welcome to lesson six first time. We've been a lesson six welcome back to practically deep learning for coders
00:00:10.220 | We just started looking at tabular data
00:00:17.520 | Last time and
00:00:21.560 | for those of you who've forgotten what we did was we
00:00:28.640 | We were looking at the titanic data set
00:00:31.720 | And we were looking at creating binary splits
00:00:36.400 | by looking at
00:00:39.240 | Categorical variables or binary variables like sex
00:00:46.600 | Continuous variables like the log of the fare that they paid and
00:00:55.200 | Using those, you know, we also kind of came up with a score
00:00:59.120 | Which was basically how how good a job did that split to of grouping the
00:01:06.880 | Survival characteristics into two groups, you know all of nearly all of one of whom survived nearly all of whom the other didn't survive
00:01:14.640 | So they had like small standard deviation in each group
00:01:20.080 | So then we created the world's simplest little UI to allow us to fiddle around and try to find a good
00:01:25.680 | binary split and we did
00:01:28.040 | We did come up with a very good binary split
00:01:33.640 | Which was on on sex and actually we created this little
00:01:39.200 | Automated version and so this is I think the first time we can we're not quite the first time. No, this is
00:01:45.800 | This is yet another time. I should say that we have successfully created a
00:01:50.280 | actual
00:01:52.760 | Machine learning algorithm from scratch. This one is about the world's simplest one. It's one are
00:01:56.920 | Creating the single rule which does a good job of splitting your data set into two parts
00:02:04.240 | Which differ as much as possible on the dependent variable?
00:02:08.080 | One hour is probably not going to cut it for a lot of things though
00:02:13.520 | It's surprisingly effective, but it's so maybe we could go a step further
00:02:18.760 | And the other step further we could go is we could create like a 2r. What if we took each of those?
00:02:24.000 | groups males and females in the Titanic data set and
00:02:29.040 | Split each of those into two other groups. So split the males into two groups and split the females into two groups
00:02:39.040 | To do that we can repeat the exact same piece of code we just did but let's remove
00:02:48.240 | sex from it and
00:02:50.240 | Then split the data set into males and females
00:02:53.760 | And run the same piece of code that we just did before but just for the males
00:02:58.480 | And so this is going to be like a one-hour rule for how do we predict which males survive the Titanic?
00:03:06.760 | And let's have a look three eight three seven three eight three eight three eight. Okay, so it's
00:03:12.760 | Age were they greater than or less than six?
00:03:17.760 | Turns out to be for the males the biggest predictor of whether they were going to survive
00:03:21.560 | That shipwreck and we can do the same thing females. So for females
00:03:27.200 | There we go, no great supplies P class so whether they were in
00:03:37.440 | first class or not
00:03:40.000 | Was the biggest predictor for females of whether they would survive the shipwreck?
00:03:44.440 | So that has now given us a
00:03:52.200 | Decision tree it is a series of binary splits
00:03:58.240 | Which will gradually?
00:04:00.960 | Split up our data more and more such that at in the end
00:04:04.720 | These in the leaf nodes as we call them. We will hopefully get as you know much
00:04:12.680 | Stronger prediction as possible about survival
00:04:15.160 | So we could just repeat this step for each of the four groups we've now created males
00:04:21.760 | kids and older than six
00:04:24.840 | females first class and
00:04:27.760 | Everybody else and we could do it again and then we'd have eight groups
00:04:32.280 | We could do that manually with another couple of lines of code or we can just use
00:04:39.120 | Decision tree classifier, which is a class which does exactly that for us
00:04:42.840 | So there's no magic in here. It's just doing what we've just described
00:04:47.240 | And a decision tree classifier comes from a library called psychic learn
00:04:54.200 | Psychic learn is a fantastic library that focuses on kind of classical
00:05:01.720 | non deep learning ish machine learning methods
00:05:06.800 | like decision trees
00:05:08.800 | So we can so to create the exact same decision tree
00:05:12.640 | We can say please create a decision tree traffic classifier with that most four leaf nodes
00:05:18.040 | And one very nice thing it has is
00:05:22.560 | It can draw the tree for us
00:05:25.840 | So here's a tiny little draw tree function
00:05:32.640 | You can see here it's gonna first of all split on sex now
00:05:36.360 | It looks a bit weird to say sex is less than or equal to point five
00:05:39.100 | But remember what our binary characteristics are coded as zero one
00:05:44.640 | So that's just how we you know easy way to say males versus females
00:05:49.600 | And then here we've got for the females
00:05:55.040 | What class are they in and for the males what age are they and here's our four leaf nodes
00:06:02.320 | so for the females in first class a
00:06:07.000 | 116 of them survived and four of them didn't so very good idea to be a well-to-do
00:06:15.720 | woman on the Titanic
00:06:18.640 | On the other hand
00:06:21.600 | Males adults
00:06:28.000 | 68 survived 350 died so a very bad idea to be a male adult on the Titanic
00:06:35.580 | So you can see you can kind of get a quick summary
00:06:39.200 | Of what's going on and one of the reasons people tend to like decision trees particularly for exploratory data analysis
00:06:46.200 | Is it doesn't allow us to get a quick picture of what are the key?
00:06:51.160 | driving variables in this data set and how much do they kind of
00:06:56.080 | Predict what was happening in the data?
00:06:58.400 | Okay, so it's around the same splits as us
00:07:03.920 | It's got one additional piece of information. We haven't seen before. It's this thing called Gini
00:07:07.840 | Gini is just another way of measuring how good a split is and
00:07:16.120 | Put the code to calculate Gini here
00:07:18.200 | Here's how you can think of Gini
00:07:21.760 | How likely is it that if you go into that sample?
00:07:25.560 | and grab
00:07:27.280 | one item and
00:07:29.000 | Then go in again and grab another item. How likely is it that you're going to grab the same item each time?
00:07:35.040 | and so
00:07:37.640 | if that if the entire leaf node is
00:07:42.620 | Just people who survived or just people who didn't survive the probability would be one you get the same time same every time
00:07:48.540 | If it was an exactly equal mix the probability would be point five
00:07:51.880 | so that's why we just
00:07:55.200 | Yeah, that's where this this formula comes from in the binary case
00:07:58.880 | And in fact, you can see it here, right? This group here is pretty much 50/50. So Gini's point five
00:08:05.240 | Where else this group here is nearly a hundred percent in one class. So Gini is nearly
00:08:10.040 | Zero, so I had it backwards. It's one minus
00:08:12.760 | And I think I've written it backwards here as well, so I better fix that
00:08:23.480 | This decision tree is you know, we would expect it to be all accurate so we can calculate
00:08:33.000 | It's been absolute error and for the 1r. So just doing males versus females
00:08:37.820 | What was our score
00:08:43.240 | Here we go
00:08:45.280 | point 407
00:08:47.680 | Actually, we have do we have an accuracy score somewhere here we are point three three six
00:08:51.720 | That was for log fair and for sex it was
00:08:59.640 | Point two one five. Okay, so point two one five. So that was for the 1r version for the decision tree with four leaf nodes
00:09:08.120 | Point two two four. So it's actually a little worse, right?
00:09:12.300 | And I think this just reflects the fact that this is such a small data set
00:09:18.480 | the 1r
00:09:20.480 | Version was so good. We haven't really improved it that much
00:09:24.560 | But not enough to really see it
00:09:27.320 | Amongst the randomness of such a small validation set
00:09:30.680 | We could go further
00:09:34.640 | To 50 a minimum of 50 samples per leaf node. So that means that in each of these
00:09:41.520 | So you have it says samples which in this case is passengers on the Titanic. There's at least there's 67 people that
00:09:50.680 | female first class
00:09:52.680 | Less than 28
00:09:55.680 | That's how you define that. So this decision tree keeps building keep splitting until it gets to a point where there's going to be less
00:10:02.240 | Than 50 at which point it stops putting that that leaf so you can see they're all got at least 50 samples
00:10:09.280 | And so here's the decision tree that builds as you can see, it doesn't have to be like constant depth, right?
00:10:15.000 | So this group here
00:10:16.800 | Which is males?
00:10:18.800 | Who had cheaper fares?
00:10:21.720 | And who were older than 20?
00:10:25.680 | but younger than 32
00:10:29.120 | Actually younger than 24 and
00:10:32.840 | actually
00:10:35.000 | Super cheap fares and so forth, right? So it keeps going down until we get to that group. So
00:10:40.600 | Let's try that decision trees. That decision tree has an absolute error of point one eight three
00:10:46.560 | So not surprisingly, you know, once we get there, it's starting to look like it's a little bit better
00:10:51.240 | So there's a model and
00:10:56.680 | This is a kaggle competition. So therefore we should submit it to the leaderboard and
00:11:04.800 | You know one of the you know
00:11:06.800 | Biggest mistakes I see
00:11:09.760 | Not just beginners but every level of practitioner make on Kaggle is not to submit to the leaderboard
00:11:15.480 | Spend months making some perfect thing, right?
00:11:19.560 | But you're actually going to see how you're going and you should try and submit something to the leaderboard every day
00:11:24.200 | So, you know regardless of how rubbish it is because
00:11:29.720 | You want to improve every day?
00:11:33.040 | And so you want to keep iterating so to submit something to the leaderboard you generally have to provide a
00:11:38.640 | CSV file
00:11:41.760 | And so we're going to create a CSV file
00:11:48.920 | We're going to apply the category codes to get the the category for each one in our test set
00:11:55.440 | We're going to set the survived column to our predictions
00:11:58.800 | And then we're going to send that off to a CSV
00:12:05.080 | So yeah, so I submitted that and I got a score a little bit worse than most of our linear models and neural nets
00:12:13.120 | But not terrible, you know, it was it's it's just doing an okay job
00:12:17.800 | Now one interesting thing for the decision tree is there was a lot less pre-processing to do
00:12:27.320 | Did you notice that we didn't have to create any dummy variables for our categories?
00:12:35.080 | Like you certainly can create dummy variables, but you often don't have to so for example
00:12:40.720 | You know for for class, you know, it's one two or three you can just split on one two or three, you know
00:12:48.720 | even for like
00:12:52.040 | What was that thing like the the embarkation?
00:12:54.560 | City code like we just convert them kind of arbitrarily to numbers one two and three and you can split on those numbers
00:13:02.720 | So with random forest or so not random first not the decision trees
00:13:06.320 | Yeah, you can generally get away with not doing stuff like
00:13:11.280 | dummy variables
00:13:14.440 | In fact even taking the log of fair
00:13:16.760 | We only did that to make our graph look better. But if you think about it
00:13:22.320 | splitting on log fair less than 2.7
00:13:26.840 | It's exactly the same as putting on fair is less than either the 2.7, you know, whatever blog base we used I can't remember
00:13:36.840 | All that a decision tree cares about is the ordering of the data and this is another reason that decision tree-based approaches are fantastic
00:13:44.520 | Because they don't care at all about outliers, you know long tail distributions
00:13:52.320 | Categorical variables, whatever you can throw it all in and it'll do a perfectly fine job
00:14:01.280 | For tabular data, I would always start by using a decision tree-based approach
00:14:08.040 | And kind of create some baselines and so forth because it's it's really hard to mess it up
00:14:14.960 | And that's important
00:14:21.120 | So, yeah, so here for example is embarked right it it was coded originally as
00:14:28.080 | the first letter of the city they embarked in
00:14:31.760 | But we turned it into a categorical variable
00:14:35.320 | And so pandas for us creates this this vocab this list of all of the possible values
00:14:40.520 | And if you look at the codes
00:14:43.520 | Attribute you can see it's that S is that 0 1 2 so S has become 2 C
00:14:50.800 | Has become 0 and so forth. All right, so that's how we converting the categories the strings
00:14:57.660 | Into numbers that we can sort and group by
00:15:01.480 | So, yeah, so if we wanted to split C into one group and Q and S and the other we can just do
00:15:09.920 | Okay, less than a quarter one point 0.5
00:15:12.780 | Now, of course if we wanted to split C and S into one group and Q into the other
00:15:18.480 | We would need two binary splits first C
00:15:21.000 | On one side and QS at Q and S on the other and then Q and S into Q versus S
00:15:26.680 | And then the Q and S leaf nodes could get similar
00:15:29.960 | Predictions so like you do have that sometimes it can take a little bit more
00:15:33.680 | messing around but
00:15:36.920 | Most of the time I find categorical variables work fine as numeric in decision tree-based approaches
00:15:44.600 | And as I say here, I tend to use dummy variables only if there's like less than four levels
00:15:48.840 | Now what if we wanted to make this more accurate could we grow the tree further I
00:15:59.040 | Mean we could but
00:16:03.800 | You know, there's only 50 samples in these leaves right it's it's not really
00:16:14.680 | Know if I keep splitting it the leaf nodes are going to have subtle data that that's not really going to make very useful predictions
00:16:24.720 | Now there are limitations to how accurate a decision tree can be
00:16:30.240 | So what can we do
00:16:35.720 | We can do something that's actually very I mean, I find it amazing and fascinating
00:16:42.680 | It comes from a guy called Leo Bremen
00:16:45.560 | And Leo Bremen came with his came up with this idea
00:16:50.480 | Called bagging and here's the basic idea of bagging
00:16:54.840 | Let's say we've got a model
00:16:57.600 | That's not very good
00:17:00.560 | Because let's say it's a decision tree, it's really small we've hardly used any data for it, right?
00:17:07.280 | It's not very good. So it's got error. It's got errors on predictions
00:17:11.560 | It's not a systematically biased error. It's not always predicting too high or is predicting too low
00:17:16.880 | I mean decision trees, you know on average will predict the average, right?
00:17:21.080 | But it has errors
00:17:23.920 | So what I could do is I could build another decision tree in
00:17:28.300 | Some slightly different way that would have different splits and it would also be not a great model but
00:17:38.160 | Predicts the correct thing on average. It's not completely hopeless
00:17:41.080 | And again, you know, some of the errors are a bit too high and some are a bit too low
00:17:45.320 | And I could keep doing this. So if I could create building lots and lots of slightly different decision trees
00:17:50.960 | I'm gonna end up with say a hundred different models all of which are unbiased
00:17:57.680 | All of which are better than nothing and all of which have some errors bit high some bit low whatever
00:18:04.040 | So what would happen if I average their predictions?
00:18:08.440 | Assuming that the models are not correlated with each other
00:18:13.000 | Then you're going to end up with errors on either side of the correct prediction
00:18:20.560 | Some are a bit high some are a bit low and there'll be this kind of distribution of errors, right? And
00:18:26.000 | the average of those errors will be
00:18:29.160 | zero and
00:18:32.800 | So that means the average of the predictions of these multiple
00:18:36.280 | uncorrelated models each of which is unbiased will be
00:18:40.600 | The correct prediction because they have an error of zero and this is a mind-blowing insight
00:18:47.100 | It says that if we can generate a whole bunch of
00:18:51.840 | uncorrelated
00:18:54.280 | unbiased
00:18:55.840 | models
00:18:57.440 | We can average them and get something better than any of the individual models because the average of the error
00:19:04.600 | Will be zero
00:19:06.880 | So all we need is a way to generate
00:19:09.960 | lots of models
00:19:12.600 | Well, we already have a great way to build models, which is to create a decision tree
00:19:16.180 | How do we create lots of them?
00:19:19.160 | How do we create lots of unbiased but different models?
00:19:25.880 | Let's just grab a different subset of the data each time. Let's just grab at random half the rows and
00:19:32.720 | Build a decision tree and then grab another half the rows and build a decision tree
00:19:37.920 | And grab another half the rows and build a decision tree each of those decision trees is going to be not great
00:19:42.840 | It's only using half the data
00:19:44.960 | But it will be unbiased. It will be predicting the average on average
00:19:48.780 | It will certainly be better than nothing because it's using, you know, some real data to try and create a real decision tree
00:19:55.680 | The they won't be correlated with each other because they're each random subsets. So that makes all of our criteria
00:20:02.720 | for bagging
00:20:05.160 | When you do this, you create something called a random forest
00:20:08.840 | So let's create one in four lines of code
00:20:17.960 | Here is a function to create a decision tree. So let's say what put this is just the proportion of data
00:20:23.760 | So let's say we put 75% of the data in each time or we could change it to 50% whatever
00:20:29.680 | So this is the number of samples in this subset n and so let's at random choose
00:20:38.600 | n times the proportion we requested from the sample and build a decision tree from that and
00:20:47.240 | So now let's
00:20:50.520 | 100 times
00:20:52.520 | Get a tree and stick them all in a list using a list comprehension
00:20:56.800 | And now let's grab the predictions for each one of those trees and
00:21:03.760 | Then let's stack all those predictions up together and take their mean
00:21:07.680 | That is a random forest
00:21:12.480 | And what do we get one two three four five six seven eight that's seven lines of code. So
00:21:21.560 | random forests are very
00:21:23.560 | simple
00:21:25.400 | This is a slight simplification. There's one other difference that random forests do
00:21:30.000 | Which is when they build the decision tree. They also randomly select a subset of columns and
00:21:36.480 | they select a different random subset of columns each time they do a split and
00:21:41.800 | So the idea is you kind of want it to be as random as possible, but also somewhat useful
00:21:49.440 | We can do that by creating a random forest classifier
00:22:01.920 | Say how many trees do we want?
00:22:04.920 | how many
00:22:06.720 | Samples per leaf and then fit does what we just did and here's our main absolute error rich
00:22:16.680 | Again, it's like not as good as our decision tree, but it's still pretty good. And again, it's such a small data set
00:22:21.640 | It's hard to tell if that means anything and
00:22:23.640 | So we can submit that to Kaggle so earlier on I created a little function to submit to Kaggle
00:22:28.160 | So now I just create some predictions and I submit to Kaggle and yeah looks like it gave nearly identical results to a single tree
00:22:35.240 | Now to one of my favorite things
00:22:42.160 | About random forests and I should say in most real-world data sets of reasonable size random forests
00:22:48.480 | Basically always give you much better results than decision trees. This is just a small data set to show you what to do
00:22:55.540 | one of my favorite things about
00:22:58.640 | Random forests as we can do something quite cool with it. What we can do is we can look at the
00:23:04.080 | Underlying decision trees they create so we've now got a hundred decision trees
00:23:09.720 | And we can see what columns
00:23:12.440 | Did it find a split on and so it's a here. Okay. Well the first thing it split on with six and
00:23:18.280 | it improved the Gini from
00:23:21.880 | Point four seven
00:23:25.080 | Two now just take the weighted average of point three eight and point three one weighted by the samples
00:23:30.400 | So that's probably going to be about point three three. So it's a okay. It's like point one for improvement in Gini. Thanks to sex
00:23:41.480 | We can do that again. Okay. Well then P class, you know, how about did that improve Gini?
00:23:46.000 | Again, we keep waiting it by the number of samples as well log fair. How much does that improve Gini and we can keep track
00:23:52.880 | for each column of
00:23:55.560 | How much in total did they improve the Gini in this decision tree and then do that for every decision tree and
00:24:05.320 | then add them up per column and that gives you something called a feature importance plot and
00:24:11.880 | Here it is
00:24:14.960 | And a feature importance plot tells you how important is each feature
00:24:20.280 | how often did the trees pick it and how much did it improve the Gini when it did and
00:24:26.240 | so we can see from the feature importance plot that sex was the most important and
00:24:34.120 | Class was the second most important and everything else was a long way back
00:24:38.400 | Now this is another reason by the way why our random forest isn't really particularly helpful
00:24:43.400 | Because it's just such an easy split to do right? I basically all that matters is
00:24:49.080 | You know what class you're in and whether you're male and female
00:24:54.280 | and these
00:24:57.720 | Feature importance plots remember because they're built on random forests
00:25:05.040 | Random forests don't care about
00:25:07.200 | really
00:25:09.280 | The distribution of your data and they can handle categorical variables and stuff like that
00:25:13.600 | That means that you can basically any tabular data set you have you can just plot this
00:25:18.720 | right away and
00:25:21.000 | Random forests, you know for most data sets only take a few seconds to train, you know, really at most of a minute or two
00:25:29.440 | And so if you've got a big data set and you know hundreds of columns
00:25:34.480 | do this first and
00:25:37.480 | find the
00:25:39.400 | 30 columns that might matter
00:25:41.760 | It's such a helpful thing to do. So I've done that for example. I did some work in credit scoring
00:25:49.080 | so we're trying to find out which
00:25:51.080 | Things would predict who's going to default on a loan and I was given
00:25:55.840 | something like
00:25:58.000 | 1,000
00:25:59.200 | columns from the database
00:26:01.200 | And I put it straight into a random forest and found I think there was about 30 columns that seemed
00:26:06.640 | Kind of interesting. I did that
00:26:09.360 | like two hours after I started the job and I went to the
00:26:14.040 | Head of marketing and the head of risk and I told them here's the columns. I think that we should focus on and
00:26:21.160 | They were like, oh my god. We just finished a two-year
00:26:25.800 | consulting project with one of the big consultants
00:26:27.960 | Paid the millions of dollars and they came up with a subset of these
00:26:32.640 | There are other things that you can do with
00:26:41.400 | With random forests along this path. I'll touch on them briefly
00:26:49.560 | Specifically
00:26:51.800 | I'm going to look at
00:26:53.760 | chapter 8 of the book
00:26:56.040 | Which goes into this in a lot more detail and particularly interestingly chapter 8 of the book uses a
00:27:03.160 | Much bigger and more interesting data set which is auction prices of heavy industrial equipment
00:27:10.600 | I mean, it's less interesting historically, but more interestingly numerically
00:27:14.600 | And so some of the things I did there on this data set
00:27:25.800 | Say this isn't from the data set. This is from the psychic learn documentation
00:27:28.960 | They looked at how as you increase the number of estimators. So the number of trees
00:27:33.920 | how much does the
00:27:37.400 | Accuracy improve so I then did the same thing on our data set. So I actually just
00:27:42.000 | Added up to 40 more and more and more trees and
00:27:47.960 | you can see that basically as as predicted by that kind of an initial bit of
00:27:53.840 | Hand-wavy theory I gave you that you would expect the more trees
00:27:58.360 | The lower the error because the more things you're averaging and that's exactly what we find the accuracy improves as we have more trees
00:28:06.880 | John what's up?
00:28:09.600 | Victor is
00:28:11.600 | You might have just answered his question actually as he talked it but he's he's asking on the same theme the number of trees in a
00:28:18.520 | Random forest does increasing the number of trees always?
00:28:21.480 | Translate to a better error. Yes. It does always I mean tiny bumps, right? But yeah, once you smooth it out
00:28:34.880 | Decreasing returns and
00:28:40.320 | If you end up production ising a random forest then of course every one of these trees you have to
00:28:45.680 | You know go through for at inference time
00:28:49.520 | So it's thought that there's no cost. I mean having said that
00:28:53.600 | Zipping through a binary tree is the kind of thing you can
00:28:58.080 | really
00:29:01.120 | Do fast in fact, it's it's quite easy to let literally
00:29:05.080 | spit out C++ code
00:29:09.320 | With a bunch of if statements and compile it and get extremely fast performance
00:29:15.400 | I don't often use more than a hundred trees. This is a rule of thumb
00:29:22.200 | That the only one John
00:29:31.000 | So then there's another interesting feature of random forests
00:29:35.920 | Which is remember how in our example we trained with?
00:29:40.000 | 75% of the data
00:29:42.760 | on each tree
00:29:44.560 | So that means for each tree there was 25% of the data we didn't train on
00:29:47.840 | Now this actually means if you don't have much data in some situations you can get away with not having a validation set and
00:29:56.120 | the reason why is
00:29:59.040 | because for each tree we can pick the 25% of
00:30:03.960 | rows that weren't in that tree and
00:30:06.360 | See how accurate that tree was on those rows and we can average for each row
00:30:13.520 | their accuracy on all of the trees in which they were not part of the training and
00:30:18.920 | That is called the out-of-bag error
00:30:22.120 | Or OOB error and this is built in also to SK learn you can ask for an OOB
00:30:29.200 | prediction
00:30:35.800 | Just before we move on
00:30:40.080 | Zakiya has a question about bagging
00:30:43.640 | So we know that bagging is powerful as an ensemble approach to machine learning
00:30:48.280 | Would it be advisable to try out bagging then first when approaching a particular?
00:30:53.440 | say tabular task
00:30:56.080 | Before deep learning, so that's the first part of the question
00:31:01.280 | And the second part is could we create a bagging model which includes fast AI deep learning models?
00:31:11.160 | Absolutely. So to be clear, you know bagging is kind of like a
00:31:14.360 | meta method it's not a
00:31:17.000 | prediction it's not a method of modeling itself. It's just a method of
00:31:21.400 | Combining other models
00:31:24.680 | So random forests in particular as a particular approach to bagging
00:31:31.040 | Is a you know, I would probably always start personally a tabular
00:31:35.800 | Project with a random forest because they're nearly impossible to mess up and they give good insight and they give a good base case
00:31:42.400 | But yeah your question then about can you bag?
00:31:47.440 | other models is a very interesting one and the answer is you absolutely can and
00:31:53.320 | People very rarely do
00:31:57.280 | But we will
00:31:59.840 | We will quite soon
00:32:01.840 | Maybe even today
00:32:04.040 | So I you know you might be getting the impression I'm a bit of a fan of random forests and
00:32:12.960 | Before I was before you know, people thought of me as the deep learning guy people thought of me as the random forests guy
00:32:20.480 | I used to go on about random forests all the time and one of the reasons I'm so enthused about them isn't just that they're
00:32:27.400 | Very accurate or that they require, you know that they're very hard to mess up and require very little processing pre-processing
00:32:32.280 | But they give you a lot of quick and easy insight
00:32:36.720 | And specifically these are the five things
00:32:40.480 | Which I think that we're interested in and all of which are things that random forests good at they will tell us how confident
00:32:47.400 | Are we in our predictions on some particular row? So when somebody you know, when we're giving a loan to somebody
00:32:54.440 | We don't necessarily just want to know
00:32:57.240 | How likely are they to repay?
00:32:59.240 | But I'd also like to know how confident are we that we know because if we're if we like well
00:33:06.360 | We think they'll repay but we're not confident of that. We would probably want to give them less of a loan and
00:33:12.960 | Another thing that's very important is when we're then making a prediction. So again, for example for for credit
00:33:21.320 | Let's say you rejected that person's loan
00:33:27.080 | And a random forest will tell us
00:33:29.080 | What what is the what is the reason that we made a prediction and you'll see why all these things?
00:33:34.480 | Which columns are the strongest predictors? You've already seen that one, right? That's the feature importance plot
00:33:39.960 | Which columns are effectively redundant with each other ie they're basically highly correlated with each other
00:33:49.240 | And then one of the most important ones as you vary a column, how does it vary the predictions? So for example in your
00:33:56.760 | credit model, how does your prediction of
00:34:00.760 | Risk vary as you vary
00:34:06.960 | Well something that probably the regulator would want to know might be some, you know, some protected
00:34:12.120 | variable like, you know
00:34:14.720 | Race or some socio demographic characteristics that you're not allowed to use in your model. So they might check things like that
00:34:20.000 | For the first thing how confident are we in our predictions using a particular row of data?
00:34:27.960 | There's a really simple thing we can do which is remember how when we
00:34:32.960 | Calculated our predictions manually we stacked up the predictions together and took their mean
00:34:37.720 | Well, what if you took their standard deviation instead?
00:34:42.680 | so if you stack up your predictions and take their standard deviation and
00:34:46.720 | If that standard deviation is high
00:34:49.720 | That means all of them all of the trees are predicting something different and that suggests that we don't really know what we're doing
00:34:57.040 | And so that would happen if different subsets of the data end up giving completely different trees
00:35:02.240 | for this
00:35:04.760 | particular row
00:35:07.160 | So there's like a really simple thing you can do to get a sense of your prediction confidence
00:35:13.080 | Okay feature importance. We've already discussed
00:35:16.160 | After I do feature importance, you know, like I said when I had the what 7,000 or so columns that got rid of like all but 30
00:35:26.760 | That doesn't tend to improve the predictions of your random forest very much
00:35:31.680 | If at all, but it certainly helps like
00:35:36.680 | You know kind of logistically thinking about cleaning up the data
00:35:40.160 | You can focus on cleaning those 30 columns stuff like that. So I tend to remove the low importance variables
00:35:45.080 | I'm going to skip over this bit about removing redundant features because it's a little bit outside what we're talking about
00:35:53.880 | But definitely check it out in the book
00:35:55.880 | something called a dendrogram
00:35:57.880 | But what I do want to mention is is the partial dependence this is the thing which says
00:36:04.720 | What is the relationship?
00:36:06.720 | between a
00:36:09.480 | Column and the dependent variable and so this is something called a partial dependence plot now
00:36:15.640 | This one's actually not specific to random forests
00:36:17.800 | A partial dependence plot is something you can do for basically any machine learning model
00:36:22.760 | Let's first of all look at one and then talk about how we make it
00:36:27.040 | So in this data set we're looking at the relationship. We're looking at
00:36:32.560 | the sale price at auction of heavy industrial equipment like bulldozers, this is specifically the
00:36:38.920 | blue books for bulldozers Kaggle competition and
00:36:42.240 | a partial dependence plot between the year that the bulldozer or whatever was made and
00:36:48.880 | The price that was sold for this is actually the log price is
00:36:52.640 | That it goes up more recent bulldozers more recently made bulldozers are more expensive
00:37:00.920 | And as you go back it back to older and older build it bulldozers
00:37:04.000 | They're less and less expensive to a point and maybe these ones are some old
00:37:09.400 | classic bulldozers you pay a bit extra for
00:37:14.760 | You might think that you could easily create this plot by simply looking at your data at each year and taking the average sale price
00:37:22.760 | But that doesn't really work very well
00:37:25.680 | I mean it kind of does but it kind of doesn't let me give an example
00:37:29.600 | It turns out that one of the biggest predictors of sale price for industrial equipment. It's whether it has air conditioning
00:37:37.080 | and so air conditioning is you know, it's an expensive thing to add and it makes the equipment more expensive to buy and
00:37:45.200 | Most things didn't have air conditioning back in the 60s and 70s and most of them do now
00:37:50.360 | So if you plot the relationship between year made and price
00:37:55.480 | You're actually going to be seeing a whole bunch of
00:37:57.880 | When you know how popular was air conditioning?
00:38:01.640 | Right, so you get this this cross correlation going on that we just want to know know
00:38:06.480 | What's what's just the impact of of the year? It was made all else being equal
00:38:10.960 | So there's actually a really easy way to do that which is we take our data set
00:38:17.320 | We take the we leave it exactly as it is to just use the training data set
00:38:22.200 | but we take every single row and for the year made column we set it to 1950 and
00:38:27.680 | so then we predict for every row what would the sale price of that have been if it was made in 1950 and
00:38:35.120 | then we repeat it for 1951 and they repeated for 1952 and so forth and then we plot the averages and
00:38:42.160 | That does exactly what I just said. Remember I said the special words all else being equal
00:38:47.720 | This is setting everything else equal. It's the everything else is the data as it actually occurred and we're only varying year made
00:38:55.240 | And that's what a partial dependence plot is
00:38:58.320 | That works just as well for deep learning or gradient boosting trees or logistic regressions or whatever. It's a really
00:39:07.480 | Cool thing you can do
00:39:10.280 | And you can do more than one column at a time, you know, you can do two-way
00:39:17.600 | partial dependence plots
00:39:19.320 | for example
00:39:20.840 | Another one. Okay, so then another one I mentioned was
00:39:24.400 | Can you describe why a particular?
00:39:27.960 | Prediction was made. So how did you decide for this particular row?
00:39:33.560 | to predict this particular value and
00:39:36.960 | This is actually pretty easy to do there's a thing called tree interpreter
00:39:41.840 | But we could you could easily create this in about half a dozen lines of code all we do
00:39:49.320 | We're saying okay
00:39:51.920 | This customer's come in they've asked for a loan
00:39:54.480 | We've put in all of their data through the random forest. It's bad out of prediction
00:39:59.200 | We can actually have a look and say okay. Well that in tree number one
00:40:03.680 | What's the path that went down through the tree to get to the leaf node?
00:40:07.520 | And we can say oh, well first of all it looked at sex and then it looked at postcode and then it looked at income
00:40:13.600 | and so we can see
00:40:16.960 | exactly in tree number one which variables were used and what was the
00:40:21.720 | change in Gini for each one and
00:40:24.480 | Then we can do the same entry to 7 3 3 2 3 4 does this sound familiar?
00:40:29.160 | It's basically the same as our feature importance plot, right?
00:40:32.680 | But it's just for this one row of data and so that will tell you basically the feature
00:40:37.320 | Importances for that one particular prediction and so then we can plot them
00:40:43.280 | Like this. So for example, this is an example of an
00:40:46.720 | auction price prediction and
00:40:49.680 | According to this plot, you know, so he predicted that the net would be
00:40:55.280 | This is just a change from from so I don't actually know what the price is
00:41:03.760 | But this is this is how much each one impacted the price. So
00:41:06.640 | Year made I guess this must have been an old attractor. It caused a prediction of the price to go down
00:41:13.280 | But then it must have been a larger machine the product size caused it to go up
00:41:17.440 | Couple of system made it go up model ID made it go up and
00:41:21.280 | So forth, right so you can see the reds says this made this made our prediction go down green made our prediction go up and
00:41:28.280 | so overall you can see
00:41:31.080 | Which things had the biggest impact on the prediction and what was the direction?
00:41:35.200 | for each one
00:41:37.560 | So it's basically a feature importance plot
00:41:40.160 | But just for a single row for a single row
00:41:42.400 | Any questions John
00:41:47.440 | Yeah, there are a couple that have that are sort of queued up this is a good spot to jump to them
00:41:59.360 | first of all Andrew's asking jumping back to the
00:42:02.680 | The OOB era, would you ever exclude a tree from a forest if had a if it had a bad out of bag?
00:42:11.360 | Like if you if you had a I guess if you had a particularly bad
00:42:13.960 | Tree in your ensemble. Yeah, like might you just
00:42:17.720 | Would you delete a tree that was not doing its thing? It's not playing its part. No you wouldn't
00:42:24.180 | If you start deleting trees then you are no longer
00:42:29.960 | Having a unbiased prediction of the dependent variable
00:42:34.440 | You are biasing it by making a choice. So even the bad ones
00:42:40.160 | will be
00:42:42.120 | Improving the quality of the overall
00:42:44.280 | average
00:42:46.320 | All right. Thank you. Um
00:42:48.320 | Zaki a followed up with the question about
00:42:50.400 | Bagging and we're just going you know layers and layers here
00:42:55.440 | You know, we could go on and create ensembles of bagged models
00:42:59.520 | And you know, is it reasonable to assume that they would continue that's not gonna make much difference, right?
00:43:05.480 | If they're all like you could take you a hundred trees split them into groups of ten create ten bagged ensembles
00:43:12.640 | And then average those but the average of an average is the same as the average
00:43:16.180 | You could like have a wider range of other kinds of models
00:43:20.760 | You could have like neural nets trained on different subsets as well
00:43:23.800 | But again, it's just the average of an average will still give you the average
00:43:26.640 | Right. So there's not a lot of value in kind of structuring the ensemble
00:43:31.840 | You just I mean some some ensembles you can structure but but not bagging bagging's the simplest one
00:43:37.920 | It's the one I mainly use
00:43:39.920 | There are more sophisticated approaches, but this one
00:43:42.480 | Is nice and easy
00:43:45.080 | All right, and there's there's one that
00:43:47.080 | Is a bit specific and it's referencing content you haven't covered but we're here now. So
00:43:52.040 | And it's on explainability
00:43:54.960 | so feature importance of
00:43:57.840 | Random forest model sometimes has different results when you compare to other explainability techniques
00:44:03.400 | Like SHAP shap or lime
00:44:07.120 | And we haven't covered these in the course, but Amir is just curious if you've got any thoughts on which is more accurate or reliable
00:44:14.240 | Random forest feature importance or other techniques? I
00:44:18.080 | Would lean towards
00:44:26.400 | More immediately trusting random forest feature importances over other techniques on the whole
00:44:32.560 | On the basis that it's very hard to mess up a random forest
00:44:42.680 | Yeah, I feel like pretty confident that a random forest feature importance is going to
00:44:47.720 | Be pretty reasonable
00:44:50.680 | As long as this is the kind of data which a random forest is likely to be pretty good at you know
00:44:56.400 | Doing you know, if it's like a computer vision model random forests aren't
00:45:00.280 | Particularly good at that
00:45:01.920 | And so one of the things that Brian and talked about a lot was explainability and he's got a great essay called the two cultures
00:45:08.120 | of statistics in which he talks about I guess what we're nowadays called kind of like data scientists and machine learning folks versus classic statisticians and
00:45:16.120 | He he was you know, definitely a data scientist well before the
00:45:22.560 | The label existed and he pointed out. Yeah, you know first and foremost
00:45:26.720 | You need a model that's accurate. It is to make good predictions a model that makes bad predictions
00:45:33.800 | Will also be bad for making explanations because it doesn't actually know what's going on
00:45:38.200 | So if you know if you if you've got a deep learning model that's far more accurate than your random forest then it's you know
00:45:45.640 | Explainability methods from the deep learning model will probably be more useful because it's explaining a model
00:45:51.760 | It's actually correct
00:45:53.760 | Alright, let's take a 10-minute break and we'll come back at 5 past 7
00:46:03.840 | Welcome back one person pointed out I noticed I got the chapter wrong. It's chapter 9 not chapter 8 in the book
00:46:16.440 | I guess I can't read
00:46:20.960 | Somebody asked during the break about overfitting
00:46:24.680 | Can you overfit a random forest?
00:46:28.840 | Basically, no, not really adding more trees will make it more accurate
00:46:35.840 | It kind of asymptotes so you can't make it infinitely accurate by using infinite trees, but certainly, you know adding more trees won't make it worse
00:46:46.760 | If you don't have enough trees
00:46:51.520 | and you
00:46:53.520 | Let the trees grow very deep that could overfit
00:46:57.720 | So you just have to make sure you have enough trees
00:47:00.800 | Radak told me about experiment he did during that Radak told me during the break about an experiment he did
00:47:15.480 | Which is something I've done something similar which is adding lots and lots of randomly generated columns
00:47:21.800 | to a data set and
00:47:24.160 | Try to break the random forest and
00:47:26.160 | If you try it, it basically doesn't work. It's like it's really hard
00:47:30.680 | to confuse a random forest by giving it lots of
00:47:34.440 | meaningless data it does an amazingly good job of picking out
00:47:38.720 | The the useful stuff as I said, you know, I had
00:47:43.000 | 30 useful columns out of 7,000 and it found them
00:47:45.800 | perfectly well
00:47:48.720 | And often, you know when you find those 30 columns
00:47:52.340 | You know, you could go to you know
00:47:54.400 | I was doing consulting at the time go back to the client and say like tell me more about these columns
00:47:58.680 | That's and they'd say like oh well that one there. We've actually got a better version of that now
00:48:02.000 | There's a new system, you know, we should grab that and oh this column actually that was because of this thing that happened last year
00:48:07.960 | But we don't do it anymore or you know, like you can really have this kind of discussion about the stuff you've zoomed into
00:48:13.120 | You know
00:48:26.440 | There are other things that you have to think about with lots of kinds of models like particularly regression models things like interactions
00:48:32.520 | You don't have to worry about that with random forests like because you split on one column and then split on another column
00:48:38.800 | You get interactions for free
00:48:41.200 | as well
00:48:43.760 | Normalization you don't have to worry about you know, you don't have to have normally distributed columns
00:48:49.960 | So, yeah, definitely worth a try now something I haven't gone into
00:48:57.800 | Is gradient boosting
00:49:05.400 | But if you go to explain.ai
00:49:10.720 | You'll see that my friend Terrence and I have a three-part series about gradient boosting
00:49:16.840 | including pictures of golf made by Terrence
00:49:20.240 | But to explain gradient boosting is a lot like random forests
00:49:27.160 | but rather than
00:49:29.160 | training a
00:49:31.560 | model training now fitting a tree again and again and again on different random subsets of the data
00:49:37.280 | Instead what we do is we fit very very very small trees to hardly ever any splits and
00:49:44.840 | We then say okay. What's the error? So, you know
00:49:49.120 | so imagine the simplest tree would be a one-hour rule tree of
00:49:55.320 | Male versus female say and then use you take what's called the residual
00:50:00.600 | That's the difference between the prediction and the actual the error and then you create another tree which attempts to predict that
00:50:08.120 | very small tree and then you create another very small tree which track tries to predict the error from that and
00:50:17.000 | So forth each one is predicting the residual from all of the previous ones. And so then to calculate a prediction
00:50:25.120 | Rather than taking the average of all the trees
00:50:27.640 | you take the sum of all the trees because each one is predicted the difference between the actual and
00:50:33.640 | All of the previous trees and that's called boosting
00:50:37.920 | versus bagging so boosting and bagging are two kind of meta-ensembling techniques and
00:50:44.160 | When bagging is applied to trees, it's called a random forest and when boosting is applied to trees
00:50:50.880 | It's called a gradient boosting machine or gradient boosted decision tree
00:50:55.800 | Gradient boosting is generally speaking more accurate than random forests
00:51:04.840 | But you can absolutely over fit
00:51:08.280 | and so therefore
00:51:11.040 | It's not necessarily my first go-to thing having said that there are ways to avoid over fitting
00:51:16.460 | But yeah, it's just it's it's not
00:51:19.280 | It's it you know because it's breakable it's not my first choice
00:51:26.040 | But yeah, check out our stuff here if you're interested and you know, you there is stuff which largely automates the process
00:51:34.920 | There's lots of hyper parameters. You have to select people generally just you know, try every combination of hyper parameters
00:51:41.040 | And in the end you're generally should be able to get a more accurate gradient boosting model than random forest
00:51:48.760 | But not necessarily by much
00:51:50.760 | Okay, so that was the
00:51:58.560 | Kaggle notebook on random forests how random forests really work
00:52:16.720 | What we've been doing is having this daily
00:52:20.240 | Walk through where me and I don't know how many 20 or 30 folks get together on a zoom call and chat about
00:52:29.040 | you know getting through the course and
00:52:32.400 | setting up machines and stuff like that and
00:52:36.480 | You know, we've been trying to kind of practice what you know things along the way
00:52:44.120 | and so a couple of weeks ago, I
00:52:46.960 | wanted to show like
00:52:49.640 | What does it look like to pick a Kaggle competition and just like?
00:52:53.200 | Do the normal sensible
00:52:57.080 | Kind of mechanical steps that you would do for any computer vision model
00:53:02.720 | And so the
00:53:06.880 | Competition I picked was paddy disease classification
00:53:13.080 | which is about
00:53:15.080 | Recognizing diseases rice diseases and rice patties
00:53:18.600 | And yeah, I spent I don't know a couple of hours or three. I can't remember a few hours
00:53:23.720 | Throwing together something and
00:53:29.920 | Found that I was number one on the leaderboard and I thought oh, that's that's interesting like
00:53:35.840 | because you never quite have a sense of
00:53:38.360 | How well these things work?
00:53:41.880 | And then I thought well, there's all these other things. We should be doing as well and I tried
00:53:45.800 | three more things and each time I tried another thing I got further ahead at the top of the leaderboard so
00:53:53.760 | I thought it'd be cool to take you through
00:53:57.500 | the process I'm gonna do it reasonably quickly because
00:54:04.040 | The walkthroughs are all available
00:54:08.960 | For you to see the entire thing in you know, seven hours of detail or however long we probably were six to seven hours of conversations
00:54:16.560 | But I want to kind of take you through the basic process that I went through
00:54:22.780 | So since I've been starting to do more stuff on Kaggle, you know, I realized there's some
00:54:35.600 | Kind of menial steps. I have to do each time particularly because I like to run stuff on my own machine
00:54:41.420 | And then kind of upload it to Kaggle
00:54:44.200 | So to do to make my life easier I created a little module called fast Kaggle
00:54:51.120 | Which you'll see in my notebooks now on which you can download from pit or Conda
00:54:56.900 | And as you'll see it makes some things a bit easier for example
00:55:02.920 | unloading the data for the paddy disease classification if you just run setup comp and
00:55:08.400 | Pass in the name of the competition if you are on Kaggle it will return a path to
00:55:18.440 | Competition data that's already on Kaggle if you are not on Kaggle and you haven't downloaded it
00:55:23.560 | It will download and unzip the data for you
00:55:25.480 | If you're not on Kaggle and you have not downloaded on zip the data, it will return a path to the one that you've already downloaded
00:55:31.520 | also, if you are on Kaggle you can ask it to make sure that
00:55:34.620 | Pip things are installed that might not be up to date. Otherwise
00:55:39.040 | So this basically one line of code now gets us all set up and ready to go
00:55:43.600 | so this path
00:55:46.680 | So I ran this particular one on my own machine so it's downloaded and unzipped the data
00:55:52.760 | I've also got links to the
00:55:55.680 | Six walkthroughs so far. These are the videos
00:56:00.600 | Oh, yes, and here's my result after these
00:56:06.240 | For attempts that's a few fiddling around at the start
00:56:10.880 | So the overall approach at is well and this is not just to a Kaggle competition right at the reason
00:56:21.260 | I like looking at Kaggle competitions is
00:56:23.260 | You can't hide from the truth
00:56:27.840 | In a Kaggle competition, you know when you're working on some work project or something
00:56:32.300 | You might be able to convince yourself and everybody around you that you've done a fantastic job of
00:56:38.100 | not overfitting and your models better than what anybody else could have made and whatever else but
00:56:44.240 | The brutal assessment of the private leaderboard
00:56:48.560 | Will tell you the truth
00:56:51.560 | Is your model actually predicting things correctly and is it overfit?
00:57:00.560 | Until you've been through that process
00:57:04.120 | You know, you're never gonna know and a lot of people don't go through that process because at some level they don't want to know
00:57:09.640 | But it's okay, you know, nobody needed it you don't have to put your own name there
00:57:19.040 | Always did right from the very first one. I wanted, you know, if I was gonna screw up royally
00:57:23.760 | I wanted to have the pressure on myself of people seeing me in last place
00:57:27.240 | but you know, it's it's fine you could do it all and honestly and
00:57:31.120 | You'll actually find
00:57:34.320 | As you improve you also have so much self-confidence, you know
00:57:41.880 | The stuff we do in a Kaggle competition is indeed a subset of the things we need to do in real life
00:57:49.120 | It's an important subset, you know building a model that actually predicts things correctly and doesn't overfit is important and furthermore
00:57:57.080 | structuring your code and analysis in such a way that you can keep improving over a three-month period without gradually getting into more and
00:58:04.900 | more of a tangled mess of impossible to understand code and
00:58:07.840 | Having no idea what untitled copy 13 was and why it was better than
00:58:16.080 | right, this is all
00:58:18.360 | stuff you want to be practicing
00:58:20.360 | ideally
00:58:22.480 | Well away from customers or whatever, you know before you've kind of figured things out
00:58:27.280 | So the things I talk about here about doing things well in this Kaggle competition
00:58:34.040 | Should work, you know in other settings as well
00:58:39.060 | And so these are the two focuses that I recommend
00:58:45.200 | Get a really good validation set together. We've talked about that before right and in a Kaggle competition
00:58:50.460 | That's like it's very rare to see people do well in a Kaggle competition who don't have a good validation set
00:58:55.900 | sometimes that's easy and this competition actually it is easy because the
00:59:03.800 | Test set seems to be a random example
00:59:05.800 | But most of the time it's not actually I would say
00:59:08.480 | And then how quickly can you iterate?
00:59:12.720 | How quickly can you try things and find out what worked? So obviously you need a good validation set. Otherwise, it's impossible to iterate and
00:59:19.880 | So quickly iterating means not saying what is the biggest?
00:59:25.640 | You know open AI takes four months on a hundred TPUs model that I can train
00:59:34.100 | it's what can I do that's going to train in a minute or so and
00:59:39.200 | Will quickly give me a sense of like well, I could try this I could try that what things gonna work and then try
00:59:45.160 | You know 80 things
00:59:48.200 | It also doesn't mean that saying like, oh I heard this is amazing you
00:59:52.540 | Bayesian hyper parameter tuning approach. I'm gonna spend three months implementing that because that's gonna like give you one thing
01:00:01.460 | but actually do well and
01:00:04.640 | In these competitions or in machine learning in general, you actually have to do everything
01:00:09.840 | reasonably well
01:00:12.560 | And doing just one thing really well will still put you somewhere about last place
01:00:17.160 | So I actually saw that a couple of years ago Aussie guy who's
01:00:21.200 | very very distinguished machine learning
01:00:24.720 | practitioner
01:00:27.720 | Actually put together a team entered the Kaggle competition and literally came in last place
01:00:34.160 | Because they spent the entire three months trying to build this amazing new
01:00:38.560 | fancy
01:00:41.200 | thing and
01:00:43.040 | Never actually never actually iterated if you iterate a guarantee you won't be in last place
01:00:48.820 | Okay, so here's how we can grab our data with fast Kaggle and it gives us tells us what path it's in
01:01:03.120 | And then I set my random seed
01:01:05.120 | And I only do this because I'm creating a notebook to share, you know when I share a notebook
01:01:11.960 | I like to be able to say as you can see, this is point eight three blah blah blah, right and
01:01:15.760 | Know that when you see it, it'll be point eight three as well
01:01:18.720 | But when I'm doing stuff, otherwise, I would never set a random seed
01:01:22.400 | I want to be able to run things multiple times and see how much it changes each time
01:01:26.720 | because that'll give me a sense of like
01:01:29.960 | The modifications I'm making changing it because they're improving it making it worse or is it just random variation
01:01:35.280 | So if you or if you always set a random seed
01:01:37.960 | That's a bad idea because you won't be able to see the random variation. So this is just here for presenting a notebook
01:01:44.520 | Okay, so the data they've given us as usual they've got a sample submission they've got some test set images
01:01:53.280 | They've got some training set images a CSV file about the training set
01:02:00.000 | And then these other two you can ignore because I created them
01:02:02.840 | So let's grab a path
01:02:05.880 | To train images and so do you remember?
01:02:09.440 | Get image files. So that gets us a list of the file names of all the images here recursively
01:02:16.760 | So we could just grab the first one and
01:02:19.960 | Take a look. So it's 480
01:02:22.720 | by 640
01:02:25.040 | Now we've got to be careful
01:02:26.880 | This is a pillow image Python imaging library image
01:02:30.120 | In the imaging world. They generally say columns by rows in
01:02:35.560 | The array slash tensor world. We always say rows by columns
01:02:40.800 | So if you ask pie torch what the size of this is, it'll say 640 by 480 and I guarantee at some point
01:02:47.440 | This is going to bite you. So try to recognize it now
01:02:50.540 | Okay, so they're kind of taller than they are. There's at least this one is taller than it is wide
01:02:58.320 | I'd actually like to know are they all this size because it's really helpful if they all are all the same size or at least similar
01:03:03.880 | Believe it or not the amount of time it takes to decode a JPEG is actually quite significant
01:03:12.640 | And so figuring out what size these things are is actually going to be pretty slow
01:03:18.200 | But my fast core library has a parallel sub module which can basically do anything
01:03:25.140 | That you can do in Python. It can do it in parallel. So in this case, we wanted to create a pillow image and get its size
01:03:31.060 | So if we create a function that does that and pass it to parallel passing in the function and the list of files
01:03:37.720 | It does it in parallel and that actually runs pretty fast
01:03:41.020 | And so here is the answer
01:03:44.060 | How this happened ten thousand four hundred and three images are indeed 480 by 640 and four of them aren't
01:03:52.540 | So basically what this says to me is that we should pre-process them or you know
01:03:56.580 | At some point process them so that they're probably all for 80 by 640 or all basically the kind of same size
01:04:02.100 | We'll pretend they're all this size
01:04:04.100 | But we can't not do some initial resizing. Otherwise, this is going to screw things up
01:04:17.540 | So like that probably the easiest way to do things the most common way to do things is to
01:04:22.460 | Either squish or crop every image to be a square
01:04:26.860 | So squishing is when you just in this case squish the aspect ratio down
01:04:33.260 | As opposed to cropping randomly a section out, so if we call resize squish it will squish it down
01:04:42.900 | And so this is 480 by 480 squared. So this is what it's going to do to all of the images first on the CPU
01:04:50.500 | That allows them to be all batched together into a single mini batch
01:04:56.780 | Everything in a mini batch has to be the same shape
01:04:59.020 | otherwise the GPU won't like it and
01:05:02.220 | then that mini batch is put through data augmentation and
01:05:06.820 | It will
01:05:09.620 | Grab a random subset of the image and make it at 128 by 128 pixel
01:05:15.980 | And here's what that looks like. Here's our data
01:05:19.780 | So show batch works for pretty much everything not just in the fast AI library
01:05:26.620 | But even for things like fast audio, which are kind of community based things
01:05:30.980 | You should be to use show batch on anything and and see or hear or whatever what your data looks like
01:05:39.340 | I don't know anything about rice disease
01:05:41.780 | But apparently these are various rice diseases and this is what they look like
01:05:46.060 | So, um, I I jump into creating models much more quickly than most people
01:05:58.260 | Because I find model, you know models are a great way to understand my data as we've seen before
01:06:03.540 | So I basically build a model as soon as I can
01:06:09.220 | I want to
01:06:10.900 | Create a model that's going to let me iterate quickly. So that means that I'm going to need a model that can train quickly
01:06:20.540 | Thomas Kapel and I recently
01:06:23.580 | Did this big project the best vision models of fine-tuning
01:06:29.340 | Where we looked at nearly a hundred different
01:06:35.420 | architectures
01:06:37.060 | from from Ross Whiteman's Tim library
01:06:40.140 | Pytorch image model library and
01:06:43.300 | looked at
01:06:46.740 | Which ones could we fine-tune which ones had the best transfer learning results
01:06:52.320 | And we tried two different data sets very different data sets
01:06:55.900 | One is the pets data set that we've seen before
01:06:58.940 | So trying to predict what breed of pet is from 37 different breeds
01:07:05.780 | and the other was a
01:07:07.780 | Satellite imagery data set called planet. They're very very different data sets in terms of what they contain and also very different sizes
01:07:15.500 | The planet ones a lot smaller the pets ones a lot bigger
01:07:19.180 | And so the main things we measured were how much memory did it use?
01:07:23.980 | How accurate was it and how long did it take to fit?
01:07:27.580 | And then I created this score which can which combines the fit time and error rate together
01:07:35.460 | So this is a really useful table
01:07:37.780 | For picking a model and now in this case. I want to pick something
01:07:45.260 | that's really fast and
01:07:48.380 | there's one clear winner on speed which is resnet 26 D and
01:07:53.380 | So its accuracy was 6% versus the best was like 4.1%
01:07:59.540 | So okay, it's not amazingly accurate, but it's still pretty good, and it's gonna be really fast
01:08:04.260 | So that's why I picked resnet
01:08:07.660 | 2016 a lot of people think that
01:08:11.460 | when they do deep learning they're going to spend all of their time learning about exactly how a resnet 26 D is made and
01:08:19.500 | convolutions and resnet blocks and transformers and blah blah blah we will cover all that stuff
01:08:25.640 | In part two and a little bit of it next week
01:08:29.500 | But it almost never matters
01:08:31.740 | Right, it's just it's just a function right and what matters is the inputs to it and the outputs to it
01:08:38.420 | And how fast it is how accurate it is
01:08:41.500 | So let's create a learner which with a resnet 26 D from our data loaders
01:08:51.260 | Let's run LR find so LR find
01:08:54.020 | Will put through one mini batch at a time
01:08:58.380 | starting at a very very very low learning rate and gradually increase the learning rate and track the loss and
01:09:03.860 | Initially the learn the loss won't improve because the learning rate is so small
01:09:10.060 | It doesn't really do anything and at some point the learning rates high enough that the loss will start coming down
01:09:15.020 | Then at some other point the load the learning rate so high that it's gonna start jumping past the answer and it's got a bit worse
01:09:22.700 | And so somewhere around here is a learning rate. We'd want to pick
01:09:29.980 | We've got a couple of different ways of making suggestions I
01:09:34.180 | Generally ignore them because these suggestions are specifically designed to be conservative
01:09:41.620 | They're a bit lower than perhaps an optimal in order to make sure we don't recommend something that totally screws up
01:09:47.220 | But I kind of like to say like well, how far right can I go and still see it like clearly really improving quickly?
01:09:53.740 | And so I pick somewhere around
01:09:57.180 | 0.01 for this
01:09:59.180 | So I can now
01:10:01.740 | Fine-tune our model with a learning rate of 0.01
01:10:04.140 | Three epochs and look the whole thing took a minute. That's what we want, right? We want to be able to iterate
01:10:10.220 | Rapidly just a minute or so. So that's enough time for me to go and you know, grab a glass of water or
01:10:16.340 | There's some reading like it's not gonna get too distracted
01:10:22.820 | What do we do before we submit?
01:10:25.500 | Nothing, we submit as soon as we can. Okay, let's get our submission in so we've got a model. Let's get it in
01:10:32.020 | So we read in our CSV file of the sample submission and
01:10:37.580 | So the CSV file basically looks like we're gonna have to have a list of the image
01:10:42.180 | file names in order and then a column of labels
01:10:46.980 | So we can get all the image files in the test image
01:10:54.580 | like so and we can sort them and
01:10:56.580 | So now we want is what we want is a data loader
01:11:01.300 | Which is exactly like the data loader we use to train the model
01:11:07.380 | Except pointing at the test set we want to use exactly the same transformations
01:11:11.700 | So there's actually a DL dot test DL method which does that you just pass in
01:11:18.000 | The new set of items so the test set files
01:11:22.340 | So this is a data loader which we can use
01:11:24.820 | for our
01:11:27.380 | Test set a
01:11:29.580 | Test data loader has a key difference to a normal data loader, which is that it does not have any labels
01:11:36.300 | So that's a key distinction
01:11:39.620 | So we can get the predictions for our learner passing in that data loader and
01:11:48.300 | In the case of a classification problem, you can also ask for them to be decoded decoded means rather than just get returned the
01:11:56.420 | probability of every
01:11:58.660 | Rice disease we're every plus it'll tell you what is the index of the most probable
01:12:05.420 | Rice disease. That's what decoded means. So that return with probabilities
01:12:10.420 | Targets, which obviously will be empty because it's a test set. So throw them away and those decoded indexes
01:12:17.260 | Which look like this numbers from 0 to 9 because there's 10 possible rice diseases
01:12:21.700 | The Kaggle submission does not expect numbers from 0 to 9 it expects to see
01:12:27.580 | strings like these
01:12:30.780 | So what do those numbers from 0 to 9 represent?
01:12:34.660 | We can look up our vocab
01:12:37.620 | to get a list
01:12:39.700 | So that's 0 that's 1 etc. That's 9
01:12:46.380 | Realized later. This is a slightly inefficient way to do it, but it does the job
01:12:50.140 | I need to be able to map these two strings
01:12:55.580 | If I enumerate the vocab that gives me pairs of numbers 0 bacterial leaf blight 1 bacterial leaf streak, etc
01:13:02.740 | They could then create a dictionary out of that and then I can use pandas
01:13:07.700 | To look up each thing in a dictionary. They call that map
01:13:13.260 | If you're a pandas user, you've probably seen map used before being passed a function
01:13:18.140 | Which is really really slow. But if you pass map addict, it's actually really really fast do it this way if you can
01:13:25.620 | so here's our
01:13:28.700 | Predictions
01:13:31.420 | So we've got our
01:13:34.540 | Submission sample submission file SS. So if we replace this column label with our predictions
01:13:43.220 | like so
01:13:45.220 | Then we can turn that into a CSV and
01:13:47.540 | remember this means
01:13:50.860 | This means run a bash command a shell command head is the first few rows. Let's just take a look that looks reasonable
01:13:58.780 | So we can now submit that to Kaggle now
01:14:03.180 | Iterating rapidly means everything needs to be
01:14:09.300 | Fast and easy things that are slow and hard don't just take up your time
01:14:14.420 | But they take up your mental energy. So even submitting to Kaggle needs needs to be fast. So I put it into a cell
01:14:20.580 | So I can just run this cell
01:14:23.340 | API competitions admit this CSV file
01:14:30.340 | Give it a description. So just run the cell and it submits to Kaggle and as you can see it says here
01:14:36.940 | We go successfully submitted
01:14:39.740 | So that submission
01:14:41.740 | was terrible
01:14:44.780 | Top 80% also known as bottom 20% which is not too surprising right? I mean, it's it's one minute of training time
01:14:53.820 | But it's something that we can start with and that would be like
01:15:00.740 | However long it takes to get to this point that you put in our submission
01:15:04.260 | Now you've really started right because then tomorrow
01:15:08.340 | You can try to make a slightly better one
01:15:11.260 | So I'd like to share my notebooks and so even sharing the notebook I've automated
01:15:20.340 | So part of fast Kaggle is you can use this thing called push notebook and that sends it off to Kaggle
01:15:26.740 | to create a
01:15:29.460 | Notebook on Kaggle
01:15:34.660 | There it is. And there's my score
01:15:37.540 | As you can see, it's exactly the same thing
01:15:43.060 | Why would you create public notebooks on Kaggle? Well
01:15:57.660 | It's the same
01:16:01.740 | brutality of feedback
01:16:04.060 | That you get for entering a competition
01:16:06.940 | But this time rather than finding out in no uncertain terms whether you can predict things accurately
01:16:12.660 | This time you can find out no, it's no uncertain terms whether you can communicate things in a way that people find interesting and useful
01:16:19.620 | And if you get zero votes
01:16:22.380 | You know, so be it right that's something to know and then you know ideally go and ask some friends like
01:16:29.980 | What do you think I could do to improve and if they say oh nothing it's fantastic you can tell no that's not true
01:16:36.540 | I didn't get any votes. I'll try again. This isn't good. How do I make it better? You know
01:16:42.140 | And you can try and improve
01:16:45.260 | because
01:16:47.580 | If you can create models that predict things well, and you can communicate your results in a way that is clear and compelling
01:16:54.220 | You're a pretty good data scientist, you know, like they're two pretty important things and so here's a great way to
01:17:01.900 | Test yourself out on those things and improve. Yes, John
01:17:07.220 | Yes, Jeremy. We have a sort of a I think a timely question here from Zakiya about your iterative approach
01:17:13.700 | And they're asking do you create different Kaggle notebooks for each model that you try?
01:17:19.940 | So one Kaggle book for the first one then separate notebooks subsequently or do you do append to the bottom of it?
01:17:27.600 | What's your strategy? That's a great question
01:17:29.940 | And I know Zaki is going through the
01:17:33.740 | The daily walkthroughs but isn't quite caught up yet. So I will say keep keep it up because
01:17:38.460 | In the six hours of going through this you'll see me create all the notebooks
01:17:43.880 | But if I go to the actual directory I used
01:17:51.720 | You can see them so basically yeah, I started with
01:17:57.740 | You know what you just saw
01:18:01.900 | Bit messier without the pros but that same basic thing. I then duplicated it
01:18:06.380 | to create the next one
01:18:09.500 | Which is here and because I duplicated it, you know this stuff which I still need it's still there right?
01:18:15.780 | and so I run it and
01:18:17.780 | I don't always know what I'm doing, you know
01:18:21.380 | And so at first if I don't really want to do index my duplicate it it will be called
01:18:26.540 | You know first steps on the road to the top part one - copy one
01:18:31.560 | You know, and that's okay
01:18:33.920 | And as soon as I can I'll try to rename that
01:18:39.020 | Once I know what I'm doing, you know
01:18:41.680 | Or if it doesn't say to go anywhere I rename it into something like, you know
01:18:47.840 | Experiment blah blah blah and I'll put some notes at the bottom and I might put it into a failed folder or something
01:18:54.040 | But yeah, it's like
01:18:56.040 | It's a very low-tech
01:18:59.800 | Approach
01:19:01.080 | That I find works really well, which is just duplicating notebooks and editing them and naming them carefully and putting them in order
01:19:08.960 | And you know put the file name in when you submit as well
01:19:17.440 | Then of course also if you've got things in git
01:19:19.440 | You know, you can have a link to the git commit so you'll know exactly what it is
01:19:23.560 | Generally speaking for me, you know
01:19:25.380 | My notebooks will only have one submission in and then I'll move on and create a new notebook
01:19:30.020 | So I don't really worry about burgeoning so much
01:19:32.600 | But you can do that as well if that helps you
01:19:36.480 | Yeah, so that's basically what I do and and
01:19:41.360 | I've worked with a lot of people who use much more sophisticated and complex processes and tools and stuff, but
01:19:48.080 | None of them seem to be able to stay as well organized as I am
01:19:53.200 | I think they kind of get a bit lost in their tools sometimes and
01:19:56.880 | File systems and file names I think are good
01:20:01.080 | Great thanks. Um, so away from that kind of
01:20:06.300 | dev process more towards the
01:20:09.480 | The specifics of you know finding the best model and all that sort of stuff
01:20:14.120 | we've got a couple of questions that are in the same space, which is
01:20:17.280 | You know, we've got some people here talking about AutoML frameworks
01:20:21.000 | Which you might want to you know touch on for people who haven't heard of those
01:20:24.000 | If you've got any particular AutoML frameworks, you think are
01:20:27.780 | worth
01:20:30.080 | Recommending or just more generally, how do you go trying different models random forest gradient boosting neural network?
01:20:36.840 | It just so in that space if you could comment it sure
01:20:40.080 | I use AutoML less than anybody. I know I would guess
01:20:48.000 | Which is to say never
01:20:52.960 | Hyperparameter optimization never
01:21:02.200 | The reason why is I like being highly intentional, you know
01:21:07.560 | I like to think more like a scientist and have hypotheses and test them carefully
01:21:14.760 | And come out with conclusions, which then I implement, you know, so for example
01:21:19.600 | in this best vision models of fine-tuning I
01:21:23.720 | Didn't try a huge grid search of every possible
01:21:29.280 | Model every possible learning rate every possible pre-processing approach blah blah blah, right instead step one was to find out
01:21:37.240 | Well, which things matter right? So
01:21:43.480 | For example, does whether we squish or crop?
01:21:46.080 | Make a difference, you know are some models better with squished and some models better with crop and
01:21:52.880 | So we just tested that
01:21:57.200 | Again, not for every possible architecture
01:21:59.280 | But for one or two versions of each of the main families that took 20 minutes and the answer was no in every single case
01:22:05.680 | The same thing was better. So we don't need to do a grid search over that anymore, you know
01:22:12.160 | Or another classic one is like learning rates. Most people
01:22:15.680 | Do a kind of grid search over learning rates or they'll train a thousand models, you know with different learning rates
01:22:23.620 | This fantastic researcher named Leslie Smith invented the learning rate finder a few years ago
01:22:27.660 | We implemented it. I think within days of it first coming out as a technical report. That's what I've used ever since
01:22:35.680 | Because it works
01:22:38.320 | Well and runs in a minute or so
01:22:42.160 | Yeah, I mean then like neural nets versus GBM sources random forests, I mean that's
01:22:52.740 | That shouldn't be too much of a question on the whole like they have pretty clear
01:22:59.160 | Places that they go
01:23:05.560 | If I'm doing computer vision, I'm obviously going to use a computer vision deep learning model
01:23:10.760 | And which one I would use. Well if I'm transfer learning, which hopefully is always I would look up the two tables here
01:23:17.440 | This is my table for pets
01:23:18.920 | Which is which are the best at fine-tuning to very similar things to what they were pre trained on and then the same thing for planet
01:23:26.440 | Is which ones are best for fine-tuning for two data sets that are very different to what they're trained on
01:23:32.680 | And as it happens in both case, they're very similar in particular con next is right up towards the top in both cases
01:23:39.120 | so I just like to have these rules of thumb and
01:23:42.240 | Yeah, my rule of thumb for tabular is
01:23:46.080 | Random forests going to be the fastest easiest way to get a pretty good result GBM's
01:23:50.760 | Probably gonna give me a slightly better result if I need it and can be bothered fussing around
01:23:59.920 | GBM I would probably yeah, actually I probably would run a hyper parameter
01:24:05.520 | sweep
01:24:07.280 | Because it is fiddly and and it's fast. So you may as well
01:24:11.000 | So yeah, so now you know, we were able to make a slightly better submission slightly better model
01:24:29.560 | I had a couple of thoughts about this. The first thing was
01:24:32.200 | that thing trained in
01:24:35.520 | A minute on my home computer and then when I uploaded it to Kaggle it took about four minutes per epoch
01:24:43.720 | which was horrifying and
01:24:46.120 | Kaggle's GPUs are not amazing, but they're not that bad
01:24:51.280 | So I do something was up
01:24:54.160 | And what was up is I realized that they only have two
01:24:58.640 | Virtual CPUs, which nowadays is tiny like, you know, you generally want is a rule of thumb about eight
01:25:05.440 | physical CPUs per GPU
01:25:12.120 | So spending all of its time just reading the damn data
01:25:14.720 | Now the data was 640 by 480 and we were ending up with any 128 pixel size bits for speed
01:25:20.760 | So there's no point doing that every epoch
01:25:23.880 | so step one was to make my
01:25:27.760 | Kaggle iteration faster as well. And so very simple thing to do
01:25:32.480 | resize the images
01:25:34.800 | So fast AI has a function called resize images and you say okay take all the train images and stick them in
01:25:42.640 | the destination
01:25:45.080 | making them this size
01:25:47.080 | recursively
01:25:49.000 | And it will recreate the same folder structure over here. And so that's why I called this the training path
01:25:56.440 | because this is now my training data and
01:25:58.440 | So when I then
01:26:02.240 | trained on that on
01:26:04.600 | Kaggle it went down to
01:26:06.880 | four times faster
01:26:10.040 | With no loss of accuracy. So that was kind of step one was to actually get my fast
01:26:15.920 | iteration working
01:26:20.720 | Now still a minute it's a long time and on Kaggle you can actually see this little graph showing how much the CPU is being
01:26:28.200 | Used how much the GPU is being used on your own home machine
01:26:30.920 | You can there are tools free GP, you know free tools to do the same thing
01:26:34.680 | I saw that the GPU was still hardly being used. So it's still CPU was being driven pretty hard
01:26:41.000 | I wanted to use a better model anyway to move up the leaderboard
01:26:45.040 | so I moved from a
01:26:50.640 | Oh, by the way, this graph is very useful. So this is
01:26:55.040 | This is speed versus error rate by family and so we're about to be looking at these
01:27:06.560 | Conve next models
01:27:10.640 | So we're going to be looking at this one complex tiny
01:27:18.920 | Here it is complex tiny so we were looking at resident 2016 which took this long on this data set
01:27:25.200 | But this one here is nearly the best. It's third best, but it's still very fast
01:27:31.900 | And so it's the best overall score. So let's use this
01:27:36.200 | Particularly because you know, we're still spending all of our time waiting for the CPU anyway
01:27:40.760 | So it turned out that when I switched my architecture to Conve next
01:27:46.240 | It basically ran just as fast on Kaggle
01:27:48.840 | So we can then
01:27:51.880 | train that
01:27:53.920 | Let me switch to the Kaggle version because my outputs are missing for some reason
01:28:03.680 | Yeah, so I started out by running the resident 2016 on the resized images and got
01:28:08.160 | Similar error rate, but I ran a few more epochs
01:28:11.000 | got 12% error rate and
01:28:14.280 | so then I do exactly the same thing but with Conve next small and
01:28:18.560 | 4.5% error rate. So don't think that different architectures are best
01:28:24.400 | Tiny little differences. This is over twice as good
01:28:34.880 | Lot of folks you talked to will never have heard of this Conve next because it's very new and
01:28:42.040 | I've noticed a lot of people tend not to
01:28:44.040 | Keep up to date with new things. They kind of learn something at university and then they stop stop learning
01:28:51.040 | So if somebody's still just using res nets all the time
01:28:54.320 | You know, you can tell them we've we've actually we've moved on, you know
01:28:59.600 | Res nets are still probably the fastest
01:29:02.880 | But for the mix of speed and performance, you know, not so much
01:29:10.480 | Conve next, you know again, you want these rules of thumb, right? If you're not sure what to do
01:29:15.920 | This Conve next. Okay, and then like most things there's different sizes. There's a tiny there's a small there's a base
01:29:24.840 | There's a large there's an extra large and you know, it's just well, let's look at the picture
01:29:30.320 | This is it here
01:29:37.000 | Right
01:29:39.760 | Large takes longer but lower error
01:29:43.080 | Tiny takes less time but higher error, right? So you you pick
01:29:48.280 | About your speed versus accuracy trade-off for you. So for us small is great
01:29:54.860 | And so yeah now we've got a 4.5 cent error that's that's terrific
01:30:03.680 | Now let's iterate on Kaggle, this is taking about a minute per epoch on my computer is probably taking about 20 seconds per epoch
01:30:12.120 | So not too bad
01:30:14.120 | So, you know one thing we could try is instead of using squish as
01:30:19.880 | Our pre-processing let's try using crop. So that will randomly crop out an area
01:30:25.840 | And that's the default. So if I remove the method equals squish that will crop
01:30:31.480 | So you see how I've tried to get everything into a single
01:30:34.080 | Function right the single function I can tell it that's going to find the definition
01:30:39.920 | What architecture do I want to train? How do I want to transform the items?
01:30:45.320 | How do I want to transform the batches and how many epochs do I want to do? That's basically it, right?
01:30:50.480 | So this time I want to use the same architecture comp next. I want to resize without cropping and then use the same data augmentation and
01:31:00.880 | And okay error rates about the same
01:31:03.000 | So not particularly it's a tiny bit worse, but not enough to be interesting
01:31:08.280 | Instead of cropping we can pad now padding is interesting. Do you see how these are all square?
01:31:15.560 | Right, but they've got black borders
01:31:20.720 | Padding is interesting because it's the only way of pre-processing images
01:31:24.200 | Which doesn't distort them and doesn't lose anything if you crop you lose things
01:31:29.680 | If you squish you distort things
01:31:31.840 | This does neither now, of course the downside is that there's pixels that are literally pointless. They contain zeros
01:31:39.720 | So every way of getting this working has its compromises
01:31:44.560 | but this approach of resizing where we pad with zeros is
01:31:48.480 | Not used enough and it can actually often work quite well
01:31:52.600 | And this case it was about as good as our best so far
01:31:58.920 | But no not huge differences yet
01:32:00.920 | What else could we do well
01:32:05.280 | What we could do is
01:32:09.600 | See these pictures this is all the same picture
01:32:16.840 | But it's gone through our data augmentation. So sometimes it's a bit darker. Sometimes it's flipped horizontally
01:32:24.400 | Sometimes it's slightly rotated. Sometimes it's slightly what sometimes it's zooming into a slightly different section, but this is all the same picture
01:32:31.240 | Maybe our model would like some of these versions better than others
01:32:37.200 | So what we can do is we can pass all of these to our model get predictions for all of them and
01:32:44.640 | Take the average
01:32:48.280 | Right. So it's our own kind of like little mini bagging approach and this is called test time augmentation
01:32:54.440 | Fast AI is very unusual in making that available in a single method
01:32:59.800 | you just pass TTA and it will pass multiple augmented versions of the image and
01:33:08.800 | Average them for you
01:33:13.120 | So this is the same model as before which had a four point five percent
01:33:18.640 | So instead if we get TTA predictions
01:33:21.960 | And then get the error rate
01:33:26.120 | Wait, why does this say four point eight last time I did this it was way better. Well, that's messing things up, isn't it?
01:33:39.120 | So when I did this originally on my home computer it went from like four point five to three point nine so possibly a
01:33:45.160 | Got a very bad luck. It's time. So this is the first time I've actually ever seen TTA give a worse result
01:33:52.680 | So that's very weird I
01:33:58.120 | wonder if it's
01:34:00.840 | If I should do something other than the crop padding, all right, I'll have to check that out and I'll try and come back to
01:34:06.480 | You and find out
01:34:07.840 | Why in this case?
01:34:09.760 | This one was worse
01:34:11.760 | Anyway take my word for it every other time I've tried it TTA has been better
01:34:17.480 | So then, you know now that we've got a pretty good way of
01:34:22.760 | resizing
01:34:25.320 | We've got TTA. We've got a good training process
01:34:28.480 | Let's just make bigger images and something that's really interesting and a lot of people don't realize is your images don't have to be square
01:34:37.240 | they just all have to be the same size and
01:34:39.240 | Given that nearly all of our images are 640 by 480 we can just pick, you know that aspect ratio
01:34:46.080 | So for example 256 by 192 and we'll resize everything
01:34:50.520 | To the same aspect ratio rectangular
01:34:53.900 | That should work even better still. So if we do that, we'll do 12 epochs
01:34:58.360 | Okay, now our error rates down to 2.2 percent and
01:35:06.440 | Then we'll do TTA
01:35:08.440 | Okay, this time you can see it actually improving down to under 2 percent
01:35:13.280 | So that's pretty cool, right? We've got our error rate at the start of this notebook. We were at
01:35:18.880 | Twelve percent and by the time we've got through our little experiments
01:35:29.600 | We're down to under 2 percent
01:35:34.600 | And nothing about this is in any way specific to
01:35:38.320 | Rice or this competition, you know, it's like this is a very
01:35:43.680 | Mechanistic, you know standardized
01:35:48.640 | Approach
01:35:51.440 | which you can use for
01:35:53.440 | Certainly any kind of this type of computer vision competition and you have computer vision data set almost
01:36:00.120 | But you know, it looked very similar for a collaborative filtering model or tabular model NLP model whatever
01:36:06.000 | So, of course again, I want us a bit as soon as I can so just copy and paste the exact same steps
01:36:13.640 | I took last time basically for creating a submission
01:36:16.320 | So as I said last time we did it using pandas, but there's actually an easier way
01:36:22.640 | So the step where here I've got the numbers from 0 to 9
01:36:26.840 | Which is like which which rice disease is it?
01:36:30.240 | So here's a cute idea
01:36:33.000 | we can take our vocab and
01:36:35.000 | Make it an array. So that's going to be a list of ten things and
01:36:39.040 | Then we can index into that vocab with our indices, which is kind of weird. This is a list of ten things
01:36:47.400 | This is a list of I don't know four or five thousand things. So this will give me four or five thousand results, which is
01:36:55.160 | Each vocab item for that thing. So this is another way of doing the same mapping and I would
01:37:01.380 | spend time
01:37:03.840 | playing with this code to understand what it does because it's the kind of like
01:37:07.480 | very fast what you know, not just in terms of writing but this this the this would
01:37:14.320 | Optimize, you know on on the CPU
01:37:17.920 | Very very well. This is the kind of coding you want to get used to
01:37:23.120 | this kind of indexing
01:37:25.120 | Anyway, so then we can submit it just like last time and
01:37:29.440 | when I did that I
01:37:32.200 | got in the top 25% and
01:37:34.200 | That's that's where you want to be right? Like generally speaking I find in Kaggle competitions the top 25% is
01:37:43.920 | You're kind of like solid competent
01:37:46.680 | Level, you know, look just not to say like it's not easy
01:37:51.960 | You've got to know what you're doing
01:37:53.960 | but if you get in the top 25% and I think you can really feel like yeah, this is this is a
01:37:58.680 | you know very
01:38:01.600 | Reasonable attempt and so that's I think this is a very reasonable attempt
01:38:06.360 | Okay, before we wrap up John any last questions
01:38:10.560 | Yeah, there's this there's two I think that would be good if we could touch on quickly before you wrap up
01:38:20.080 | one from Victor asking about TTA
01:38:22.560 | When I use TTA during my training process do I need to do something special during inference or is this something you use only
01:38:31.440 | Okay, so just explain
01:38:34.160 | TTA means test time augmentation. So specifically it means inference. I think you mean augmentation during training. So yeah, so during training
01:38:42.360 | You basically always do augmentation, which means you're varying each image slightly
01:38:49.080 | so that the
01:38:50.720 | Model never seems the same image exactly the same twice and so I can't memorize it
01:38:55.400 | On fast AI and as I say, I don't think anybody else does this as far as I know if you call TTA
01:39:02.800 | it will use the exact same augmentation approach on
01:39:07.120 | whatever data set you pass it and
01:39:10.120 | Average out the prediction but but like multiple times on the same image and we'll average them out
01:39:15.960 | So you don't have to do anything different. But if you didn't have any data augmentation in training, you can't use TTA
01:39:21.760 | It uses the same by default the same data augmentation you use for training
01:39:26.120 | Great. Thank you. And the other one is about how
01:39:30.520 | You know when you first started this example you squared the models and the images rather and you talked about
01:39:36.760 | squashing verse cropping verse, you know clipping and
01:39:39.920 | Scaling and so on but then you went on to say that
01:39:45.120 | These models can actually take rectangular input, right?
01:39:48.600 | so there's a question that's kind of probing it at that, you know, if the if the models can take rectangular inputs
01:39:56.520 | Why would you ever even care as long as they're all the same size? So I
01:40:02.660 | Find most of the time
01:40:06.480 | Datasets tend to have a wide variety of input sizes and aspect ratios
01:40:14.240 | You know, if there's just as many tall skinny ones as wide
01:40:18.040 | short ones
01:40:21.360 | You know, you doesn't make sense to create a rectangle because some of them you're gonna really destroy them
01:40:26.840 | So that's where is the kind of?
01:40:28.840 | best compromise in some ways
01:40:31.600 | There are
01:40:34.440 | better things we can do
01:40:36.440 | Which we don't have any
01:40:40.480 | Off-the-shelf library support for yet and I don't think and I don't know that anybody else has even published about this
01:40:45.560 | but we experimented with kind of trying to
01:40:47.800 | Batch things that are similar aspect ratios together and use the kind of median
01:40:54.640 | Rectangle for those and have had some good results with that. But honestly
01:40:59.240 | 99.99% of people given a wide variety of aspect ratios chuck everything into a square a
01:41:07.280 | Follow-up just this is my own interest. Have you ever looked at?
01:41:10.600 | You know, so the issue with with padding as you say is that you're putting black pixels there
01:41:17.120 | Those are not nans those are black pixels. That's right. It's here. And so there's something problematic to me, you know conceptually about that
01:41:26.480 | You know when you when you see
01:41:30.440 | for example
01:41:31.880 | four to three aspect ratio footage
01:41:35.040 | Presented for broadcast on 16 to 9 you got the kind of the blurred stretch that kind of stuff
01:41:39.760 | No, we played with that a lot. Yeah, I used to be really into it actually and fast a I still by default
01:41:45.920 | Uses a reflection padding, which means if this is I don't know that says this is a 20 pixel wide thing
01:41:51.400 | it takes the 20 pixels next to it and flips it over and sticks it here and
01:41:55.080 | It looks pretty good. You know, another one is copy which simply takes the outside pixel and it's a bit more like TV
01:42:06.760 | You know much too much agreed it turns out none of them really help you know of anything they make it worse
01:42:13.160 | Because in the end
01:42:16.980 | The computer wants to know no, this is the end of the image. There's nothing else here. And if you reflect it, for example
01:42:22.880 | Then you're kind of creating weird spikes that didn't exist and the computer's got to be like, oh, I wonder what that spike is
01:42:29.640 | So yeah, it's a great question and I obviously spent like a couple of years
01:42:34.120 | Assuming that we should be doing things that look more image-like
01:42:38.180 | But actually the computer likes things to be presented to it in as straightforward a way as possible
01:42:43.820 | Alright, thanks everybody and I hope to see some of you in the walkthroughs and otherwise see you next time
01:42:51.340 | [BLANK_AUDIO]