Lesson 6: Practical Deep Learning for Coders 2022

00:00:00.000 | Okay, so welcome back to

00:00:02.000 | Not welcome back to welcome to lesson six first time. We've been a lesson six welcome back to practically deep learning for coders

00:00:10.220 | We just started looking at tabular data

00:00:17.520 | Last time and

00:00:21.560 | for those of you who've forgotten what we did was we

00:00:28.640 | We were looking at the titanic data set

00:00:31.720 | And we were looking at creating binary splits

00:00:36.400 | by looking at

00:00:39.240 | Categorical variables or binary variables like sex

00:00:43.040 | and

00:00:46.600 | Continuous variables like the log of the fare that they paid and

00:00:55.200 | Using those, you know, we also kind of came up with a score

00:00:59.120 | Which was basically how how good a job did that split to of grouping the

00:01:06.880 | Survival characteristics into two groups, you know all of nearly all of one of whom survived nearly all of whom the other didn't survive

00:01:14.640 | So they had like small standard deviation in each group

00:01:17.280 | And

00:01:20.080 | So then we created the world's simplest little UI to allow us to fiddle around and try to find a good

00:01:25.680 | binary split and we did

00:01:28.040 | We did come up with a very good binary split

00:01:33.640 | Which was on on sex and actually we created this little

00:01:39.200 | Automated version and so this is I think the first time we can we're not quite the first time. No, this is

00:01:45.800 | This is yet another time. I should say that we have successfully created a

00:01:50.280 | actual

00:01:52.760 | Machine learning algorithm from scratch. This one is about the world's simplest one. It's one are

00:01:56.920 | Creating the single rule which does a good job of splitting your data set into two parts

00:02:04.240 | Which differ as much as possible on the dependent variable?

00:02:08.080 | One hour is probably not going to cut it for a lot of things though

00:02:13.520 | It's surprisingly effective, but it's so maybe we could go a step further

00:02:18.760 | And the other step further we could go is we could create like a 2r. What if we took each of those?

00:02:24.000 | groups males and females in the Titanic data set and

00:02:29.040 | Split each of those into two other groups. So split the males into two groups and split the females into two groups

00:02:35.160 | So

00:02:39.040 | To do that we can repeat the exact same piece of code we just did but let's remove

00:02:48.240 | sex from it and

00:02:50.240 | Then split the data set into males and females

00:02:53.760 | And run the same piece of code that we just did before but just for the males

00:02:58.480 | And so this is going to be like a one-hour rule for how do we predict which males survive the Titanic?

00:03:06.760 | And let's have a look three eight three seven three eight three eight three eight. Okay, so it's

00:03:12.760 | Age were they greater than or less than six?

00:03:17.760 | Turns out to be for the males the biggest predictor of whether they were going to survive

00:03:21.560 | That shipwreck and we can do the same thing females. So for females

00:03:27.200 | There we go, no great supplies P class so whether they were in

00:03:37.440 | first class or not

00:03:40.000 | Was the biggest predictor for females of whether they would survive the shipwreck?

00:03:44.440 | So that has now given us a

00:03:52.200 | Decision tree it is a series of binary splits

00:03:58.240 | Which will gradually?

00:04:00.960 | Split up our data more and more such that at in the end

00:04:04.720 | These in the leaf nodes as we call them. We will hopefully get as you know much

00:04:12.680 | Stronger prediction as possible about survival

00:04:15.160 | So we could just repeat this step for each of the four groups we've now created males

00:04:21.760 | kids and older than six

00:04:24.840 | females first class and

00:04:27.760 | Everybody else and we could do it again and then we'd have eight groups

00:04:32.280 | We could do that manually with another couple of lines of code or we can just use

00:04:39.120 | Decision tree classifier, which is a class which does exactly that for us

00:04:42.840 | So there's no magic in here. It's just doing what we've just described

00:04:47.240 | And a decision tree classifier comes from a library called psychic learn

00:04:54.200 | Psychic learn is a fantastic library that focuses on kind of classical

00:05:01.720 | non deep learning ish machine learning methods

00:05:06.800 | like decision trees

00:05:08.800 | So we can so to create the exact same decision tree

00:05:12.640 | We can say please create a decision tree traffic classifier with that most four leaf nodes

00:05:18.040 | And one very nice thing it has is

00:05:22.560 | It can draw the tree for us

00:05:25.840 | So here's a tiny little draw tree function

00:05:28.560 | And

00:05:32.640 | You can see here it's gonna first of all split on sex now

00:05:36.360 | It looks a bit weird to say sex is less than or equal to point five

00:05:39.100 | But remember what our binary characteristics are coded as zero one

00:05:44.640 | So that's just how we you know easy way to say males versus females

00:05:49.600 | And then here we've got for the females

00:05:55.040 | What class are they in and for the males what age are they and here's our four leaf nodes

00:06:02.320 | so for the females in first class a

00:06:07.000 | 116 of them survived and four of them didn't so very good idea to be a well-to-do

00:06:15.720 | woman on the Titanic

00:06:18.640 | On the other hand

00:06:21.600 | Males adults

00:06:28.000 | 68 survived 350 died so a very bad idea to be a male adult on the Titanic

00:06:35.580 | So you can see you can kind of get a quick summary

00:06:39.200 | Of what's going on and one of the reasons people tend to like decision trees particularly for exploratory data analysis

00:06:46.200 | Is it doesn't allow us to get a quick picture of what are the key?

00:06:51.160 | driving variables in this data set and how much do they kind of

00:06:56.080 | Predict what was happening in the data?

00:06:58.400 | Okay, so it's around the same splits as us

00:07:02.000 | and

00:07:03.920 | It's got one additional piece of information. We haven't seen before. It's this thing called Gini

00:07:07.840 | Gini is just another way of measuring how good a split is and

00:07:13.840 | I've

00:07:16.120 | Put the code to calculate Gini here

00:07:18.200 | Here's how you can think of Gini

00:07:21.760 | How likely is it that if you go into that sample?

00:07:25.560 | and grab

00:07:27.280 | one item and

00:07:29.000 | Then go in again and grab another item. How likely is it that you're going to grab the same item each time?

00:07:35.040 | and so

00:07:37.640 | if that if the entire leaf node is

00:07:42.620 | Just people who survived or just people who didn't survive the probability would be one you get the same time same every time

00:07:48.540 | If it was an exactly equal mix the probability would be point five

00:07:51.880 | so that's why we just

00:07:55.200 | Yeah, that's where this this formula comes from in the binary case

00:07:58.880 | And in fact, you can see it here, right? This group here is pretty much 50/50. So Gini's point five

00:08:05.240 | Where else this group here is nearly a hundred percent in one class. So Gini is nearly

00:08:10.040 | Zero, so I had it backwards. It's one minus

00:08:12.760 | And I think I've written it backwards here as well, so I better fix that

00:08:21.480 | so

00:08:23.480 | This decision tree is you know, we would expect it to be all accurate so we can calculate

00:08:33.000 | It's been absolute error and for the 1r. So just doing males versus females

00:08:37.820 | What was our score

00:08:43.240 | Here we go

00:08:45.280 | point 407

00:08:47.680 | Actually, we have do we have an accuracy score somewhere here we are point three three six

00:08:51.720 | That was for log fair and for sex it was

00:08:59.640 | Point two one five. Okay, so point two one five. So that was for the 1r version for the decision tree with four leaf nodes

00:09:08.120 | Point two two four. So it's actually a little worse, right?

00:09:12.300 | And I think this just reflects the fact that this is such a small data set

00:09:17.200 | and

00:09:18.480 | the 1r

00:09:20.480 | Version was so good. We haven't really improved it that much

00:09:24.560 | But not enough to really see it

00:09:27.320 | Amongst the randomness of such a small validation set

00:09:30.680 | We could go further

00:09:34.640 | To 50 a minimum of 50 samples per leaf node. So that means that in each of these

00:09:41.520 | So you have it says samples which in this case is passengers on the Titanic. There's at least there's 67 people that

00:09:48.260 | were

00:09:50.680 | female first class

00:09:52.680 | Less than 28

00:09:55.680 | That's how you define that. So this decision tree keeps building keep splitting until it gets to a point where there's going to be less

00:10:02.240 | Than 50 at which point it stops putting that that leaf so you can see they're all got at least 50 samples

00:10:09.280 | And so here's the decision tree that builds as you can see, it doesn't have to be like constant depth, right?

00:10:15.000 | So this group here

00:10:16.800 | Which is males?

00:10:18.800 | Who had cheaper fares?

00:10:21.720 | And who were older than 20?

00:10:25.680 | but younger than 32

00:10:29.120 | Actually younger than 24 and

00:10:32.840 | actually

00:10:35.000 | Super cheap fares and so forth, right? So it keeps going down until we get to that group. So

00:10:40.600 | Let's try that decision trees. That decision tree has an absolute error of point one eight three

00:10:46.560 | So not surprisingly, you know, once we get there, it's starting to look like it's a little bit better

00:10:51.240 | So there's a model and

00:10:56.680 | This is a kaggle competition. So therefore we should submit it to the leaderboard and

00:11:04.800 | You know one of the you know

00:11:06.800 | Biggest mistakes I see

00:11:09.760 | Not just beginners but every level of practitioner make on Kaggle is not to submit to the leaderboard

00:11:15.480 | Spend months making some perfect thing, right?

00:11:19.560 | But you're actually going to see how you're going and you should try and submit something to the leaderboard every day

00:11:24.200 | So, you know regardless of how rubbish it is because

00:11:29.720 | You want to improve every day?

00:11:33.040 | And so you want to keep iterating so to submit something to the leaderboard you generally have to provide a

00:11:38.640 | CSV file

00:11:41.760 | And so we're going to create a CSV file

00:11:45.000 | And

00:11:48.920 | We're going to apply the category codes to get the the category for each one in our test set

00:11:55.440 | We're going to set the survived column to our predictions

00:11:58.800 | And then we're going to send that off to a CSV

00:12:03.080 | and

00:12:05.080 | So yeah, so I submitted that and I got a score a little bit worse than most of our linear models and neural nets

00:12:13.120 | But not terrible, you know, it was it's it's just doing an okay job

00:12:17.800 | Now one interesting thing for the decision tree is there was a lot less pre-processing to do

00:12:27.320 | Did you notice that we didn't have to create any dummy variables for our categories?

00:12:33.240 | and

00:12:35.080 | Like you certainly can create dummy variables, but you often don't have to so for example

00:12:40.720 | You know for for class, you know, it's one two or three you can just split on one two or three, you know

00:12:48.720 | even for like

00:12:52.040 | What was that thing like the the embarkation?

00:12:54.560 | City code like we just convert them kind of arbitrarily to numbers one two and three and you can split on those numbers

00:13:02.720 | So with random forest or so not random first not the decision trees

00:13:06.320 | Yeah, you can generally get away with not doing stuff like

00:13:11.280 | dummy variables

00:13:14.440 | In fact even taking the log of fair

00:13:16.760 | We only did that to make our graph look better. But if you think about it

00:13:22.320 | splitting on log fair less than 2.7

00:13:26.840 | It's exactly the same as putting on fair is less than either the 2.7, you know, whatever blog base we used I can't remember

00:13:33.720 | so

00:13:36.840 | All that a decision tree cares about is the ordering of the data and this is another reason that decision tree-based approaches are fantastic

00:13:44.520 | Because they don't care at all about outliers, you know long tail distributions

00:13:52.320 | Categorical variables, whatever you can throw it all in and it'll do a perfectly fine job

00:13:58.400 | so

00:14:01.280 | For tabular data, I would always start by using a decision tree-based approach

00:14:08.040 | And kind of create some baselines and so forth because it's it's really hard to mess it up

00:14:14.960 | And that's important

00:14:21.120 | So, yeah, so here for example is embarked right it it was coded originally as

00:14:28.080 | the first letter of the city they embarked in

00:14:31.760 | But we turned it into a categorical variable

00:14:35.320 | And so pandas for us creates this this vocab this list of all of the possible values

00:14:40.520 | And if you look at the codes

00:14:43.520 | Attribute you can see it's that S is that 0 1 2 so S has become 2 C

00:14:50.800 | Has become 0 and so forth. All right, so that's how we converting the categories the strings

00:14:57.660 | Into numbers that we can sort and group by

00:15:01.480 | So, yeah, so if we wanted to split C into one group and Q and S and the other we can just do

00:15:09.920 | Okay, less than a quarter one point 0.5

00:15:12.780 | Now, of course if we wanted to split C and S into one group and Q into the other

00:15:18.480 | We would need two binary splits first C

00:15:21.000 | On one side and QS at Q and S on the other and then Q and S into Q versus S

00:15:26.680 | And then the Q and S leaf nodes could get similar

00:15:29.960 | Predictions so like you do have that sometimes it can take a little bit more

00:15:33.680 | messing around but

00:15:36.920 | Most of the time I find categorical variables work fine as numeric in decision tree-based approaches

00:15:44.600 | And as I say here, I tend to use dummy variables only if there's like less than four levels

00:15:48.840 | Now what if we wanted to make this more accurate could we grow the tree further I

00:15:59.040 | Mean we could but

00:16:03.800 | You know, there's only 50 samples in these leaves right it's it's not really

00:16:12.680 | You

00:16:14.680 | Know if I keep splitting it the leaf nodes are going to have subtle data that that's not really going to make very useful predictions

00:16:24.720 | Now there are limitations to how accurate a decision tree can be

00:16:30.240 | So what can we do

00:16:35.720 | We can do something that's actually very I mean, I find it amazing and fascinating

00:16:42.680 | It comes from a guy called Leo Bremen

00:16:45.560 | And Leo Bremen came with his came up with this idea

00:16:50.480 | Called bagging and here's the basic idea of bagging

00:16:54.840 | Let's say we've got a model

00:16:57.600 | That's not very good

00:17:00.560 | Because let's say it's a decision tree, it's really small we've hardly used any data for it, right?

00:17:07.280 | It's not very good. So it's got error. It's got errors on predictions

00:17:11.560 | It's not a systematically biased error. It's not always predicting too high or is predicting too low

00:17:16.880 | I mean decision trees, you know on average will predict the average, right?

00:17:21.080 | But it has errors

00:17:23.920 | So what I could do is I could build another decision tree in

00:17:28.300 | Some slightly different way that would have different splits and it would also be not a great model but

00:17:38.160 | Predicts the correct thing on average. It's not completely hopeless

00:17:41.080 | And again, you know, some of the errors are a bit too high and some are a bit too low

00:17:45.320 | And I could keep doing this. So if I could create building lots and lots of slightly different decision trees

00:17:50.960 | I'm gonna end up with say a hundred different models all of which are unbiased

00:17:57.680 | All of which are better than nothing and all of which have some errors bit high some bit low whatever

00:18:04.040 | So what would happen if I average their predictions?

00:18:08.440 | Assuming that the models are not correlated with each other

00:18:13.000 | Then you're going to end up with errors on either side of the correct prediction

00:18:20.560 | Some are a bit high some are a bit low and there'll be this kind of distribution of errors, right? And

00:18:26.000 | the average of those errors will be

00:18:29.160 | zero and

00:18:32.800 | So that means the average of the predictions of these multiple

00:18:36.280 | uncorrelated models each of which is unbiased will be

00:18:40.600 | The correct prediction because they have an error of zero and this is a mind-blowing insight

00:18:47.100 | It says that if we can generate a whole bunch of

00:18:51.840 | uncorrelated

00:18:54.280 | unbiased

00:18:55.840 | models

00:18:57.440 | We can average them and get something better than any of the individual models because the average of the error

00:19:04.600 | Will be zero

00:19:06.880 | So all we need is a way to generate

00:19:09.960 | lots of models

00:19:12.600 | Well, we already have a great way to build models, which is to create a decision tree

00:19:16.180 | How do we create lots of them?

00:19:19.160 | How do we create lots of unbiased but different models?

00:19:23.200 | well

00:19:25.880 | Let's just grab a different subset of the data each time. Let's just grab at random half the rows and

00:19:32.720 | Build a decision tree and then grab another half the rows and build a decision tree

00:19:37.920 | And grab another half the rows and build a decision tree each of those decision trees is going to be not great

00:19:42.840 | It's only using half the data

00:19:44.960 | But it will be unbiased. It will be predicting the average on average

00:19:48.780 | It will certainly be better than nothing because it's using, you know, some real data to try and create a real decision tree

00:19:55.680 | The they won't be correlated with each other because they're each random subsets. So that makes all of our criteria

00:20:02.720 | for bagging

00:20:05.160 | When you do this, you create something called a random forest

00:20:08.840 | So let's create one in four lines of code

00:20:15.040 | so

00:20:17.960 | Here is a function to create a decision tree. So let's say what put this is just the proportion of data

00:20:23.760 | So let's say we put 75% of the data in each time or we could change it to 50% whatever

00:20:29.680 | So this is the number of samples in this subset n and so let's at random choose

00:20:38.600 | n times the proportion we requested from the sample and build a decision tree from that and

00:20:47.240 | So now let's

00:20:50.520 | 100 times

00:20:52.520 | Get a tree and stick them all in a list using a list comprehension

00:20:56.800 | And now let's grab the predictions for each one of those trees and

00:21:03.760 | Then let's stack all those predictions up together and take their mean

00:21:07.680 | That is a random forest

00:21:12.480 | And what do we get one two three four five six seven eight that's seven lines of code. So

00:21:21.560 | random forests are very

00:21:23.560 | simple

00:21:25.400 | This is a slight simplification. There's one other difference that random forests do

00:21:30.000 | Which is when they build the decision tree. They also randomly select a subset of columns and

00:21:36.480 | they select a different random subset of columns each time they do a split and

00:21:41.800 | So the idea is you kind of want it to be as random as possible, but also somewhat useful

00:21:47.440 | So

00:21:49.440 | We can do that by creating a random forest classifier

00:22:01.920 | Say how many trees do we want?

00:22:04.920 | how many

00:22:06.720 | Samples per leaf and then fit does what we just did and here's our main absolute error rich

00:22:16.680 | Again, it's like not as good as our decision tree, but it's still pretty good. And again, it's such a small data set

00:22:21.640 | It's hard to tell if that means anything and

00:22:23.640 | So we can submit that to Kaggle so earlier on I created a little function to submit to Kaggle

00:22:28.160 | So now I just create some predictions and I submit to Kaggle and yeah looks like it gave nearly identical results to a single tree

00:22:35.240 | Now to one of my favorite things

00:22:42.160 | About random forests and I should say in most real-world data sets of reasonable size random forests

00:22:48.480 | Basically always give you much better results than decision trees. This is just a small data set to show you what to do

00:22:55.540 | one of my favorite things about

00:22:58.640 | Random forests as we can do something quite cool with it. What we can do is we can look at the

00:23:04.080 | Underlying decision trees they create so we've now got a hundred decision trees

00:23:09.720 | And we can see what columns

00:23:12.440 | Did it find a split on and so it's a here. Okay. Well the first thing it split on with six and

00:23:18.280 | it improved the Gini from

00:23:21.880 | Point four seven

00:23:25.080 | Two now just take the weighted average of point three eight and point three one weighted by the samples

00:23:30.400 | So that's probably going to be about point three three. So it's a okay. It's like point one for improvement in Gini. Thanks to sex

00:23:40.280 | And

00:23:41.480 | We can do that again. Okay. Well then P class, you know, how about did that improve Gini?

00:23:46.000 | Again, we keep waiting it by the number of samples as well log fair. How much does that improve Gini and we can keep track

00:23:52.880 | for each column of

00:23:55.560 | How much in total did they improve the Gini in this decision tree and then do that for every decision tree and

00:24:05.320 | then add them up per column and that gives you something called a feature importance plot and

00:24:11.880 | Here it is

00:24:14.960 | And a feature importance plot tells you how important is each feature

00:24:20.280 | how often did the trees pick it and how much did it improve the Gini when it did and

00:24:26.240 | so we can see from the feature importance plot that sex was the most important and

00:24:34.120 | Class was the second most important and everything else was a long way back

00:24:38.400 | Now this is another reason by the way why our random forest isn't really particularly helpful

00:24:43.400 | Because it's just such an easy split to do right? I basically all that matters is

00:24:49.080 | You know what class you're in and whether you're male and female

00:24:54.280 | and these

00:24:57.720 | Feature importance plots remember because they're built on random forests

00:25:03.880 | and

00:25:05.040 | Random forests don't care about

00:25:07.200 | really

00:25:09.280 | The distribution of your data and they can handle categorical variables and stuff like that

00:25:13.600 | That means that you can basically any tabular data set you have you can just plot this

00:25:18.720 | right away and

00:25:21.000 | Random forests, you know for most data sets only take a few seconds to train, you know, really at most of a minute or two

00:25:29.440 | And so if you've got a big data set and you know hundreds of columns

00:25:34.480 | do this first and

00:25:37.480 | find the

00:25:39.400 | 30 columns that might matter

00:25:41.760 | It's such a helpful thing to do. So I've done that for example. I did some work in credit scoring

00:25:49.080 | so we're trying to find out which

00:25:51.080 | Things would predict who's going to default on a loan and I was given

00:25:55.840 | something like

00:25:58.000 | 1,000

00:25:59.200 | columns from the database

00:26:01.200 | And I put it straight into a random forest and found I think there was about 30 columns that seemed

00:26:06.640 | Kind of interesting. I did that

00:26:09.360 | like two hours after I started the job and I went to the

00:26:14.040 | Head of marketing and the head of risk and I told them here's the columns. I think that we should focus on and

00:26:21.160 | They were like, oh my god. We just finished a two-year

00:26:25.800 | consulting project with one of the big consultants

00:26:27.960 | Paid the millions of dollars and they came up with a subset of these

00:26:32.640 | There are other things that you can do with

00:26:41.400 | With random forests along this path. I'll touch on them briefly

00:26:46.360 | and

00:26:49.560 | Specifically

00:26:51.800 | I'm going to look at

00:26:53.760 | chapter 8 of the book

00:26:56.040 | Which goes into this in a lot more detail and particularly interestingly chapter 8 of the book uses a

00:27:03.160 | Much bigger and more interesting data set which is auction prices of heavy industrial equipment

00:27:10.600 | I mean, it's less interesting historically, but more interestingly numerically

00:27:14.600 | And so some of the things I did there on this data set

00:27:23.800 | I

00:27:25.800 | Say this isn't from the data set. This is from the psychic learn documentation

00:27:28.960 | They looked at how as you increase the number of estimators. So the number of trees

00:27:33.920 | how much does the

00:27:37.400 | Accuracy improve so I then did the same thing on our data set. So I actually just

00:27:42.000 | Added up to 40 more and more and more trees and

00:27:47.960 | you can see that basically as as predicted by that kind of an initial bit of

00:27:53.840 | Hand-wavy theory I gave you that you would expect the more trees

00:27:58.360 | The lower the error because the more things you're averaging and that's exactly what we find the accuracy improves as we have more trees

00:28:06.880 | John what's up?

00:28:09.600 | Victor is

00:28:11.600 | You might have just answered his question actually as he talked it but he's he's asking on the same theme the number of trees in a

00:28:18.520 | Random forest does increasing the number of trees always?

00:28:21.480 | Translate to a better error. Yes. It does always I mean tiny bumps, right? But yeah, once you smooth it out

00:28:30.640 | But

00:28:34.880 | Decreasing returns and

00:28:40.320 | If you end up production ising a random forest then of course every one of these trees you have to

00:28:45.680 | You know go through for at inference time

00:28:49.520 | So it's thought that there's no cost. I mean having said that

00:28:53.600 | Zipping through a binary tree is the kind of thing you can

00:28:58.080 | really

00:29:01.120 | Do fast in fact, it's it's quite easy to let literally

00:29:05.080 | spit out C++ code

00:29:09.320 | With a bunch of if statements and compile it and get extremely fast performance

00:29:15.400 | I don't often use more than a hundred trees. This is a rule of thumb

00:29:22.200 | That the only one John

00:29:28.320 | Okay

00:29:31.000 | So then there's another interesting feature of random forests

00:29:35.920 | Which is remember how in our example we trained with?

00:29:40.000 | 75% of the data

00:29:42.760 | on each tree

00:29:44.560 | So that means for each tree there was 25% of the data we didn't train on

00:29:47.840 | Now this actually means if you don't have much data in some situations you can get away with not having a validation set and

00:29:56.120 | the reason why is

00:29:59.040 | because for each tree we can pick the 25% of

00:30:03.960 | rows that weren't in that tree and

00:30:06.360 | See how accurate that tree was on those rows and we can average for each row

00:30:13.520 | their accuracy on all of the trees in which they were not part of the training and

00:30:18.920 | That is called the out-of-bag error

00:30:22.120 | Or OOB error and this is built in also to SK learn you can ask for an OOB

00:30:29.200 | prediction

00:30:33.800 | John

00:30:35.800 | Just before we move on

00:30:40.080 | Zakiya has a question about bagging

00:30:43.640 | So we know that bagging is powerful as an ensemble approach to machine learning

00:30:48.280 | Would it be advisable to try out bagging then first when approaching a particular?

00:30:53.440 | say tabular task

00:30:56.080 | Before deep learning, so that's the first part of the question

00:31:01.280 | And the second part is could we create a bagging model which includes fast AI deep learning models?

00:31:08.200 | Yes

00:31:11.160 | Absolutely. So to be clear, you know bagging is kind of like a

00:31:14.360 | meta method it's not a

00:31:17.000 | prediction it's not a method of modeling itself. It's just a method of

00:31:21.400 | Combining other models

00:31:24.680 | So random forests in particular as a particular approach to bagging

00:31:31.040 | Is a you know, I would probably always start personally a tabular

00:31:35.800 | Project with a random forest because they're nearly impossible to mess up and they give good insight and they give a good base case

00:31:42.400 | But yeah your question then about can you bag?

00:31:47.440 | other models is a very interesting one and the answer is you absolutely can and

00:31:53.320 | People very rarely do

00:31:57.280 | But we will

00:31:59.840 | We will quite soon

00:32:01.840 | Maybe even today

00:32:04.040 | So I you know you might be getting the impression I'm a bit of a fan of random forests and

00:32:12.960 | Before I was before you know, people thought of me as the deep learning guy people thought of me as the random forests guy

00:32:20.480 | I used to go on about random forests all the time and one of the reasons I'm so enthused about them isn't just that they're

00:32:27.400 | Very accurate or that they require, you know that they're very hard to mess up and require very little processing pre-processing

00:32:32.280 | But they give you a lot of quick and easy insight

00:32:36.720 | And specifically these are the five things

00:32:40.480 | Which I think that we're interested in and all of which are things that random forests good at they will tell us how confident

00:32:47.400 | Are we in our predictions on some particular row? So when somebody you know, when we're giving a loan to somebody

00:32:54.440 | We don't necessarily just want to know

00:32:57.240 | How likely are they to repay?

00:32:59.240 | But I'd also like to know how confident are we that we know because if we're if we like well

00:33:06.360 | We think they'll repay but we're not confident of that. We would probably want to give them less of a loan and

00:33:12.960 | Another thing that's very important is when we're then making a prediction. So again, for example for for credit

00:33:21.320 | Let's say you rejected that person's loan

00:33:25.840 | Why?

00:33:27.080 | And a random forest will tell us

00:33:29.080 | What what is the what is the reason that we made a prediction and you'll see why all these things?

00:33:34.480 | Which columns are the strongest predictors? You've already seen that one, right? That's the feature importance plot

00:33:39.960 | Which columns are effectively redundant with each other ie they're basically highly correlated with each other

00:33:49.240 | And then one of the most important ones as you vary a column, how does it vary the predictions? So for example in your

00:33:56.760 | credit model, how does your prediction of

00:34:00.760 | Risk vary as you vary

00:34:06.960 | Well something that probably the regulator would want to know might be some, you know, some protected

00:34:12.120 | variable like, you know

00:34:14.720 | Race or some socio demographic characteristics that you're not allowed to use in your model. So they might check things like that

00:34:20.000 | For the first thing how confident are we in our predictions using a particular row of data?

00:34:27.960 | There's a really simple thing we can do which is remember how when we

00:34:32.960 | Calculated our predictions manually we stacked up the predictions together and took their mean

00:34:37.720 | Well, what if you took their standard deviation instead?

00:34:42.680 | so if you stack up your predictions and take their standard deviation and

00:34:46.720 | If that standard deviation is high

00:34:49.720 | That means all of them all of the trees are predicting something different and that suggests that we don't really know what we're doing

00:34:57.040 | And so that would happen if different subsets of the data end up giving completely different trees

00:35:02.240 | for this

00:35:04.760 | particular row

00:35:07.160 | So there's like a really simple thing you can do to get a sense of your prediction confidence

00:35:13.080 | Okay feature importance. We've already discussed

00:35:16.160 | After I do feature importance, you know, like I said when I had the what 7,000 or so columns that got rid of like all but 30

00:35:26.760 | That doesn't tend to improve the predictions of your random forest very much

00:35:31.680 | If at all, but it certainly helps like

00:35:36.680 | You know kind of logistically thinking about cleaning up the data

00:35:40.160 | You can focus on cleaning those 30 columns stuff like that. So I tend to remove the low importance variables

00:35:45.080 | I'm going to skip over this bit about removing redundant features because it's a little bit outside what we're talking about

00:35:53.880 | But definitely check it out in the book

00:35:55.880 | something called a dendrogram

00:35:57.880 | But what I do want to mention is is the partial dependence this is the thing which says

00:36:04.720 | What is the relationship?

00:36:06.720 | between a

00:36:09.480 | Column and the dependent variable and so this is something called a partial dependence plot now

00:36:15.640 | This one's actually not specific to random forests

00:36:17.800 | A partial dependence plot is something you can do for basically any machine learning model

00:36:22.760 | Let's first of all look at one and then talk about how we make it

00:36:27.040 | So in this data set we're looking at the relationship. We're looking at

00:36:32.560 | the sale price at auction of heavy industrial equipment like bulldozers, this is specifically the

00:36:38.920 | blue books for bulldozers Kaggle competition and

00:36:42.240 | a partial dependence plot between the year that the bulldozer or whatever was made and

00:36:48.880 | The price that was sold for this is actually the log price is

00:36:52.640 | That it goes up more recent bulldozers more recently made bulldozers are more expensive

00:37:00.920 | And as you go back it back to older and older build it bulldozers

00:37:04.000 | They're less and less expensive to a point and maybe these ones are some old

00:37:09.400 | classic bulldozers you pay a bit extra for

00:37:12.440 | now

00:37:14.760 | You might think that you could easily create this plot by simply looking at your data at each year and taking the average sale price

00:37:22.760 | But that doesn't really work very well

00:37:25.680 | I mean it kind of does but it kind of doesn't let me give an example

00:37:29.600 | It turns out that one of the biggest predictors of sale price for industrial equipment. It's whether it has air conditioning

00:37:37.080 | and so air conditioning is you know, it's an expensive thing to add and it makes the equipment more expensive to buy and

00:37:45.200 | Most things didn't have air conditioning back in the 60s and 70s and most of them do now

00:37:50.360 | So if you plot the relationship between year made and price

00:37:55.480 | You're actually going to be seeing a whole bunch of

00:37:57.880 | When you know how popular was air conditioning?

00:38:01.640 | Right, so you get this this cross correlation going on that we just want to know know

00:38:06.480 | What's what's just the impact of of the year? It was made all else being equal

00:38:10.960 | So there's actually a really easy way to do that which is we take our data set

00:38:17.320 | We take the we leave it exactly as it is to just use the training data set

00:38:22.200 | but we take every single row and for the year made column we set it to 1950 and

00:38:27.680 | so then we predict for every row what would the sale price of that have been if it was made in 1950 and

00:38:35.120 | then we repeat it for 1951 and they repeated for 1952 and so forth and then we plot the averages and

00:38:42.160 | That does exactly what I just said. Remember I said the special words all else being equal

00:38:47.720 | This is setting everything else equal. It's the everything else is the data as it actually occurred and we're only varying year made

00:38:55.240 | And that's what a partial dependence plot is

00:38:58.320 | That works just as well for deep learning or gradient boosting trees or logistic regressions or whatever. It's a really

00:39:07.480 | Cool thing you can do

00:39:10.280 | And you can do more than one column at a time, you know, you can do two-way

00:39:17.600 | partial dependence plots

00:39:19.320 | for example

00:39:20.840 | Another one. Okay, so then another one I mentioned was

00:39:24.400 | Can you describe why a particular?

00:39:27.960 | Prediction was made. So how did you decide for this particular row?

00:39:33.560 | to predict this particular value and

00:39:36.960 | This is actually pretty easy to do there's a thing called tree interpreter

00:39:41.840 | But we could you could easily create this in about half a dozen lines of code all we do

00:39:47.320 | is

00:39:49.320 | We're saying okay

00:39:51.920 | This customer's come in they've asked for a loan

00:39:54.480 | We've put in all of their data through the random forest. It's bad out of prediction

00:39:59.200 | We can actually have a look and say okay. Well that in tree number one

00:40:03.680 | What's the path that went down through the tree to get to the leaf node?

00:40:07.520 | And we can say oh, well first of all it looked at sex and then it looked at postcode and then it looked at income

00:40:13.600 | and so we can see

00:40:16.960 | exactly in tree number one which variables were used and what was the

00:40:21.720 | change in Gini for each one and

00:40:24.480 | Then we can do the same entry to 7 3 3 2 3 4 does this sound familiar?

00:40:29.160 | It's basically the same as our feature importance plot, right?

00:40:32.680 | But it's just for this one row of data and so that will tell you basically the feature

00:40:37.320 | Importances for that one particular prediction and so then we can plot them

00:40:43.280 | Like this. So for example, this is an example of an

00:40:46.720 | auction price prediction and

00:40:49.680 | According to this plot, you know, so he predicted that the net would be

00:40:55.280 | This is just a change from from so I don't actually know what the price is

00:41:03.760 | But this is this is how much each one impacted the price. So

00:41:06.640 | Year made I guess this must have been an old attractor. It caused a prediction of the price to go down

00:41:13.280 | But then it must have been a larger machine the product size caused it to go up

00:41:17.440 | Couple of system made it go up model ID made it go up and

00:41:21.280 | So forth, right so you can see the reds says this made this made our prediction go down green made our prediction go up and

00:41:28.280 | so overall you can see

00:41:31.080 | Which things had the biggest impact on the prediction and what was the direction?

00:41:35.200 | for each one

00:41:37.560 | So it's basically a feature importance plot

00:41:40.160 | But just for a single row for a single row

00:41:42.400 | Any questions John

00:41:47.440 | Yeah, there are a couple that have that are sort of queued up this is a good spot to jump to them

00:41:55.560 | So

00:41:59.360 | first of all Andrew's asking jumping back to the

00:42:02.680 | The OOB era, would you ever exclude a tree from a forest if had a if it had a bad out of bag?

00:42:09.800 | era

00:42:11.360 | Like if you if you had a I guess if you had a particularly bad

00:42:13.960 | Tree in your ensemble. Yeah, like might you just

00:42:17.720 | Would you delete a tree that was not doing its thing? It's not playing its part. No you wouldn't

00:42:24.180 | If you start deleting trees then you are no longer

00:42:29.960 | Having a unbiased prediction of the dependent variable

00:42:34.440 | You are biasing it by making a choice. So even the bad ones

00:42:40.160 | will be

00:42:42.120 | Improving the quality of the overall

00:42:44.280 | average

00:42:46.320 | All right. Thank you. Um

00:42:48.320 | Zaki a followed up with the question about

00:42:50.400 | Bagging and we're just going you know layers and layers here

00:42:55.440 | You know, we could go on and create ensembles of bagged models

00:42:59.520 | And you know, is it reasonable to assume that they would continue that's not gonna make much difference, right?

00:43:05.480 | If they're all like you could take you a hundred trees split them into groups of ten create ten bagged ensembles

00:43:12.640 | And then average those but the average of an average is the same as the average

00:43:16.180 | You could like have a wider range of other kinds of models

00:43:20.760 | You could have like neural nets trained on different subsets as well

00:43:23.800 | But again, it's just the average of an average will still give you the average

00:43:26.640 | Right. So there's not a lot of value in kind of structuring the ensemble

00:43:31.840 | You just I mean some some ensembles you can structure but but not bagging bagging's the simplest one

00:43:37.920 | It's the one I mainly use

00:43:39.920 | There are more sophisticated approaches, but this one

00:43:42.480 | Is nice and easy

00:43:45.080 | All right, and there's there's one that

00:43:47.080 | Is a bit specific and it's referencing content you haven't covered but we're here now. So

00:43:52.040 | And it's on explainability

00:43:54.960 | so feature importance of

00:43:57.840 | Random forest model sometimes has different results when you compare to other explainability techniques

00:44:03.400 | Like SHAP shap or lime

00:44:07.120 | And we haven't covered these in the course, but Amir is just curious if you've got any thoughts on which is more accurate or reliable

00:44:14.240 | Random forest feature importance or other techniques? I

00:44:18.080 | Would lean towards

00:44:26.400 | More immediately trusting random forest feature importances over other techniques on the whole

00:44:32.560 | On the basis that it's very hard to mess up a random forest

00:44:36.920 | So

00:44:42.680 | Yeah, I feel like pretty confident that a random forest feature importance is going to

00:44:47.720 | Be pretty reasonable

00:44:50.680 | As long as this is the kind of data which a random forest is likely to be pretty good at you know

00:44:56.400 | Doing you know, if it's like a computer vision model random forests aren't

00:45:00.280 | Particularly good at that

00:45:01.920 | And so one of the things that Brian and talked about a lot was explainability and he's got a great essay called the two cultures

00:45:08.120 | of statistics in which he talks about I guess what we're nowadays called kind of like data scientists and machine learning folks versus classic statisticians and

00:45:16.120 | He he was you know, definitely a data scientist well before the

00:45:22.560 | The label existed and he pointed out. Yeah, you know first and foremost

00:45:26.720 | You need a model that's accurate. It is to make good predictions a model that makes bad predictions

00:45:33.800 | Will also be bad for making explanations because it doesn't actually know what's going on

00:45:38.200 | So if you know if you if you've got a deep learning model that's far more accurate than your random forest then it's you know

00:45:45.640 | Explainability methods from the deep learning model will probably be more useful because it's explaining a model

00:45:51.760 | It's actually correct

00:45:53.760 | Alright, let's take a 10-minute break and we'll come back at 5 past 7

00:46:03.840 | Welcome back one person pointed out I noticed I got the chapter wrong. It's chapter 9 not chapter 8 in the book

00:46:16.440 | I guess I can't read

00:46:20.960 | Somebody asked during the break about overfitting

00:46:24.680 | Can you overfit a random forest?

00:46:28.840 | Basically, no, not really adding more trees will make it more accurate

00:46:35.840 | It kind of asymptotes so you can't make it infinitely accurate by using infinite trees, but certainly, you know adding more trees won't make it worse

00:46:46.760 | If you don't have enough trees

00:46:51.520 | and you

00:46:53.520 | Let the trees grow very deep that could overfit

00:46:57.720 | So you just have to make sure you have enough trees

00:47:00.800 | Radak told me about experiment he did during that Radak told me during the break about an experiment he did

00:47:15.480 | Which is something I've done something similar which is adding lots and lots of randomly generated columns

00:47:21.800 | to a data set and

00:47:24.160 | Try to break the random forest and

00:47:26.160 | If you try it, it basically doesn't work. It's like it's really hard

00:47:30.680 | to confuse a random forest by giving it lots of

00:47:34.440 | meaningless data it does an amazingly good job of picking out

00:47:38.720 | The the useful stuff as I said, you know, I had

00:47:43.000 | 30 useful columns out of 7,000 and it found them

00:47:45.800 | perfectly well

00:47:48.720 | And often, you know when you find those 30 columns

00:47:52.340 | You know, you could go to you know

00:47:54.400 | I was doing consulting at the time go back to the client and say like tell me more about these columns

00:47:58.680 | That's and they'd say like oh well that one there. We've actually got a better version of that now

00:48:02.000 | There's a new system, you know, we should grab that and oh this column actually that was because of this thing that happened last year

00:48:07.960 | But we don't do it anymore or you know, like you can really have this kind of discussion about the stuff you've zoomed into

00:48:13.120 | You know

00:48:26.440 | There are other things that you have to think about with lots of kinds of models like particularly regression models things like interactions

00:48:32.520 | You don't have to worry about that with random forests like because you split on one column and then split on another column

00:48:38.800 | You get interactions for free

00:48:41.200 | as well

00:48:43.760 | Normalization you don't have to worry about you know, you don't have to have normally distributed columns

00:48:49.960 | So, yeah, definitely worth a try now something I haven't gone into

00:48:57.800 | Is gradient boosting

00:49:05.400 | But if you go to explain.ai

00:49:10.720 | You'll see that my friend Terrence and I have a three-part series about gradient boosting

00:49:16.840 | including pictures of golf made by Terrence

00:49:20.240 | But to explain gradient boosting is a lot like random forests

00:49:27.160 | but rather than

00:49:29.160 | training a

00:49:31.560 | model training now fitting a tree again and again and again on different random subsets of the data

00:49:37.280 | Instead what we do is we fit very very very small trees to hardly ever any splits and

00:49:44.840 | We then say okay. What's the error? So, you know

00:49:49.120 | so imagine the simplest tree would be a one-hour rule tree of

00:49:55.320 | Male versus female say and then use you take what's called the residual

00:50:00.600 | That's the difference between the prediction and the actual the error and then you create another tree which attempts to predict that

00:50:08.120 | very small tree and then you create another very small tree which track tries to predict the error from that and

00:50:17.000 | So forth each one is predicting the residual from all of the previous ones. And so then to calculate a prediction

00:50:25.120 | Rather than taking the average of all the trees

00:50:27.640 | you take the sum of all the trees because each one is predicted the difference between the actual and

00:50:33.640 | All of the previous trees and that's called boosting

00:50:37.920 | versus bagging so boosting and bagging are two kind of meta-ensembling techniques and

00:50:44.160 | When bagging is applied to trees, it's called a random forest and when boosting is applied to trees

00:50:50.880 | It's called a gradient boosting machine or gradient boosted decision tree

00:50:55.800 | Gradient boosting is generally speaking more accurate than random forests

00:51:04.840 | But you can absolutely over fit

00:51:08.280 | and so therefore

00:51:11.040 | It's not necessarily my first go-to thing having said that there are ways to avoid over fitting

00:51:16.460 | But yeah, it's just it's it's not

00:51:19.280 | It's it you know because it's breakable it's not my first choice

00:51:26.040 | But yeah, check out our stuff here if you're interested and you know, you there is stuff which largely automates the process

00:51:34.920 | There's lots of hyper parameters. You have to select people generally just you know, try every combination of hyper parameters

00:51:41.040 | And in the end you're generally should be able to get a more accurate gradient boosting model than random forest

00:51:48.760 | But not necessarily by much

00:51:50.760 | Okay, so that was the

00:51:58.560 | Kaggle notebook on random forests how random forests really work

00:52:10.740 | So

00:52:16.720 | What we've been doing is having this daily

00:52:20.240 | Walk through where me and I don't know how many 20 or 30 folks get together on a zoom call and chat about

00:52:29.040 | you know getting through the course and

00:52:32.400 | setting up machines and stuff like that and

00:52:36.480 | You know, we've been trying to kind of practice what you know things along the way

00:52:44.120 | and so a couple of weeks ago, I

00:52:46.960 | wanted to show like

00:52:49.640 | What does it look like to pick a Kaggle competition and just like?

00:52:53.200 | Do the normal sensible

00:52:57.080 | Kind of mechanical steps that you would do for any computer vision model

00:53:02.720 | And so the

00:53:06.880 | Competition I picked was paddy disease classification

00:53:13.080 | which is about

00:53:15.080 | Recognizing diseases rice diseases and rice patties

00:53:18.600 | And yeah, I spent I don't know a couple of hours or three. I can't remember a few hours

00:53:23.720 | Throwing together something and

00:53:27.360 | I

00:53:29.920 | Found that I was number one on the leaderboard and I thought oh, that's that's interesting like

00:53:35.840 | because you never quite have a sense of

00:53:38.360 | How well these things work?

00:53:41.880 | And then I thought well, there's all these other things. We should be doing as well and I tried

00:53:45.800 | three more things and each time I tried another thing I got further ahead at the top of the leaderboard so

00:53:53.760 | I thought it'd be cool to take you through

00:53:57.500 | the process I'm gonna do it reasonably quickly because

00:54:04.040 | The walkthroughs are all available

00:54:08.960 | For you to see the entire thing in you know, seven hours of detail or however long we probably were six to seven hours of conversations

00:54:16.560 | But I want to kind of take you through the basic process that I went through

00:54:22.780 | So since I've been starting to do more stuff on Kaggle, you know, I realized there's some

00:54:35.600 | Kind of menial steps. I have to do each time particularly because I like to run stuff on my own machine

00:54:41.420 | And then kind of upload it to Kaggle

00:54:44.200 | So to do to make my life easier I created a little module called fast Kaggle

00:54:51.120 | Which you'll see in my notebooks now on which you can download from pit or Conda

00:54:56.900 | And as you'll see it makes some things a bit easier for example

00:55:02.920 | unloading the data for the paddy disease classification if you just run setup comp and

00:55:08.400 | Pass in the name of the competition if you are on Kaggle it will return a path to

00:55:15.680 | that

00:55:18.440 | Competition data that's already on Kaggle if you are not on Kaggle and you haven't downloaded it

00:55:23.560 | It will download and unzip the data for you

00:55:25.480 | If you're not on Kaggle and you have not downloaded on zip the data, it will return a path to the one that you've already downloaded

00:55:31.520 | also, if you are on Kaggle you can ask it to make sure that

00:55:34.620 | Pip things are installed that might not be up to date. Otherwise

00:55:39.040 | So this basically one line of code now gets us all set up and ready to go

00:55:43.600 | so this path

00:55:46.680 | So I ran this particular one on my own machine so it's downloaded and unzipped the data

00:55:52.760 | I've also got links to the

00:55:55.680 | Six walkthroughs so far. These are the videos

00:56:00.600 | Oh, yes, and here's my result after these

00:56:06.240 | For attempts that's a few fiddling around at the start

00:56:10.880 | So the overall approach at is well and this is not just to a Kaggle competition right at the reason

00:56:21.260 | I like looking at Kaggle competitions is

00:56:23.260 | You can't hide from the truth

00:56:27.840 | In a Kaggle competition, you know when you're working on some work project or something

00:56:32.300 | You might be able to convince yourself and everybody around you that you've done a fantastic job of

00:56:38.100 | not overfitting and your models better than what anybody else could have made and whatever else but

00:56:44.240 | The brutal assessment of the private leaderboard

00:56:48.560 | Will tell you the truth

00:56:51.560 | Is your model actually predicting things correctly and is it overfit?

00:56:58.560 | um

00:57:00.560 | Until you've been through that process

00:57:04.120 | You know, you're never gonna know and a lot of people don't go through that process because at some level they don't want to know

00:57:09.640 | But it's okay, you know, nobody needed it you don't have to put your own name there

00:57:16.720 | I

00:57:19.040 | Always did right from the very first one. I wanted, you know, if I was gonna screw up royally

00:57:23.760 | I wanted to have the pressure on myself of people seeing me in last place

00:57:27.240 | but you know, it's it's fine you could do it all and honestly and

00:57:31.120 | You'll actually find

00:57:34.320 | As you improve you also have so much self-confidence, you know

00:57:39.440 | and

00:57:41.880 | The stuff we do in a Kaggle competition is indeed a subset of the things we need to do in real life

00:57:47.040 | but

00:57:49.120 | It's an important subset, you know building a model that actually predicts things correctly and doesn't overfit is important and furthermore

00:57:57.080 | structuring your code and analysis in such a way that you can keep improving over a three-month period without gradually getting into more and

00:58:04.900 | more of a tangled mess of impossible to understand code and

00:58:07.840 | Having no idea what untitled copy 13 was and why it was better than

00:58:14.440 | 25

00:58:16.080 | right, this is all

00:58:18.360 | stuff you want to be practicing

00:58:20.360 | ideally

00:58:22.480 | Well away from customers or whatever, you know before you've kind of figured things out

00:58:27.280 | So the things I talk about here about doing things well in this Kaggle competition

00:58:34.040 | Should work, you know in other settings as well

00:58:39.060 | And so these are the two focuses that I recommend

00:58:45.200 | Get a really good validation set together. We've talked about that before right and in a Kaggle competition

00:58:50.460 | That's like it's very rare to see people do well in a Kaggle competition who don't have a good validation set

00:58:55.900 | sometimes that's easy and this competition actually it is easy because the

00:59:01.280 | the

00:59:03.800 | Test set seems to be a random example

00:59:05.800 | But most of the time it's not actually I would say

00:59:08.480 | And then how quickly can you iterate?

00:59:12.720 | How quickly can you try things and find out what worked? So obviously you need a good validation set. Otherwise, it's impossible to iterate and

00:59:19.880 | So quickly iterating means not saying what is the biggest?

00:59:25.640 | You know open AI takes four months on a hundred TPUs model that I can train

00:59:34.100 | it's what can I do that's going to train in a minute or so and

00:59:39.200 | Will quickly give me a sense of like well, I could try this I could try that what things gonna work and then try

00:59:45.160 | You know 80 things

00:59:48.200 | It also doesn't mean that saying like, oh I heard this is amazing you

00:59:52.540 | Bayesian hyper parameter tuning approach. I'm gonna spend three months implementing that because that's gonna like give you one thing

01:00:01.460 | but actually do well and

01:00:04.640 | In these competitions or in machine learning in general, you actually have to do everything

01:00:09.840 | reasonably well

01:00:12.560 | And doing just one thing really well will still put you somewhere about last place

01:00:17.160 | So I actually saw that a couple of years ago Aussie guy who's

01:00:21.200 | very very distinguished machine learning

01:00:24.720 | practitioner

01:00:27.720 | Actually put together a team entered the Kaggle competition and literally came in last place

01:00:34.160 | Because they spent the entire three months trying to build this amazing new

01:00:38.560 | fancy

01:00:41.200 | thing and

01:00:43.040 | Never actually never actually iterated if you iterate a guarantee you won't be in last place

01:00:48.820 | Okay, so here's how we can grab our data with fast Kaggle and it gives us tells us what path it's in

01:01:03.120 | And then I set my random seed

01:01:05.120 | And I only do this because I'm creating a notebook to share, you know when I share a notebook

01:01:11.960 | I like to be able to say as you can see, this is point eight three blah blah blah, right and

01:01:15.760 | Know that when you see it, it'll be point eight three as well

01:01:18.720 | But when I'm doing stuff, otherwise, I would never set a random seed

01:01:22.400 | I want to be able to run things multiple times and see how much it changes each time

01:01:26.720 | because that'll give me a sense of like

01:01:29.960 | The modifications I'm making changing it because they're improving it making it worse or is it just random variation

01:01:35.280 | So if you or if you always set a random seed

01:01:37.960 | That's a bad idea because you won't be able to see the random variation. So this is just here for presenting a notebook

01:01:44.520 | Okay, so the data they've given us as usual they've got a sample submission they've got some test set images

01:01:53.280 | They've got some training set images a CSV file about the training set

01:02:00.000 | And then these other two you can ignore because I created them

01:02:02.840 | So let's grab a path

01:02:05.880 | To train images and so do you remember?

01:02:09.440 | Get image files. So that gets us a list of the file names of all the images here recursively

01:02:16.760 | So we could just grab the first one and

01:02:19.960 | Take a look. So it's 480

01:02:22.720 | by 640

01:02:25.040 | Now we've got to be careful

01:02:26.880 | This is a pillow image Python imaging library image

01:02:30.120 | In the imaging world. They generally say columns by rows in

01:02:35.560 | The array slash tensor world. We always say rows by columns

01:02:40.800 | So if you ask pie torch what the size of this is, it'll say 640 by 480 and I guarantee at some point

01:02:47.440 | This is going to bite you. So try to recognize it now

01:02:50.540 | Okay, so they're kind of taller than they are. There's at least this one is taller than it is wide

01:02:56.320 | so

01:02:58.320 | I'd actually like to know are they all this size because it's really helpful if they all are all the same size or at least similar

01:03:03.880 | Believe it or not the amount of time it takes to decode a JPEG is actually quite significant

01:03:12.640 | And so figuring out what size these things are is actually going to be pretty slow

01:03:18.200 | But my fast core library has a parallel sub module which can basically do anything

01:03:25.140 | That you can do in Python. It can do it in parallel. So in this case, we wanted to create a pillow image and get its size

01:03:31.060 | So if we create a function that does that and pass it to parallel passing in the function and the list of files

01:03:37.720 | It does it in parallel and that actually runs pretty fast

01:03:41.020 | And so here is the answer

01:03:44.060 | How this happened ten thousand four hundred and three images are indeed 480 by 640 and four of them aren't

01:03:52.540 | So basically what this says to me is that we should pre-process them or you know

01:03:56.580 | At some point process them so that they're probably all for 80 by 640 or all basically the kind of same size

01:04:02.100 | We'll pretend they're all this size

01:04:04.100 | But we can't not do some initial resizing. Otherwise, this is going to screw things up

01:04:17.540 | So like that probably the easiest way to do things the most common way to do things is to

01:04:22.460 | Either squish or crop every image to be a square

01:04:26.860 | So squishing is when you just in this case squish the aspect ratio down

01:04:33.260 | As opposed to cropping randomly a section out, so if we call resize squish it will squish it down

01:04:42.900 | And so this is 480 by 480 squared. So this is what it's going to do to all of the images first on the CPU

01:04:50.500 | That allows them to be all batched together into a single mini batch

01:04:56.780 | Everything in a mini batch has to be the same shape

01:04:59.020 | otherwise the GPU won't like it and

01:05:02.220 | then that mini batch is put through data augmentation and

01:05:06.820 | It will

01:05:09.620 | Grab a random subset of the image and make it at 128 by 128 pixel

01:05:15.980 | And here's what that looks like. Here's our data

01:05:19.780 | So show batch works for pretty much everything not just in the fast AI library

01:05:26.620 | But even for things like fast audio, which are kind of community based things

01:05:30.980 | You should be to use show batch on anything and and see or hear or whatever what your data looks like

01:05:39.340 | I don't know anything about rice disease

01:05:41.780 | But apparently these are various rice diseases and this is what they look like

01:05:46.060 | So, um, I I jump into creating models much more quickly than most people

01:05:58.260 | Because I find model, you know models are a great way to understand my data as we've seen before

01:06:03.540 | So I basically build a model as soon as I can

01:06:08.020 | and

01:06:09.220 | I want to

01:06:10.900 | Create a model that's going to let me iterate quickly. So that means that I'm going to need a model that can train quickly

01:06:17.700 | so

01:06:20.540 | Thomas Kapel and I recently

01:06:23.580 | Did this big project the best vision models of fine-tuning

01:06:29.340 | Where we looked at nearly a hundred different

01:06:35.420 | architectures

01:06:37.060 | from from Ross Whiteman's Tim library

01:06:40.140 | Pytorch image model library and

01:06:43.300 | looked at

01:06:46.740 | Which ones could we fine-tune which ones had the best transfer learning results

01:06:52.320 | And we tried two different data sets very different data sets

01:06:55.900 | One is the pets data set that we've seen before

01:06:58.940 | So trying to predict what breed of pet is from 37 different breeds

01:07:05.780 | and the other was a

01:07:07.780 | Satellite imagery data set called planet. They're very very different data sets in terms of what they contain and also very different sizes

01:07:15.500 | The planet ones a lot smaller the pets ones a lot bigger

01:07:19.180 | And so the main things we measured were how much memory did it use?

01:07:23.980 | How accurate was it and how long did it take to fit?

01:07:27.580 | And then I created this score which can which combines the fit time and error rate together

01:07:33.940 | And

01:07:35.460 | So this is a really useful table

01:07:37.780 | For picking a model and now in this case. I want to pick something

01:07:45.260 | that's really fast and

01:07:48.380 | there's one clear winner on speed which is resnet 26 D and

01:07:53.380 | So its accuracy was 6% versus the best was like 4.1%

01:07:59.540 | So okay, it's not amazingly accurate, but it's still pretty good, and it's gonna be really fast

01:08:04.260 | So that's why I picked resnet

01:08:07.660 | 2016 a lot of people think that

01:08:11.460 | when they do deep learning they're going to spend all of their time learning about exactly how a resnet 26 D is made and

01:08:19.500 | convolutions and resnet blocks and transformers and blah blah blah we will cover all that stuff

01:08:25.640 | In part two and a little bit of it next week

01:08:29.500 | But it almost never matters

01:08:31.740 | Right, it's just it's just a function right and what matters is the inputs to it and the outputs to it

01:08:38.420 | And how fast it is how accurate it is

01:08:41.500 | So let's create a learner which with a resnet 26 D from our data loaders

01:08:48.180 | and

01:08:51.260 | Let's run LR find so LR find

01:08:54.020 | Will put through one mini batch at a time

01:08:58.380 | starting at a very very very low learning rate and gradually increase the learning rate and track the loss and

01:09:03.860 | Initially the learn the loss won't improve because the learning rate is so small

01:09:10.060 | It doesn't really do anything and at some point the learning rates high enough that the loss will start coming down

01:09:15.020 | Then at some other point the load the learning rate so high that it's gonna start jumping past the answer and it's got a bit worse

01:09:22.700 | And so somewhere around here is a learning rate. We'd want to pick

01:09:27.980 | a

01:09:29.980 | We've got a couple of different ways of making suggestions I

01:09:34.180 | Generally ignore them because these suggestions are specifically designed to be conservative

01:09:41.620 | They're a bit lower than perhaps an optimal in order to make sure we don't recommend something that totally screws up

01:09:47.220 | But I kind of like to say like well, how far right can I go and still see it like clearly really improving quickly?

01:09:53.740 | And so I pick somewhere around

01:09:57.180 | 0.01 for this

01:09:59.180 | So I can now

01:10:01.740 | Fine-tune our model with a learning rate of 0.01

01:10:04.140 | Three epochs and look the whole thing took a minute. That's what we want, right? We want to be able to iterate

01:10:10.220 | Rapidly just a minute or so. So that's enough time for me to go and you know, grab a glass of water or

01:10:16.340 | There's some reading like it's not gonna get too distracted

01:10:19.980 | and

01:10:22.820 | What do we do before we submit?

01:10:25.500 | Nothing, we submit as soon as we can. Okay, let's get our submission in so we've got a model. Let's get it in

01:10:32.020 | So we read in our CSV file of the sample submission and

01:10:37.580 | So the CSV file basically looks like we're gonna have to have a list of the image

01:10:42.180 | file names in order and then a column of labels

01:10:46.980 | So we can get all the image files in the test image

01:10:54.580 | like so and we can sort them and

01:10:56.580 | So now we want is what we want is a data loader

01:11:01.300 | Which is exactly like the data loader we use to train the model

01:11:07.380 | Except pointing at the test set we want to use exactly the same transformations

01:11:11.700 | So there's actually a DL dot test DL method which does that you just pass in

01:11:18.000 | The new set of items so the test set files

01:11:22.340 | So this is a data loader which we can use

01:11:24.820 | for our

01:11:27.380 | Test set a

01:11:29.580 | Test data loader has a key difference to a normal data loader, which is that it does not have any labels

01:11:36.300 | So that's a key distinction

01:11:39.620 | So we can get the predictions for our learner passing in that data loader and

01:11:48.300 | In the case of a classification problem, you can also ask for them to be decoded decoded means rather than just get returned the

01:11:56.420 | probability of every

01:11:58.660 | Rice disease we're every plus it'll tell you what is the index of the most probable

01:12:05.420 | Rice disease. That's what decoded means. So that return with probabilities

01:12:10.420 | Targets, which obviously will be empty because it's a test set. So throw them away and those decoded indexes

01:12:17.260 | Which look like this numbers from 0 to 9 because there's 10 possible rice diseases

01:12:21.700 | The Kaggle submission does not expect numbers from 0 to 9 it expects to see

01:12:27.580 | strings like these

01:12:30.780 | So what do those numbers from 0 to 9 represent?

01:12:34.660 | We can look up our vocab

01:12:37.620 | to get a list

01:12:39.700 | So that's 0 that's 1 etc. That's 9

01:12:44.380 | So I

01:12:46.380 | Realized later. This is a slightly inefficient way to do it, but it does the job

01:12:50.140 | I need to be able to map these two strings

01:12:53.100 | so

01:12:55.580 | If I enumerate the vocab that gives me pairs of numbers 0 bacterial leaf blight 1 bacterial leaf streak, etc

01:13:02.740 | They could then create a dictionary out of that and then I can use pandas

01:13:07.700 | To look up each thing in a dictionary. They call that map

01:13:13.260 | If you're a pandas user, you've probably seen map used before being passed a function

01:13:18.140 | Which is really really slow. But if you pass map addict, it's actually really really fast do it this way if you can

01:13:25.620 | so here's our

01:13:28.700 | Predictions

01:13:31.420 | So we've got our

01:13:34.540 | Submission sample submission file SS. So if we replace this column label with our predictions

01:13:43.220 | like so

01:13:45.220 | Then we can turn that into a CSV and

01:13:47.540 | remember this means

01:13:50.860 | This means run a bash command a shell command head is the first few rows. Let's just take a look that looks reasonable

01:13:58.780 | So we can now submit that to Kaggle now

01:14:03.180 | Iterating rapidly means everything needs to be

01:14:09.300 | Fast and easy things that are slow and hard don't just take up your time

01:14:14.420 | But they take up your mental energy. So even submitting to Kaggle needs needs to be fast. So I put it into a cell

01:14:20.580 | So I can just run this cell

01:14:23.340 | API competitions admit this CSV file

01:14:30.340 | Give it a description. So just run the cell and it submits to Kaggle and as you can see it says here

01:14:36.940 | We go successfully submitted

01:14:39.740 | So that submission

01:14:41.740 | was terrible

01:14:44.780 | Top 80% also known as bottom 20% which is not too surprising right? I mean, it's it's one minute of training time

01:14:53.820 | But it's something that we can start with and that would be like

01:15:00.740 | However long it takes to get to this point that you put in our submission

01:15:04.260 | Now you've really started right because then tomorrow

01:15:08.340 | You can try to make a slightly better one

01:15:11.260 | So I'd like to share my notebooks and so even sharing the notebook I've automated

01:15:20.340 | So part of fast Kaggle is you can use this thing called push notebook and that sends it off to Kaggle

01:15:26.740 | to create a

01:15:29.460 | Notebook on Kaggle

01:15:34.660 | There it is. And there's my score

01:15:37.540 | As you can see, it's exactly the same thing

01:15:43.060 | Why would you create public notebooks on Kaggle? Well

01:15:57.660 | It's the same

01:16:01.740 | brutality of feedback

01:16:04.060 | That you get for entering a competition

01:16:06.940 | But this time rather than finding out in no uncertain terms whether you can predict things accurately

01:16:12.660 | This time you can find out no, it's no uncertain terms whether you can communicate things in a way that people find interesting and useful

01:16:19.620 | And if you get zero votes

01:16:22.380 | You know, so be it right that's something to know and then you know ideally go and ask some friends like

01:16:29.980 | What do you think I could do to improve and if they say oh nothing it's fantastic you can tell no that's not true

01:16:36.540 | I didn't get any votes. I'll try again. This isn't good. How do I make it better? You know

01:16:42.140 | And you can try and improve

01:16:45.260 | because

01:16:47.580 | If you can create models that predict things well, and you can communicate your results in a way that is clear and compelling

01:16:54.220 | You're a pretty good data scientist, you know, like they're two pretty important things and so here's a great way to

01:17:01.900 | Test yourself out on those things and improve. Yes, John

01:17:07.220 | Yes, Jeremy. We have a sort of a I think a timely question here from Zakiya about your iterative approach

01:17:13.700 | And they're asking do you create different Kaggle notebooks for each model that you try?

01:17:19.940 | So one Kaggle book for the first one then separate notebooks subsequently or do you do append to the bottom of it?

01:17:27.600 | What's your strategy? That's a great question

01:17:29.940 | And I know Zaki is going through the

01:17:33.740 | The daily walkthroughs but isn't quite caught up yet. So I will say keep keep it up because

01:17:38.460 | In the six hours of going through this you'll see me create all the notebooks

01:17:43.880 | But if I go to the actual directory I used

01:17:51.720 | You can see them so basically yeah, I started with

01:17:57.740 | You know what you just saw

01:18:01.900 | Bit messier without the pros but that same basic thing. I then duplicated it

01:18:06.380 | to create the next one

01:18:09.500 | Which is here and because I duplicated it, you know this stuff which I still need it's still there right?

01:18:15.780 | and so I run it and

01:18:17.780 | I don't always know what I'm doing, you know

01:18:21.380 | And so at first if I don't really want to do index my duplicate it it will be called

01:18:26.540 | You know first steps on the road to the top part one - copy one

01:18:31.560 | You know, and that's okay

01:18:33.920 | And as soon as I can I'll try to rename that

01:18:39.020 | Once I know what I'm doing, you know

01:18:41.680 | Or if it doesn't say to go anywhere I rename it into something like, you know

01:18:47.840 | Experiment blah blah blah and I'll put some notes at the bottom and I might put it into a failed folder or something

01:18:54.040 | But yeah, it's like

01:18:56.040 | It's a very low-tech

01:18:59.800 | Approach

01:19:01.080 | That I find works really well, which is just duplicating notebooks and editing them and naming them carefully and putting them in order

01:19:08.960 | And you know put the file name in when you submit as well

01:19:14.960 | And

01:19:17.440 | Then of course also if you've got things in git

01:19:19.440 | You know, you can have a link to the git commit so you'll know exactly what it is

01:19:23.560 | Generally speaking for me, you know

01:19:25.380 | My notebooks will only have one submission in and then I'll move on and create a new notebook

01:19:30.020 | So I don't really worry about burgeoning so much

01:19:32.600 | But you can do that as well if that helps you

01:19:36.480 | Yeah, so that's basically what I do and and

01:19:41.360 | I've worked with a lot of people who use much more sophisticated and complex processes and tools and stuff, but

01:19:48.080 | None of them seem to be able to stay as well organized as I am

01:19:53.200 | I think they kind of get a bit lost in their tools sometimes and

01:19:56.880 | File systems and file names I think are good

01:20:01.080 | Great thanks. Um, so away from that kind of

01:20:06.300 | dev process more towards the

01:20:09.480 | The specifics of you know finding the best model and all that sort of stuff

01:20:14.120 | we've got a couple of questions that are in the same space, which is

01:20:17.280 | You know, we've got some people here talking about AutoML frameworks

01:20:21.000 | Which you might want to you know touch on for people who haven't heard of those

01:20:24.000 | If you've got any particular AutoML frameworks, you think are

01:20:27.780 | worth

01:20:30.080 | Recommending or just more generally, how do you go trying different models random forest gradient boosting neural network?

01:20:36.840 | It just so in that space if you could comment it sure

01:20:40.080 | I use AutoML less than anybody. I know I would guess

01:20:48.000 | Which is to say never

01:20:52.960 | Hyperparameter optimization never

01:20:58.640 | And

01:21:02.200 | The reason why is I like being highly intentional, you know

01:21:07.560 | I like to think more like a scientist and have hypotheses and test them carefully

01:21:14.760 | And come out with conclusions, which then I implement, you know, so for example

01:21:19.600 | in this best vision models of fine-tuning I

01:21:23.720 | Didn't try a huge grid search of every possible

01:21:29.280 | Model every possible learning rate every possible pre-processing approach blah blah blah, right instead step one was to find out

01:21:37.240 | Well, which things matter right? So

01:21:43.480 | For example, does whether we squish or crop?

01:21:46.080 | Make a difference, you know are some models better with squished and some models better with crop and

01:21:52.880 | So we just tested that

01:21:55.840 | for

01:21:57.200 | Again, not for every possible architecture

01:21:59.280 | But for one or two versions of each of the main families that took 20 minutes and the answer was no in every single case

01:22:05.680 | The same thing was better. So we don't need to do a grid search over that anymore, you know

01:22:12.160 | Or another classic one is like learning rates. Most people

01:22:15.680 | Do a kind of grid search over learning rates or they'll train a thousand models, you know with different learning rates

01:22:21.720 | but

01:22:23.620 | This fantastic researcher named Leslie Smith invented the learning rate finder a few years ago

01:22:27.660 | We implemented it. I think within days of it first coming out as a technical report. That's what I've used ever since

01:22:35.680 | Because it works

01:22:38.320 | Well and runs in a minute or so

01:22:42.160 | Yeah, I mean then like neural nets versus GBM sources random forests, I mean that's

01:22:52.740 | That shouldn't be too much of a question on the whole like they have pretty clear

01:22:59.160 | Places that they go

01:23:02.560 | like

01:23:05.560 | If I'm doing computer vision, I'm obviously going to use a computer vision deep learning model

01:23:10.760 | And which one I would use. Well if I'm transfer learning, which hopefully is always I would look up the two tables here

01:23:17.440 | This is my table for pets

01:23:18.920 | Which is which are the best at fine-tuning to very similar things to what they were pre trained on and then the same thing for planet

01:23:26.440 | Is which ones are best for fine-tuning for two data sets that are very different to what they're trained on

01:23:32.680 | And as it happens in both case, they're very similar in particular con next is right up towards the top in both cases

01:23:39.120 | so I just like to have these rules of thumb and

01:23:42.240 | Yeah, my rule of thumb for tabular is

01:23:46.080 | Random forests going to be the fastest easiest way to get a pretty good result GBM's

01:23:50.760 | Probably gonna give me a slightly better result if I need it and can be bothered fussing around

01:23:59.920 | GBM I would probably yeah, actually I probably would run a hyper parameter

01:24:05.520 | sweep

01:24:07.280 | Because it is fiddly and and it's fast. So you may as well

01:24:11.000 | So yeah, so now you know, we were able to make a slightly better submission slightly better model

01:24:25.880 | and

01:24:27.920 | so

01:24:29.560 | I had a couple of thoughts about this. The first thing was

01:24:32.200 | that thing trained in

01:24:35.520 | A minute on my home computer and then when I uploaded it to Kaggle it took about four minutes per epoch

01:24:43.720 | which was horrifying and

01:24:46.120 | Kaggle's GPUs are not amazing, but they're not that bad

01:24:51.280 | So I do something was up

01:24:54.160 | And what was up is I realized that they only have two

01:24:58.640 | Virtual CPUs, which nowadays is tiny like, you know, you generally want is a rule of thumb about eight

01:25:05.440 | physical CPUs per GPU

01:25:08.560 | And

01:25:12.120 | So spending all of its time just reading the damn data

01:25:14.720 | Now the data was 640 by 480 and we were ending up with any 128 pixel size bits for speed

01:25:20.760 | So there's no point doing that every epoch

01:25:23.880 | so step one was to make my

01:25:27.760 | Kaggle iteration faster as well. And so very simple thing to do

01:25:32.480 | resize the images

01:25:34.800 | So fast AI has a function called resize images and you say okay take all the train images and stick them in

01:25:42.640 | the destination

01:25:45.080 | making them this size

01:25:47.080 | recursively

01:25:49.000 | And it will recreate the same folder structure over here. And so that's why I called this the training path

01:25:56.440 | because this is now my training data and

01:25:58.440 | So when I then

01:26:02.240 | trained on that on

01:26:04.600 | Kaggle it went down to

01:26:06.880 | four times faster

01:26:10.040 | With no loss of accuracy. So that was kind of step one was to actually get my fast

01:26:15.920 | iteration working

01:26:20.720 | Now still a minute it's a long time and on Kaggle you can actually see this little graph showing how much the CPU is being

01:26:28.200 | Used how much the GPU is being used on your own home machine

01:26:30.920 | You can there are tools free GP, you know free tools to do the same thing

01:26:34.680 | I saw that the GPU was still hardly being used. So it's still CPU was being driven pretty hard

01:26:41.000 | I wanted to use a better model anyway to move up the leaderboard

01:26:45.040 | so I moved from a

01:26:50.640 | Oh, by the way, this graph is very useful. So this is

01:26:55.040 | This is speed versus error rate by family and so we're about to be looking at these

01:27:06.560 | Conve next models

01:27:10.640 | So we're going to be looking at this one complex tiny

01:27:18.920 | Here it is complex tiny so we were looking at resident 2016 which took this long on this data set

01:27:25.200 | But this one here is nearly the best. It's third best, but it's still very fast

01:27:31.900 | And so it's the best overall score. So let's use this

01:27:36.200 | Particularly because you know, we're still spending all of our time waiting for the CPU anyway

01:27:40.760 | So it turned out that when I switched my architecture to Conve next

01:27:46.240 | It basically ran just as fast on Kaggle

01:27:48.840 | So we can then

01:27:51.880 | train that

01:27:53.920 | Let me switch to the Kaggle version because my outputs are missing for some reason

01:27:58.800 | So

01:28:03.680 | Yeah, so I started out by running the resident 2016 on the resized images and got

01:28:08.160 | Similar error rate, but I ran a few more epochs

01:28:11.000 | got 12% error rate and

01:28:14.280 | so then I do exactly the same thing but with Conve next small and

01:28:18.560 | 4.5% error rate. So don't think that different architectures are best

01:28:24.400 | Tiny little differences. This is over twice as good

01:28:29.360 | and

01:28:32.640 | a

01:28:34.880 | Lot of folks you talked to will never have heard of this Conve next because it's very new and

01:28:42.040 | I've noticed a lot of people tend not to

01:28:44.040 | Keep up to date with new things. They kind of learn something at university and then they stop stop learning

01:28:51.040 | So if somebody's still just using res nets all the time

01:28:54.320 | You know, you can tell them we've we've actually we've moved on, you know

01:28:59.600 | Res nets are still probably the fastest

01:29:02.880 | But for the mix of speed and performance, you know, not so much

01:29:10.480 | Conve next, you know again, you want these rules of thumb, right? If you're not sure what to do

01:29:15.920 | This Conve next. Okay, and then like most things there's different sizes. There's a tiny there's a small there's a base

01:29:24.840 | There's a large there's an extra large and you know, it's just well, let's look at the picture

01:29:30.320 | This is it here

01:29:37.000 | Right

01:29:39.760 | Large takes longer but lower error

01:29:43.080 | Tiny takes less time but higher error, right? So you you pick

01:29:48.280 | About your speed versus accuracy trade-off for you. So for us small is great

01:29:54.860 | And so yeah now we've got a 4.5 cent error that's that's terrific

01:30:03.680 | Now let's iterate on Kaggle, this is taking about a minute per epoch on my computer is probably taking about 20 seconds per epoch

01:30:12.120 | So not too bad

01:30:14.120 | So, you know one thing we could try is instead of using squish as

01:30:19.880 | Our pre-processing let's try using crop. So that will randomly crop out an area

01:30:25.840 | And that's the default. So if I remove the method equals squish that will crop

01:30:31.480 | So you see how I've tried to get everything into a single

01:30:34.080 | Function right the single function I can tell it that's going to find the definition

01:30:39.920 | What architecture do I want to train? How do I want to transform the items?

01:30:45.320 | How do I want to transform the batches and how many epochs do I want to do? That's basically it, right?

01:30:50.480 | So this time I want to use the same architecture comp next. I want to resize without cropping and then use the same data augmentation and

01:31:00.880 | And okay error rates about the same

01:31:03.000 | So not particularly it's a tiny bit worse, but not enough to be interesting

01:31:08.280 | Instead of cropping we can pad now padding is interesting. Do you see how these are all square?

01:31:15.560 | Right, but they've got black borders

01:31:18.400 | so

01:31:20.720 | Padding is interesting because it's the only way of pre-processing images

01:31:24.200 | Which doesn't distort them and doesn't lose anything if you crop you lose things

01:31:29.680 | If you squish you distort things

01:31:31.840 | This does neither now, of course the downside is that there's pixels that are literally pointless. They contain zeros

01:31:39.720 | So every way of getting this working has its compromises

01:31:44.560 | but this approach of resizing where we pad with zeros is

01:31:48.480 | Not used enough and it can actually often work quite well

01:31:52.600 | And this case it was about as good as our best so far

01:31:58.920 | But no not huge differences yet

01:32:00.920 | What else could we do well

01:32:05.280 | What we could do is

01:32:09.600 | See these pictures this is all the same picture

01:32:16.840 | But it's gone through our data augmentation. So sometimes it's a bit darker. Sometimes it's flipped horizontally

01:32:24.400 | Sometimes it's slightly rotated. Sometimes it's slightly what sometimes it's zooming into a slightly different section, but this is all the same picture

01:32:31.240 | Maybe our model would like some of these versions better than others

01:32:37.200 | So what we can do is we can pass all of these to our model get predictions for all of them and

01:32:44.640 | Take the average

01:32:48.280 | Right. So it's our own kind of like little mini bagging approach and this is called test time augmentation

01:32:54.440 | Fast AI is very unusual in making that available in a single method

01:32:59.800 | you just pass TTA and it will pass multiple augmented versions of the image and

01:33:05.760 | and

01:33:08.800 | Average them for you

01:33:10.800 | and

01:33:13.120 | So this is the same model as before which had a four point five percent

01:33:18.640 | So instead if we get TTA predictions

01:33:21.960 | And then get the error rate

01:33:26.120 | Wait, why does this say four point eight last time I did this it was way better. Well, that's messing things up, isn't it?

01:33:39.120 | So when I did this originally on my home computer it went from like four point five to three point nine so possibly a

01:33:45.160 | Got a very bad luck. It's time. So this is the first time I've actually ever seen TTA give a worse result

01:33:52.680 | So that's very weird I

01:33:58.120 | wonder if it's

01:34:00.840 | If I should do something other than the crop padding, all right, I'll have to check that out and I'll try and come back to

01:34:06.480 | You and find out

01:34:07.840 | Why in this case?

01:34:09.760 | This one was worse

01:34:11.760 | Anyway take my word for it every other time I've tried it TTA has been better

01:34:17.480 | So then, you know now that we've got a pretty good way of

01:34:22.760 | resizing

01:34:25.320 | We've got TTA. We've got a good training process

01:34:28.480 | Let's just make bigger images and something that's really interesting and a lot of people don't realize is your images don't have to be square

01:34:37.240 | they just all have to be the same size and

01:34:39.240 | Given that nearly all of our images are 640 by 480 we can just pick, you know that aspect ratio

01:34:46.080 | So for example 256 by 192 and we'll resize everything

01:34:50.520 | To the same aspect ratio rectangular

01:34:53.900 | That should work even better still. So if we do that, we'll do 12 epochs

01:34:58.360 | Okay, now our error rates down to 2.2 percent and

01:35:06.440 | Then we'll do TTA

01:35:08.440 | Okay, this time you can see it actually improving down to under 2 percent

01:35:13.280 | So that's pretty cool, right? We've got our error rate at the start of this notebook. We were at

01:35:18.880 | Twelve percent and by the time we've got through our little experiments

01:35:29.600 | We're down to under 2 percent

01:35:34.600 | And nothing about this is in any way specific to

01:35:38.320 | Rice or this competition, you know, it's like this is a very

01:35:43.680 | Mechanistic, you know standardized

01:35:48.640 | Approach

01:35:51.440 | which you can use for

01:35:53.440 | Certainly any kind of this type of computer vision competition and you have computer vision data set almost

01:36:00.120 | But you know, it looked very similar for a collaborative filtering model or tabular model NLP model whatever

01:36:06.000 | So, of course again, I want us a bit as soon as I can so just copy and paste the exact same steps

01:36:13.640 | I took last time basically for creating a submission

01:36:16.320 | So as I said last time we did it using pandas, but there's actually an easier way

01:36:22.640 | So the step where here I've got the numbers from 0 to 9

01:36:26.840 | Which is like which which rice disease is it?

01:36:30.240 | So here's a cute idea

01:36:33.000 | we can take our vocab and

01:36:35.000 | Make it an array. So that's going to be a list of ten things and

01:36:39.040 | Then we can index into that vocab with our indices, which is kind of weird. This is a list of ten things

01:36:47.400 | This is a list of I don't know four or five thousand things. So this will give me four or five thousand results, which is

01:36:55.160 | Each vocab item for that thing. So this is another way of doing the same mapping and I would

01:37:01.380 | spend time

01:37:03.840 | playing with this code to understand what it does because it's the kind of like

01:37:07.480 | very fast what you know, not just in terms of writing but this this the this would

01:37:14.320 | Optimize, you know on on the CPU

01:37:17.920 | Very very well. This is the kind of coding you want to get used to

01:37:23.120 | this kind of indexing

01:37:25.120 | Anyway, so then we can submit it just like last time and

01:37:29.440 | when I did that I

01:37:32.200 | got in the top 25% and

01:37:34.200 | That's that's where you want to be right? Like generally speaking I find in Kaggle competitions the top 25% is

01:37:41.360 | like

01:37:43.920 | You're kind of like solid competent

01:37:46.680 | Level, you know, look just not to say like it's not easy

01:37:51.960 | You've got to know what you're doing

01:37:53.960 | but if you get in the top 25% and I think you can really feel like yeah, this is this is a

01:37:58.680 | you know very

01:38:01.600 | Reasonable attempt and so that's I think this is a very reasonable attempt

01:38:06.360 | Okay, before we wrap up John any last questions

01:38:10.560 | Yeah, there's this there's two I think that would be good if we could touch on quickly before you wrap up

01:38:20.080 | one from Victor asking about TTA

01:38:22.560 | When I use TTA during my training process do I need to do something special during inference or is this something you use only

01:38:31.440 | Okay, so just explain

01:38:34.160 | TTA means test time augmentation. So specifically it means inference. I think you mean augmentation during training. So yeah, so during training

01:38:42.360 | You basically always do augmentation, which means you're varying each image slightly

01:38:49.080 | so that the

01:38:50.720 | Model never seems the same image exactly the same twice and so I can't memorize it

01:38:55.400 | On fast AI and as I say, I don't think anybody else does this as far as I know if you call TTA

01:39:02.800 | it will use the exact same augmentation approach on

01:39:07.120 | whatever data set you pass it and

01:39:10.120 | Average out the prediction but but like multiple times on the same image and we'll average them out

01:39:15.960 | So you don't have to do anything different. But if you didn't have any data augmentation in training, you can't use TTA

01:39:21.760 | It uses the same by default the same data augmentation you use for training

01:39:26.120 | Great. Thank you. And the other one is about how

01:39:30.520 | You know when you first started this example you squared the models and the images rather and you talked about

01:39:36.760 | squashing verse cropping verse, you know clipping and

01:39:39.920 | Scaling and so on but then you went on to say that

01:39:45.120 | These models can actually take rectangular input, right?

01:39:48.600 | so there's a question that's kind of probing it at that, you know, if the if the models can take rectangular inputs

01:39:56.520 | Why would you ever even care as long as they're all the same size? So I

01:40:02.660 | Find most of the time

01:40:06.480 | Datasets tend to have a wide variety of input sizes and aspect ratios

01:40:11.400 | so

01:40:14.240 | You know, if there's just as many tall skinny ones as wide

01:40:18.040 | short ones

01:40:21.360 | You know, you doesn't make sense to create a rectangle because some of them you're gonna really destroy them

01:40:26.840 | So that's where is the kind of?

01:40:28.840 | best compromise in some ways

01:40:31.600 | There are

01:40:34.440 | better things we can do

01:40:36.440 | Which we don't have any

01:40:40.480 | Off-the-shelf library support for yet and I don't think and I don't know that anybody else has even published about this

01:40:45.560 | but we experimented with kind of trying to

01:40:47.800 | Batch things that are similar aspect ratios together and use the kind of median

01:40:54.640 | Rectangle for those and have had some good results with that. But honestly

01:40:59.240 | 99.99% of people given a wide variety of aspect ratios chuck everything into a square a

01:41:07.280 | Follow-up just this is my own interest. Have you ever looked at?

01:41:10.600 | You know, so the issue with with padding as you say is that you're putting black pixels there

01:41:17.120 | Those are not nans those are black pixels. That's right. It's here. And so there's something problematic to me, you know conceptually about that

01:41:26.480 | You know when you when you see

01:41:30.440 | for example

01:41:31.880 | four to three aspect ratio footage

01:41:35.040 | Presented for broadcast on 16 to 9 you got the kind of the blurred stretch that kind of stuff

01:41:39.760 | No, we played with that a lot. Yeah, I used to be really into it actually and fast a I still by default

01:41:45.920 | Uses a reflection padding, which means if this is I don't know that says this is a 20 pixel wide thing

01:41:51.400 | it takes the 20 pixels next to it and flips it over and sticks it here and

01:41:55.080 | It looks pretty good. You know, another one is copy which simply takes the outside pixel and it's a bit more like TV

01:42:04.760 | I

01:42:06.760 | You know much too much agreed it turns out none of them really help you know of anything they make it worse

01:42:13.160 | Because in the end

01:42:16.980 | The computer wants to know no, this is the end of the image. There's nothing else here. And if you reflect it, for example

01:42:22.880 | Then you're kind of creating weird spikes that didn't exist and the computer's got to be like, oh, I wonder what that spike is

01:42:29.640 | So yeah, it's a great question and I obviously spent like a couple of years

01:42:34.120 | Assuming that we should be doing things that look more image-like

01:42:38.180 | But actually the computer likes things to be presented to it in as straightforward a way as possible

01:42:43.820 | Alright, thanks everybody and I hope to see some of you in the walkthroughs and otherwise see you next time

01:42:51.340 | [BLANK_AUDIO]

Lesson 6: Practical Deep Learning for Coders 2022

Chapters