Back to Index

Lesson 6: Practical Deep Learning for Coders 2022


Chapters

0:0 Review
2:9 TwoR model
4:43 How to create a decision tree
7:2 Gini
10:54 Making a submission
15:52 Bagging
19:6 Random forest introduction
20:9 Creating a random forest
22:38 Feature importance
26:37 Adding trees
29:32 What is OOB
32:8 Model interpretation
35:47 Removing the redundant features
35:59 What does Partial dependence do
39:22 Can you explain why a particular prediction is made
46:7 Can you overfit a random forest
49:3 What is gradient boosting
51:56 Introducing walkthrus
54:28 What does fastkaggle do
62:52 fastcore.parallel
64:12 item_tfms=Resize(480, method='squish')
66:20 Fine-tuning project
67:22 Criteria for evaluating models
70:22 Should we submit as soon as we can
75:15 How to automate the process of sharing kaggle notebooks
80:17 AutoML
84:16 Why the first model run so slow on Kaggle GPUs
87:53 How much better can a new novel architecture improve the accuracy
88:33 Convnext
91:10 How to iterate the model with padding
92:1 What does our data augmentation do to images
94:12 How to iterate the model with larger images
96:8 pandas indexing
98:16 What data-augmentation does tta use?

Transcript

Okay, so welcome back to Not welcome back to welcome to lesson six first time. We've been a lesson six welcome back to practically deep learning for coders We just started looking at tabular data Last time and for those of you who've forgotten what we did was we We were looking at the titanic data set And we were looking at creating binary splits by looking at Categorical variables or binary variables like sex and Continuous variables like the log of the fare that they paid and Using those, you know, we also kind of came up with a score Which was basically how how good a job did that split to of grouping the Survival characteristics into two groups, you know all of nearly all of one of whom survived nearly all of whom the other didn't survive So they had like small standard deviation in each group And So then we created the world's simplest little UI to allow us to fiddle around and try to find a good binary split and we did We did come up with a very good binary split Which was on on sex and actually we created this little Automated version and so this is I think the first time we can we're not quite the first time.

No, this is This is yet another time. I should say that we have successfully created a actual Machine learning algorithm from scratch. This one is about the world's simplest one. It's one are Creating the single rule which does a good job of splitting your data set into two parts Which differ as much as possible on the dependent variable?

One hour is probably not going to cut it for a lot of things though It's surprisingly effective, but it's so maybe we could go a step further And the other step further we could go is we could create like a 2r. What if we took each of those? groups males and females in the Titanic data set and Split each of those into two other groups.

So split the males into two groups and split the females into two groups So To do that we can repeat the exact same piece of code we just did but let's remove sex from it and Then split the data set into males and females And run the same piece of code that we just did before but just for the males And so this is going to be like a one-hour rule for how do we predict which males survive the Titanic?

And let's have a look three eight three seven three eight three eight three eight. Okay, so it's Age were they greater than or less than six? Turns out to be for the males the biggest predictor of whether they were going to survive That shipwreck and we can do the same thing females.

So for females There we go, no great supplies P class so whether they were in first class or not Was the biggest predictor for females of whether they would survive the shipwreck? So that has now given us a Decision tree it is a series of binary splits Which will gradually?

Split up our data more and more such that at in the end These in the leaf nodes as we call them. We will hopefully get as you know much Stronger prediction as possible about survival So we could just repeat this step for each of the four groups we've now created males kids and older than six females first class and Everybody else and we could do it again and then we'd have eight groups We could do that manually with another couple of lines of code or we can just use Decision tree classifier, which is a class which does exactly that for us So there's no magic in here.

It's just doing what we've just described And a decision tree classifier comes from a library called psychic learn Psychic learn is a fantastic library that focuses on kind of classical non deep learning ish machine learning methods like decision trees So we can so to create the exact same decision tree We can say please create a decision tree traffic classifier with that most four leaf nodes And one very nice thing it has is It can draw the tree for us So here's a tiny little draw tree function And You can see here it's gonna first of all split on sex now It looks a bit weird to say sex is less than or equal to point five But remember what our binary characteristics are coded as zero one So that's just how we you know easy way to say males versus females And then here we've got for the females What class are they in and for the males what age are they and here's our four leaf nodes so for the females in first class a 116 of them survived and four of them didn't so very good idea to be a well-to-do woman on the Titanic On the other hand Males adults 68 survived 350 died so a very bad idea to be a male adult on the Titanic So you can see you can kind of get a quick summary Of what's going on and one of the reasons people tend to like decision trees particularly for exploratory data analysis Is it doesn't allow us to get a quick picture of what are the key?

driving variables in this data set and how much do they kind of Predict what was happening in the data? Okay, so it's around the same splits as us and It's got one additional piece of information. We haven't seen before. It's this thing called Gini Gini is just another way of measuring how good a split is and I've Put the code to calculate Gini here Here's how you can think of Gini How likely is it that if you go into that sample?

and grab one item and Then go in again and grab another item. How likely is it that you're going to grab the same item each time? and so if that if the entire leaf node is Just people who survived or just people who didn't survive the probability would be one you get the same time same every time If it was an exactly equal mix the probability would be point five so that's why we just Yeah, that's where this this formula comes from in the binary case And in fact, you can see it here, right?

This group here is pretty much 50/50. So Gini's point five Where else this group here is nearly a hundred percent in one class. So Gini is nearly Zero, so I had it backwards. It's one minus And I think I've written it backwards here as well, so I better fix that so This decision tree is you know, we would expect it to be all accurate so we can calculate It's been absolute error and for the 1r.

So just doing males versus females What was our score Here we go point 407 Actually, we have do we have an accuracy score somewhere here we are point three three six That was for log fair and for sex it was Point two one five. Okay, so point two one five.

So that was for the 1r version for the decision tree with four leaf nodes Point two two four. So it's actually a little worse, right? And I think this just reflects the fact that this is such a small data set and the 1r Version was so good. We haven't really improved it that much But not enough to really see it Amongst the randomness of such a small validation set We could go further To 50 a minimum of 50 samples per leaf node.

So that means that in each of these So you have it says samples which in this case is passengers on the Titanic. There's at least there's 67 people that were female first class Less than 28 That's how you define that. So this decision tree keeps building keep splitting until it gets to a point where there's going to be less Than 50 at which point it stops putting that that leaf so you can see they're all got at least 50 samples And so here's the decision tree that builds as you can see, it doesn't have to be like constant depth, right?

So this group here Which is males? Who had cheaper fares? And who were older than 20? but younger than 32 Actually younger than 24 and actually Super cheap fares and so forth, right? So it keeps going down until we get to that group. So Let's try that decision trees.

That decision tree has an absolute error of point one eight three So not surprisingly, you know, once we get there, it's starting to look like it's a little bit better So there's a model and This is a kaggle competition. So therefore we should submit it to the leaderboard and You know one of the you know Biggest mistakes I see Not just beginners but every level of practitioner make on Kaggle is not to submit to the leaderboard Spend months making some perfect thing, right?

But you're actually going to see how you're going and you should try and submit something to the leaderboard every day So, you know regardless of how rubbish it is because You want to improve every day? And so you want to keep iterating so to submit something to the leaderboard you generally have to provide a CSV file And so we're going to create a CSV file And We're going to apply the category codes to get the the category for each one in our test set We're going to set the survived column to our predictions And then we're going to send that off to a CSV and So yeah, so I submitted that and I got a score a little bit worse than most of our linear models and neural nets But not terrible, you know, it was it's it's just doing an okay job Now one interesting thing for the decision tree is there was a lot less pre-processing to do Did you notice that we didn't have to create any dummy variables for our categories?

and Like you certainly can create dummy variables, but you often don't have to so for example You know for for class, you know, it's one two or three you can just split on one two or three, you know even for like What was that thing like the the embarkation?

City code like we just convert them kind of arbitrarily to numbers one two and three and you can split on those numbers So with random forest or so not random first not the decision trees Yeah, you can generally get away with not doing stuff like dummy variables In fact even taking the log of fair We only did that to make our graph look better.

But if you think about it splitting on log fair less than 2.7 It's exactly the same as putting on fair is less than either the 2.7, you know, whatever blog base we used I can't remember so All that a decision tree cares about is the ordering of the data and this is another reason that decision tree-based approaches are fantastic Because they don't care at all about outliers, you know long tail distributions Categorical variables, whatever you can throw it all in and it'll do a perfectly fine job so For tabular data, I would always start by using a decision tree-based approach And kind of create some baselines and so forth because it's it's really hard to mess it up And that's important So, yeah, so here for example is embarked right it it was coded originally as the first letter of the city they embarked in But we turned it into a categorical variable And so pandas for us creates this this vocab this list of all of the possible values And if you look at the codes Attribute you can see it's that S is that 0 1 2 so S has become 2 C Has become 0 and so forth.

All right, so that's how we converting the categories the strings Into numbers that we can sort and group by So, yeah, so if we wanted to split C into one group and Q and S and the other we can just do Okay, less than a quarter one point 0.5 Now, of course if we wanted to split C and S into one group and Q into the other We would need two binary splits first C On one side and QS at Q and S on the other and then Q and S into Q versus S And then the Q and S leaf nodes could get similar Predictions so like you do have that sometimes it can take a little bit more messing around but Most of the time I find categorical variables work fine as numeric in decision tree-based approaches And as I say here, I tend to use dummy variables only if there's like less than four levels Now what if we wanted to make this more accurate could we grow the tree further I Mean we could but You know, there's only 50 samples in these leaves right it's it's not really You Know if I keep splitting it the leaf nodes are going to have subtle data that that's not really going to make very useful predictions Now there are limitations to how accurate a decision tree can be So what can we do We can do something that's actually very I mean, I find it amazing and fascinating It comes from a guy called Leo Bremen And Leo Bremen came with his came up with this idea Called bagging and here's the basic idea of bagging Let's say we've got a model That's not very good Because let's say it's a decision tree, it's really small we've hardly used any data for it, right?

It's not very good. So it's got error. It's got errors on predictions It's not a systematically biased error. It's not always predicting too high or is predicting too low I mean decision trees, you know on average will predict the average, right? But it has errors So what I could do is I could build another decision tree in Some slightly different way that would have different splits and it would also be not a great model but Predicts the correct thing on average.

It's not completely hopeless And again, you know, some of the errors are a bit too high and some are a bit too low And I could keep doing this. So if I could create building lots and lots of slightly different decision trees I'm gonna end up with say a hundred different models all of which are unbiased All of which are better than nothing and all of which have some errors bit high some bit low whatever So what would happen if I average their predictions?

Assuming that the models are not correlated with each other Then you're going to end up with errors on either side of the correct prediction Some are a bit high some are a bit low and there'll be this kind of distribution of errors, right? And the average of those errors will be zero and So that means the average of the predictions of these multiple uncorrelated models each of which is unbiased will be The correct prediction because they have an error of zero and this is a mind-blowing insight It says that if we can generate a whole bunch of uncorrelated unbiased models We can average them and get something better than any of the individual models because the average of the error Will be zero So all we need is a way to generate lots of models Well, we already have a great way to build models, which is to create a decision tree How do we create lots of them?

How do we create lots of unbiased but different models? well Let's just grab a different subset of the data each time. Let's just grab at random half the rows and Build a decision tree and then grab another half the rows and build a decision tree And grab another half the rows and build a decision tree each of those decision trees is going to be not great It's only using half the data But it will be unbiased.

It will be predicting the average on average It will certainly be better than nothing because it's using, you know, some real data to try and create a real decision tree The they won't be correlated with each other because they're each random subsets. So that makes all of our criteria for bagging When you do this, you create something called a random forest So let's create one in four lines of code so Here is a function to create a decision tree.

So let's say what put this is just the proportion of data So let's say we put 75% of the data in each time or we could change it to 50% whatever So this is the number of samples in this subset n and so let's at random choose n times the proportion we requested from the sample and build a decision tree from that and So now let's 100 times Get a tree and stick them all in a list using a list comprehension And now let's grab the predictions for each one of those trees and Then let's stack all those predictions up together and take their mean That is a random forest And what do we get one two three four five six seven eight that's seven lines of code.

So random forests are very simple This is a slight simplification. There's one other difference that random forests do Which is when they build the decision tree. They also randomly select a subset of columns and they select a different random subset of columns each time they do a split and So the idea is you kind of want it to be as random as possible, but also somewhat useful So We can do that by creating a random forest classifier Say how many trees do we want?

how many Samples per leaf and then fit does what we just did and here's our main absolute error rich Again, it's like not as good as our decision tree, but it's still pretty good. And again, it's such a small data set It's hard to tell if that means anything and So we can submit that to Kaggle so earlier on I created a little function to submit to Kaggle So now I just create some predictions and I submit to Kaggle and yeah looks like it gave nearly identical results to a single tree Now to one of my favorite things About random forests and I should say in most real-world data sets of reasonable size random forests Basically always give you much better results than decision trees.

This is just a small data set to show you what to do one of my favorite things about Random forests as we can do something quite cool with it. What we can do is we can look at the Underlying decision trees they create so we've now got a hundred decision trees And we can see what columns Did it find a split on and so it's a here.

Okay. Well the first thing it split on with six and it improved the Gini from Point four seven Two now just take the weighted average of point three eight and point three one weighted by the samples So that's probably going to be about point three three. So it's a okay.

It's like point one for improvement in Gini. Thanks to sex And We can do that again. Okay. Well then P class, you know, how about did that improve Gini? Again, we keep waiting it by the number of samples as well log fair. How much does that improve Gini and we can keep track for each column of How much in total did they improve the Gini in this decision tree and then do that for every decision tree and then add them up per column and that gives you something called a feature importance plot and Here it is And a feature importance plot tells you how important is each feature how often did the trees pick it and how much did it improve the Gini when it did and so we can see from the feature importance plot that sex was the most important and Class was the second most important and everything else was a long way back Now this is another reason by the way why our random forest isn't really particularly helpful Because it's just such an easy split to do right?

I basically all that matters is You know what class you're in and whether you're male and female and these Feature importance plots remember because they're built on random forests and Random forests don't care about really The distribution of your data and they can handle categorical variables and stuff like that That means that you can basically any tabular data set you have you can just plot this right away and Random forests, you know for most data sets only take a few seconds to train, you know, really at most of a minute or two And so if you've got a big data set and you know hundreds of columns do this first and find the 30 columns that might matter It's such a helpful thing to do.

So I've done that for example. I did some work in credit scoring so we're trying to find out which Things would predict who's going to default on a loan and I was given something like 1,000 columns from the database And I put it straight into a random forest and found I think there was about 30 columns that seemed Kind of interesting.

I did that like two hours after I started the job and I went to the Head of marketing and the head of risk and I told them here's the columns. I think that we should focus on and They were like, oh my god. We just finished a two-year consulting project with one of the big consultants Paid the millions of dollars and they came up with a subset of these There are other things that you can do with With random forests along this path.

I'll touch on them briefly and Specifically I'm going to look at chapter 8 of the book Which goes into this in a lot more detail and particularly interestingly chapter 8 of the book uses a Much bigger and more interesting data set which is auction prices of heavy industrial equipment I mean, it's less interesting historically, but more interestingly numerically And so some of the things I did there on this data set I Say this isn't from the data set.

This is from the psychic learn documentation They looked at how as you increase the number of estimators. So the number of trees how much does the Accuracy improve so I then did the same thing on our data set. So I actually just Added up to 40 more and more and more trees and you can see that basically as as predicted by that kind of an initial bit of Hand-wavy theory I gave you that you would expect the more trees The lower the error because the more things you're averaging and that's exactly what we find the accuracy improves as we have more trees John what's up?

Victor is You might have just answered his question actually as he talked it but he's he's asking on the same theme the number of trees in a Random forest does increasing the number of trees always? Translate to a better error. Yes. It does always I mean tiny bumps, right?

But yeah, once you smooth it out But Decreasing returns and If you end up production ising a random forest then of course every one of these trees you have to You know go through for at inference time So it's thought that there's no cost. I mean having said that Zipping through a binary tree is the kind of thing you can really Do fast in fact, it's it's quite easy to let literally spit out C++ code With a bunch of if statements and compile it and get extremely fast performance I don't often use more than a hundred trees.

This is a rule of thumb That the only one John Okay So then there's another interesting feature of random forests Which is remember how in our example we trained with? 75% of the data on each tree So that means for each tree there was 25% of the data we didn't train on Now this actually means if you don't have much data in some situations you can get away with not having a validation set and the reason why is because for each tree we can pick the 25% of rows that weren't in that tree and See how accurate that tree was on those rows and we can average for each row their accuracy on all of the trees in which they were not part of the training and That is called the out-of-bag error Or OOB error and this is built in also to SK learn you can ask for an OOB prediction John Just before we move on Zakiya has a question about bagging So we know that bagging is powerful as an ensemble approach to machine learning Would it be advisable to try out bagging then first when approaching a particular?

say tabular task Before deep learning, so that's the first part of the question And the second part is could we create a bagging model which includes fast AI deep learning models? Yes Absolutely. So to be clear, you know bagging is kind of like a meta method it's not a prediction it's not a method of modeling itself.

It's just a method of Combining other models So random forests in particular as a particular approach to bagging Is a you know, I would probably always start personally a tabular Project with a random forest because they're nearly impossible to mess up and they give good insight and they give a good base case But yeah your question then about can you bag?

other models is a very interesting one and the answer is you absolutely can and People very rarely do But we will We will quite soon Maybe even today So I you know you might be getting the impression I'm a bit of a fan of random forests and Before I was before you know, people thought of me as the deep learning guy people thought of me as the random forests guy I used to go on about random forests all the time and one of the reasons I'm so enthused about them isn't just that they're Very accurate or that they require, you know that they're very hard to mess up and require very little processing pre-processing But they give you a lot of quick and easy insight And specifically these are the five things Which I think that we're interested in and all of which are things that random forests good at they will tell us how confident Are we in our predictions on some particular row?

So when somebody you know, when we're giving a loan to somebody We don't necessarily just want to know How likely are they to repay? But I'd also like to know how confident are we that we know because if we're if we like well We think they'll repay but we're not confident of that.

We would probably want to give them less of a loan and Another thing that's very important is when we're then making a prediction. So again, for example for for credit Let's say you rejected that person's loan Why? And a random forest will tell us What what is the what is the reason that we made a prediction and you'll see why all these things?

Which columns are the strongest predictors? You've already seen that one, right? That's the feature importance plot Which columns are effectively redundant with each other ie they're basically highly correlated with each other And then one of the most important ones as you vary a column, how does it vary the predictions?

So for example in your credit model, how does your prediction of Risk vary as you vary Well something that probably the regulator would want to know might be some, you know, some protected variable like, you know Race or some socio demographic characteristics that you're not allowed to use in your model.

So they might check things like that For the first thing how confident are we in our predictions using a particular row of data? There's a really simple thing we can do which is remember how when we Calculated our predictions manually we stacked up the predictions together and took their mean Well, what if you took their standard deviation instead?

so if you stack up your predictions and take their standard deviation and If that standard deviation is high That means all of them all of the trees are predicting something different and that suggests that we don't really know what we're doing And so that would happen if different subsets of the data end up giving completely different trees for this particular row So there's like a really simple thing you can do to get a sense of your prediction confidence Okay feature importance.

We've already discussed After I do feature importance, you know, like I said when I had the what 7,000 or so columns that got rid of like all but 30 That doesn't tend to improve the predictions of your random forest very much If at all, but it certainly helps like You know kind of logistically thinking about cleaning up the data You can focus on cleaning those 30 columns stuff like that.

So I tend to remove the low importance variables I'm going to skip over this bit about removing redundant features because it's a little bit outside what we're talking about But definitely check it out in the book something called a dendrogram But what I do want to mention is is the partial dependence this is the thing which says What is the relationship?

between a Column and the dependent variable and so this is something called a partial dependence plot now This one's actually not specific to random forests A partial dependence plot is something you can do for basically any machine learning model Let's first of all look at one and then talk about how we make it So in this data set we're looking at the relationship.

We're looking at the sale price at auction of heavy industrial equipment like bulldozers, this is specifically the blue books for bulldozers Kaggle competition and a partial dependence plot between the year that the bulldozer or whatever was made and The price that was sold for this is actually the log price is That it goes up more recent bulldozers more recently made bulldozers are more expensive And as you go back it back to older and older build it bulldozers They're less and less expensive to a point and maybe these ones are some old classic bulldozers you pay a bit extra for now You might think that you could easily create this plot by simply looking at your data at each year and taking the average sale price But that doesn't really work very well I mean it kind of does but it kind of doesn't let me give an example It turns out that one of the biggest predictors of sale price for industrial equipment.

It's whether it has air conditioning and so air conditioning is you know, it's an expensive thing to add and it makes the equipment more expensive to buy and Most things didn't have air conditioning back in the 60s and 70s and most of them do now So if you plot the relationship between year made and price You're actually going to be seeing a whole bunch of When you know how popular was air conditioning?

Right, so you get this this cross correlation going on that we just want to know know What's what's just the impact of of the year? It was made all else being equal So there's actually a really easy way to do that which is we take our data set We take the we leave it exactly as it is to just use the training data set but we take every single row and for the year made column we set it to 1950 and so then we predict for every row what would the sale price of that have been if it was made in 1950 and then we repeat it for 1951 and they repeated for 1952 and so forth and then we plot the averages and That does exactly what I just said.

Remember I said the special words all else being equal This is setting everything else equal. It's the everything else is the data as it actually occurred and we're only varying year made And that's what a partial dependence plot is That works just as well for deep learning or gradient boosting trees or logistic regressions or whatever.

It's a really Cool thing you can do And you can do more than one column at a time, you know, you can do two-way partial dependence plots for example Another one. Okay, so then another one I mentioned was Can you describe why a particular? Prediction was made. So how did you decide for this particular row?

to predict this particular value and This is actually pretty easy to do there's a thing called tree interpreter But we could you could easily create this in about half a dozen lines of code all we do is We're saying okay This customer's come in they've asked for a loan We've put in all of their data through the random forest.

It's bad out of prediction We can actually have a look and say okay. Well that in tree number one What's the path that went down through the tree to get to the leaf node? And we can say oh, well first of all it looked at sex and then it looked at postcode and then it looked at income and so we can see exactly in tree number one which variables were used and what was the change in Gini for each one and Then we can do the same entry to 7 3 3 2 3 4 does this sound familiar?

It's basically the same as our feature importance plot, right? But it's just for this one row of data and so that will tell you basically the feature Importances for that one particular prediction and so then we can plot them Like this. So for example, this is an example of an auction price prediction and According to this plot, you know, so he predicted that the net would be This is just a change from from so I don't actually know what the price is But this is this is how much each one impacted the price.

So Year made I guess this must have been an old attractor. It caused a prediction of the price to go down But then it must have been a larger machine the product size caused it to go up Couple of system made it go up model ID made it go up and So forth, right so you can see the reds says this made this made our prediction go down green made our prediction go up and so overall you can see Which things had the biggest impact on the prediction and what was the direction?

for each one So it's basically a feature importance plot But just for a single row for a single row Any questions John Yeah, there are a couple that have that are sort of queued up this is a good spot to jump to them So first of all Andrew's asking jumping back to the The OOB era, would you ever exclude a tree from a forest if had a if it had a bad out of bag?

era Like if you if you had a I guess if you had a particularly bad Tree in your ensemble. Yeah, like might you just Would you delete a tree that was not doing its thing? It's not playing its part. No you wouldn't If you start deleting trees then you are no longer Having a unbiased prediction of the dependent variable You are biasing it by making a choice.

So even the bad ones will be Improving the quality of the overall average All right. Thank you. Um Zaki a followed up with the question about Bagging and we're just going you know layers and layers here You know, we could go on and create ensembles of bagged models And you know, is it reasonable to assume that they would continue that's not gonna make much difference, right?

If they're all like you could take you a hundred trees split them into groups of ten create ten bagged ensembles And then average those but the average of an average is the same as the average You could like have a wider range of other kinds of models You could have like neural nets trained on different subsets as well But again, it's just the average of an average will still give you the average Right.

So there's not a lot of value in kind of structuring the ensemble You just I mean some some ensembles you can structure but but not bagging bagging's the simplest one It's the one I mainly use There are more sophisticated approaches, but this one Is nice and easy All right, and there's there's one that Is a bit specific and it's referencing content you haven't covered but we're here now.

So And it's on explainability so feature importance of Random forest model sometimes has different results when you compare to other explainability techniques Like SHAP shap or lime And we haven't covered these in the course, but Amir is just curious if you've got any thoughts on which is more accurate or reliable Random forest feature importance or other techniques?

I Would lean towards More immediately trusting random forest feature importances over other techniques on the whole On the basis that it's very hard to mess up a random forest So Yeah, I feel like pretty confident that a random forest feature importance is going to Be pretty reasonable As long as this is the kind of data which a random forest is likely to be pretty good at you know Doing you know, if it's like a computer vision model random forests aren't Particularly good at that And so one of the things that Brian and talked about a lot was explainability and he's got a great essay called the two cultures of statistics in which he talks about I guess what we're nowadays called kind of like data scientists and machine learning folks versus classic statisticians and He he was you know, definitely a data scientist well before the The label existed and he pointed out.

Yeah, you know first and foremost You need a model that's accurate. It is to make good predictions a model that makes bad predictions Will also be bad for making explanations because it doesn't actually know what's going on So if you know if you if you've got a deep learning model that's far more accurate than your random forest then it's you know Explainability methods from the deep learning model will probably be more useful because it's explaining a model It's actually correct Alright, let's take a 10-minute break and we'll come back at 5 past 7 Welcome back one person pointed out I noticed I got the chapter wrong.

It's chapter 9 not chapter 8 in the book I guess I can't read Somebody asked during the break about overfitting Can you overfit a random forest? Basically, no, not really adding more trees will make it more accurate It kind of asymptotes so you can't make it infinitely accurate by using infinite trees, but certainly, you know adding more trees won't make it worse If you don't have enough trees and you Let the trees grow very deep that could overfit So you just have to make sure you have enough trees Radak told me about experiment he did during that Radak told me during the break about an experiment he did Which is something I've done something similar which is adding lots and lots of randomly generated columns to a data set and Try to break the random forest and If you try it, it basically doesn't work.

It's like it's really hard to confuse a random forest by giving it lots of meaningless data it does an amazingly good job of picking out The the useful stuff as I said, you know, I had 30 useful columns out of 7,000 and it found them perfectly well And often, you know when you find those 30 columns You know, you could go to you know I was doing consulting at the time go back to the client and say like tell me more about these columns That's and they'd say like oh well that one there.

We've actually got a better version of that now There's a new system, you know, we should grab that and oh this column actually that was because of this thing that happened last year But we don't do it anymore or you know, like you can really have this kind of discussion about the stuff you've zoomed into You know There are other things that you have to think about with lots of kinds of models like particularly regression models things like interactions You don't have to worry about that with random forests like because you split on one column and then split on another column You get interactions for free as well Normalization you don't have to worry about you know, you don't have to have normally distributed columns So, yeah, definitely worth a try now something I haven't gone into Is gradient boosting But if you go to explain.ai You'll see that my friend Terrence and I have a three-part series about gradient boosting including pictures of golf made by Terrence But to explain gradient boosting is a lot like random forests but rather than training a model training now fitting a tree again and again and again on different random subsets of the data Instead what we do is we fit very very very small trees to hardly ever any splits and We then say okay.

What's the error? So, you know so imagine the simplest tree would be a one-hour rule tree of Male versus female say and then use you take what's called the residual That's the difference between the prediction and the actual the error and then you create another tree which attempts to predict that very small tree and then you create another very small tree which track tries to predict the error from that and So forth each one is predicting the residual from all of the previous ones.

And so then to calculate a prediction Rather than taking the average of all the trees you take the sum of all the trees because each one is predicted the difference between the actual and All of the previous trees and that's called boosting versus bagging so boosting and bagging are two kind of meta-ensembling techniques and When bagging is applied to trees, it's called a random forest and when boosting is applied to trees It's called a gradient boosting machine or gradient boosted decision tree Gradient boosting is generally speaking more accurate than random forests But you can absolutely over fit and so therefore It's not necessarily my first go-to thing having said that there are ways to avoid over fitting But yeah, it's just it's it's not It's it you know because it's breakable it's not my first choice But yeah, check out our stuff here if you're interested and you know, you there is stuff which largely automates the process There's lots of hyper parameters.

You have to select people generally just you know, try every combination of hyper parameters And in the end you're generally should be able to get a more accurate gradient boosting model than random forest But not necessarily by much Okay, so that was the Kaggle notebook on random forests how random forests really work So What we've been doing is having this daily Walk through where me and I don't know how many 20 or 30 folks get together on a zoom call and chat about you know getting through the course and setting up machines and stuff like that and You know, we've been trying to kind of practice what you know things along the way and so a couple of weeks ago, I wanted to show like What does it look like to pick a Kaggle competition and just like?

Do the normal sensible Kind of mechanical steps that you would do for any computer vision model And so the Competition I picked was paddy disease classification which is about Recognizing diseases rice diseases and rice patties And yeah, I spent I don't know a couple of hours or three. I can't remember a few hours Throwing together something and I Found that I was number one on the leaderboard and I thought oh, that's that's interesting like because you never quite have a sense of How well these things work?

And then I thought well, there's all these other things. We should be doing as well and I tried three more things and each time I tried another thing I got further ahead at the top of the leaderboard so I thought it'd be cool to take you through the process I'm gonna do it reasonably quickly because The walkthroughs are all available For you to see the entire thing in you know, seven hours of detail or however long we probably were six to seven hours of conversations But I want to kind of take you through the basic process that I went through So since I've been starting to do more stuff on Kaggle, you know, I realized there's some Kind of menial steps.

I have to do each time particularly because I like to run stuff on my own machine And then kind of upload it to Kaggle So to do to make my life easier I created a little module called fast Kaggle Which you'll see in my notebooks now on which you can download from pit or Conda And as you'll see it makes some things a bit easier for example unloading the data for the paddy disease classification if you just run setup comp and Pass in the name of the competition if you are on Kaggle it will return a path to that Competition data that's already on Kaggle if you are not on Kaggle and you haven't downloaded it It will download and unzip the data for you If you're not on Kaggle and you have not downloaded on zip the data, it will return a path to the one that you've already downloaded also, if you are on Kaggle you can ask it to make sure that Pip things are installed that might not be up to date.

Otherwise So this basically one line of code now gets us all set up and ready to go so this path So I ran this particular one on my own machine so it's downloaded and unzipped the data I've also got links to the Six walkthroughs so far. These are the videos Oh, yes, and here's my result after these For attempts that's a few fiddling around at the start So the overall approach at is well and this is not just to a Kaggle competition right at the reason I like looking at Kaggle competitions is You can't hide from the truth In a Kaggle competition, you know when you're working on some work project or something You might be able to convince yourself and everybody around you that you've done a fantastic job of not overfitting and your models better than what anybody else could have made and whatever else but The brutal assessment of the private leaderboard Will tell you the truth Is your model actually predicting things correctly and is it overfit?

um Until you've been through that process You know, you're never gonna know and a lot of people don't go through that process because at some level they don't want to know But it's okay, you know, nobody needed it you don't have to put your own name there I Always did right from the very first one.

I wanted, you know, if I was gonna screw up royally I wanted to have the pressure on myself of people seeing me in last place but you know, it's it's fine you could do it all and honestly and You'll actually find As you improve you also have so much self-confidence, you know and The stuff we do in a Kaggle competition is indeed a subset of the things we need to do in real life but It's an important subset, you know building a model that actually predicts things correctly and doesn't overfit is important and furthermore structuring your code and analysis in such a way that you can keep improving over a three-month period without gradually getting into more and more of a tangled mess of impossible to understand code and Having no idea what untitled copy 13 was and why it was better than 25 right, this is all stuff you want to be practicing ideally Well away from customers or whatever, you know before you've kind of figured things out So the things I talk about here about doing things well in this Kaggle competition Should work, you know in other settings as well And so these are the two focuses that I recommend Get a really good validation set together.

We've talked about that before right and in a Kaggle competition That's like it's very rare to see people do well in a Kaggle competition who don't have a good validation set sometimes that's easy and this competition actually it is easy because the the Test set seems to be a random example But most of the time it's not actually I would say And then how quickly can you iterate?

How quickly can you try things and find out what worked? So obviously you need a good validation set. Otherwise, it's impossible to iterate and So quickly iterating means not saying what is the biggest? You know open AI takes four months on a hundred TPUs model that I can train it's what can I do that's going to train in a minute or so and Will quickly give me a sense of like well, I could try this I could try that what things gonna work and then try You know 80 things It also doesn't mean that saying like, oh I heard this is amazing you Bayesian hyper parameter tuning approach.

I'm gonna spend three months implementing that because that's gonna like give you one thing but actually do well and In these competitions or in machine learning in general, you actually have to do everything reasonably well And doing just one thing really well will still put you somewhere about last place So I actually saw that a couple of years ago Aussie guy who's very very distinguished machine learning practitioner Actually put together a team entered the Kaggle competition and literally came in last place Because they spent the entire three months trying to build this amazing new fancy thing and Never actually never actually iterated if you iterate a guarantee you won't be in last place Okay, so here's how we can grab our data with fast Kaggle and it gives us tells us what path it's in And then I set my random seed And I only do this because I'm creating a notebook to share, you know when I share a notebook I like to be able to say as you can see, this is point eight three blah blah blah, right and Know that when you see it, it'll be point eight three as well But when I'm doing stuff, otherwise, I would never set a random seed I want to be able to run things multiple times and see how much it changes each time because that'll give me a sense of like The modifications I'm making changing it because they're improving it making it worse or is it just random variation So if you or if you always set a random seed That's a bad idea because you won't be able to see the random variation.

So this is just here for presenting a notebook Okay, so the data they've given us as usual they've got a sample submission they've got some test set images They've got some training set images a CSV file about the training set And then these other two you can ignore because I created them So let's grab a path To train images and so do you remember?

Get image files. So that gets us a list of the file names of all the images here recursively So we could just grab the first one and Take a look. So it's 480 by 640 Now we've got to be careful This is a pillow image Python imaging library image In the imaging world.

They generally say columns by rows in The array slash tensor world. We always say rows by columns So if you ask pie torch what the size of this is, it'll say 640 by 480 and I guarantee at some point This is going to bite you. So try to recognize it now Okay, so they're kind of taller than they are.

There's at least this one is taller than it is wide so I'd actually like to know are they all this size because it's really helpful if they all are all the same size or at least similar Believe it or not the amount of time it takes to decode a JPEG is actually quite significant And so figuring out what size these things are is actually going to be pretty slow But my fast core library has a parallel sub module which can basically do anything That you can do in Python.

It can do it in parallel. So in this case, we wanted to create a pillow image and get its size So if we create a function that does that and pass it to parallel passing in the function and the list of files It does it in parallel and that actually runs pretty fast And so here is the answer How this happened ten thousand four hundred and three images are indeed 480 by 640 and four of them aren't So basically what this says to me is that we should pre-process them or you know At some point process them so that they're probably all for 80 by 640 or all basically the kind of same size We'll pretend they're all this size But we can't not do some initial resizing.

Otherwise, this is going to screw things up So like that probably the easiest way to do things the most common way to do things is to Either squish or crop every image to be a square So squishing is when you just in this case squish the aspect ratio down As opposed to cropping randomly a section out, so if we call resize squish it will squish it down And so this is 480 by 480 squared.

So this is what it's going to do to all of the images first on the CPU That allows them to be all batched together into a single mini batch Everything in a mini batch has to be the same shape otherwise the GPU won't like it and then that mini batch is put through data augmentation and It will Grab a random subset of the image and make it at 128 by 128 pixel And here's what that looks like.

Here's our data So show batch works for pretty much everything not just in the fast AI library But even for things like fast audio, which are kind of community based things You should be to use show batch on anything and and see or hear or whatever what your data looks like I don't know anything about rice disease But apparently these are various rice diseases and this is what they look like So, um, I I jump into creating models much more quickly than most people Because I find model, you know models are a great way to understand my data as we've seen before So I basically build a model as soon as I can and I want to Create a model that's going to let me iterate quickly.

So that means that I'm going to need a model that can train quickly so Thomas Kapel and I recently Did this big project the best vision models of fine-tuning Where we looked at nearly a hundred different architectures from from Ross Whiteman's Tim library Pytorch image model library and looked at Which ones could we fine-tune which ones had the best transfer learning results And we tried two different data sets very different data sets One is the pets data set that we've seen before So trying to predict what breed of pet is from 37 different breeds and the other was a Satellite imagery data set called planet.

They're very very different data sets in terms of what they contain and also very different sizes The planet ones a lot smaller the pets ones a lot bigger And so the main things we measured were how much memory did it use? How accurate was it and how long did it take to fit?

And then I created this score which can which combines the fit time and error rate together And So this is a really useful table For picking a model and now in this case. I want to pick something that's really fast and there's one clear winner on speed which is resnet 26 D and So its accuracy was 6% versus the best was like 4.1% So okay, it's not amazingly accurate, but it's still pretty good, and it's gonna be really fast So that's why I picked resnet 2016 a lot of people think that when they do deep learning they're going to spend all of their time learning about exactly how a resnet 26 D is made and convolutions and resnet blocks and transformers and blah blah blah we will cover all that stuff In part two and a little bit of it next week But it almost never matters Right, it's just it's just a function right and what matters is the inputs to it and the outputs to it And how fast it is how accurate it is So let's create a learner which with a resnet 26 D from our data loaders and Let's run LR find so LR find Will put through one mini batch at a time starting at a very very very low learning rate and gradually increase the learning rate and track the loss and Initially the learn the loss won't improve because the learning rate is so small It doesn't really do anything and at some point the learning rates high enough that the loss will start coming down Then at some other point the load the learning rate so high that it's gonna start jumping past the answer and it's got a bit worse And so somewhere around here is a learning rate.

We'd want to pick a We've got a couple of different ways of making suggestions I Generally ignore them because these suggestions are specifically designed to be conservative They're a bit lower than perhaps an optimal in order to make sure we don't recommend something that totally screws up But I kind of like to say like well, how far right can I go and still see it like clearly really improving quickly?

And so I pick somewhere around 0.01 for this So I can now Fine-tune our model with a learning rate of 0.01 Three epochs and look the whole thing took a minute. That's what we want, right? We want to be able to iterate Rapidly just a minute or so. So that's enough time for me to go and you know, grab a glass of water or There's some reading like it's not gonna get too distracted and What do we do before we submit?

Nothing, we submit as soon as we can. Okay, let's get our submission in so we've got a model. Let's get it in So we read in our CSV file of the sample submission and So the CSV file basically looks like we're gonna have to have a list of the image file names in order and then a column of labels So we can get all the image files in the test image like so and we can sort them and So now we want is what we want is a data loader Which is exactly like the data loader we use to train the model Except pointing at the test set we want to use exactly the same transformations So there's actually a DL dot test DL method which does that you just pass in The new set of items so the test set files So this is a data loader which we can use for our Test set a Test data loader has a key difference to a normal data loader, which is that it does not have any labels So that's a key distinction So we can get the predictions for our learner passing in that data loader and In the case of a classification problem, you can also ask for them to be decoded decoded means rather than just get returned the probability of every Rice disease we're every plus it'll tell you what is the index of the most probable Rice disease.

That's what decoded means. So that return with probabilities Targets, which obviously will be empty because it's a test set. So throw them away and those decoded indexes Which look like this numbers from 0 to 9 because there's 10 possible rice diseases The Kaggle submission does not expect numbers from 0 to 9 it expects to see strings like these So what do those numbers from 0 to 9 represent?

We can look up our vocab to get a list So that's 0 that's 1 etc. That's 9 So I Realized later. This is a slightly inefficient way to do it, but it does the job I need to be able to map these two strings so If I enumerate the vocab that gives me pairs of numbers 0 bacterial leaf blight 1 bacterial leaf streak, etc They could then create a dictionary out of that and then I can use pandas To look up each thing in a dictionary.

They call that map If you're a pandas user, you've probably seen map used before being passed a function Which is really really slow. But if you pass map addict, it's actually really really fast do it this way if you can so here's our Predictions So we've got our Submission sample submission file SS.

So if we replace this column label with our predictions like so Then we can turn that into a CSV and remember this means This means run a bash command a shell command head is the first few rows. Let's just take a look that looks reasonable So we can now submit that to Kaggle now Iterating rapidly means everything needs to be Fast and easy things that are slow and hard don't just take up your time But they take up your mental energy.

So even submitting to Kaggle needs needs to be fast. So I put it into a cell So I can just run this cell API competitions admit this CSV file Give it a description. So just run the cell and it submits to Kaggle and as you can see it says here We go successfully submitted So that submission was terrible Top 80% also known as bottom 20% which is not too surprising right?

I mean, it's it's one minute of training time But it's something that we can start with and that would be like However long it takes to get to this point that you put in our submission Now you've really started right because then tomorrow You can try to make a slightly better one So I'd like to share my notebooks and so even sharing the notebook I've automated So part of fast Kaggle is you can use this thing called push notebook and that sends it off to Kaggle to create a Notebook on Kaggle There it is.

And there's my score As you can see, it's exactly the same thing Why would you create public notebooks on Kaggle? Well It's the same brutality of feedback That you get for entering a competition But this time rather than finding out in no uncertain terms whether you can predict things accurately This time you can find out no, it's no uncertain terms whether you can communicate things in a way that people find interesting and useful And if you get zero votes You know, so be it right that's something to know and then you know ideally go and ask some friends like What do you think I could do to improve and if they say oh nothing it's fantastic you can tell no that's not true I didn't get any votes.

I'll try again. This isn't good. How do I make it better? You know And you can try and improve because If you can create models that predict things well, and you can communicate your results in a way that is clear and compelling You're a pretty good data scientist, you know, like they're two pretty important things and so here's a great way to Test yourself out on those things and improve.

Yes, John Yes, Jeremy. We have a sort of a I think a timely question here from Zakiya about your iterative approach And they're asking do you create different Kaggle notebooks for each model that you try? So one Kaggle book for the first one then separate notebooks subsequently or do you do append to the bottom of it?

What's your strategy? That's a great question And I know Zaki is going through the The daily walkthroughs but isn't quite caught up yet. So I will say keep keep it up because In the six hours of going through this you'll see me create all the notebooks But if I go to the actual directory I used You can see them so basically yeah, I started with You know what you just saw Bit messier without the pros but that same basic thing.

I then duplicated it to create the next one Which is here and because I duplicated it, you know this stuff which I still need it's still there right? and so I run it and I don't always know what I'm doing, you know And so at first if I don't really want to do index my duplicate it it will be called You know first steps on the road to the top part one - copy one You know, and that's okay And as soon as I can I'll try to rename that Once I know what I'm doing, you know Or if it doesn't say to go anywhere I rename it into something like, you know Experiment blah blah blah and I'll put some notes at the bottom and I might put it into a failed folder or something But yeah, it's like It's a very low-tech Approach That I find works really well, which is just duplicating notebooks and editing them and naming them carefully and putting them in order And you know put the file name in when you submit as well And Then of course also if you've got things in git You know, you can have a link to the git commit so you'll know exactly what it is Generally speaking for me, you know My notebooks will only have one submission in and then I'll move on and create a new notebook So I don't really worry about burgeoning so much But you can do that as well if that helps you Yeah, so that's basically what I do and and I've worked with a lot of people who use much more sophisticated and complex processes and tools and stuff, but None of them seem to be able to stay as well organized as I am I think they kind of get a bit lost in their tools sometimes and File systems and file names I think are good Great thanks.

Um, so away from that kind of dev process more towards the The specifics of you know finding the best model and all that sort of stuff we've got a couple of questions that are in the same space, which is You know, we've got some people here talking about AutoML frameworks Which you might want to you know touch on for people who haven't heard of those If you've got any particular AutoML frameworks, you think are worth Recommending or just more generally, how do you go trying different models random forest gradient boosting neural network?

It just so in that space if you could comment it sure I use AutoML less than anybody. I know I would guess Which is to say never Hyperparameter optimization never And The reason why is I like being highly intentional, you know I like to think more like a scientist and have hypotheses and test them carefully And come out with conclusions, which then I implement, you know, so for example in this best vision models of fine-tuning I Didn't try a huge grid search of every possible Model every possible learning rate every possible pre-processing approach blah blah blah, right instead step one was to find out Well, which things matter right?

So For example, does whether we squish or crop? Make a difference, you know are some models better with squished and some models better with crop and So we just tested that for Again, not for every possible architecture But for one or two versions of each of the main families that took 20 minutes and the answer was no in every single case The same thing was better.

So we don't need to do a grid search over that anymore, you know Or another classic one is like learning rates. Most people Do a kind of grid search over learning rates or they'll train a thousand models, you know with different learning rates but This fantastic researcher named Leslie Smith invented the learning rate finder a few years ago We implemented it.

I think within days of it first coming out as a technical report. That's what I've used ever since Because it works Well and runs in a minute or so Yeah, I mean then like neural nets versus GBM sources random forests, I mean that's That shouldn't be too much of a question on the whole like they have pretty clear Places that they go like If I'm doing computer vision, I'm obviously going to use a computer vision deep learning model And which one I would use.

Well if I'm transfer learning, which hopefully is always I would look up the two tables here This is my table for pets Which is which are the best at fine-tuning to very similar things to what they were pre trained on and then the same thing for planet Is which ones are best for fine-tuning for two data sets that are very different to what they're trained on And as it happens in both case, they're very similar in particular con next is right up towards the top in both cases so I just like to have these rules of thumb and Yeah, my rule of thumb for tabular is Random forests going to be the fastest easiest way to get a pretty good result GBM's Probably gonna give me a slightly better result if I need it and can be bothered fussing around GBM I would probably yeah, actually I probably would run a hyper parameter sweep Because it is fiddly and and it's fast.

So you may as well So yeah, so now you know, we were able to make a slightly better submission slightly better model and so I had a couple of thoughts about this. The first thing was that thing trained in A minute on my home computer and then when I uploaded it to Kaggle it took about four minutes per epoch which was horrifying and Kaggle's GPUs are not amazing, but they're not that bad So I do something was up And what was up is I realized that they only have two Virtual CPUs, which nowadays is tiny like, you know, you generally want is a rule of thumb about eight physical CPUs per GPU And So spending all of its time just reading the damn data Now the data was 640 by 480 and we were ending up with any 128 pixel size bits for speed So there's no point doing that every epoch so step one was to make my Kaggle iteration faster as well.

And so very simple thing to do resize the images So fast AI has a function called resize images and you say okay take all the train images and stick them in the destination making them this size recursively And it will recreate the same folder structure over here. And so that's why I called this the training path because this is now my training data and So when I then trained on that on Kaggle it went down to four times faster With no loss of accuracy.

So that was kind of step one was to actually get my fast iteration working Now still a minute it's a long time and on Kaggle you can actually see this little graph showing how much the CPU is being Used how much the GPU is being used on your own home machine You can there are tools free GP, you know free tools to do the same thing I saw that the GPU was still hardly being used.

So it's still CPU was being driven pretty hard I wanted to use a better model anyway to move up the leaderboard so I moved from a Oh, by the way, this graph is very useful. So this is This is speed versus error rate by family and so we're about to be looking at these Conve next models So we're going to be looking at this one complex tiny Here it is complex tiny so we were looking at resident 2016 which took this long on this data set But this one here is nearly the best.

It's third best, but it's still very fast And so it's the best overall score. So let's use this Particularly because you know, we're still spending all of our time waiting for the CPU anyway So it turned out that when I switched my architecture to Conve next It basically ran just as fast on Kaggle So we can then train that Let me switch to the Kaggle version because my outputs are missing for some reason So Yeah, so I started out by running the resident 2016 on the resized images and got Similar error rate, but I ran a few more epochs got 12% error rate and so then I do exactly the same thing but with Conve next small and 4.5% error rate.

So don't think that different architectures are best Tiny little differences. This is over twice as good and a Lot of folks you talked to will never have heard of this Conve next because it's very new and I've noticed a lot of people tend not to Keep up to date with new things.

They kind of learn something at university and then they stop stop learning So if somebody's still just using res nets all the time You know, you can tell them we've we've actually we've moved on, you know Res nets are still probably the fastest But for the mix of speed and performance, you know, not so much Conve next, you know again, you want these rules of thumb, right?

If you're not sure what to do This Conve next. Okay, and then like most things there's different sizes. There's a tiny there's a small there's a base There's a large there's an extra large and you know, it's just well, let's look at the picture This is it here Right Large takes longer but lower error Tiny takes less time but higher error, right?

So you you pick About your speed versus accuracy trade-off for you. So for us small is great And so yeah now we've got a 4.5 cent error that's that's terrific Now let's iterate on Kaggle, this is taking about a minute per epoch on my computer is probably taking about 20 seconds per epoch So not too bad So, you know one thing we could try is instead of using squish as Our pre-processing let's try using crop.

So that will randomly crop out an area And that's the default. So if I remove the method equals squish that will crop So you see how I've tried to get everything into a single Function right the single function I can tell it that's going to find the definition What architecture do I want to train?

How do I want to transform the items? How do I want to transform the batches and how many epochs do I want to do? That's basically it, right? So this time I want to use the same architecture comp next. I want to resize without cropping and then use the same data augmentation and And okay error rates about the same So not particularly it's a tiny bit worse, but not enough to be interesting Instead of cropping we can pad now padding is interesting.

Do you see how these are all square? Right, but they've got black borders so Padding is interesting because it's the only way of pre-processing images Which doesn't distort them and doesn't lose anything if you crop you lose things If you squish you distort things This does neither now, of course the downside is that there's pixels that are literally pointless.

They contain zeros So every way of getting this working has its compromises but this approach of resizing where we pad with zeros is Not used enough and it can actually often work quite well And this case it was about as good as our best so far But no not huge differences yet What else could we do well What we could do is See these pictures this is all the same picture But it's gone through our data augmentation.

So sometimes it's a bit darker. Sometimes it's flipped horizontally Sometimes it's slightly rotated. Sometimes it's slightly what sometimes it's zooming into a slightly different section, but this is all the same picture Maybe our model would like some of these versions better than others So what we can do is we can pass all of these to our model get predictions for all of them and Take the average Right.

So it's our own kind of like little mini bagging approach and this is called test time augmentation Fast AI is very unusual in making that available in a single method you just pass TTA and it will pass multiple augmented versions of the image and and Average them for you and So this is the same model as before which had a four point five percent So instead if we get TTA predictions And then get the error rate Wait, why does this say four point eight last time I did this it was way better.

Well, that's messing things up, isn't it? So when I did this originally on my home computer it went from like four point five to three point nine so possibly a Got a very bad luck. It's time. So this is the first time I've actually ever seen TTA give a worse result So that's very weird I wonder if it's If I should do something other than the crop padding, all right, I'll have to check that out and I'll try and come back to You and find out Why in this case?

This one was worse Anyway take my word for it every other time I've tried it TTA has been better So then, you know now that we've got a pretty good way of resizing We've got TTA. We've got a good training process Let's just make bigger images and something that's really interesting and a lot of people don't realize is your images don't have to be square they just all have to be the same size and Given that nearly all of our images are 640 by 480 we can just pick, you know that aspect ratio So for example 256 by 192 and we'll resize everything To the same aspect ratio rectangular That should work even better still.

So if we do that, we'll do 12 epochs Okay, now our error rates down to 2.2 percent and Then we'll do TTA Okay, this time you can see it actually improving down to under 2 percent So that's pretty cool, right? We've got our error rate at the start of this notebook.

We were at Twelve percent and by the time we've got through our little experiments We're down to under 2 percent And nothing about this is in any way specific to Rice or this competition, you know, it's like this is a very Mechanistic, you know standardized Approach which you can use for Certainly any kind of this type of computer vision competition and you have computer vision data set almost But you know, it looked very similar for a collaborative filtering model or tabular model NLP model whatever So, of course again, I want us a bit as soon as I can so just copy and paste the exact same steps I took last time basically for creating a submission So as I said last time we did it using pandas, but there's actually an easier way So the step where here I've got the numbers from 0 to 9 Which is like which which rice disease is it?

So here's a cute idea we can take our vocab and Make it an array. So that's going to be a list of ten things and Then we can index into that vocab with our indices, which is kind of weird. This is a list of ten things This is a list of I don't know four or five thousand things.

So this will give me four or five thousand results, which is Each vocab item for that thing. So this is another way of doing the same mapping and I would spend time playing with this code to understand what it does because it's the kind of like very fast what you know, not just in terms of writing but this this the this would Optimize, you know on on the CPU Very very well.

This is the kind of coding you want to get used to this kind of indexing Anyway, so then we can submit it just like last time and when I did that I got in the top 25% and That's that's where you want to be right? Like generally speaking I find in Kaggle competitions the top 25% is like You're kind of like solid competent Level, you know, look just not to say like it's not easy You've got to know what you're doing but if you get in the top 25% and I think you can really feel like yeah, this is this is a you know very Reasonable attempt and so that's I think this is a very reasonable attempt Okay, before we wrap up John any last questions Yeah, there's this there's two I think that would be good if we could touch on quickly before you wrap up one from Victor asking about TTA When I use TTA during my training process do I need to do something special during inference or is this something you use only Okay, so just explain TTA means test time augmentation.

So specifically it means inference. I think you mean augmentation during training. So yeah, so during training You basically always do augmentation, which means you're varying each image slightly so that the Model never seems the same image exactly the same twice and so I can't memorize it On fast AI and as I say, I don't think anybody else does this as far as I know if you call TTA it will use the exact same augmentation approach on whatever data set you pass it and Average out the prediction but but like multiple times on the same image and we'll average them out So you don't have to do anything different.

But if you didn't have any data augmentation in training, you can't use TTA It uses the same by default the same data augmentation you use for training Great. Thank you. And the other one is about how You know when you first started this example you squared the models and the images rather and you talked about squashing verse cropping verse, you know clipping and Scaling and so on but then you went on to say that These models can actually take rectangular input, right?

so there's a question that's kind of probing it at that, you know, if the if the models can take rectangular inputs Why would you ever even care as long as they're all the same size? So I Find most of the time Datasets tend to have a wide variety of input sizes and aspect ratios so You know, if there's just as many tall skinny ones as wide short ones You know, you doesn't make sense to create a rectangle because some of them you're gonna really destroy them So that's where is the kind of?

best compromise in some ways There are better things we can do Which we don't have any Off-the-shelf library support for yet and I don't think and I don't know that anybody else has even published about this but we experimented with kind of trying to Batch things that are similar aspect ratios together and use the kind of median Rectangle for those and have had some good results with that.

But honestly 99.99% of people given a wide variety of aspect ratios chuck everything into a square a Follow-up just this is my own interest. Have you ever looked at? You know, so the issue with with padding as you say is that you're putting black pixels there Those are not nans those are black pixels.

That's right. It's here. And so there's something problematic to me, you know conceptually about that You know when you when you see for example four to three aspect ratio footage Presented for broadcast on 16 to 9 you got the kind of the blurred stretch that kind of stuff No, we played with that a lot.

Yeah, I used to be really into it actually and fast a I still by default Uses a reflection padding, which means if this is I don't know that says this is a 20 pixel wide thing it takes the 20 pixels next to it and flips it over and sticks it here and It looks pretty good.

You know, another one is copy which simply takes the outside pixel and it's a bit more like TV I You know much too much agreed it turns out none of them really help you know of anything they make it worse Because in the end The computer wants to know no, this is the end of the image.

There's nothing else here. And if you reflect it, for example Then you're kind of creating weird spikes that didn't exist and the computer's got to be like, oh, I wonder what that spike is So yeah, it's a great question and I obviously spent like a couple of years Assuming that we should be doing things that look more image-like But actually the computer likes things to be presented to it in as straightforward a way as possible Alright, thanks everybody and I hope to see some of you in the walkthroughs and otherwise see you next time