back to indexLive coding 14
Chapters
0:0 Questions
0:5 About the concept/capability of early stoppings
4:0 Different models, which one to use
5:25 Gradient Boosting Machine with different model predictions
7:25 AutoML tools
7:50 Kaggle winners approaches, ensemble
9:0 Test Time Augmentation (TTA)
11:0 Training loss vs validation loss
12:30 Averaging a few augmented versions
13:50 Unbalanced dataset and augmentation
15:0 On balancing datasets
15:40 WeightedDL, Weighted DataLoader
17:55 Weighted sampling on Diabetic Retinopathy competition
19:40 Lets try something…
21:40 Setting an environment variable when having multiple GPUs
21:55 Multi target model
23:0 Debugging
27:4 Revise transforms to 128x128 and 5 epochs
28:0 Progressive resizing
29:16 Fine tuning again but on larger 160x160 images
34:30 Oops, small bug, restart (without creating a new learner)
37:30 Re-run second fine-tuning
40:0 How did you come up with the idea of progressive resizing?
41:0 Changing things during training
42:30 On the paper Fixing the train-test resolution discrepancy
44:15 Fine tuning again but on larger 192x192 images
46:11 A detour about paper reference management
48:27 Final fine-tuning 256x192
49:30 Looking at WeightedDL, WeightedDataLoader
57:8 Back to the results of fine-tuning 256x192
58:20 Question leading to look at callbacks
59:18 About SaveModelCallback
60:56 Contributing, Documentation, and looking at “Docments”
63:50 Final questions: lr_find()
64:50 Final questions: Training for longer, decreasing validation loss, epochs, error rate
66:15 Final questions: Progressive resizing and reinitialization
68:0 Final questions: Resolution independent models
00:00:05.960 |
So Joe, the training sort of process in Fast AI, like, is there a concept or capability 00:00:14.520 |
to do like early stopping or best kind of thing, or if there isn't, is there a reason 00:00:25.320 |
I never remember because I don't use it myself. 00:00:28.400 |
So what I would check, I'm just checking now, is the callbacks, which is under training. 00:00:33.680 |
So let's go to the docs, training, callbacks. 00:01:03.440 |
So perhaps the more interesting part then is like, why do I not use it so I don't even 00:01:16.340 |
One is that it doesn't play nicely with one cycle training or fine tuning. 00:01:29.300 |
If you stop early, then the learning rate hasn't got a chance to go down. 00:01:36.720 |
And for that reason, it's almost never the case that earlier APOCs have better accuracy 00:01:47.320 |
because the learning rate hasn't settled down yet. 00:01:52.000 |
If I was doing one cycle training and I saw that an earlier APOC had a much better accuracy, 00:01:59.920 |
then I would know that I'm overfitting in which case I would be adding more data augmentation 00:02:06.800 |
rather than doing early stopping because it's good to train for the amount of time that 00:02:14.320 |
So yeah, I can't think offhand of a situation where I would, I mean, I haven't come across 00:02:19.520 |
a situation where I've personally wanted to use early stopping. 00:02:25.520 |
Like in some of the training examples, like where you had the error rate, like some of 00:02:31.080 |
the prior runs may have had a better lower error rate. 00:02:36.720 |
Oh, I mean, in the ones I've shown like a tiny bit better, yeah, but like, not enough 00:02:48.000 |
And yeah, so that there's no reason to believe that those would, those are actually better 00:02:54.080 |
models and there's plenty of a prior reason to believe that they're actually not, which 00:02:57.680 |
is that the learning rate still hasn't settled down at that point. 00:03:00.280 |
So we haven't let it fine tune into the best spot yet. 00:03:06.240 |
So yeah, if it's kind of going down, down and down and it's kind of bottoming out and 00:03:10.960 |
just bumps a little bit at the bottom, that's not a reason to use early stopping. 00:03:18.600 |
And it's also, I think, important to realize that the validation set is relatively small 00:03:25.720 |
So it's only a representation of the distribution that the data is coming from. 00:03:31.320 |
So reading too much into those small fluctuations can be very counterproductive. 00:03:38.440 |
I know that I've wasted a lot of time in the past doing that, but yeah, a lot of time. 00:03:46.840 |
We're looking for changes that dramatically improve things, you know, like changing from 00:03:51.820 |
ResNet26 data comes next and we improved by what, 400 or 500% and it's like, okay, that's 00:03:59.640 |
Over the weekend, I went on my own server that I have here behind me, that I have at 00:04:09.160 |
an API and I run all like 35 models for the PADI thing. 00:04:21.160 |
I didn't do the example, but I was thinking about this, that when I was taking Algebra 00:04:28.440 |
back in high school or college, you have some of these expressions that you have the function 00:04:33.840 |
of X is equal to X squared for the X greater than something and the absolute value of X 00:04:46.040 |
So it just got me, my idea, the idea is that maybe some of the data set is going to fail 00:04:56.040 |
the target value for every single one of the models that we tried, but if we try different 00:05:07.760 |
I mean, of course we can, but I mean, what will be the easiest approach to say for this 00:05:13.400 |
validation when X is equal to this or greater than that, this is the model to use, but then 00:05:20.600 |
if this is the other model, this is what you have to use. 00:05:24.280 |
Yeah. I mean, you could do that, right? And like a really simple way to do that, which 00:05:36.080 |
I've seen used for some success on Kaggle is to train lots of models and then to train 00:05:44.440 |
a gradient boosting machine whose inputs are those model predictions that whose output 00:05:51.960 |
is the targets. And so that'll do exactly what you just described. 00:05:57.120 |
It's very easy to overfit when you do that and you're only going to get it. If you've 00:06:08.960 |
trained them well, you're only going to get a tiny increase, right? Because the neural 00:06:15.400 |
nets are flexible, it shouldn't have that situation where this part of the space, it 00:06:22.120 |
has bad predictions and this part of the space, it has good predictions. Like it's, that's 00:06:30.880 |
If you had a variety of types of like different, totally different types of model, like a random 00:06:35.960 |
forest energy, BM and a neural net, I could see that maybe, but most of the time, one 00:06:45.680 |
of those will be dramatically better than the other ones. And so like, I don't that 00:06:49.240 |
often find myself wanting to ensemble across totally different types of model. So I'd say 00:06:56.720 |
it's another one of these things like early stopping, which like a lot of people waste 00:07:03.840 |
huge amounts of time on, you know, and it's not really where the big benefits are going 00:07:09.880 |
to be seen. But yeah, if you're like in gold medal zone on a Kaggle competition and you 00:07:15.560 |
need another 0.002% or something, then these are all things you can certainly try at that 00:07:23.480 |
It kind of reminded me of AutoML, like the regime of tools. I don't know how you feel 00:07:37.160 |
Yeah, we talked about that in last night's lesson actually. So you'll have to catch up 00:07:41.680 |
to see what I, what I said, if you haven't seen the lesson yet. Yeah. I'll mention also 00:07:49.040 |
reading Kaggle winners descriptions of their approaches is, is great. But you've got to 00:07:55.480 |
be very careful because remember, like Kaggle winners are the people who did get that last 00:08:01.720 |
0.002%. You know, because like everybody found all the low hanging fruit and the people who 00:08:08.320 |
won grabbed the really high hanging fruit. And so every time you win a Kaggle winners 00:08:13.200 |
description, they almost always have complex ensembling methods. And that's why, you know, 00:08:20.940 |
in like something like a big image recognition competition, it's very hard to win or probably 00:08:25.640 |
impossible to win with a single model, unless you invent some amazing new architecture or 00:08:31.080 |
something. And so you're kind of, you might get the impression then that ensembling is 00:08:38.520 |
the big thing that gets you all the low hanging fruit, but it's not. Ensembling is the thing 00:08:42.760 |
which are particularly complex. Ensembling is a thing that gets you that last fraction 00:08:49.600 |
One more question. Yeah, of course, the TTL concept, right? So I mean, TTA, TTA, sorry, 00:09:03.440 |
TTA. Yeah, no time to the, yeah, the, so if I understand, like, I'm trying to understand 00:09:11.760 |
conceptually why TTL improves the score, because technically, when you're training, it is using 00:09:20.640 |
those augmented sort of pictures and providing them providing a percentage number. But when 00:09:26.480 |
you're kind of, when you run that TTA function, why is it able to predict better? 00:09:34.080 |
So like, you know how sometimes you're like, looking at some like, I don't know, a screwhead 00:09:44.960 |
or a pla or a socket or something, it's really small, and you can't quite see like, what, 00:09:51.280 |
how many pins are in it or what type is it or whatever. And you're kind of like, look 00:09:54.760 |
at it from different angles, and you're kind of like, put it up to the light, and you try 00:09:58.280 |
to like, at some point, you're like, okay, I see it, right? And there's like some angle 00:10:05.640 |
and some lighting that you can see it. That's what you're doing for the computer, you're 00:10:11.600 |
giving it different angles, and you're giving it different lighting in the hope that in 00:10:16.160 |
one of those, it's going to be really clear. And for the ones where it's easy, it's not 00:10:22.560 |
going to make any difference, right? But for the ones who it's like, oh, I don't know if 00:10:25.480 |
it's this disease or that disease, but oh, you know, when it's a bit brighter, and you 00:10:30.760 |
kind of zoom into that section, like, oh, now I can see. And so when you then average 00:10:36.160 |
them out, you know, all the other ones are all like, oh, I don't know which kind of is 00:10:40.240 |
I don't know which kind, so it's like point 5.5.5. And then this one is like point six. 00:10:44.560 |
And so that's the one that in the average, it's going to end up picking. That's basically 00:10:49.200 |
what happens. It also has another benefit, which is when we train our models, I don't 00:11:03.320 |
know if you've noticed, but our training loss generally gets much, much lower than our validation 00:11:09.480 |
loss. And sometimes our validate sometimes our well, so basically, like, what's happening 00:11:24.320 |
there is that on the training set, the model is getting very confident, right? So even 00:11:29.800 |
though we're using data augmentation, it's seeing slightly different versions of the 00:11:33.400 |
same image dozens of times. And it's like, oh, I know how to recognize these. And so 00:11:40.640 |
what it does is that the probabilities that associates with them is like point nine, point 00:11:46.200 |
nine, nine, you know, like, it's like, I'm very confident of these. And it actually gets 00:11:51.200 |
overconfident, which actually doesn't necessarily impact our accuracy, you know, to be overconfident. 00:12:05.160 |
But at some point, it, it can. And so we are systematically going to have like overconfident 00:12:13.760 |
predictions of probability. When, even when it doesn't really know, just because it's 00:12:20.280 |
really seen that kind of image before. So then on the validation set, it's going to be, you 00:12:26.520 |
know, over picking probabilities as well. And so one nice benefit is that when you average 00:12:32.400 |
out a few augmented versions, you know, it's like, oh, point nine, point nine probability 00:12:39.880 |
is this one. And then on the next one, it's like, augmented version with the same image 00:12:43.160 |
like, oh, no, point one probability is that one. And they'll kind of average out to much 00:12:48.560 |
more reasonable probabilities, which can, you know, allow it sometimes to yeah, combine 00:13:01.600 |
these ideas into an average that that makes more sense. And so that can improve accuracy, 00:13:08.240 |
but in particular, it improves the actual probabilities to get rid of that overconfidence. 00:13:15.200 |
Is it fair to say that when you train without, when you train, it's not able to separate 00:13:23.400 |
the replicated sort of images or the distorted slightly the variant of the original image, 00:13:30.240 |
but when you use the TTA, it is able to group all the four images. 00:13:35.080 |
And where that's what TTA is, we present them all together and average out that group. Yes. 00:13:42.400 |
But in training, we don't indicate in any way that they're the same image or that they're 00:13:47.480 |
the same underlying object. One of the questions, Jeremy, we I'm going to be pushing and stumbling 00:13:59.600 |
and how to pick the best and stumbling. I was going to ask about that. But another question 00:14:06.280 |
is, we have a fairly unbalanced data set, I guess, with the normal versus the disease 00:14:13.560 |
states. You're doing augmentation. Is there any benefit to sort of over representing the 00:14:19.040 |
minority classes? So let's let's pull away augmentation. So it's actually got nothing 00:14:24.720 |
to do with augmentation. So more generally, when you're training, does it make sense to 00:14:29.160 |
over represent the minority class? And the answer is, maybe. Yeah, it can. Right. And 00:14:40.480 |
so, okay, so just for those who aren't following, the issue Matt's talking about is that there 00:14:47.480 |
was, you know, a couple of diseases which appear lots and lots in the data, and a couple 00:14:52.440 |
which hardly appear at all. And so, you know, do we want to try to balance this out more? 00:15:02.560 |
And one thing that people often do to balance it out more is that they'll throw away some 00:15:08.560 |
of the images in their more represent the highly represented classes. And I can certainly 00:15:13.700 |
tell you straight away, you should never ever do that. You never want to throw away data. 00:15:20.720 |
But Matt's question was, well, could we, you know, over sample the less common diseases? 00:15:29.920 |
And the answer is, yeah, absolutely, you could. And in first AI, you go into the docs. Now, 00:15:42.560 |
where is it? There is a weighted data loader somewhere. Weighted. Search for that. Here 00:15:52.280 |
we go. Of course, it's a callback. So if you go to the callbacks data section, you'll find 00:15:58.040 |
a weighted DL callback or a weighted data loaders method. I'm not. No, I'm just telling 00:16:08.420 |
you where to look. Thanks for checking. So, yeah, I mean, let's look at that today, right? 00:16:21.480 |
Because I kind of want to look at, like, things we can do to improve things today. It doesn't 00:16:28.720 |
necessarily help. Because it does mean, you know, given that you're, you know, let's say 00:16:35.060 |
you do 10 epochs of 1,000 images, it's going to get to look at 10,000 images, right? And 00:16:41.240 |
if you over sample a class, then that also means that it's going to get, it's going to 00:16:46.120 |
see less of some images and going to get more repetition of other images, which could be 00:16:53.640 |
a problem, you know? And really, it just depends on, depends on a few things. If it's like 00:17:00.480 |
really unbalanced, like 99% all of one type, then you're going to have a whole lot of batches 00:17:05.980 |
that it never sees anything of the underrepresented class. And so basically, there's nothing for 00:17:11.280 |
to learn from. So at some point, you probably certainly need weighted sampling. It also 00:17:18.880 |
depends on the evaluation. You know, if people like say in the evaluation, okay, we're going 00:17:22.700 |
to kind of average out for each disease, how accurate you were. So every disease will then 00:17:28.720 |
be like equally weighted, then you would definitely need to use weighted sampling. But in this 00:17:34.640 |
case, you know, presuming, presuming that the test set has a similar distribution as 00:17:40.760 |
a training set, weighted sampling might not help. Because they're going to care the most 00:17:48.120 |
about how well we do on the highly represented diseases. 00:17:52.680 |
I'll note my experience with like oversampling and things like that. I think one time I had 00:18:02.840 |
done with I think diabetic retinopathy, there was a competition for that. And I had used 00:18:09.640 |
weighted sampling or oversampling, and it did seem to help. And then also a while back, 00:18:14.360 |
I did an experiment where I think this was back with fast AI version one, where I took 00:18:20.320 |
like the minced data set, and then I like artificially added some sort of imbalance. 00:18:29.600 |
And then I trained with and without weighted sampling. And I saw like there was an improvement 00:18:35.520 |
with the weighted sampling on accuracy on like just a regular validation set. So from 00:18:42.800 |
that, from those couple experiments, I'd say like, I've at least seen some help and improvement 00:18:48.440 |
with weighted sampling. Cool. And was that cases where that data set was like, highly 00:18:53.600 |
unbalanced? Or was it more like the data set that we're looking at, at the moment? 00:18:59.600 |
It wasn't highly unbalanced. It was maybe like, I don't know, like, maybe like, yeah, 00:19:05.240 |
just 75% versus 25% or something like that. It's not like 99.99 versus 1%, nothing like 00:19:11.960 |
that. It was more. Oh, well, it wasn't that bad. So let's try it today. Yeah. I see we've 00:19:17.440 |
got a new face today as well. Hello, Zach. Thanks for joining. 00:19:21.880 |
Hey, hey, glad I could finally make these. Yeah. Are you joining from Florida? 00:19:28.760 |
No, I'm in Maryland now. Maryland now. Okay, I have a change. Yes, much more up north. 00:19:38.880 |
Okay, great. So let's let's try something. Okay, so let's connect to my little computer. 00:20:08.840 |
It says, is there a way to shrink my zoom bar out of the way? It takes up so much space. 00:20:25.840 |
Hide floating meeting controls. I guess that's what I want. Control Alt Shift H. Wow. Press 00:20:32.960 |
escape to show floating meeting controls. That doesn't work very well with Vim. Oh, well, 00:20:40.440 |
Control Alt Shift 8. Okay. We're not doing tabular today, so let's get rid of that. So 00:20:58.680 |
I think what I might do is, you know, because we're iterating. Well, I guess we could start 00:21:07.840 |
with the multitask button, because this is our kind of like things to try to improve 00:21:17.200 |
version. I'll close that. I'll leave that open just in case we want to. Okay. By the 00:21:40.360 |
way, if you've got multiple GPUs, this is how you just use one of them. You can just 00:21:44.200 |
set an environment variable. Okay, so this is where we this is where we did the multi 00:22:07.880 |
target model. Okay. Just moved everything slightly. 00:22:34.000 |
Comp. Not comp path. Right. Back to where we were. Okay. So now what? What's broken? 00:23:03.320 |
Data block. Get image files. Well, this is working the other day. So I guess we better 00:23:28.920 |
try to do some debugging. So the obvious thing to do would be to call this thing here, get 00:23:38.960 |
image files on the thing that we passed in here, which is train path. Okay, so that's 00:23:47.480 |
working. Then the other thing to do would be to check data by doing show batch. Okay, 00:23:58.200 |
that's working. And I guess, all right, and it's showing you two different things. That's 00:24:07.880 |
good. Oh, is it? Right, we've got the two category blocks. So we can't use this one. 00:24:26.280 |
We have to use this one. So fit one cycle. Yeah, okay. So the to remind you we have this 00:24:45.640 |
is the one where we had two categories and one input. And to get the two categories, 00:24:57.320 |
we use the parent label and this function, which looked up the variety from this dictionary. 00:25:07.840 |
Okay, and then when we fine tuned it, and let's just check, yes, it equals 42. So that's our 00:25:19.640 |
standard set, we should be able to then compare that to small models trained for 12 epochs. 00:25:36.360 |
And then that was this one. Part two. And let's see. They're not quite 00:26:06.240 |
the same because this was 480 squish. Or else this was rectangular pad. Let's do five epochs. 00:26:31.920 |
Let's do it the same as this one. Yeah, let's do this one. Because we want to be able to 00:26:44.240 |
do quick iterations. Let's see resize 192 squish. There we go. 00:27:12.200 |
And then we trained it for 0.01 with FP16 with five epochs. All right. So this will be our 00:27:41.880 |
base case. Well, you know, I mean, I guess this is our base case 0.045. This will be 00:27:51.600 |
our next case. Okay, so while that's running, the next thing I wanted to talk about is progressive 00:28:06.040 |
resizing. So this is training at a size of 128. Which is not very big. And we wouldn't 00:28:34.400 |
expect it to do very well. But it's certainly better than nothing. And as you can see, it's 00:28:41.400 |
-- that's not error. Disease error. It's down to 7.5% error already and it's not even done. 00:28:50.880 |
So that's not bad. And, you know, in the past, what we've then done is we've said, okay, 00:28:57.960 |
well, that's working pretty well. Let's throw that away and try bigger. But there's actually 00:29:11.320 |
something more interesting we can do. Which is we don't have to throw it away. What we 00:29:18.040 |
could do is to continue training it on larger images. So we're basically saying, okay, this 00:29:29.160 |
is a model which is fine tuned to recognize 128 by 128 pixel images of rice. That's fine 00:29:41.120 |
tuned to recognize 192 by 192 pixel images of rice. And we could even -- there's a few 00:29:48.960 |
benefits to that. One is it's very fast, you know, to do the smaller images. And it can 00:29:56.160 |
recognize the key features of it. So, you know, this lets us do a lot of epochs quickly. And 00:30:06.920 |
then, like, the difference between small images of rice disease and large images of rice disease 00:30:11.080 |
isn't very big difference. So you would expect it would probably fine tune to bigger images 00:30:16.000 |
of rice disease quite easily. So we might get most of the benefit of training on big 00:30:21.320 |
images, but without most of the time. The second benefit is it's a kind of data augmentation, 00:30:30.120 |
which is we're actually giving it different sized images. So that should help. So here's 00:30:37.560 |
how we would do that. Let's grab this data block. Let's make it into a function. Get 00:30:48.520 |
dl. Okay. And the key thing I guess we're going to do -- well, let's just do the item 00:30:54.080 |
transforms and the batch transforms as usual. Oops. So the things we're going to change 00:31:08.600 |
are the item transforms. And the batch transforms. And then we're going to return the data loader 00:31:23.520 |
for that, which is here. Okay. So let's try -- 00:31:52.720 |
Coming up a bit. dl equals get dl. I guess it should be get dl's really because it returns 00:32:06.760 |
data loaders get dl's. Okay. So let's see what we did last time as we scale it up a bit. 00:32:27.200 |
So this is going to be data augmentation as well. We're going to change how we scale. 00:32:36.600 |
So we'll scale with zero padding. And let's go up to 160. Okay. 00:33:00.780 |
So then we need a ladder. Okay. So we're going to change the size of the item. So we're going to 00:33:29.840 |
change the size of the item. So our -- where's our squish one here? Squish. So the squish 00:33:41.280 |
here got .45. Our multitask got .48. So it's actually a little bit worse. This might not 00:33:53.600 |
be a great test, actually, because I feel like one of the reasons that doing a multitask 00:33:58.480 |
bottle might be useful is it might be able to train for more epochs. Because we're kind 00:34:06.640 |
of giving it more signal. So we should probably revisit this with, like, 20 epochs. Any questions 00:34:19.400 |
or comments about progressive resizing while we wait for this to train? 00:34:23.240 |
>> Sorry, I can't see how you progressively changed the size because -- 00:34:33.040 |
>> I actually didn't. I messed it up. Whoops. Thank you. I have to do that again. I actually 00:34:44.720 |
didn't. Oh, and we need to get our DLS back as well. Okay. Let's start again. Okay. And 00:34:53.640 |
let's -- in case I mess this up again, let's export this. We'll call this, like, stage 00:34:58.960 |
one. See? Yeah. The problem was we created a new learner. So what we should have done 00:35:10.160 |
is gone learn dot DLS equals DLS. That's actually -- so that would actually change the data 00:35:24.820 |
loaders inside the learner without recreating it. Was that where you were heading with your 00:35:29.920 |
comment? There was an unfreeze method. There was an unfreeze method. Like, the same thing 00:35:41.760 |
in the book I actually mentioned is using the unfreeze method. 00:35:47.200 |
>> There is an unfreeze method. Yes. What were you saying about the unfreeze method? 00:35:51.600 |
>> Is an unfreeze required for progressive resizing? Am I wrong? 00:35:55.200 |
>> No, because fine-tune is already unfrozen. Although I actually want to fine-tune again. 00:36:04.360 |
So if anything, I kind of actually want to -- I actually want to refreeze it. Because 00:36:12.400 |
we've changed the resolution, I think fine-tuning the head might be a good idea to do again. 00:36:24.160 |
>> Which line of code is doing the progressive resizing part, just to be clear? 00:36:31.560 |
>> It's not our line of code. It's basically this. It's basically saying our current learner 00:36:36.780 |
is getting new data loaders. And the new data loaders have a size of 160, whereas the old 00:36:43.960 |
data loaders had a size of 128. And our old data loaders did a presizing of 192 squish, 00:36:52.200 |
but our new data loaders are doing a presizing of rectangular padding. Does that make sense? 00:36:59.000 |
>> Why are you calling it progressive in this case? Are you going to keep changing the size 00:37:04.200 |
or something like that? >> Yeah, it's changing the size of the images 00:37:09.280 |
without resetting the learner. >> Just looked it up because I was curious. 00:37:15.560 |
>> Fine-tune calls a freeze first. >> I had a feeling it did. Thanks for checking, 00:37:24.440 |
Zach. So this time, you know, let's see. It'll be interesting, right, to see how it does. 00:37:32.280 |
So after the initial epoch, it's got .09, right? Whereas previously it had .27. So obviously 00:37:40.160 |
it's better than last time. But it's actually worse than the final point, right? This time 00:37:45.480 |
it got all the way to .418. Yeah, or else this time it has got worse. So it's got some 00:37:52.000 |
work to do to learn to recognize what 160 pixel images look like. 00:37:58.000 |
>> Can I just clarify, Jeremy? So you're like doing one more step in the progressive resizing 00:38:08.280 |
here. It's not kind of an automated resizing. >> Correct. Correct. Yeah. Yeah. There isn't 00:38:14.720 |
anything in fast.ai to do this for you. And in fact, this technique is something that 00:38:21.520 |
we invented. So it doesn't exist in other libraries at all. So, yeah, it's the name 00:38:29.360 |
of a technique. It's not the name of, like, a method in fast.ai. And, yeah, the technique 00:38:35.360 |
is basically to replace the data loaders with ones at a larger size. And we invented it 00:38:43.800 |
as part of a competition called Dawnbench, which is where we work very well on a competition 00:38:50.920 |
for ImageNet training. And Google then took the idea and studied it a lot further as part 00:39:00.880 |
of a paper called EfficientNet V2 and found ways to make it work even better. Oh, my gosh, 00:39:09.640 |
look at this. So we've gone from 0.418 to 0.0336. Have we done training at 160 before? I don't 00:39:23.200 |
think we have. I should be checking this one. 128, 128. 171 by 128. No, we haven't. This 00:39:49.900 |
is a 256 by 192. So eventually, I guess we're going to get to that point. So let's keep 00:39:58.240 |
going. So, okay. So we're down to 2.9% error. >> How did you come up with the idea for this? 00:40:05.880 |
Is it something that you just wanted to try? Or did you, like, stumble upon it while looking 00:40:11.480 |
at something else? >> Oh, I mean, it just seemed very obviously 00:40:15.000 |
to me like something which obviously we should do because, like, we were spending -- okay, 00:40:20.200 |
so on Dawnbench, we were training on ImageNet. It was taking 12 hours, I guess, to train 00:40:25.760 |
a single model. And the vast majority of that time, it's just recognizing very, very basic 00:40:34.960 |
things about images, you know? It's not learning the finer details of different cat breeds 00:40:40.800 |
or whatever, but it's just trying to understand about the concepts of, like, fur or sky or 00:40:45.280 |
metal. And I thought, well, there's no -- there's absolutely no reason to need 224 by 224 pixel 00:40:49.960 |
images to be able to do that, you know? Like, it just seemed obviously stupid that we would 00:40:59.120 |
do it. And partly, it was, like, also, like, I was just generally interested in changing 00:41:06.160 |
things during training. So, one of, you know, in particular, learning rates, right? So, 00:41:12.500 |
the idea of changing learning rates during training goes back a lot longer than Dawnbench, 00:41:17.720 |
that people had been generally training them by having a learning rate that kind of dropped 00:41:22.080 |
by a lot and then stayed flat and dropped by a lot and stayed flat. And Leslie Smith, 00:41:26.840 |
in particular, came up with this idea of kind of, like, gradually increasing it over a curve 00:41:31.520 |
and then gradually decreasing it following another curve. And so, I was definitely in 00:41:35.600 |
the mindset of, like, oh, there's kind of interesting things we can change during training. 00:41:39.320 |
So, I was looking at, like, oh, what if we change data augmentation during training, 00:41:44.280 |
for example? Like, maybe towards the end of training, we should, like, turn off data augmentation 00:41:49.260 |
so it could learn what unaugmented images look like, because that's what we really care 00:41:53.600 |
about, for example. So, yeah, that was the kind of stuff that I was kind of interested 00:42:02.160 |
in at the time. And so, yeah, definitely this thing of, like, you know, why are we looking 00:42:12.120 |
over 224 by 224 pixel images the entire time? Like, that just seemed obviously stupid. And 00:42:17.960 |
so, it wasn't something where I was like, wow, here's a crazy idea. I bet it won't work. 00:42:21.160 |
As soon as I thought of it, I just thought, okay, this is definitely going to work, you 00:42:24.720 |
know? And it did. >> Interesting. Thanks. Yeah. >> No worries. >> One question I have 00:42:33.400 |
for you, Jeremy. >> Yeah. >> There was a paper that came out, like, 00:42:37.400 |
in 2019 called Fixing the Test Train Resolution Discrepancy, where, yeah, were they, like, 00:42:44.940 |
trained on 224 and then did inference finally on, like, 320 by 320? >> Yeah. >> Have you 00:42:52.120 |
seen that still sort of work? Have you done that at all in your workflow? >> I mean, honestly, 00:42:57.920 |
I don't remember. I need to revisit that paper because you're right, it's important tonight. 00:43:04.760 |
I, you know, I would generally try to fine-tune on the final size I was going to be predicting 00:43:16.920 |
on anyway. So, yeah, I guess we'll kind of see how we go with this, right? I mean, you 00:43:25.240 |
can definitely take a model that was trained on 224 by 224 images and use it to predict 00:43:32.720 |
360 by 360 images, and it will generally go pretty well. But I think it will go better 00:43:38.940 |
if you first fine-tune it on 360 by 360 images. >> Yeah, I don't think they tried pre-training 00:43:46.160 |
and then also training on, like, 320 versus just 320 in the 224. >> Yeah. >> That would 00:43:52.100 |
definitely be an interesting experiment. >> Yeah, it would be an interesting experiment. 00:43:54.960 |
It's definitely something that any of us here could do, you know? I think it would be cool. 00:44:00.680 |
Right? So, let's try scaling this up. So, we can change these two lines to one. So, 00:44:07.520 |
this is something I often do, is I do things like, yep. >> I think we don't have your screen. 00:44:14.720 |
>> So, I was just saying previously, I had, like, two cells to do this, and so now I'm 00:44:22.880 |
just going to combine it into one cell. So, this is what I tend to do as I fiddle around, 00:44:27.040 |
because I try to, like, gradually make things a little bit more concise, you know? Okay. 00:44:46.640 |
>> Does it make sense to go smaller than the original pre-training, like, covenant? Can't 00:44:58.120 |
come next? >> Yeah, I mean, you can fine-tune to any size 00:45:02.320 |
you like. Absolutely. I'm just going to get rid of the zero padding, because, again, I 00:45:08.560 |
want to, like, try to change it a little bit each time, just to kind of, you know, it's 00:45:13.880 |
a kind of augmentation, right? So, okay. So, let's go up to 192. You know, one thing I 00:45:29.080 |
find encouraging is that, you know, my training loss isn't getting way underneath the validation 00:45:34.120 |
loss. It's not like we're -- feels like we could do this for ages before our error rates 00:45:43.240 |
start going up. Interestingly, when I reran this, my error rate was better, .418. We've 00:46:10.200 |
got a good memory to remember these old papers. It's very helpful to be able to do that. 00:46:18.960 |
>> Usually what I wind up doing is my dad and I will email back and forth papers to 00:46:22.820 |
each other. So, I can just go through my scent, look at archive, and usually, if I don't remember 00:46:28.240 |
the name of it, I remember the subject of it in some degree. So, I can just go through 00:46:32.760 |
it all. >> I mean, it's a very, very good idea to 00:46:34.760 |
use a paper manager of some sort, to save papers, you know, whether it be Mendeley or 00:46:41.560 |
Zenodo or archive sanity or whatever, or bookmarks or something. Yeah, because otherwise these 00:46:54.000 |
things disappear. Personally, I just tend to, like, tweet or favorite tweets about papers 00:47:01.400 |
I'm interested in. And then I've set up pinboard.in. I don't know if you guys have seen that, but 00:47:07.120 |
it's a really nice little thing, which basically any time you're on a website, you can click 00:47:17.000 |
a button and the extension and it adds it to pinboard, but it also automatically adds 00:47:23.680 |
all of your tweets and favorites, and it's got a full text search of the thing that the 00:47:30.720 |
URLs link to, which is really helpful. >> So, you've favorited something that just 00:47:35.960 |
says, oh, shit? >> No, I actually wrote something that just 00:47:38.600 |
said, oh, shit. That was me writing, oh, shit. It was this, I mean, totally off topic, but 00:47:47.320 |
it's absolutely disaster. I hope it's wrong, but it's absolutely disastrous sounding paper 00:47:55.120 |
that came out yesterday that basically, where was this key thing? People who've had one 00:48:03.520 |
COVID infection have a list of one sequelae of 8.4%, two infections 23%, three infections 00:48:09.680 |
36%. It's like my worst nightmare is the more people get infected with COVID, the more likely 00:48:15.960 |
it is that they'll get long-term symptoms, which is horrifying. That was my first shit 00:48:25.720 |
>> It's really awful. Okay. So, keeps going down, right? Which is cool. Let's keep bringing 00:48:31.840 |
along, I suppose. I guess, you know, what we could do is just grab this whole damn thing 00:48:40.240 |
here. Kind of have a bit of a comparison. So, we're basically going to run exactly the 00:48:49.980 |
same thing we did earlier. At this time, with some pre-sizing first. 00:49:15.240 |
All right. So, that'll be an interesting experiment. So, while that's running, you know, this is 00:49:34.880 |
where I hit the old duplicate button. And this is why it's nice if you can to have a 00:49:46.320 |
second card. Because while something's running, you can try something else. Cuda visible devices. 00:50:19.000 |
Okay. So, Waited Data Loader. So, this is something I added to Fast AI a while ago and haven't 00:50:42.080 |
used much myself since. But if I just search for Waited, here it is. Here it is. So, you 00:50:55.040 |
can see in the docs, it shows you exactly how to use Waited Data Loaders. And so, we 00:51:05.600 |
pass in a bat size. We pass in some weights. This is the weights. It's going to be 1, 2, 00:51:12.800 |
3, 4, 5, 6, 7, 8. And then some item transforms. These are really interesting in the docs. 00:51:23.560 |
In some ways, it's extremely advanced. And in other ways, it's extremely simple. Which 00:51:28.440 |
is to say, if you look at this example in the docs, everything is totally manual, right? 00:51:33.840 |
So, our labels are some random integers. And I've even added a comment here, right? It's 00:51:44.180 |
going to be in the training set. Two are going to be in the validation set. So, our data 00:51:51.960 |
block is going to contain one category block. Because we just got the one thing, right? 00:52:01.080 |
And rather than doing get X and get Y, you can also just say getters. Because get X and 00:52:07.960 |
get Y basically become getters, which is a list of transformations to do. And so, this 00:52:15.240 |
is going to be a single getter or a single get X, if you like, which is going to return 00:52:20.000 |
the Ith label. And a splitter, which is going to decide whether something's valid or not 00:52:25.400 |
based on this function. So, you can see this whole thing is totally manual. So, we can 00:52:32.080 |
create our data set by passing in a list of the numbers from 0 to 9. And a single item 00:52:38.520 |
transform that's going to convert that to a tensor. And then our weights will be the 00:52:43.440 |
numbers from 0 to 7. And so, then we can take our data sets or data sets and turn them into 00:52:51.040 |
data loaders using those weights. So, for the batch size of 1, if we say show batch, 00:53:02.240 |
we get back a single number, okay? And it's not doing random shuffling. So, we get the 00:53:06.640 |
number 0, because that was the first thing in our data set. Let's see, what do we do 00:53:17.720 |
next? Now, we've got to do n equals 160. So, now, we've got all of the numbers from 0 to 00:53:24.520 |
159 with those weights. Yes, forgettors, yep. >> You mentioned, this is for X or Y. 00:53:38.120 |
>> This is a list. It's whatever, right? There is just one thing. I don't know if you call 00:53:44.040 |
that X or you call it Y. It's just one thing. So, if you have a get X and a get Y, that's 00:53:49.320 |
the same as having a getters with a list of two things. So, yeah. I think I could just 00:53:58.400 |
write get -- this has been aged since I put this, but I think I could just write get X 00:54:01.400 |
here and put this not in a list. It would probably be the same thing. 00:54:05.160 |
>> Okay. >> Probably handle a little bit of mystery 00:54:09.400 |
that might be happening as well. >> Yeah. >> The data block has an input parameter. 00:54:14.480 |
>> Correct. >> Which is how it determines what of the 00:54:17.680 |
getters is X versus Y. >> Correct. Which we actually looked at last 00:54:22.920 |
time. Here. When we created our multi-image block. That was before you joined, Zach. Yes, 00:54:33.400 |
useful reminder. Okay. So, here we see in a histogram of how often -- so, our -- we created 00:54:49.560 |
like a little synthetic learner that doesn't really do anything, but we can pass callbacks 00:54:54.280 |
to it, and there's a callback called collect data callback, which just collects the data 00:54:59.280 |
that's part -- that is called in the learner, and so this is how we can then find out what 00:55:05.360 |
data was passed to the learner, get a histogram of it, and we can see that, yeah, the number 00:55:11.000 |
160 was received a lot more often when we trained this learner, which is what you would 00:55:18.560 |
expect. This is the source of the weighted data loader class here, and as you can see, 00:55:30.200 |
other than the boilerplate, it's one, two, three, four, five lines of code. And then 00:55:37.240 |
the weighted data loader's method is two lines of code. So, there's actually a lot more lines 00:55:43.680 |
of example than there is of actual code. So, often it's easier just to read the source 00:55:49.560 |
code, because, you know, thanks to the very layered approach of fast AI, we can do so 00:55:55.800 |
much stuff with so little code. And so, in this case, if we look through the code, we're 00:56:02.920 |
passing in some weights, and basically the key thing here is that we set -- if the -- if 00:56:10.120 |
you pass in no weights at all, then we're just going to set it equal to the number one 00:56:15.440 |
repeated n times, so everything's going to get one, a weight of one. And then we divide 00:56:22.080 |
the weights by the sum of the weights so that the sum of the weights ends up summing up 00:56:26.240 |
to one, which is what we want. And then if you're not shuffling, then there's no weighted 00:56:42.520 |
anything to do, so we just pass back the indexes. And if we are shuffling, we will grab a random 00:56:49.000 |
choice of -- based on the weights. Cool. All right. So, there's going to be one weight 00:57:10.120 |
per row. Let's come back to that, because I want to see how our thing's gone. It looks 00:57:15.520 |
like it's finished. Notice that the fav icon in Jupyter will change depending on whether 00:57:21.040 |
something's running or not, so that's how you can quickly tell if something's finished. 00:57:27.920 |
0.216, 0.221. Okay, I mean, it's not a huge difference, but maybe it's a tiny bit better. 00:57:50.400 |
I don't know. The key thing, though, is this lets us use our resources better, right? So 00:57:56.760 |
we often will end up with a better answer, but you can train for a lot less time. In 00:58:02.000 |
fact, you can see that the error was at 0.216 back here, so we could probably have trained 00:58:08.360 |
for a lot less epochs. So that's progressive resizing. 00:58:13.240 |
Is there a way to look at that and go, "Oh, actually, I'd like to take the outputs from 00:58:31.000 |
That was the question we got earlier about. That's called early stopping, and the answer 00:58:36.120 |
is no. You probably wouldn't want to do early stopping. 00:58:39.080 |
But you can't go back to a previous epoch. There's no history. 00:58:46.680 |
You can. You have to use the early stopping callback to do that. 00:58:53.320 |
Or there's other things you can use. As I say, I don't think you should, but you can. 00:59:04.680 |
Okay, so the other part of that is, is it counterproductive or not? 00:59:10.400 |
It's not a cheat if it works, but not if it doesn't? 00:59:13.920 |
It's probably not a good idea. It probably will make it worse, yeah. 00:59:17.600 |
So the other thing you can do is save model callback, which saves -- which is kind of 00:59:22.520 |
like early stopping, but it doesn't stop. It saves the parameters of the best model 00:59:29.920 |
during training, which is probably what you want instead of early stopping. 00:59:34.400 |
I don't think you should do that either for the same reason we discussed earlier. 00:59:40.400 |
Why shouldn't you do this? It seems like you could just ignore it if you didn't want it. 00:59:49.080 |
Well, so this actually automatically loads the best set of parameters at the end. 01:00:00.200 |
And you're just going to end up with this kind of model that just so happened to look 01:00:10.480 |
a tiny bit better on the validation set at an earlier epoch. 01:00:14.280 |
But at that earlier epoch, the laning rate hadn't yet stabilized, and it's very unlikely 01:00:21.200 |
So you've probably actually just picked something that's slightly worse and made your process 01:00:27.880 |
slightly more complicated for no good reason. 01:00:31.040 |
Being better on an epoch there doesn't necessarily say anything about the final hidden test set. 01:00:39.320 |
Yeah, we have a strong prior belief that it will improve each epoch unless you're overfitting. 01:00:50.920 |
And if you're overfitting, then you shouldn't be doing early stopping, you should be doing 01:00:56.520 |
It seems like a good opportunity for somebody to document the arguments, because I'm curious 01:01:05.040 |
Yes, that would be a great opportunity for somebody to document the arguments. 01:01:10.320 |
And if somebody is interested in doing that, we have a really cool thing called documents, 01:01:19.600 |
which I only invented after we created fast.ai. 01:01:32.140 |
I should delete this because this is the old version, it's part of fast core. 01:01:38.160 |
And you document each parameter by putting a comment after it. 01:01:44.800 |
And you document the return by putting a comment after it. 01:01:47.880 |
And Zach actually started a project to, after I created documents, to add documents, comments 01:01:55.640 |
to everything in fast.ai, which of course is not finished because fast.ai is pretty 01:02:01.240 |
And so here's an example of something that doesn't yet have documents, comments. 01:02:04.440 |
So if somebody wants to go and add a comment to each of these things and put that into 01:02:10.800 |
a PR, then that will end up in the documentation. 01:02:20.320 |
Something we should do, Zach, is to actually include an example in the documentation of 01:02:33.320 |
what it ends up looking like in NVDev, because I can see that's missing. 01:02:44.600 |
No, I just wanted to encourage everybody that writing the documentation is an excellent 01:02:55.520 |
And what ends up happening is you write this documentation and somebody like Jeremy will 01:03:02.360 |
review it carefully and let you know what you don't understand. 01:03:07.000 |
And that's how I learned about some other fast.ai library. 01:03:12.520 |
So I highly recommend it, going and doing that. 01:03:20.040 |
And you can see it's got a little table underneath. 01:03:22.120 |
And if we look at the source of optimizer, you'll see that each parameter has a comment 01:03:30.320 |
But it's automatically turned into this table. 01:03:43.360 |
Anybody got any questions or comments or anything before we wrap up? 01:03:50.720 |
I have a question regarding polyglot C resizing. 01:03:57.320 |
We didn't do actually L.R. find after each step, don't you think it's something helpful? 01:04:14.320 |
I, to be honest, I don't use L.R. find much anymore nowadays, because, you know, at least 01:04:24.520 |
for object recognition in computer vision, the optimal learning rate is pretty much always 01:04:33.680 |
There's no reason to believe that we have any need to change it just because we changed 01:04:40.760 |
So, yeah, I wouldn't bother just leave it where it was. 01:04:50.640 |
Jeremy, if you're training and validation loss is still decreasing after 12 people, can you 01:04:55.640 |
pick up and train for a little longer without restarting? 01:05:00.680 |
The first thing I'll say is you shouldn't be looking at the validation loss to see if 01:05:04.880 |
So, the validation loss can get worse whilst the error rate gets better, and that doesn't 01:05:09.560 |
count as overfitting because the thing you want is to improve as the error rate. 01:05:13.560 |
That can happen if it gets overconfident, but it's still improving. 01:05:17.360 |
Yeah, you can keep training for longer because we're using, if you're using fit one cycle 01:05:25.520 |
or fine-tune, and fine-tune uses fit one cycle behind the scenes, continuing to train further, 01:05:33.400 |
your learning rate is going to go up and then down and then up and then down each time, 01:05:37.240 |
which is not necessarily a bad thing, but, you know, if you, yeah, if you basically want 01:05:44.520 |
to keep training at that, you know, at that point, you would probably want to decrease 01:05:53.280 |
the learning rate by maybe 4x or so, and in fact, you know, I think after this, I'm going 01:05:59.240 |
to rerun this whole notebook, but half the learning rate each time, so I think that would 01:06:17.360 |
I don't know if it's too late, but I think it might be useful to discuss, when you do 01:06:22.840 |
the progressive resizing, what part of the model gets dropped, like, what, you know, 01:06:32.160 |
is there some part of the model that needs to be reinitialized for the new? 01:06:42.120 |
I thought you were talking to me, but you're talking to Siri? 01:06:54.080 |
Yeah, ConfNext is what we call a resolution-independent architecture, which means it doesn't, it works 01:07:03.040 |
for any input resolution, and time-permitting in the next lesson, we will see how convolutional 01:07:13.240 |
neural networks actually work, but I guess a lot of you probably already know, so for 01:07:18.040 |
those of you that do, if you think about it, it's basically going patch by patch and doing 01:07:23.520 |
this kind of mini matrix multiplier for each patch, so if you change the input resolution, 01:07:32.760 |
it just has more patches to cover, but it doesn't change the parameters at all, so there's nothing 01:07:52.080 |
I was just going to, a quick note, say, is ResNet resolution-independent? 01:08:01.080 |
Typically, everything we use is normally, but in the, like, have a look at that, like, 01:08:09.160 |
best fine-tuning models notebook, and you'll see that two of the best ones are called VIT 01:08:19.160 |
None of those are resolution-independent, although there is a trick you can use to kind of make 01:08:26.800 |
them resolution-independent, which we should try out in a future walkthrough. 01:08:42.120 |
I don't know if we can use it to support progressive resizing or not. 01:08:48.680 |
It's basically changing the positional encodings. 01:08:57.640 |
After you've done your experiments, progressive resizing, and fine-tuning, how do you infest 01:09:16.640 |
Like, instead, I do what we saw in the last walkthrough, which is I just train on a few 01:09:24.200 |
different randomly selected training sets, because that way, you know, you get the benefit 01:09:37.760 |
You're going to end up seeing all the images, at least one anyway. 01:09:41.080 |
And you can also kind of see if something's messed up, because you've still got a validation 01:09:46.240 |
So yeah, I used to do this thing where I would create a validation set with a single item 01:09:51.440 |
in to get that last bit of juice, but I don't even do that anymore.