Live coding 14

00:00:00.000 | There we go.

00:00:01.960 | Yeah.

00:00:02.960 | Got a question.

00:00:04.960 | Yeah.

00:00:05.960 | So Joe, the training sort of process in Fast AI, like, is there a concept or capability

00:00:14.520 | to do like early stopping or best kind of thing, or if there isn't, is there a reason

00:00:20.720 | why you chose not to do that?

00:00:25.320 | I never remember because I don't use it myself.

00:00:28.400 | So what I would check, I'm just checking now, is the callbacks, which is under training.

00:00:33.680 | So let's go to the docs, training, callbacks.

00:00:45.040 | And if anybody else knows, please shout out.

00:00:48.320 | There is a callback thingy.

00:00:52.280 | Early stopping callback.

00:00:53.280 | Yeah.

00:00:54.280 | I found it.

00:00:55.280 | Okay.

00:00:56.280 | It's under tracking callbacks.

00:00:57.280 | There's training, callbacks, tracker.

00:00:59.680 | There's an early stopping callback.

00:01:03.440 | So perhaps the more interesting part then is like, why do I not use it so I don't even

00:01:08.320 | know whether it exists.

00:01:10.840 | There's a few reasons.

00:01:16.340 | One is that it doesn't play nicely with one cycle training or fine tuning.

00:01:29.300 | If you stop early, then the learning rate hasn't got a chance to go down.

00:01:36.720 | And for that reason, it's almost never the case that earlier APOCs have better accuracy

00:01:47.320 | because the learning rate hasn't settled down yet.

00:01:52.000 | If I was doing one cycle training and I saw that an earlier APOC had a much better accuracy,

00:01:59.920 | then I would know that I'm overfitting in which case I would be adding more data augmentation

00:02:06.800 | rather than doing early stopping because it's good to train for the amount of time that

00:02:12.520 | you have.

00:02:14.320 | So yeah, I can't think offhand of a situation where I would, I mean, I haven't come across

00:02:19.520 | a situation where I've personally wanted to use early stopping.

00:02:25.520 | Like in some of the training examples, like where you had the error rate, like some of

00:02:31.080 | the prior runs may have had a better lower error rate.

00:02:36.720 | Oh, I mean, in the ones I've shown like a tiny bit better, yeah, but like, not enough

00:02:42.760 | to be like meaningful, you know.

00:02:48.000 | And yeah, so that there's no reason to believe that those would, those are actually better

00:02:54.080 | models and there's plenty of a prior reason to believe that they're actually not, which

00:02:57.680 | is that the learning rate still hasn't settled down at that point.

00:03:00.280 | So we haven't let it fine tune into the best spot yet.

00:03:06.240 | So yeah, if it's kind of going down, down and down and it's kind of bottoming out and

00:03:10.960 | just bumps a little bit at the bottom, that's not a reason to use early stopping.

00:03:18.600 | And it's also, I think, important to realize that the validation set is relatively small

00:03:24.720 | as well.

00:03:25.720 | So it's only a representation of the distribution that the data is coming from.

00:03:31.320 | So reading too much into those small fluctuations can be very counterproductive.

00:03:38.440 | I know that I've wasted a lot of time in the past doing that, but yeah, a lot of time.

00:03:46.840 | We're looking for changes that dramatically improve things, you know, like changing from

00:03:51.820 | ResNet26 data comes next and we improved by what, 400 or 500% and it's like, okay, that's

00:03:57.680 | an improvement.

00:03:59.640 | Over the weekend, I went on my own server that I have here behind me, that I have at

00:04:09.160 | an API and I run all like 35 models for the PADI thing.

00:04:21.160 | I didn't do the example, but I was thinking about this, that when I was taking Algebra

00:04:28.440 | back in high school or college, you have some of these expressions that you have the function

00:04:33.840 | of X is equal to X squared for the X greater than something and the absolute value of X

00:04:42.640 | when you have X equal to something.

00:04:46.040 | So it just got me, my idea, the idea is that maybe some of the data set is going to fail

00:04:56.040 | the target value for every single one of the models that we tried, but if we try different

00:05:01.760 | models, it's going to be successful.

00:05:05.080 | So can we do that?

00:05:07.760 | I mean, of course we can, but I mean, what will be the easiest approach to say for this

00:05:13.400 | validation when X is equal to this or greater than that, this is the model to use, but then

00:05:20.600 | if this is the other model, this is what you have to use.

00:05:24.280 | Yeah. I mean, you could do that, right? And like a really simple way to do that, which

00:05:36.080 | I've seen used for some success on Kaggle is to train lots of models and then to train

00:05:44.440 | a gradient boosting machine whose inputs are those model predictions that whose output

00:05:51.960 | is the targets. And so that'll do exactly what you just described.

00:05:57.120 | It's very easy to overfit when you do that and you're only going to get it. If you've

00:06:08.960 | trained them well, you're only going to get a tiny increase, right? Because the neural

00:06:15.400 | nets are flexible, it shouldn't have that situation where this part of the space, it

00:06:22.120 | has bad predictions and this part of the space, it has good predictions. Like it's, that's

00:06:26.280 | not really how neural nets work.

00:06:30.880 | If you had a variety of types of like different, totally different types of model, like a random

00:06:35.960 | forest energy, BM and a neural net, I could see that maybe, but most of the time, one

00:06:45.680 | of those will be dramatically better than the other ones. And so like, I don't that

00:06:49.240 | often find myself wanting to ensemble across totally different types of model. So I'd say

00:06:56.720 | it's another one of these things like early stopping, which like a lot of people waste

00:07:03.840 | huge amounts of time on, you know, and it's not really where the big benefits are going

00:07:09.880 | to be seen. But yeah, if you're like in gold medal zone on a Kaggle competition and you

00:07:15.560 | need another 0.002% or something, then these are all things you can certainly try at that

00:07:22.480 | point.

00:07:23.480 | It kind of reminded me of AutoML, like the regime of tools. I don't know how you feel

00:07:35.160 | about how you feel about those things.

00:07:37.160 | Yeah, we talked about that in last night's lesson actually. So you'll have to catch up

00:07:41.680 | to see what I, what I said, if you haven't seen the lesson yet. Yeah. I'll mention also

00:07:49.040 | reading Kaggle winners descriptions of their approaches is, is great. But you've got to

00:07:55.480 | be very careful because remember, like Kaggle winners are the people who did get that last

00:08:01.720 | 0.002%. You know, because like everybody found all the low hanging fruit and the people who

00:08:08.320 | won grabbed the really high hanging fruit. And so every time you win a Kaggle winners

00:08:13.200 | description, they almost always have complex ensembling methods. And that's why, you know,

00:08:20.940 | in like something like a big image recognition competition, it's very hard to win or probably

00:08:25.640 | impossible to win with a single model, unless you invent some amazing new architecture or

00:08:31.080 | something. And so you're kind of, you might get the impression then that ensembling is

00:08:38.520 | the big thing that gets you all the low hanging fruit, but it's not. Ensembling is the thing

00:08:42.760 | which are particularly complex. Ensembling is a thing that gets you that last fraction

00:08:47.400 | of a fraction of a percent.

00:08:49.600 | One more question. Yeah, of course, the TTL concept, right? So I mean, TTA, TTA, sorry,

00:09:03.440 | TTA. Yeah, no time to the, yeah, the, so if I understand, like, I'm trying to understand

00:09:11.760 | conceptually why TTL improves the score, because technically, when you're training, it is using

00:09:20.640 | those augmented sort of pictures and providing them providing a percentage number. But when

00:09:26.480 | you're kind of, when you run that TTA function, why is it able to predict better?

00:09:34.080 | So like, you know how sometimes you're like, looking at some like, I don't know, a screwhead

00:09:44.960 | or a pla or a socket or something, it's really small, and you can't quite see like, what,

00:09:51.280 | how many pins are in it or what type is it or whatever. And you're kind of like, look

00:09:54.760 | at it from different angles, and you're kind of like, put it up to the light, and you try

00:09:58.280 | to like, at some point, you're like, okay, I see it, right? And there's like some angle

00:10:05.640 | and some lighting that you can see it. That's what you're doing for the computer, you're

00:10:11.600 | giving it different angles, and you're giving it different lighting in the hope that in

00:10:16.160 | one of those, it's going to be really clear. And for the ones where it's easy, it's not

00:10:22.560 | going to make any difference, right? But for the ones who it's like, oh, I don't know if

00:10:25.480 | it's this disease or that disease, but oh, you know, when it's a bit brighter, and you

00:10:30.760 | kind of zoom into that section, like, oh, now I can see. And so when you then average

00:10:36.160 | them out, you know, all the other ones are all like, oh, I don't know which kind of is

00:10:40.240 | I don't know which kind, so it's like point 5.5.5. And then this one is like point six.

00:10:44.560 | And so that's the one that in the average, it's going to end up picking. That's basically

00:10:49.200 | what happens. It also has another benefit, which is when we train our models, I don't

00:11:03.320 | know if you've noticed, but our training loss generally gets much, much lower than our validation

00:11:09.480 | loss. And sometimes our validate sometimes our well, so basically, like, what's happening

00:11:24.320 | there is that on the training set, the model is getting very confident, right? So even

00:11:29.800 | though we're using data augmentation, it's seeing slightly different versions of the

00:11:33.400 | same image dozens of times. And it's like, oh, I know how to recognize these. And so

00:11:40.640 | what it does is that the probabilities that associates with them is like point nine, point

00:11:46.200 | nine, nine, you know, like, it's like, I'm very confident of these. And it actually gets

00:11:51.200 | overconfident, which actually doesn't necessarily impact our accuracy, you know, to be overconfident.

00:12:05.160 | But at some point, it, it can. And so we are systematically going to have like overconfident

00:12:13.760 | predictions of probability. When, even when it doesn't really know, just because it's

00:12:20.280 | really seen that kind of image before. So then on the validation set, it's going to be, you

00:12:26.520 | know, over picking probabilities as well. And so one nice benefit is that when you average

00:12:32.400 | out a few augmented versions, you know, it's like, oh, point nine, point nine probability

00:12:39.880 | is this one. And then on the next one, it's like, augmented version with the same image

00:12:43.160 | like, oh, no, point one probability is that one. And they'll kind of average out to much

00:12:48.560 | more reasonable probabilities, which can, you know, allow it sometimes to yeah, combine

00:13:01.600 | these ideas into an average that that makes more sense. And so that can improve accuracy,

00:13:08.240 | but in particular, it improves the actual probabilities to get rid of that overconfidence.

00:13:15.200 | Is it fair to say that when you train without, when you train, it's not able to separate

00:13:23.400 | the replicated sort of images or the distorted slightly the variant of the original image,

00:13:30.240 | but when you use the TTA, it is able to group all the four images.

00:13:35.080 | And where that's what TTA is, we present them all together and average out that group. Yes.

00:13:42.400 | But in training, we don't indicate in any way that they're the same image or that they're

00:13:47.480 | the same underlying object. One of the questions, Jeremy, we I'm going to be pushing and stumbling

00:13:59.600 | and how to pick the best and stumbling. I was going to ask about that. But another question

00:14:06.280 | is, we have a fairly unbalanced data set, I guess, with the normal versus the disease

00:14:13.560 | states. You're doing augmentation. Is there any benefit to sort of over representing the

00:14:19.040 | minority classes? So let's let's pull away augmentation. So it's actually got nothing

00:14:24.720 | to do with augmentation. So more generally, when you're training, does it make sense to

00:14:29.160 | over represent the minority class? And the answer is, maybe. Yeah, it can. Right. And

00:14:40.480 | so, okay, so just for those who aren't following, the issue Matt's talking about is that there

00:14:47.480 | was, you know, a couple of diseases which appear lots and lots in the data, and a couple

00:14:52.440 | which hardly appear at all. And so, you know, do we want to try to balance this out more?

00:15:02.560 | And one thing that people often do to balance it out more is that they'll throw away some

00:15:08.560 | of the images in their more represent the highly represented classes. And I can certainly

00:15:13.700 | tell you straight away, you should never ever do that. You never want to throw away data.

00:15:20.720 | But Matt's question was, well, could we, you know, over sample the less common diseases?

00:15:29.920 | And the answer is, yeah, absolutely, you could. And in first AI, you go into the docs. Now,

00:15:42.560 | where is it? There is a weighted data loader somewhere. Weighted. Search for that. Here

00:15:52.280 | we go. Of course, it's a callback. So if you go to the callbacks data section, you'll find

00:15:58.040 | a weighted DL callback or a weighted data loaders method. I'm not. No, I'm just telling

00:16:08.420 | you where to look. Thanks for checking. So, yeah, I mean, let's look at that today, right?

00:16:21.480 | Because I kind of want to look at, like, things we can do to improve things today. It doesn't

00:16:28.720 | necessarily help. Because it does mean, you know, given that you're, you know, let's say

00:16:35.060 | you do 10 epochs of 1,000 images, it's going to get to look at 10,000 images, right? And

00:16:41.240 | if you over sample a class, then that also means that it's going to get, it's going to

00:16:46.120 | see less of some images and going to get more repetition of other images, which could be

00:16:53.640 | a problem, you know? And really, it just depends on, depends on a few things. If it's like

00:17:00.480 | really unbalanced, like 99% all of one type, then you're going to have a whole lot of batches

00:17:05.980 | that it never sees anything of the underrepresented class. And so basically, there's nothing for

00:17:11.280 | to learn from. So at some point, you probably certainly need weighted sampling. It also

00:17:18.880 | depends on the evaluation. You know, if people like say in the evaluation, okay, we're going

00:17:22.700 | to kind of average out for each disease, how accurate you were. So every disease will then

00:17:28.720 | be like equally weighted, then you would definitely need to use weighted sampling. But in this

00:17:34.640 | case, you know, presuming, presuming that the test set has a similar distribution as

00:17:40.760 | a training set, weighted sampling might not help. Because they're going to care the most

00:17:48.120 | about how well we do on the highly represented diseases.

00:17:52.680 | I'll note my experience with like oversampling and things like that. I think one time I had

00:18:02.840 | done with I think diabetic retinopathy, there was a competition for that. And I had used

00:18:09.640 | weighted sampling or oversampling, and it did seem to help. And then also a while back,

00:18:14.360 | I did an experiment where I think this was back with fast AI version one, where I took

00:18:20.320 | like the minced data set, and then I like artificially added some sort of imbalance.

00:18:29.600 | And then I trained with and without weighted sampling. And I saw like there was an improvement

00:18:35.520 | with the weighted sampling on accuracy on like just a regular validation set. So from

00:18:42.800 | that, from those couple experiments, I'd say like, I've at least seen some help and improvement

00:18:48.440 | with weighted sampling. Cool. And was that cases where that data set was like, highly

00:18:53.600 | unbalanced? Or was it more like the data set that we're looking at, at the moment?

00:18:59.600 | It wasn't highly unbalanced. It was maybe like, I don't know, like, maybe like, yeah,

00:19:05.240 | just 75% versus 25% or something like that. It's not like 99.99 versus 1%, nothing like

00:19:11.960 | that. It was more. Oh, well, it wasn't that bad. So let's try it today. Yeah. I see we've

00:19:17.440 | got a new face today as well. Hello, Zach. Thanks for joining.

00:19:21.880 | Hey, hey, glad I could finally make these. Yeah. Are you joining from Florida?

00:19:28.760 | No, I'm in Maryland now. Maryland now. Okay, I have a change. Yes, much more up north.

00:19:38.880 | Okay, great. So let's let's try something. Okay, so let's connect to my little computer.

00:20:08.840 | It says, is there a way to shrink my zoom bar out of the way? It takes up so much space.

00:20:25.840 | Hide floating meeting controls. I guess that's what I want. Control Alt Shift H. Wow. Press

00:20:32.960 | escape to show floating meeting controls. That doesn't work very well with Vim. Oh, well,

00:20:40.440 | Control Alt Shift 8. Okay. We're not doing tabular today, so let's get rid of that. So

00:20:58.680 | I think what I might do is, you know, because we're iterating. Well, I guess we could start

00:21:07.840 | with the multitask button, because this is our kind of like things to try to improve

00:21:17.200 | version. I'll close that. I'll leave that open just in case we want to. Okay. By the

00:21:40.360 | way, if you've got multiple GPUs, this is how you just use one of them. You can just

00:21:44.200 | set an environment variable. Okay, so this is where we this is where we did the multi

00:22:07.880 | target model. Okay. Just moved everything slightly.

00:22:34.000 | Comp. Not comp path. Right. Back to where we were. Okay. So now what? What's broken?

00:23:03.320 | Data block. Get image files. Well, this is working the other day. So I guess we better

00:23:28.920 | try to do some debugging. So the obvious thing to do would be to call this thing here, get

00:23:38.960 | image files on the thing that we passed in here, which is train path. Okay, so that's

00:23:47.480 | working. Then the other thing to do would be to check data by doing show batch. Okay,

00:23:58.200 | that's working. And I guess, all right, and it's showing you two different things. That's

00:24:07.880 | good. Oh, is it? Right, we've got the two category blocks. So we can't use this one.

00:24:26.280 | We have to use this one. So fit one cycle. Yeah, okay. So the to remind you we have this

00:24:45.640 | is the one where we had two categories and one input. And to get the two categories,

00:24:57.320 | we use the parent label and this function, which looked up the variety from this dictionary.

00:25:07.840 | Okay, and then when we fine tuned it, and let's just check, yes, it equals 42. So that's our

00:25:19.640 | standard set, we should be able to then compare that to small models trained for 12 epochs.

00:25:36.360 | And then that was this one. Part two. And let's see. They're not quite

00:26:06.240 | the same because this was 480 squish. Or else this was rectangular pad. Let's do five epochs.

00:26:31.920 | Let's do it the same as this one. Yeah, let's do this one. Because we want to be able to

00:26:44.240 | do quick iterations. Let's see resize 192 squish. There we go.

00:27:12.200 | And then we trained it for 0.01 with FP16 with five epochs. All right. So this will be our

00:27:41.880 | base case. Well, you know, I mean, I guess this is our base case 0.045. This will be

00:27:51.600 | our next case. Okay, so while that's running, the next thing I wanted to talk about is progressive

00:28:06.040 | resizing. So this is training at a size of 128. Which is not very big. And we wouldn't

00:28:34.400 | expect it to do very well. But it's certainly better than nothing. And as you can see, it's

00:28:41.400 | -- that's not error. Disease error. It's down to 7.5% error already and it's not even done.

00:28:50.880 | So that's not bad. And, you know, in the past, what we've then done is we've said, okay,

00:28:57.960 | well, that's working pretty well. Let's throw that away and try bigger. But there's actually

00:29:11.320 | something more interesting we can do. Which is we don't have to throw it away. What we

00:29:18.040 | could do is to continue training it on larger images. So we're basically saying, okay, this

00:29:29.160 | is a model which is fine tuned to recognize 128 by 128 pixel images of rice. That's fine

00:29:41.120 | tuned to recognize 192 by 192 pixel images of rice. And we could even -- there's a few

00:29:48.960 | benefits to that. One is it's very fast, you know, to do the smaller images. And it can

00:29:56.160 | recognize the key features of it. So, you know, this lets us do a lot of epochs quickly. And

00:30:06.920 | then, like, the difference between small images of rice disease and large images of rice disease

00:30:11.080 | isn't very big difference. So you would expect it would probably fine tune to bigger images

00:30:16.000 | of rice disease quite easily. So we might get most of the benefit of training on big

00:30:21.320 | images, but without most of the time. The second benefit is it's a kind of data augmentation,

00:30:30.120 | which is we're actually giving it different sized images. So that should help. So here's

00:30:37.560 | how we would do that. Let's grab this data block. Let's make it into a function. Get

00:30:48.520 | dl. Okay. And the key thing I guess we're going to do -- well, let's just do the item

00:30:54.080 | transforms and the batch transforms as usual. Oops. So the things we're going to change

00:31:08.600 | are the item transforms. And the batch transforms. And then we're going to return the data loader

00:31:23.520 | for that, which is here. Okay. So let's try --

00:31:52.720 | Coming up a bit. dl equals get dl. I guess it should be get dl's really because it returns

00:32:06.760 | data loaders get dl's. Okay. So let's see what we did last time as we scale it up a bit.

00:32:27.200 | So this is going to be data augmentation as well. We're going to change how we scale.

00:32:36.600 | So we'll scale with zero padding. And let's go up to 160. Okay.

00:33:00.780 | So then we need a ladder. Okay. So we're going to change the size of the item. So we're going to

00:33:29.840 | change the size of the item. So our -- where's our squish one here? Squish. So the squish

00:33:41.280 | here got .45. Our multitask got .48. So it's actually a little bit worse. This might not

00:33:53.600 | be a great test, actually, because I feel like one of the reasons that doing a multitask

00:33:58.480 | bottle might be useful is it might be able to train for more epochs. Because we're kind

00:34:06.640 | of giving it more signal. So we should probably revisit this with, like, 20 epochs. Any questions

00:34:19.400 | or comments about progressive resizing while we wait for this to train?

00:34:23.240 | >> Sorry, I can't see how you progressively changed the size because --

00:34:33.040 | >> I actually didn't. I messed it up. Whoops. Thank you. I have to do that again. I actually

00:34:44.720 | didn't. Oh, and we need to get our DLS back as well. Okay. Let's start again. Okay. And

00:34:53.640 | let's -- in case I mess this up again, let's export this. We'll call this, like, stage

00:34:58.960 | one. See? Yeah. The problem was we created a new learner. So what we should have done

00:35:10.160 | is gone learn dot DLS equals DLS. That's actually -- so that would actually change the data

00:35:24.820 | loaders inside the learner without recreating it. Was that where you were heading with your

00:35:29.920 | comment? There was an unfreeze method. There was an unfreeze method. Like, the same thing

00:35:41.760 | in the book I actually mentioned is using the unfreeze method.

00:35:47.200 | >> There is an unfreeze method. Yes. What were you saying about the unfreeze method?

00:35:51.600 | >> Is an unfreeze required for progressive resizing? Am I wrong?

00:35:55.200 | >> No, because fine-tune is already unfrozen. Although I actually want to fine-tune again.

00:36:04.360 | So if anything, I kind of actually want to -- I actually want to refreeze it. Because

00:36:12.400 | we've changed the resolution, I think fine-tuning the head might be a good idea to do again.

00:36:24.160 | >> Which line of code is doing the progressive resizing part, just to be clear?

00:36:31.560 | >> It's not our line of code. It's basically this. It's basically saying our current learner

00:36:36.780 | is getting new data loaders. And the new data loaders have a size of 160, whereas the old

00:36:43.960 | data loaders had a size of 128. And our old data loaders did a presizing of 192 squish,

00:36:52.200 | but our new data loaders are doing a presizing of rectangular padding. Does that make sense?

00:36:59.000 | >> Why are you calling it progressive in this case? Are you going to keep changing the size

00:37:04.200 | or something like that? >> Yeah, it's changing the size of the images

00:37:09.280 | without resetting the learner. >> Just looked it up because I was curious.

00:37:15.560 | >> Fine-tune calls a freeze first. >> I had a feeling it did. Thanks for checking,

00:37:24.440 | Zach. So this time, you know, let's see. It'll be interesting, right, to see how it does.

00:37:32.280 | So after the initial epoch, it's got .09, right? Whereas previously it had .27. So obviously

00:37:40.160 | it's better than last time. But it's actually worse than the final point, right? This time

00:37:45.480 | it got all the way to .418. Yeah, or else this time it has got worse. So it's got some

00:37:52.000 | work to do to learn to recognize what 160 pixel images look like.

00:37:58.000 | >> Can I just clarify, Jeremy? So you're like doing one more step in the progressive resizing

00:38:08.280 | here. It's not kind of an automated resizing. >> Correct. Correct. Yeah. Yeah. There isn't

00:38:14.720 | anything in fast.ai to do this for you. And in fact, this technique is something that

00:38:21.520 | we invented. So it doesn't exist in other libraries at all. So, yeah, it's the name

00:38:29.360 | of a technique. It's not the name of, like, a method in fast.ai. And, yeah, the technique

00:38:35.360 | is basically to replace the data loaders with ones at a larger size. And we invented it

00:38:43.800 | as part of a competition called Dawnbench, which is where we work very well on a competition

00:38:50.920 | for ImageNet training. And Google then took the idea and studied it a lot further as part

00:39:00.880 | of a paper called EfficientNet V2 and found ways to make it work even better. Oh, my gosh,

00:39:09.640 | look at this. So we've gone from 0.418 to 0.0336. Have we done training at 160 before? I don't

00:39:23.200 | think we have. I should be checking this one. 128, 128. 171 by 128. No, we haven't. This

00:39:49.900 | is a 256 by 192. So eventually, I guess we're going to get to that point. So let's keep

00:39:58.240 | going. So, okay. So we're down to 2.9% error. >> How did you come up with the idea for this?

00:40:05.880 | Is it something that you just wanted to try? Or did you, like, stumble upon it while looking

00:40:11.480 | at something else? >> Oh, I mean, it just seemed very obviously

00:40:15.000 | to me like something which obviously we should do because, like, we were spending -- okay,

00:40:20.200 | so on Dawnbench, we were training on ImageNet. It was taking 12 hours, I guess, to train

00:40:25.760 | a single model. And the vast majority of that time, it's just recognizing very, very basic

00:40:34.960 | things about images, you know? It's not learning the finer details of different cat breeds

00:40:40.800 | or whatever, but it's just trying to understand about the concepts of, like, fur or sky or

00:40:45.280 | metal. And I thought, well, there's no -- there's absolutely no reason to need 224 by 224 pixel

00:40:49.960 | images to be able to do that, you know? Like, it just seemed obviously stupid that we would

00:40:59.120 | do it. And partly, it was, like, also, like, I was just generally interested in changing

00:41:06.160 | things during training. So, one of, you know, in particular, learning rates, right? So,

00:41:12.500 | the idea of changing learning rates during training goes back a lot longer than Dawnbench,

00:41:17.720 | that people had been generally training them by having a learning rate that kind of dropped

00:41:22.080 | by a lot and then stayed flat and dropped by a lot and stayed flat. And Leslie Smith,

00:41:26.840 | in particular, came up with this idea of kind of, like, gradually increasing it over a curve

00:41:31.520 | and then gradually decreasing it following another curve. And so, I was definitely in

00:41:35.600 | the mindset of, like, oh, there's kind of interesting things we can change during training.

00:41:39.320 | So, I was looking at, like, oh, what if we change data augmentation during training,

00:41:44.280 | for example? Like, maybe towards the end of training, we should, like, turn off data augmentation

00:41:49.260 | so it could learn what unaugmented images look like, because that's what we really care

00:41:53.600 | about, for example. So, yeah, that was the kind of stuff that I was kind of interested

00:42:02.160 | in at the time. And so, yeah, definitely this thing of, like, you know, why are we looking

00:42:12.120 | over 224 by 224 pixel images the entire time? Like, that just seemed obviously stupid. And

00:42:17.960 | so, it wasn't something where I was like, wow, here's a crazy idea. I bet it won't work.

00:42:21.160 | As soon as I thought of it, I just thought, okay, this is definitely going to work, you

00:42:24.720 | know? And it did. >> Interesting. Thanks. Yeah. >> No worries. >> One question I have

00:42:33.400 | for you, Jeremy. >> Yeah. >> There was a paper that came out, like,

00:42:37.400 | in 2019 called Fixing the Test Train Resolution Discrepancy, where, yeah, were they, like,

00:42:44.940 | trained on 224 and then did inference finally on, like, 320 by 320? >> Yeah. >> Have you

00:42:52.120 | seen that still sort of work? Have you done that at all in your workflow? >> I mean, honestly,

00:42:57.920 | I don't remember. I need to revisit that paper because you're right, it's important tonight.

00:43:04.760 | I, you know, I would generally try to fine-tune on the final size I was going to be predicting

00:43:16.920 | on anyway. So, yeah, I guess we'll kind of see how we go with this, right? I mean, you

00:43:25.240 | can definitely take a model that was trained on 224 by 224 images and use it to predict

00:43:32.720 | 360 by 360 images, and it will generally go pretty well. But I think it will go better

00:43:38.940 | if you first fine-tune it on 360 by 360 images. >> Yeah, I don't think they tried pre-training

00:43:46.160 | and then also training on, like, 320 versus just 320 in the 224. >> Yeah. >> That would

00:43:52.100 | definitely be an interesting experiment. >> Yeah, it would be an interesting experiment.

00:43:54.960 | It's definitely something that any of us here could do, you know? I think it would be cool.

00:44:00.680 | Right? So, let's try scaling this up. So, we can change these two lines to one. So,

00:44:07.520 | this is something I often do, is I do things like, yep. >> I think we don't have your screen.

00:44:14.720 | >> So, I was just saying previously, I had, like, two cells to do this, and so now I'm

00:44:22.880 | just going to combine it into one cell. So, this is what I tend to do as I fiddle around,

00:44:27.040 | because I try to, like, gradually make things a little bit more concise, you know? Okay.

00:44:46.640 | >> Does it make sense to go smaller than the original pre-training, like, covenant? Can't

00:44:58.120 | come next? >> Yeah, I mean, you can fine-tune to any size

00:45:02.320 | you like. Absolutely. I'm just going to get rid of the zero padding, because, again, I

00:45:08.560 | want to, like, try to change it a little bit each time, just to kind of, you know, it's

00:45:13.880 | a kind of augmentation, right? So, okay. So, let's go up to 192. You know, one thing I

00:45:29.080 | find encouraging is that, you know, my training loss isn't getting way underneath the validation

00:45:34.120 | loss. It's not like we're -- feels like we could do this for ages before our error rates

00:45:43.240 | start going up. Interestingly, when I reran this, my error rate was better, .418. We've

00:46:10.200 | got a good memory to remember these old papers. It's very helpful to be able to do that.

00:46:18.960 | >> Usually what I wind up doing is my dad and I will email back and forth papers to

00:46:22.820 | each other. So, I can just go through my scent, look at archive, and usually, if I don't remember

00:46:28.240 | the name of it, I remember the subject of it in some degree. So, I can just go through

00:46:32.760 | it all. >> I mean, it's a very, very good idea to

00:46:34.760 | use a paper manager of some sort, to save papers, you know, whether it be Mendeley or

00:46:41.560 | Zenodo or archive sanity or whatever, or bookmarks or something. Yeah, because otherwise these

00:46:54.000 | things disappear. Personally, I just tend to, like, tweet or favorite tweets about papers

00:47:01.400 | I'm interested in. And then I've set up pinboard.in. I don't know if you guys have seen that, but

00:47:07.120 | it's a really nice little thing, which basically any time you're on a website, you can click

00:47:17.000 | a button and the extension and it adds it to pinboard, but it also automatically adds

00:47:23.680 | all of your tweets and favorites, and it's got a full text search of the thing that the

00:47:30.720 | URLs link to, which is really helpful. >> So, you've favorited something that just

00:47:35.960 | says, oh, shit? >> No, I actually wrote something that just

00:47:38.600 | said, oh, shit. That was me writing, oh, shit. It was this, I mean, totally off topic, but

00:47:47.320 | it's absolutely disaster. I hope it's wrong, but it's absolutely disastrous sounding paper

00:47:55.120 | that came out yesterday that basically, where was this key thing? People who've had one

00:48:03.520 | COVID infection have a list of one sequelae of 8.4%, two infections 23%, three infections

00:48:09.680 | 36%. It's like my worst nightmare is the more people get infected with COVID, the more likely

00:48:15.960 | it is that they'll get long-term symptoms, which is horrifying. That was my first shit

00:48:22.320 | moment. >> That is very horrifying.

00:48:25.720 | >> It's really awful. Okay. So, keeps going down, right? Which is cool. Let's keep bringing

00:48:31.840 | along, I suppose. I guess, you know, what we could do is just grab this whole damn thing

00:48:40.240 | here. Kind of have a bit of a comparison. So, we're basically going to run exactly the

00:48:49.980 | same thing we did earlier. At this time, with some pre-sizing first.

00:49:15.240 | All right. So, that'll be an interesting experiment. So, while that's running, you know, this is

00:49:34.880 | where I hit the old duplicate button. And this is why it's nice if you can to have a

00:49:46.320 | second card. Because while something's running, you can try something else. Cuda visible devices.

00:49:59.600 | There we go. So, we can keep working.

00:50:19.000 | Okay. So, Waited Data Loader. So, this is something I added to Fast AI a while ago and haven't

00:50:42.080 | used much myself since. But if I just search for Waited, here it is. Here it is. So, you

00:50:55.040 | can see in the docs, it shows you exactly how to use Waited Data Loaders. And so, we

00:51:05.600 | pass in a bat size. We pass in some weights. This is the weights. It's going to be 1, 2,

00:51:12.800 | 3, 4, 5, 6, 7, 8. And then some item transforms. These are really interesting in the docs.

00:51:23.560 | In some ways, it's extremely advanced. And in other ways, it's extremely simple. Which

00:51:28.440 | is to say, if you look at this example in the docs, everything is totally manual, right?

00:51:33.840 | So, our labels are some random integers. And I've even added a comment here, right? It's

00:51:44.180 | going to be in the training set. Two are going to be in the validation set. So, our data

00:51:51.960 | block is going to contain one category block. Because we just got the one thing, right?

00:52:01.080 | And rather than doing get X and get Y, you can also just say getters. Because get X and

00:52:07.960 | get Y basically become getters, which is a list of transformations to do. And so, this

00:52:15.240 | is going to be a single getter or a single get X, if you like, which is going to return

00:52:20.000 | the Ith label. And a splitter, which is going to decide whether something's valid or not

00:52:25.400 | based on this function. So, you can see this whole thing is totally manual. So, we can

00:52:32.080 | create our data set by passing in a list of the numbers from 0 to 9. And a single item

00:52:38.520 | transform that's going to convert that to a tensor. And then our weights will be the

00:52:43.440 | numbers from 0 to 7. And so, then we can take our data sets or data sets and turn them into

00:52:51.040 | data loaders using those weights. So, for the batch size of 1, if we say show batch,

00:53:02.240 | we get back a single number, okay? And it's not doing random shuffling. So, we get the

00:53:06.640 | number 0, because that was the first thing in our data set. Let's see, what do we do

00:53:17.720 | next? Now, we've got to do n equals 160. So, now, we've got all of the numbers from 0 to

00:53:24.520 | 159 with those weights. Yes, forgettors, yep. >> You mentioned, this is for X or Y.

00:53:38.120 | >> This is a list. It's whatever, right? There is just one thing. I don't know if you call

00:53:44.040 | that X or you call it Y. It's just one thing. So, if you have a get X and a get Y, that's

00:53:49.320 | the same as having a getters with a list of two things. So, yeah. I think I could just

00:53:58.400 | write get -- this has been aged since I put this, but I think I could just write get X

00:54:01.400 | here and put this not in a list. It would probably be the same thing.

00:54:05.160 | >> Okay. >> Probably handle a little bit of mystery

00:54:09.400 | that might be happening as well. >> Yeah. >> The data block has an input parameter.

00:54:14.480 | >> Correct. >> Which is how it determines what of the

00:54:17.680 | getters is X versus Y. >> Correct. Which we actually looked at last

00:54:22.920 | time. Here. When we created our multi-image block. That was before you joined, Zach. Yes,

00:54:33.400 | useful reminder. Okay. So, here we see in a histogram of how often -- so, our -- we created

00:54:49.560 | like a little synthetic learner that doesn't really do anything, but we can pass callbacks

00:54:54.280 | to it, and there's a callback called collect data callback, which just collects the data

00:54:59.280 | that's part -- that is called in the learner, and so this is how we can then find out what

00:55:05.360 | data was passed to the learner, get a histogram of it, and we can see that, yeah, the number

00:55:11.000 | 160 was received a lot more often when we trained this learner, which is what you would

00:55:18.560 | expect. This is the source of the weighted data loader class here, and as you can see,

00:55:30.200 | other than the boilerplate, it's one, two, three, four, five lines of code. And then

00:55:37.240 | the weighted data loader's method is two lines of code. So, there's actually a lot more lines

00:55:43.680 | of example than there is of actual code. So, often it's easier just to read the source

00:55:49.560 | code, because, you know, thanks to the very layered approach of fast AI, we can do so

00:55:55.800 | much stuff with so little code. And so, in this case, if we look through the code, we're

00:56:02.920 | passing in some weights, and basically the key thing here is that we set -- if the -- if

00:56:10.120 | you pass in no weights at all, then we're just going to set it equal to the number one

00:56:15.440 | repeated n times, so everything's going to get one, a weight of one. And then we divide

00:56:22.080 | the weights by the sum of the weights so that the sum of the weights ends up summing up

00:56:26.240 | to one, which is what we want. And then if you're not shuffling, then there's no weighted

00:56:42.520 | anything to do, so we just pass back the indexes. And if we are shuffling, we will grab a random

00:56:49.000 | choice of -- based on the weights. Cool. All right. So, there's going to be one weight

00:57:10.120 | per row. Let's come back to that, because I want to see how our thing's gone. It looks

00:57:15.520 | like it's finished. Notice that the fav icon in Jupyter will change depending on whether

00:57:21.040 | something's running or not, so that's how you can quickly tell if something's finished.

00:57:27.920 | 0.216, 0.221. Okay, I mean, it's not a huge difference, but maybe it's a tiny bit better.

00:57:50.400 | I don't know. The key thing, though, is this lets us use our resources better, right? So

00:57:56.760 | we often will end up with a better answer, but you can train for a lot less time. In

00:58:02.000 | fact, you can see that the error was at 0.216 back here, so we could probably have trained

00:58:08.360 | for a lot less epochs. So that's progressive resizing.

00:58:13.240 | Is there a way to look at that and go, "Oh, actually, I'd like to take the outputs from

00:58:26.200 | epoch 9," because it had a better --

00:58:31.000 | That was the question we got earlier about. That's called early stopping, and the answer

00:58:36.120 | is no. You probably wouldn't want to do early stopping.

00:58:39.080 | But you can't go back to a previous epoch. There's no history.

00:58:46.680 | You can. You have to use the early stopping callback to do that.

00:58:50.240 | All right, cool. Okay, I'll look at that.

00:58:53.320 | Or there's other things you can use. As I say, I don't think you should, but you can.

00:59:00.200 | If I go "training, Corbex, tracker" --

00:59:04.680 | Okay, so the other part of that is, is it counterproductive or not?

00:59:09.400 | Yeah, it's counterproductive.

00:59:10.400 | It's not a cheat if it works, but not if it doesn't?

00:59:13.920 | It's probably not a good idea. It probably will make it worse, yeah.

00:59:16.600 | Okay, great.

00:59:17.600 | So the other thing you can do is save model callback, which saves -- which is kind of

00:59:22.520 | like early stopping, but it doesn't stop. It saves the parameters of the best model

00:59:29.920 | during training, which is probably what you want instead of early stopping.

00:59:34.400 | I don't think you should do that either for the same reason we discussed earlier.

00:59:40.400 | Why shouldn't you do this? It seems like you could just ignore it if you didn't want it.

00:59:44.360 | Or later on, like it might not hurt you?

00:59:49.080 | Well, so this actually automatically loads the best set of parameters at the end.

01:00:00.200 | And you're just going to end up with this kind of model that just so happened to look

01:00:10.480 | a tiny bit better on the validation set at an earlier epoch.

01:00:14.280 | But at that earlier epoch, the laning rate hadn't yet stabilized, and it's very unlikely

01:00:18.720 | it really is better.

01:00:21.200 | So you've probably actually just picked something that's slightly worse and made your process

01:00:27.880 | slightly more complicated for no good reason.

01:00:31.040 | Being better on an epoch there doesn't necessarily say anything about the final hidden test set.

01:00:39.320 | Yeah, we have a strong prior belief that it will improve each epoch unless you're overfitting.

01:00:50.920 | And if you're overfitting, then you shouldn't be doing early stopping, you should be doing

01:00:54.520 | more augmentation.

01:00:56.520 | It seems like a good opportunity for somebody to document the arguments, because I'm curious

01:01:02.960 | what add-in does.

01:01:05.040 | Yes, that would be a great opportunity for somebody to document the arguments.

01:01:10.320 | And if somebody is interested in doing that, we have a really cool thing called documents,

01:01:19.600 | which I only invented after we created fast.ai.

01:01:32.140 | I should delete this because this is the old version, it's part of fast core.

01:01:38.160 | And you document each parameter by putting a comment after it.

01:01:44.800 | And you document the return by putting a comment after it.

01:01:47.880 | And Zach actually started a project to, after I created documents, to add documents, comments

01:01:55.640 | to everything in fast.ai, which of course is not finished because fast.ai is pretty

01:02:00.240 | big.

01:02:01.240 | And so here's an example of something that doesn't yet have documents, comments.

01:02:04.440 | So if somebody wants to go and add a comment to each of these things and put that into

01:02:10.800 | a PR, then that will end up in the documentation.

01:02:20.320 | Something we should do, Zach, is to actually include an example in the documentation of

01:02:33.320 | what it ends up looking like in NVDev, because I can see that's missing.

01:02:37.160 | That might be a good idea.

01:02:39.160 | I can see if I can get on that tomorrow.

01:02:41.600 | Yeah.

01:02:42.600 | Sorry, Hamel.

01:02:43.600 | What were you saying?

01:02:44.600 | No, I just wanted to encourage everybody that writing the documentation is an excellent

01:02:50.840 | way to learn deeply how everything works.

01:02:55.520 | And what ends up happening is you write this documentation and somebody like Jeremy will

01:03:02.360 | review it carefully and let you know what you don't understand.

01:03:07.000 | And that's how I learned about some other fast.ai library.

01:03:12.520 | So I highly recommend it, going and doing that.

01:03:16.680 | And here's what it ends up looking like.

01:03:18.040 | Right.

01:03:19.040 | So here's optimizer.

01:03:20.040 | And you can see it's got a little table underneath.

01:03:22.120 | And if we look at the source of optimizer, you'll see that each parameter has a comment

01:03:29.320 | next to it.

01:03:30.320 | But it's automatically turned into this table.

01:03:37.080 | All right.

01:03:38.840 | Yeah.

01:03:39.840 | Document are super cool.

01:03:40.840 | They are super cool.

01:03:41.840 | This sounds like a good place to wrap up.

01:03:43.360 | Anybody got any questions or comments or anything before we wrap up?

01:03:50.720 | I have a question regarding polyglot C resizing.

01:03:56.320 | Yes.

01:03:57.320 | We didn't do actually L.R. find after each step, don't you think it's something helpful?

01:04:09.320 | The L.R. find.

01:04:10.320 | Did you say?

01:04:11.320 | Yeah.

01:04:12.320 | Yeah.

01:04:13.320 | Yeah.

01:04:14.320 | I, to be honest, I don't use L.R. find much anymore nowadays, because, you know, at least

01:04:24.520 | for object recognition in computer vision, the optimal learning rate is pretty much always

01:04:28.960 | the same.

01:04:30.400 | It's always around 0.008, 0.01.

01:04:32.680 | Yeah.

01:04:33.680 | There's no reason to believe that we have any need to change it just because we changed

01:04:39.760 | the resolution.

01:04:40.760 | So, yeah, I wouldn't bother just leave it where it was.

01:04:50.640 | Jeremy, if you're training and validation loss is still decreasing after 12 people, can you

01:04:55.640 | pick up and train for a little longer without restarting?

01:04:59.680 | You can.

01:05:00.680 | The first thing I'll say is you shouldn't be looking at the validation loss to see if

01:05:02.880 | you're overfitting.

01:05:03.880 | You should be looking at the error rate.

01:05:04.880 | So, the validation loss can get worse whilst the error rate gets better, and that doesn't

01:05:09.560 | count as overfitting because the thing you want is to improve as the error rate.

01:05:13.560 | That can happen if it gets overconfident, but it's still improving.

01:05:17.360 | Yeah, you can keep training for longer because we're using, if you're using fit one cycle

01:05:25.520 | or fine-tune, and fine-tune uses fit one cycle behind the scenes, continuing to train further,

01:05:33.400 | your learning rate is going to go up and then down and then up and then down each time,

01:05:37.240 | which is not necessarily a bad thing, but, you know, if you, yeah, if you basically want

01:05:44.520 | to keep training at that, you know, at that point, you would probably want to decrease

01:05:53.280 | the learning rate by maybe 4x or so, and in fact, you know, I think after this, I'm going

01:05:59.240 | to rerun this whole notebook, but half the learning rate each time, so I think that would

01:06:08.320 | be potentially a good idea.

01:06:16.360 | I have a question.

01:06:17.360 | I don't know if it's too late, but I think it might be useful to discuss, when you do

01:06:22.840 | the progressive resizing, what part of the model gets dropped, like, what, you know,

01:06:32.160 | is there some part of the model that needs to be reinitialized for the new?

01:06:37.120 | No.

01:06:38.120 | Nothing needs to be reinitialized, no.

01:06:39.120 | I found this on the web.

01:06:40.120 | Sorry, I didn't watch.

01:06:41.120 | Who found what on the web?

01:06:42.120 | I thought you were talking to me, but you're talking to Siri?

01:06:48.160 | I'm offended.

01:06:50.160 | Siri, teach me deep learning.

01:06:54.080 | Yeah, ConfNext is what we call a resolution-independent architecture, which means it doesn't, it works

01:07:03.040 | for any input resolution, and time-permitting in the next lesson, we will see how convolutional

01:07:13.240 | neural networks actually work, but I guess a lot of you probably already know, so for

01:07:18.040 | those of you that do, if you think about it, it's basically going patch by patch and doing

01:07:23.520 | this kind of mini matrix multiplier for each patch, so if you change the input resolution,

01:07:32.760 | it just has more patches to cover, but it doesn't change the parameters at all, so there's nothing

01:07:37.000 | to reinitialize.

01:07:38.800 | Does that make sense, Hamill?

01:07:42.200 | Yeah, that makes sense.

01:07:44.120 | I was just asking for the, like, in general.

01:07:47.080 | For the record, fair enough.

01:07:49.080 | Yeah.

01:07:50.080 | Just a question.

01:07:51.080 | Go ahead.

01:07:52.080 | I was just going to, a quick note, say, is ResNet resolution-independent?

01:07:57.080 | Yep.

01:07:58.080 | Is it good?

01:07:59.080 | Yep.

01:08:00.080 | Yep.

01:08:01.080 | Typically, everything we use is normally, but in the, like, have a look at that, like,

01:08:09.160 | best fine-tuning models notebook, and you'll see that two of the best ones are called VIT

01:08:16.400 | and SWIN, and also SWIN-V2.

01:08:19.160 | None of those are resolution-independent, although there is a trick you can use to kind of make

01:08:26.800 | them resolution-independent, which we should try out in a future walkthrough.

01:08:34.000 | Is that fiddling with the head or something?

01:08:36.520 | Oh, there's a TIM.

01:08:37.520 | There's a thing you can pass to TIM.

01:08:42.120 | I don't know if we can use it to support progressive resizing or not.

01:08:45.360 | It'll be interesting to experiment with.

01:08:48.680 | It's basically changing the positional encodings.

01:08:51.680 | I have a question.

01:08:55.640 | Interesting.

01:08:56.640 | Yeah.

01:08:57.640 | After you've done your experiments, progressive resizing, and fine-tuning, how do you infest

01:09:06.500 | AI train with the whole training set?

01:09:09.720 | I never got around to do that.

01:09:12.960 | Do you create a dummy version?

01:09:15.040 | I almost never do.

01:09:16.640 | Like, instead, I do what we saw in the last walkthrough, which is I just train on a few

01:09:24.200 | different randomly selected training sets, because that way, you know, you get the benefit

01:09:36.760 | of ensembling.

01:09:37.760 | You're going to end up seeing all the images, at least one anyway.

01:09:41.080 | And you can also kind of see if something's messed up, because you've still got a validation

01:09:44.320 | set each time.

01:09:46.240 | So yeah, I used to do this thing where I would create a validation set with a single item

01:09:51.440 | in to get that last bit of juice, but I don't even do that anymore.

01:09:57.400 | Okay.

01:09:58.400 | Thanks.

01:09:59.400 | No worries.

01:10:01.400 | All right, gang.

01:10:05.040 | Enjoy the rest of your day/evening.

01:10:06.040 | Nice to see you all.

01:10:07.040 | Bye.

01:10:08.040 | Bye.

01:10:09.040 | Bye.

01:10:10.040 | Thanks.

01:10:11.040 | Thank you.

Live coding 14

Chapters