Live coding 14

There we go. Yeah. Got a question. Yeah. So Joe, the training sort of process in Fast AI, like, is there a concept or capability to do like early stopping or best kind of thing, or if there isn't, is there a reason why you chose not to do that? I never remember because I don't use it myself.

So what I would check, I'm just checking now, is the callbacks, which is under training. So let's go to the docs, training, callbacks. And if anybody else knows, please shout out. There is a callback thingy. Early stopping callback. Yeah. I found it. Okay. It's under tracking callbacks. There's training, callbacks, tracker.

There's an early stopping callback. So perhaps the more interesting part then is like, why do I not use it so I don't even know whether it exists. There's a few reasons. One is that it doesn't play nicely with one cycle training or fine tuning. If you stop early, then the learning rate hasn't got a chance to go down.

And for that reason, it's almost never the case that earlier APOCs have better accuracy because the learning rate hasn't settled down yet. If I was doing one cycle training and I saw that an earlier APOC had a much better accuracy, then I would know that I'm overfitting in which case I would be adding more data augmentation rather than doing early stopping because it's good to train for the amount of time that you have.

So yeah, I can't think offhand of a situation where I would, I mean, I haven't come across a situation where I've personally wanted to use early stopping. Like in some of the training examples, like where you had the error rate, like some of the prior runs may have had a better lower error rate.

Oh, I mean, in the ones I've shown like a tiny bit better, yeah, but like, not enough to be like meaningful, you know. And yeah, so that there's no reason to believe that those would, those are actually better models and there's plenty of a prior reason to believe that they're actually not, which is that the learning rate still hasn't settled down at that point.

So we haven't let it fine tune into the best spot yet. So yeah, if it's kind of going down, down and down and it's kind of bottoming out and just bumps a little bit at the bottom, that's not a reason to use early stopping. And it's also, I think, important to realize that the validation set is relatively small as well.

So it's only a representation of the distribution that the data is coming from. So reading too much into those small fluctuations can be very counterproductive. I know that I've wasted a lot of time in the past doing that, but yeah, a lot of time. We're looking for changes that dramatically improve things, you know, like changing from ResNet26 data comes next and we improved by what, 400 or 500% and it's like, okay, that's an improvement.

Over the weekend, I went on my own server that I have here behind me, that I have at an API and I run all like 35 models for the PADI thing. I didn't do the example, but I was thinking about this, that when I was taking Algebra back in high school or college, you have some of these expressions that you have the function of X is equal to X squared for the X greater than something and the absolute value of X when you have X equal to something.

So it just got me, my idea, the idea is that maybe some of the data set is going to fail the target value for every single one of the models that we tried, but if we try different models, it's going to be successful. So can we do that? I mean, of course we can, but I mean, what will be the easiest approach to say for this validation when X is equal to this or greater than that, this is the model to use, but then if this is the other model, this is what you have to use.

Yeah. I mean, you could do that, right? And like a really simple way to do that, which I've seen used for some success on Kaggle is to train lots of models and then to train a gradient boosting machine whose inputs are those model predictions that whose output is the targets.

And so that'll do exactly what you just described. It's very easy to overfit when you do that and you're only going to get it. If you've trained them well, you're only going to get a tiny increase, right? Because the neural nets are flexible, it shouldn't have that situation where this part of the space, it has bad predictions and this part of the space, it has good predictions.

Like it's, that's not really how neural nets work. If you had a variety of types of like different, totally different types of model, like a random forest energy, BM and a neural net, I could see that maybe, but most of the time, one of those will be dramatically better than the other ones.

And so like, I don't that often find myself wanting to ensemble across totally different types of model. So I'd say it's another one of these things like early stopping, which like a lot of people waste huge amounts of time on, you know, and it's not really where the big benefits are going to be seen.

But yeah, if you're like in gold medal zone on a Kaggle competition and you need another 0.002% or something, then these are all things you can certainly try at that point. It kind of reminded me of AutoML, like the regime of tools. I don't know how you feel about how you feel about those things.

Yeah, we talked about that in last night's lesson actually. So you'll have to catch up to see what I, what I said, if you haven't seen the lesson yet. Yeah. I'll mention also reading Kaggle winners descriptions of their approaches is, is great. But you've got to be very careful because remember, like Kaggle winners are the people who did get that last 0.002%.

You know, because like everybody found all the low hanging fruit and the people who won grabbed the really high hanging fruit. And so every time you win a Kaggle winners description, they almost always have complex ensembling methods. And that's why, you know, in like something like a big image recognition competition, it's very hard to win or probably impossible to win with a single model, unless you invent some amazing new architecture or something.

And so you're kind of, you might get the impression then that ensembling is the big thing that gets you all the low hanging fruit, but it's not. Ensembling is the thing which are particularly complex. Ensembling is a thing that gets you that last fraction of a fraction of a percent.

One more question. Yeah, of course, the TTL concept, right? So I mean, TTA, TTA, sorry, TTA. Yeah, no time to the, yeah, the, so if I understand, like, I'm trying to understand conceptually why TTL improves the score, because technically, when you're training, it is using those augmented sort of pictures and providing them providing a percentage number.

But when you're kind of, when you run that TTA function, why is it able to predict better? So like, you know how sometimes you're like, looking at some like, I don't know, a screwhead or a pla or a socket or something, it's really small, and you can't quite see like, what, how many pins are in it or what type is it or whatever.

And you're kind of like, look at it from different angles, and you're kind of like, put it up to the light, and you try to like, at some point, you're like, okay, I see it, right? And there's like some angle and some lighting that you can see it. That's what you're doing for the computer, you're giving it different angles, and you're giving it different lighting in the hope that in one of those, it's going to be really clear.

And for the ones where it's easy, it's not going to make any difference, right? But for the ones who it's like, oh, I don't know if it's this disease or that disease, but oh, you know, when it's a bit brighter, and you kind of zoom into that section, like, oh, now I can see.

And so when you then average them out, you know, all the other ones are all like, oh, I don't know which kind of is I don't know which kind, so it's like point 5.5.5. And then this one is like point six. And so that's the one that in the average, it's going to end up picking.

That's basically what happens. It also has another benefit, which is when we train our models, I don't know if you've noticed, but our training loss generally gets much, much lower than our validation loss. And sometimes our validate sometimes our well, so basically, like, what's happening there is that on the training set, the model is getting very confident, right?

So even though we're using data augmentation, it's seeing slightly different versions of the same image dozens of times. And it's like, oh, I know how to recognize these. And so what it does is that the probabilities that associates with them is like point nine, point nine, nine, you know, like, it's like, I'm very confident of these.

And it actually gets overconfident, which actually doesn't necessarily impact our accuracy, you know, to be overconfident. But at some point, it, it can. And so we are systematically going to have like overconfident predictions of probability. When, even when it doesn't really know, just because it's really seen that kind of image before.

So then on the validation set, it's going to be, you know, over picking probabilities as well. And so one nice benefit is that when you average out a few augmented versions, you know, it's like, oh, point nine, point nine probability is this one. And then on the next one, it's like, augmented version with the same image like, oh, no, point one probability is that one.

And they'll kind of average out to much more reasonable probabilities, which can, you know, allow it sometimes to yeah, combine these ideas into an average that that makes more sense. And so that can improve accuracy, but in particular, it improves the actual probabilities to get rid of that overconfidence.

Is it fair to say that when you train without, when you train, it's not able to separate the replicated sort of images or the distorted slightly the variant of the original image, but when you use the TTA, it is able to group all the four images. And where that's what TTA is, we present them all together and average out that group.

Yes. But in training, we don't indicate in any way that they're the same image or that they're the same underlying object. One of the questions, Jeremy, we I'm going to be pushing and stumbling and how to pick the best and stumbling. I was going to ask about that. But another question is, we have a fairly unbalanced data set, I guess, with the normal versus the disease states.

You're doing augmentation. Is there any benefit to sort of over representing the minority classes? So let's let's pull away augmentation. So it's actually got nothing to do with augmentation. So more generally, when you're training, does it make sense to over represent the minority class? And the answer is, maybe.

Yeah, it can. Right. And so, okay, so just for those who aren't following, the issue Matt's talking about is that there was, you know, a couple of diseases which appear lots and lots in the data, and a couple which hardly appear at all. And so, you know, do we want to try to balance this out more?

And one thing that people often do to balance it out more is that they'll throw away some of the images in their more represent the highly represented classes. And I can certainly tell you straight away, you should never ever do that. You never want to throw away data. But Matt's question was, well, could we, you know, over sample the less common diseases?

And the answer is, yeah, absolutely, you could. And in first AI, you go into the docs. Now, where is it? There is a weighted data loader somewhere. Weighted. Search for that. Here we go. Of course, it's a callback. So if you go to the callbacks data section, you'll find a weighted DL callback or a weighted data loaders method.

I'm not. No, I'm just telling you where to look. Thanks for checking. So, yeah, I mean, let's look at that today, right? Because I kind of want to look at, like, things we can do to improve things today. It doesn't necessarily help. Because it does mean, you know, given that you're, you know, let's say you do 10 epochs of 1,000 images, it's going to get to look at 10,000 images, right?

And if you over sample a class, then that also means that it's going to get, it's going to see less of some images and going to get more repetition of other images, which could be a problem, you know? And really, it just depends on, depends on a few things.

If it's like really unbalanced, like 99% all of one type, then you're going to have a whole lot of batches that it never sees anything of the underrepresented class. And so basically, there's nothing for to learn from. So at some point, you probably certainly need weighted sampling. It also depends on the evaluation.

You know, if people like say in the evaluation, okay, we're going to kind of average out for each disease, how accurate you were. So every disease will then be like equally weighted, then you would definitely need to use weighted sampling. But in this case, you know, presuming, presuming that the test set has a similar distribution as a training set, weighted sampling might not help.

Because they're going to care the most about how well we do on the highly represented diseases. I'll note my experience with like oversampling and things like that. I think one time I had done with I think diabetic retinopathy, there was a competition for that. And I had used weighted sampling or oversampling, and it did seem to help.

And then also a while back, I did an experiment where I think this was back with fast AI version one, where I took like the minced data set, and then I like artificially added some sort of imbalance. And then I trained with and without weighted sampling. And I saw like there was an improvement with the weighted sampling on accuracy on like just a regular validation set.

So from that, from those couple experiments, I'd say like, I've at least seen some help and improvement with weighted sampling. Cool. And was that cases where that data set was like, highly unbalanced? Or was it more like the data set that we're looking at, at the moment? It wasn't highly unbalanced.

It was maybe like, I don't know, like, maybe like, yeah, just 75% versus 25% or something like that. It's not like 99.99 versus 1%, nothing like that. It was more. Oh, well, it wasn't that bad. So let's try it today. Yeah. I see we've got a new face today as well.

Hello, Zach. Thanks for joining. Hey, hey, glad I could finally make these. Yeah. Are you joining from Florida? No, I'm in Maryland now. Maryland now. Okay, I have a change. Yes, much more up north. Okay, great. So let's let's try something. Okay, so let's connect to my little computer.

It says, is there a way to shrink my zoom bar out of the way? It takes up so much space. Hide floating meeting controls. I guess that's what I want. Control Alt Shift H. Wow. Press escape to show floating meeting controls. That doesn't work very well with Vim. Oh, well, Control Alt Shift 8.

Okay. We're not doing tabular today, so let's get rid of that. So I think what I might do is, you know, because we're iterating. Well, I guess we could start with the multitask button, because this is our kind of like things to try to improve version. I'll close that.

I'll leave that open just in case we want to. Okay. By the way, if you've got multiple GPUs, this is how you just use one of them. You can just set an environment variable. Okay, so this is where we this is where we did the multi target model. Okay.

Just moved everything slightly. Comp. Not comp path. Right. Back to where we were. Okay. So now what? What's broken? Data block. Get image files. Well, this is working the other day. So I guess we better try to do some debugging. So the obvious thing to do would be to call this thing here, get image files on the thing that we passed in here, which is train path.

Okay, so that's working. Then the other thing to do would be to check data by doing show batch. Okay, that's working. And I guess, all right, and it's showing you two different things. That's good. Oh, is it? Right, we've got the two category blocks. So we can't use this one.

We have to use this one. So fit one cycle. Yeah, okay. So the to remind you we have this is the one where we had two categories and one input. And to get the two categories, we use the parent label and this function, which looked up the variety from this dictionary.

Okay, and then when we fine tuned it, and let's just check, yes, it equals 42. So that's our standard set, we should be able to then compare that to small models trained for 12 epochs. And then that was this one. Part two. And let's see. They're not quite the same because this was 480 squish.

Or else this was rectangular pad. Let's do five epochs. Let's do it the same as this one. Yeah, let's do this one. Because we want to be able to do quick iterations. Let's see resize 192 squish. There we go. And then we trained it for 0.01 with FP16 with five epochs.

All right. So this will be our base case. Well, you know, I mean, I guess this is our base case 0.045. This will be our next case. Okay, so while that's running, the next thing I wanted to talk about is progressive resizing. So this is training at a size of 128.

Which is not very big. And we wouldn't expect it to do very well. But it's certainly better than nothing. And as you can see, it's -- that's not error. Disease error. It's down to 7.5% error already and it's not even done. So that's not bad. And, you know, in the past, what we've then done is we've said, okay, well, that's working pretty well.

Let's throw that away and try bigger. But there's actually something more interesting we can do. Which is we don't have to throw it away. What we could do is to continue training it on larger images. So we're basically saying, okay, this is a model which is fine tuned to recognize 128 by 128 pixel images of rice.

That's fine tuned to recognize 192 by 192 pixel images of rice. And we could even -- there's a few benefits to that. One is it's very fast, you know, to do the smaller images. And it can recognize the key features of it. So, you know, this lets us do a lot of epochs quickly.

And then, like, the difference between small images of rice disease and large images of rice disease isn't very big difference. So you would expect it would probably fine tune to bigger images of rice disease quite easily. So we might get most of the benefit of training on big images, but without most of the time.

The second benefit is it's a kind of data augmentation, which is we're actually giving it different sized images. So that should help. So here's how we would do that. Let's grab this data block. Let's make it into a function. Get dl. Okay. And the key thing I guess we're going to do -- well, let's just do the item transforms and the batch transforms as usual.

Oops. So the things we're going to change are the item transforms. And the batch transforms. And then we're going to return the data loader for that, which is here. Okay. So let's try -- Coming up a bit. dl equals get dl. I guess it should be get dl's really because it returns data loaders get dl's.

Okay. So let's see what we did last time as we scale it up a bit. So this is going to be data augmentation as well. We're going to change how we scale. So we'll scale with zero padding. And let's go up to 160. Okay. So then we need a ladder.

Okay. So we're going to change the size of the item. So we're going to change the size of the item. So our -- where's our squish one here? Squish. So the squish here got .45. Our multitask got .48. So it's actually a little bit worse. This might not be a great test, actually, because I feel like one of the reasons that doing a multitask bottle might be useful is it might be able to train for more epochs.

Because we're kind of giving it more signal. So we should probably revisit this with, like, 20 epochs. Any questions or comments about progressive resizing while we wait for this to train? >> Sorry, I can't see how you progressively changed the size because -- >> I actually didn't. I messed it up.

Whoops. Thank you. I have to do that again. I actually didn't. Oh, and we need to get our DLS back as well. Okay. Let's start again. Okay. And let's -- in case I mess this up again, let's export this. We'll call this, like, stage one. See? Yeah. The problem was we created a new learner.

So what we should have done is gone learn dot DLS equals DLS. That's actually -- so that would actually change the data loaders inside the learner without recreating it. Was that where you were heading with your comment? There was an unfreeze method. There was an unfreeze method. Like, the same thing in the book I actually mentioned is using the unfreeze method.

>> There is an unfreeze method. Yes. What were you saying about the unfreeze method? >> Is an unfreeze required for progressive resizing? Am I wrong? >> No, because fine-tune is already unfrozen. Although I actually want to fine-tune again. So if anything, I kind of actually want to -- I actually want to refreeze it.

Because we've changed the resolution, I think fine-tuning the head might be a good idea to do again. >> Which line of code is doing the progressive resizing part, just to be clear? >> It's not our line of code. It's basically this. It's basically saying our current learner is getting new data loaders.

And the new data loaders have a size of 160, whereas the old data loaders had a size of 128. And our old data loaders did a presizing of 192 squish, but our new data loaders are doing a presizing of rectangular padding. Does that make sense? >> Why are you calling it progressive in this case?

Are you going to keep changing the size or something like that? >> Yeah, it's changing the size of the images without resetting the learner. >> Just looked it up because I was curious. >> Fine-tune calls a freeze first. >> I had a feeling it did. Thanks for checking, Zach.

So this time, you know, let's see. It'll be interesting, right, to see how it does. So after the initial epoch, it's got .09, right? Whereas previously it had .27. So obviously it's better than last time. But it's actually worse than the final point, right? This time it got all the way to .418.

Yeah, or else this time it has got worse. So it's got some work to do to learn to recognize what 160 pixel images look like. >> Can I just clarify, Jeremy? So you're like doing one more step in the progressive resizing here. It's not kind of an automated resizing.

>> Correct. Correct. Yeah. Yeah. There isn't anything in fast.ai to do this for you. And in fact, this technique is something that we invented. So it doesn't exist in other libraries at all. So, yeah, it's the name of a technique. It's not the name of, like, a method in fast.ai.

And, yeah, the technique is basically to replace the data loaders with ones at a larger size. And we invented it as part of a competition called Dawnbench, which is where we work very well on a competition for ImageNet training. And Google then took the idea and studied it a lot further as part of a paper called EfficientNet V2 and found ways to make it work even better.

Oh, my gosh, look at this. So we've gone from 0.418 to 0.0336. Have we done training at 160 before? I don't think we have. I should be checking this one. 128, 128. 171 by 128. No, we haven't. This is a 256 by 192. So eventually, I guess we're going to get to that point.

So let's keep going. So, okay. So we're down to 2.9% error. >> How did you come up with the idea for this? Is it something that you just wanted to try? Or did you, like, stumble upon it while looking at something else? >> Oh, I mean, it just seemed very obviously to me like something which obviously we should do because, like, we were spending -- okay, so on Dawnbench, we were training on ImageNet.

It was taking 12 hours, I guess, to train a single model. And the vast majority of that time, it's just recognizing very, very basic things about images, you know? It's not learning the finer details of different cat breeds or whatever, but it's just trying to understand about the concepts of, like, fur or sky or metal.

And I thought, well, there's no -- there's absolutely no reason to need 224 by 224 pixel images to be able to do that, you know? Like, it just seemed obviously stupid that we would do it. And partly, it was, like, also, like, I was just generally interested in changing things during training.

So, one of, you know, in particular, learning rates, right? So, the idea of changing learning rates during training goes back a lot longer than Dawnbench, that people had been generally training them by having a learning rate that kind of dropped by a lot and then stayed flat and dropped by a lot and stayed flat.

And Leslie Smith, in particular, came up with this idea of kind of, like, gradually increasing it over a curve and then gradually decreasing it following another curve. And so, I was definitely in the mindset of, like, oh, there's kind of interesting things we can change during training. So, I was looking at, like, oh, what if we change data augmentation during training, for example?

Like, maybe towards the end of training, we should, like, turn off data augmentation so it could learn what unaugmented images look like, because that's what we really care about, for example. So, yeah, that was the kind of stuff that I was kind of interested in at the time. And so, yeah, definitely this thing of, like, you know, why are we looking over 224 by 224 pixel images the entire time?

Like, that just seemed obviously stupid. And so, it wasn't something where I was like, wow, here's a crazy idea. I bet it won't work. As soon as I thought of it, I just thought, okay, this is definitely going to work, you know? And it did. >> Interesting. Thanks. Yeah.

>> No worries. >> One question I have for you, Jeremy. >> Yeah. >> There was a paper that came out, like, in 2019 called Fixing the Test Train Resolution Discrepancy, where, yeah, were they, like, trained on 224 and then did inference finally on, like, 320 by 320? >> Yeah.

>> Have you seen that still sort of work? Have you done that at all in your workflow? >> I mean, honestly, I don't remember. I need to revisit that paper because you're right, it's important tonight. I, you know, I would generally try to fine-tune on the final size I was going to be predicting on anyway.

So, yeah, I guess we'll kind of see how we go with this, right? I mean, you can definitely take a model that was trained on 224 by 224 images and use it to predict 360 by 360 images, and it will generally go pretty well. But I think it will go better if you first fine-tune it on 360 by 360 images.

>> Yeah, I don't think they tried pre-training and then also training on, like, 320 versus just 320 in the 224. >> Yeah. >> That would definitely be an interesting experiment. >> Yeah, it would be an interesting experiment. It's definitely something that any of us here could do, you know?

I think it would be cool. Right? So, let's try scaling this up. So, we can change these two lines to one. So, this is something I often do, is I do things like, yep. >> I think we don't have your screen. >> So, I was just saying previously, I had, like, two cells to do this, and so now I'm just going to combine it into one cell.

So, this is what I tend to do as I fiddle around, because I try to, like, gradually make things a little bit more concise, you know? Okay. >> Does it make sense to go smaller than the original pre-training, like, covenant? Can't come next? >> Yeah, I mean, you can fine-tune to any size you like.

Absolutely. I'm just going to get rid of the zero padding, because, again, I want to, like, try to change it a little bit each time, just to kind of, you know, it's a kind of augmentation, right? So, okay. So, let's go up to 192. You know, one thing I find encouraging is that, you know, my training loss isn't getting way underneath the validation loss.

It's not like we're -- feels like we could do this for ages before our error rates start going up. Interestingly, when I reran this, my error rate was better, .418. We've got a good memory to remember these old papers. It's very helpful to be able to do that. >> Usually what I wind up doing is my dad and I will email back and forth papers to each other.

So, I can just go through my scent, look at archive, and usually, if I don't remember the name of it, I remember the subject of it in some degree. So, I can just go through it all. >> I mean, it's a very, very good idea to use a paper manager of some sort, to save papers, you know, whether it be Mendeley or Zenodo or archive sanity or whatever, or bookmarks or something.

Yeah, because otherwise these things disappear. Personally, I just tend to, like, tweet or favorite tweets about papers I'm interested in. And then I've set up pinboard.in. I don't know if you guys have seen that, but it's a really nice little thing, which basically any time you're on a website, you can click a button and the extension and it adds it to pinboard, but it also automatically adds all of your tweets and favorites, and it's got a full text search of the thing that the URLs link to, which is really helpful.

>> So, you've favorited something that just says, oh, shit? >> No, I actually wrote something that just said, oh, shit. That was me writing, oh, shit. It was this, I mean, totally off topic, but it's absolutely disaster. I hope it's wrong, but it's absolutely disastrous sounding paper that came out yesterday that basically, where was this key thing?

People who've had one COVID infection have a list of one sequelae of 8.4%, two infections 23%, three infections 36%. It's like my worst nightmare is the more people get infected with COVID, the more likely it is that they'll get long-term symptoms, which is horrifying. That was my first shit moment.

>> That is very horrifying. >> It's really awful. Okay. So, keeps going down, right? Which is cool. Let's keep bringing along, I suppose. I guess, you know, what we could do is just grab this whole damn thing here. Kind of have a bit of a comparison. So, we're basically going to run exactly the same thing we did earlier.

At this time, with some pre-sizing first. All right. So, that'll be an interesting experiment. So, while that's running, you know, this is where I hit the old duplicate button. And this is why it's nice if you can to have a second card. Because while something's running, you can try something else.

Cuda visible devices. There we go. So, we can keep working. Okay. So, Waited Data Loader. So, this is something I added to Fast AI a while ago and haven't used much myself since. But if I just search for Waited, here it is. Here it is. So, you can see in the docs, it shows you exactly how to use Waited Data Loaders.

And so, we pass in a bat size. We pass in some weights. This is the weights. It's going to be 1, 2, 3, 4, 5, 6, 7, 8. And then some item transforms. These are really interesting in the docs. In some ways, it's extremely advanced. And in other ways, it's extremely simple.

Which is to say, if you look at this example in the docs, everything is totally manual, right? So, our labels are some random integers. And I've even added a comment here, right? It's going to be in the training set. Two are going to be in the validation set. So, our data block is going to contain one category block.

Because we just got the one thing, right? And rather than doing get X and get Y, you can also just say getters. Because get X and get Y basically become getters, which is a list of transformations to do. And so, this is going to be a single getter or a single get X, if you like, which is going to return the Ith label.

And a splitter, which is going to decide whether something's valid or not based on this function. So, you can see this whole thing is totally manual. So, we can create our data set by passing in a list of the numbers from 0 to 9. And a single item transform that's going to convert that to a tensor.

And then our weights will be the numbers from 0 to 7. And so, then we can take our data sets or data sets and turn them into data loaders using those weights. So, for the batch size of 1, if we say show batch, we get back a single number, okay?

And it's not doing random shuffling. So, we get the number 0, because that was the first thing in our data set. Let's see, what do we do next? Now, we've got to do n equals 160. So, now, we've got all of the numbers from 0 to 159 with those weights.

Yes, forgettors, yep. >> You mentioned, this is for X or Y. >> This is a list. It's whatever, right? There is just one thing. I don't know if you call that X or you call it Y. It's just one thing. So, if you have a get X and a get Y, that's the same as having a getters with a list of two things.

So, yeah. I think I could just write get -- this has been aged since I put this, but I think I could just write get X here and put this not in a list. It would probably be the same thing. >> Okay. >> Probably handle a little bit of mystery that might be happening as well.

>> Yeah. >> The data block has an input parameter. >> Correct. >> Which is how it determines what of the getters is X versus Y. >> Correct. Which we actually looked at last time. Here. When we created our multi-image block. That was before you joined, Zach. Yes, useful reminder.

Okay. So, here we see in a histogram of how often -- so, our -- we created like a little synthetic learner that doesn't really do anything, but we can pass callbacks to it, and there's a callback called collect data callback, which just collects the data that's part -- that is called in the learner, and so this is how we can then find out what data was passed to the learner, get a histogram of it, and we can see that, yeah, the number 160 was received a lot more often when we trained this learner, which is what you would expect.

This is the source of the weighted data loader class here, and as you can see, other than the boilerplate, it's one, two, three, four, five lines of code. And then the weighted data loader's method is two lines of code. So, there's actually a lot more lines of example than there is of actual code.

So, often it's easier just to read the source code, because, you know, thanks to the very layered approach of fast AI, we can do so much stuff with so little code. And so, in this case, if we look through the code, we're passing in some weights, and basically the key thing here is that we set -- if the -- if you pass in no weights at all, then we're just going to set it equal to the number one repeated n times, so everything's going to get one, a weight of one.

And then we divide the weights by the sum of the weights so that the sum of the weights ends up summing up to one, which is what we want. And then if you're not shuffling, then there's no weighted anything to do, so we just pass back the indexes. And if we are shuffling, we will grab a random choice of -- based on the weights.

Cool. All right. So, there's going to be one weight per row. Let's come back to that, because I want to see how our thing's gone. It looks like it's finished. Notice that the fav icon in Jupyter will change depending on whether something's running or not, so that's how you can quickly tell if something's finished.

0.216, 0.221. Okay, I mean, it's not a huge difference, but maybe it's a tiny bit better. I don't know. The key thing, though, is this lets us use our resources better, right? So we often will end up with a better answer, but you can train for a lot less time.

In fact, you can see that the error was at 0.216 back here, so we could probably have trained for a lot less epochs. So that's progressive resizing. Is there a way to look at that and go, "Oh, actually, I'd like to take the outputs from epoch 9," because it had a better -- That was the question we got earlier about.

That's called early stopping, and the answer is no. You probably wouldn't want to do early stopping. But you can't go back to a previous epoch. There's no history. You can. You have to use the early stopping callback to do that. All right, cool. Okay, I'll look at that. Or there's other things you can use.

As I say, I don't think you should, but you can. If I go "training, Corbex, tracker" -- Okay, so the other part of that is, is it counterproductive or not? Yeah, it's counterproductive. It's not a cheat if it works, but not if it doesn't? It's probably not a good idea.

It probably will make it worse, yeah. Okay, great. So the other thing you can do is save model callback, which saves -- which is kind of like early stopping, but it doesn't stop. It saves the parameters of the best model during training, which is probably what you want instead of early stopping.

I don't think you should do that either for the same reason we discussed earlier. Why shouldn't you do this? It seems like you could just ignore it if you didn't want it. Or later on, like it might not hurt you? Well, so this actually automatically loads the best set of parameters at the end.

And you're just going to end up with this kind of model that just so happened to look a tiny bit better on the validation set at an earlier epoch. But at that earlier epoch, the laning rate hadn't yet stabilized, and it's very unlikely it really is better. So you've probably actually just picked something that's slightly worse and made your process slightly more complicated for no good reason.

Being better on an epoch there doesn't necessarily say anything about the final hidden test set. Yeah, we have a strong prior belief that it will improve each epoch unless you're overfitting. And if you're overfitting, then you shouldn't be doing early stopping, you should be doing more augmentation. It seems like a good opportunity for somebody to document the arguments, because I'm curious what add-in does.

Yes, that would be a great opportunity for somebody to document the arguments. And if somebody is interested in doing that, we have a really cool thing called documents, which I only invented after we created fast.ai. I should delete this because this is the old version, it's part of fast core.

And you document each parameter by putting a comment after it. And you document the return by putting a comment after it. And Zach actually started a project to, after I created documents, to add documents, comments to everything in fast.ai, which of course is not finished because fast.ai is pretty big.

And so here's an example of something that doesn't yet have documents, comments. So if somebody wants to go and add a comment to each of these things and put that into a PR, then that will end up in the documentation. Something we should do, Zach, is to actually include an example in the documentation of what it ends up looking like in NVDev, because I can see that's missing.

That might be a good idea. I can see if I can get on that tomorrow. Yeah. Sorry, Hamel. What were you saying? No, I just wanted to encourage everybody that writing the documentation is an excellent way to learn deeply how everything works. And what ends up happening is you write this documentation and somebody like Jeremy will review it carefully and let you know what you don't understand.

And that's how I learned about some other fast.ai library. So I highly recommend it, going and doing that. And here's what it ends up looking like. Right. So here's optimizer. And you can see it's got a little table underneath. And if we look at the source of optimizer, you'll see that each parameter has a comment next to it.

But it's automatically turned into this table. All right. Yeah. Document are super cool. They are super cool. This sounds like a good place to wrap up. Anybody got any questions or comments or anything before we wrap up? I have a question regarding polyglot C resizing. Yes. We didn't do actually L.R.

find after each step, don't you think it's something helpful? The L.R. find. Did you say? Yeah. Yeah. Yeah. I, to be honest, I don't use L.R. find much anymore nowadays, because, you know, at least for object recognition in computer vision, the optimal learning rate is pretty much always the same.

It's always around 0.008, 0.01. Yeah. There's no reason to believe that we have any need to change it just because we changed the resolution. So, yeah, I wouldn't bother just leave it where it was. Jeremy, if you're training and validation loss is still decreasing after 12 people, can you pick up and train for a little longer without restarting?

You can. The first thing I'll say is you shouldn't be looking at the validation loss to see if you're overfitting. You should be looking at the error rate. So, the validation loss can get worse whilst the error rate gets better, and that doesn't count as overfitting because the thing you want is to improve as the error rate.

That can happen if it gets overconfident, but it's still improving. Yeah, you can keep training for longer because we're using, if you're using fit one cycle or fine-tune, and fine-tune uses fit one cycle behind the scenes, continuing to train further, your learning rate is going to go up and then down and then up and then down each time, which is not necessarily a bad thing, but, you know, if you, yeah, if you basically want to keep training at that, you know, at that point, you would probably want to decrease the learning rate by maybe 4x or so, and in fact, you know, I think after this, I'm going to rerun this whole notebook, but half the learning rate each time, so I think that would be potentially a good idea.

I have a question. I don't know if it's too late, but I think it might be useful to discuss, when you do the progressive resizing, what part of the model gets dropped, like, what, you know, is there some part of the model that needs to be reinitialized for the new?

No. Nothing needs to be reinitialized, no. I found this on the web. Sorry, I didn't watch. Who found what on the web? I thought you were talking to me, but you're talking to Siri? I'm offended. Siri, teach me deep learning. Yeah, ConfNext is what we call a resolution-independent architecture, which means it doesn't, it works for any input resolution, and time-permitting in the next lesson, we will see how convolutional neural networks actually work, but I guess a lot of you probably already know, so for those of you that do, if you think about it, it's basically going patch by patch and doing this kind of mini matrix multiplier for each patch, so if you change the input resolution, it just has more patches to cover, but it doesn't change the parameters at all, so there's nothing to reinitialize.

Does that make sense, Hamill? Yeah, that makes sense. I was just asking for the, like, in general. For the record, fair enough. Yeah. Just a question. Go ahead. I was just going to, a quick note, say, is ResNet resolution-independent? Yep. Is it good? Yep. Yep. Typically, everything we use is normally, but in the, like, have a look at that, like, best fine-tuning models notebook, and you'll see that two of the best ones are called VIT and SWIN, and also SWIN-V2.

None of those are resolution-independent, although there is a trick you can use to kind of make them resolution-independent, which we should try out in a future walkthrough. Is that fiddling with the head or something? Oh, there's a TIM. There's a thing you can pass to TIM. I don't know if we can use it to support progressive resizing or not.

It'll be interesting to experiment with. It's basically changing the positional encodings. I have a question. Interesting. Yeah. After you've done your experiments, progressive resizing, and fine-tuning, how do you infest AI train with the whole training set? I never got around to do that. Do you create a dummy version?

I almost never do. Like, instead, I do what we saw in the last walkthrough, which is I just train on a few different randomly selected training sets, because that way, you know, you get the benefit of ensembling. You're going to end up seeing all the images, at least one anyway.

And you can also kind of see if something's messed up, because you've still got a validation set each time. So yeah, I used to do this thing where I would create a validation set with a single item in to get that last bit of juice, but I don't even do that anymore.

Okay. Thanks. No worries. All right, gang. Enjoy the rest of your day/evening. Nice to see you all. Bye. Bye. Bye. Thanks. Thank you.

Live coding 14

Chapters

Transcript