back to indexLive coding 17
Chapters
0:0 1 - Setting up Paperspace - Clone fastai/paperspace-setup
2:30 pipi fastai & pipi -U fastai
3:43 Installing universal-ctags: mambai universal-ctags
5:0 Next step: Adding a normalization to TIMM models
6:6 Oh! First lets fix pre-run.sh
7:35 Normalization in vision_learner (with same pretrained model statistics)
9:40 Adding TIMM models
13:10 model.default_cfg() to get TIMM model statistics
16:0 Lets go to _add_norm()… adding _timm_norm()
20:30 Test and debugging
28:40 Doing some redesign
32:23 Applying redesign for TIMM
36:20 create_timm_model and TimmBody
38:12 Check default config from a TIMM models
39:5 Making create_unet_model work with TIMM
40:20 Basic idea of U-nets
41:25 Dynamic U-net
00:00:06.000 |
All right, so this is the repo fastai paper space setup. 00:00:48.000 |
And that's going to install a pre run dot sh script, which is going to set up all these things and all these things. 00:01:00.000 |
And it's going to install a dot bash dot local script, which will set up our path. 00:01:07.000 |
It's going to also install things and set up things for installing software pip I for pip install and mamba I for mamba install. 00:02:20.000 |
All right, so in theory, if we look at our home directory. Oh, look at that. Well, this stuff is now similar to slash storage. 00:02:30.000 |
So I should be able to get the latest version. 00:02:47.000 |
I wonder if I can add a minus you to say upgrade. 00:02:57.000 |
Yes, I can. So that's how I get the latest version. 00:03:02.000 |
And so that should have installed it locally. 00:03:40.000 |
Look, that's a good start. Okay, next question. 00:03:50.000 |
For example, universal CTX member. I remember install universal CTX. 00:04:18.000 |
Okay, so you see the nice thing about this is even all this persistent stuff we're installing, you know, all works on the free paper space as well. 00:04:42.000 |
And that is actually in our storage. Oh, so I think we've done it. 00:05:02.000 |
Okay. So, next step is, I thought we might try to fix a. 00:05:12.000 |
I don't know if you call it fixing a bug or maybe it's probably we could generously call it adding an enhancement to fast AI, which is to add normalization to Tim models. 00:05:50.000 |
So slash notebooks is persistent on a particular machine. 00:05:55.000 |
And I think this will not work, because I'm using SSH. Oh, it's already there. 00:06:10.000 |
Oh, you know, so there's a bug in our script, 00:06:21.000 |
So let's fix that pre run SSH, I did a push D at the start. 00:06:58.000 |
That means. Okay, yes, we're actually in here. No worries. 00:07:13.000 |
And then I'll tell you about the bug we're fixing while we wait for it. 00:07:36.000 |
So normalization is where we subtract the means and divide by the standard deviation of each channel for vision. 00:07:50.000 |
And that goes that's a transform called normalize. 00:07:56.000 |
And we need to use the same standard deviation and mean that was used in the when the model was pre trained. 00:08:10.000 |
Because you don't there is, you know, so some people will normalize. So it's everything's between zero and one someone normalize so it's got a mean of zero and a standard deviation of one. 00:08:26.000 |
You don't divide by the same thing to track the same thing. 00:08:40.000 |
And if it's true, then it will attempt to add the correct normalization. 00:08:49.000 |
So if it's not a pre trained model, it doesn't do anything because it doesn't know what to normalize by. Otherwise, it's going to try and get the correct statistics from the models metadata. 00:09:01.000 |
So the models metadata is here, model underscore meta. 00:09:08.000 |
And it's just a list of models with with metadata and the metadata. 00:09:23.000 |
ImageNet stats. So the image that stats is the main and standard deviation of ImageNet, which I can't quite remember where that comes from, but that's something we import from somewhere. 00:09:37.000 |
So none of these are 10 models. And so that means currently 10 models aren't normalized. 00:10:36.000 |
One of the stuff in Tim I still haven't looked into, I actually haven't used this transforms factory. 00:10:50.000 |
Maybe in fast AI 3, we should consider using more of this functionality from Tim. 00:12:03.000 |
It's letting me start the machine. Here we go. 00:12:17.000 |
All right, so this happens in Vision Learner. 00:12:28.000 |
And Tim is optional. You don't have to use it. 00:12:36.000 |
But if you do, then we have a create Tim model which you don't normally call yourself. Normally you just call Vision Learner and you pass in an architecture as a string, and if it's a string it will create a Tim model for you. 00:13:07.000 |
I don't know what kind of it is never tried that one. 00:13:16.000 |
So we can create a model using like create model, we pass in a string. 00:13:23.000 |
And I have a feeling that's yeah, that's got a config. 00:13:31.000 |
Yeah, see, and it's got a main and a standard deviation. 00:13:40.000 |
So models equals Tim list models, maybe just to pre trained ones. 00:14:42.000 |
Yeah, so you can see a lot of them use point five. 00:15:00.000 |
And I'm guessing they're the only two options. 00:15:19.000 |
Jimmy just heard out, usually putting the image in the mean should be zero and standard deviation should be one. 00:15:29.000 |
I mean, not necessarily, sometimes people make the minimum zero in the maximum one. 00:15:40.000 |
But what we need to do is use the same stats that it was pre trained with. 00:15:45.000 |
Because we want our range to to be the same as the range is pre trained with otherwise our, you know, data has a different meaning. 00:16:07.000 |
So here's add norm, and it's being passed a meta 00:16:30.000 |
this only works for non team. So how about we put this here, we'll create an else, or I guess really an elif. 00:16:46.000 |
here, we'll have for Tim, if normalize, we could have a team normalize 00:17:41.000 |
We don't need to pass in the architecture we can just pass in the model. 00:17:45.000 |
And to protect against future like ability to pass in other types that are strings that aren't Tim do you think there's any benefit having like default normalization function that if you pass through, you can actually do your own normalization. 00:18:04.000 |
No, because my answer to all of those questions is always, you ain't going to need it. 00:18:21.000 |
you know, dealing with things that may or may not happen in the future. 00:18:26.000 |
It'd be simpler just to create your own vision liner, because that looks like there's not much going on there that you can duplicate if you wanted to have support for a different model. 00:18:36.000 |
Yeah, yeah, exactly. I mean, it's, you know, this is just a small little wrapper really you can call create Tim model or create vision model, you can call that you can call create head. 00:19:17.000 |
So, it should be just those two things I guess 00:20:51.000 |
as a transform to each data loader in it. Okay. 00:22:02.000 |
Okay, so let's find sometimes it's just easiest to look at the code. 00:22:37.000 |
we're adding it I see we're adding it to the after batch event. 00:22:45.000 |
after batch event here we are. I see and there's our transforms. 00:22:55.000 |
That should change our data loader. Yep. And it's now got normalize 00:23:01.000 |
using the ImageNet stats. And if we now try it for a string version. 00:23:30.000 |
Now what happened differently. Oh, I see. We need to recreate the data loaders for this test. 00:23:47.000 |
And that gives us okay that gives us an error. And that's because it says we're passing a sequential object. 00:23:53.000 |
Okay, that makes sense. Because create Tim model. 00:24:11.000 |
And it creates a sequential model, because it's got the head and the body in it. 00:24:38.000 |
Oh, look, here we use default config to get stuff here. 00:25:21.000 |
I guess like it would be nice to know how Tim does this exactly. 00:26:18.000 |
at the default config, we're going to be a lot 00:27:43.000 |
It's not surprising it was originally built not to expect to be doing stuff with Tim. 00:28:40.000 |
So let's do so much we think about doing some redesign maybe. 00:28:53.000 |
And so the idea of the redesign I guess would be that this doesn't instantiate the model. 00:29:01.000 |
So we would remove that case that's now not going to work of course, so then we're creating 00:29:44.000 |
So we may as well just do that directly right. 00:31:20.000 |
We now passing around models, not architectures. 00:32:01.000 |
So now we say model equals pre trained passing model. 00:33:56.000 |
So maybe we should move keep moving this back further and further. 00:34:47.000 |
Maybe we'll just call that the Tim model, Tim model. 00:35:37.000 |
So there's a lot of this is this gets a bit crazy. 00:35:40.000 |
There's a lot of keyword arguments when you create a model and the ones we don't know about we pass on to. 00:36:34.000 |
And what we might do is we'll say this is the result. 00:36:42.000 |
And we'll return the things or even return those two things. 00:37:57.000 |
Yes, we do pass in an architecture after all. 00:38:40.000 |
Now come next tiny on the other hand uses image net stats. 00:39:07.000 |
So if somebody feels like an interesting and valuable problem to solve. 00:39:13.000 |
Making create unit model work with Tim would be super helpful. 00:39:33.000 |
As create vision model, which is to actually instantiate the model. 00:39:41.000 |
Is anybody potentially interested in having a go at doing unit models with Tim. If so, did you want to talk about it. I'd be interested. Okay. 00:39:56.000 |
All right, let's just get this working first. 00:40:11.000 |
A little bit. I'm training one at the moment. That's my maximum experience and then I've been through some notebooks to walk through. 00:40:18.000 |
I wanted everything. Great. So, okay, so the interesting. Okay, so you know the basic idea of a unit is 00:40:38.000 |
Downward sampling path where the image is getting kind of effectively smaller and smaller as it goes through convolutions with strides. 00:40:45.000 |
And we end up with, you know, a kind of a very small set of patches and then rather than averaging those to get a vector and using those as our features for our head. 00:40:57.000 |
Instead we go through reverse convolutions, which are things which make it bigger and bigger. 00:41:02.000 |
And when we do that, we also don't just take the input from the previous layer of the up sampling, but also the input from the equivalently sized down sampling size down sampling there before fastai all units had to be only handled a fixed size. 00:41:22.000 |
So what Karim did was he created this thing called the dynamic unit, which would look to see how big each size was on the downward path and automatically create an appropriate size thing on the upward path. 00:41:43.000 |
So fastai has been very aggressive in like using pre trained models everywhere so something we added to this idea is this idea that the downward sampling path can be can have a pre trained model, which is not rocket science. 00:42:02.000 |
Obviously it's like this this one line of code. 00:42:21.000 |
So to understand like at the moment I'm using say like a ResNet 34 does that mean the down part is a ResNet 34 backbone and then there's a reverse ResNet 34 being automatically generated. 00:42:31.000 |
It's not a reverse. It's not a reverse ResNet 34. 00:42:38.000 |
So here's our dynamic unit, the upward sample, the up sampling path is has a fixed architecture, 00:43:02.000 |
But they're not like if you use as a downward sampling path, you know, down sampling a VIP, the upward sampling is not going to be a reverse VIP. 00:43:11.000 |
It would there be an advantage in doing that or is it just not really helpful? I don't see why there would be. 00:43:17.000 |
I'd also don't see why there wouldn't be. Nobody's tried it as far as I know. 00:43:21.000 |
I don't even know if there's such a thing as an up sampling transformer block. 00:43:34.000 |
The key thing is that in the downward sampling path, what we do is we we have the down sampling bit we call the encoder. 00:43:51.000 |
Now a dummy eval is basically to take a I can't remember like either a zero length batch or a one length batch like a very small batch and pass it through at some image size. 00:44:06.000 |
we use I believe we use hooks, if I remember correctly. 00:44:19.000 |
What's happened to my screen? My screen's gone crazy. 00:44:34.000 |
Yes. OK. So we use fast AI's hook outputs function, which says I want to use PyTorch hooks to grab the outputs of these layers. 00:45:00.000 |
So this is the indices of this is the key thing. 00:45:03.000 |
This is the indices of the layers where the size changes. 00:45:10.000 |
And so that's where you want the that's where you want the cross connection. 00:45:14.000 |
Right. Either just before that or just after that, you know. 00:45:18.000 |
So get get get the indices with the size changes. 00:45:34.000 |
So we hook outputs. We do a dummy eval and we find the shape of each thing. 00:45:43.000 |
And so here you can see dummy eval is using just a single image. 00:45:49.000 |
And so, yeah, this just returns the shape of the output of every layer. 00:45:57.000 |
That's going to be in sizes. And so then this is just a very simple function which just goes through and finds where the size changes. 00:46:14.000 |
So now that we know where the size changes, we know where we want our cross connections to be. 00:46:20.000 |
Now, for each of the cross connections, we need to store the output of the model at that point, because that's that's going to be an input in the up sampling block. 00:46:36.000 |
for each unit block we create. So for each change in the index for each up sampling block, you have to pass in that that 00:46:53.000 |
This is the index where it happened. And so this will be the actual. So if we go to the unit block 00:47:00.000 |
and it looks like it's so it's the size of that list minus one. Is that how the unit blocks get created on the other side? 00:47:09.000 |
Which is and so that that's just the hook that was used. 00:47:17.000 |
That's the hook that was used on the down sampling side. 00:47:22.000 |
And from that, we can get the stored activations. 00:47:32.000 |
So this is the shape of those stored activations. 00:47:39.000 |
And this is a minor tweak. So let's just ignore this if block for a moment. Basically, all we then do is we take those activations taken through a batch norm, 00:47:48.000 |
concatenate them with the previous layers up sampling and chuck that through a ReLU. And then we do some comms. 00:48:02.000 |
And the comms aren't just comms. They're first AI comms, which can include all kinds of things like batch norm activation, whatever. 00:48:15.000 |
So it's it's a some combination of batch norm, activation, convolution. 00:48:27.000 |
You can you can also do up sampling. So it's transpose, batch norm can go first or last, whatever. 00:48:33.000 |
So that's quite a, you know, a very rich conv convolutional layer. 00:48:42.000 |
Okay, so then this if part here is that it's possible that things didn't quite round off nicely so that the cross connection doesn't quite have the right size. 00:48:53.000 |
And if that happens, then we'll interpolate the cross connection to be the same shape as the up sampling connection. 00:49:03.000 |
And again, I don't know if anybody else does this, but this is to try to make it so that the dynamic unit always just works. 00:49:15.000 |
So to make this work for Tim, you know, this encoder needs to know about the spots, right? Oh, no, in order to text the spots. 00:49:30.000 |
So honestly, this this might almost just work. Like I don't like I don't think it does. I think somebody tried it and it didn't. Right. 00:49:39.000 |
But, yeah, it would, you know, to figure out what doesn't work, you know, you would need to change this line to say, oh, if it's a string create trim model, otherwise do this, you know. 00:49:57.000 |
And then you like create body would need to be create team team body if it's a string so like at minimum do the same stuff that create vision model does. 00:50:05.000 |
And then, yeah, and then see if this works. Right. Well, now, I will say, if you do get it working, Tim does have an API to actually tell you where the feature sizes change. 00:50:20.000 |
So like you could actually optimize out that dummy eval stuff but I don't even know if I'd bother because it makes the code more complex for no particular benefit. 00:50:29.000 |
Yeah, sure. So, look, I think if you know this you commit this is a PR I'll definitely be looking at it. I was actually going to try Conf Next in my unit so I had no idea it wouldn't work actually. 00:50:40.000 |
So that would have been I would have noticed that already but I just haven't had time. So I'd love to because I, you know, resident 32 I've got particular results and I'd like to see if we can push it with a different model. 00:50:49.000 |
Yeah, no, I mean I think there'd be a lot of benefit to that. So, all right. So now we should run the tests. 00:50:58.000 |
Just to just to know would that all likely be in the same notebook that you're editing the vision letter is that when most of the source code is unit learners, or is it a different. 00:51:10.000 |
I don't know I was just using this right now jump jump to whatever automatically in VIM so I was using VMC tags to jump around, so I don't, I have no idea where I was. 00:51:37.000 |
So yeah, so there's a models unit is where the dynamic unit lives. 00:51:47.000 |
Okay, is there anything unique about the fact that the team model doesn't that's sort of an option there to cut the tail and head off. Does that need to be done with the unit architecture. 00:52:05.000 |
Yeah, so yeah, you absolutely have to cut the head off, because it comes with a default classifier head. So you will need, you know. 00:52:15.000 |
So you know you, once you get it working, you'll probably find you can factor out some duplicate code between the unit and the vision letter. 00:52:24.000 |
But yeah, you basically have to cut off the classifier head in the same way that create him body does. 00:52:32.000 |
And I don't think you'll need to change any input processing as far as I know. 00:52:39.000 |
The, the vision, create vision model, you know, handles, like, you know, if you've only got one or two or four channel inputs in the models of three channel input it handles that automatically but Tim actually, I think, Ross and I independently 00:52:57.000 |
could enter this as far as I know we both kind of automatically handle like copying weights if necessary or deleting weights if necessary or whatever but yeah so the same stuff and vision minus should should work there as well. 00:53:21.000 |
doesn't work because it's, it's actually creating a model, which is curious. 00:55:04.000 |
I don't know if it's going to make much difference or not, you know, because we're pretty careful about fine tuning the batch normally is actually interesting to see whether normalization matters as much as it used to. 00:55:29.000 |
Is it possible to create like a layer that learns the normalization sort of thing. Yeah, I mean that's basically what batch norm does, you know, 00:55:46.000 |
understand it's a those weights in the bachelor layer basically learning the aggregate of that batch that optimally give the best activations for the next. Yeah, exactly. Yeah, yeah, it's just, it's just, you know, multiply by something and add something. 00:56:03.000 |
So it's finding what's the best thing to multiply by an ad by. 00:56:07.000 |
So, let's take a look. So I mean, all right, so this got 47% error. 00:56:18.000 |
Yeah, so I mean, it's a bit disappointing after all that work it doesn't actually. I mean this is fascinating, like, yeah, when you find you in the way we do. 00:56:49.000 |
Would it be fair to say that the one advantage would be if you wanted to use pre trained models without fine tuning you definitely want the statistics in there right. Yes, absolutely. 00:57:00.000 |
I mean I don't know if that's an actual thing that people do. Yes, if you did. 00:57:18.000 |
Yeah, it's funny these things that, you know, we've been doing for years and I guess never question. 00:57:26.000 |
I have a question relating to that because one of the things I wanted to do is get this unit into a mobile app so use the latest torch script, and it works with the demo app to fill around the locks is broken from pytorch. 00:57:38.000 |
But of course in there you need to provide the the averaging statistics for the app, so it's like inference mode. 00:57:44.000 |
So I wonder, I know that at the moment, the first day eyes kind of idea is that you dump everything is like a pickle that conceivably would be helpful if you could maybe extract those new fine tuned statistics or something for your deployment in particular 00:58:01.000 |
environments, because that, how would I go about doing that. 00:58:05.000 |
I mean, they're just parameters and batch nom layers, you know, they're just parameters. So there'll be in the parameters attribute of the model. 00:58:18.000 |
But like they're not, they're not really parameters that make sense independently of all the other parameters at all. So I don't think you would treat them any differently. 00:58:27.000 |
If you use say image nets statistics when you're fine tuning and that's the result of your model right you're going to use that down the track as well. 00:58:36.000 |
Well, yes and no, like that's what you normalize with, but, but you've got batch norm layers which then, obviously, dividing and subtracting themselves. 00:58:53.000 |
So yeah, I mean, you're those normalization stats aren't going to change but there isn't really any reason to, you know, it would only be if you 00:59:11.000 |
So I'm going to have a look at this next one. So this is 27 to 18, 24. Yeah, this is actually kind of what I thought might happen is on a slightly better model, you know, we may be getting slightly better errors initially. 00:59:42.000 |
Yeah, I'd love people to try out fast AI from master because 00:59:51.000 |
tell me if any of your models look substantially better or even more important substantially worse. 01:00:39.000 |
All right, anybody have any questions before we wrap it up. 01:00:50.000 |
It's just the initial, it will be a bit more or less than earlier. 01:00:56.000 |
Yeah, so like that, that, that, you know, well, 01:01:01.000 |
we have a random head. So at first it doesn't actually matter right it randoms random whether you normalize or not. 01:01:18.000 |
It's better or something. But, yeah, I don't know, like, 01:01:23.000 |
it would be interesting to see if anybody notices a difference. 01:01:28.000 |
I mean, it's just this used to matter a lot right for a couple of reasons. One is that most people didn't find in models most people train most models and scratch until, 01:01:47.000 |
Right, so it was totally critical. And then even when batch norm came along we didn't know how to find your models with batch norm. 01:01:59.000 |
At that point, we didn't realize that you had to fine tune the batch norm layers as well. 01:02:05.000 |
So I remember emailing Francois the creator of Keras and I was saying to him like I'm trying to fine tune your Keras model and it's like, 01:02:17.000 |
bizarrely bad like why why is that well probably doing the wrong thing here's documentation whatever like no I'm pretty sure I'm doing the right thing and I 01:02:26.000 |
spent like three months trying to answer this question. Eventually I realized it's like, holy shit, it's the batch norm layers. 01:02:33.000 |
I sent him an email and said, oh, we can't fine tune Keras models like this actually have to fine tune batch norm layers, which I don't think they changed for years. 01:02:47.000 |
Anyway, so those there so those changes is why I guess this whole normalization layer thing is much less interesting than I guess we thought, which is why we hadn't really noticed it wasn't working before. 01:03:11.000 |
Anybody else have any questions before we wrap up. 01:03:21.000 |
See you. Let's see well, good luck with unit.