back to indexLive coding 10
Chapters
0:0 Questions
6:0 Steps for Entering a Standard Image Recognition Competition on Kaggle
8:40 The best models for fine tuning image recognition
12:0 Thomas Capelle script to run experiments
14:0 Github Gist
16:0 Weights and Biases API
17:0 Automating Gist generation
20:30 Summarising and ranking models for fine tuning
23:0 Scatter plot of performance by model family
25:40 Best models for images that don't look like Imagenet
33:0 Pretrained models - Model Zoo, Papers With Code, Huggingface
37:30 Applying learning on Paddy notebook with small models
46:0 Applying learning on large models
47:0 Gradient accumulation to prevent out of memory
52:50 Majority vote
00:00:00.000 |
I've got a question. Yeah, it's to do with, is there a way that machine learning can actually 00:00:08.800 |
find the sort of conditional probabilistic segments that are say in sort of heterogeneous data? 00:00:18.240 |
I am having trouble passing that question. Can you give like an example or something? Yeah, 00:00:25.120 |
okay. All right. Well, I'm waddling with road surface friction with road risk rather. And 00:00:33.600 |
quite immediately there's this set of stereotypes in road analysis. And we all know that there's 00:00:42.720 |
highways, freeways, urban materials. And they actually go through a series of stages, 00:00:53.040 |
almost like states. And each of the states has got a sort of conditional probabilistic 00:01:00.480 |
relationship between the set of predictors and the actual response variable, 00:01:06.480 |
the crash response variable. Is there anything that white that in deep learning? 00:01:14.800 |
So how is that different to a normal predictive model? Like, I mean, all predictive models are 00:01:22.480 |
conditional probabilities, right? What's the... Well, I mean, if you take something like XGBoost, 00:01:30.880 |
for example, and you want to predict the risk of a given road, so it'll give you a value. 00:01:37.600 |
But then you've got no idea as to what's happening inside of the model. And 00:01:42.640 |
we're really interested in that because once you find the distributions, 00:01:51.680 |
you can start to do some quality testing on whether they actually follow the domain or 00:01:57.600 |
whether your segmentation process that actually determines your predictions is good or not. 00:02:04.320 |
And so, in a way, rather than, say, predicting some sort of 00:02:14.880 |
crash rate or risk or whatever, I'm really looking for those probabilistic distributions 00:02:24.080 |
and learning beneath the surface. So all deep learning models will return a set of probabilities. 00:02:33.520 |
That's what their final layer returns. And then we decode them by taking the argmax 00:02:40.160 |
across them. But there's nothing to stop you using those probabilities directly. 00:02:44.800 |
But I'm probably misunderstanding your question. It's a little abstract for me to understand. 00:02:53.520 |
Like, I mean, I know there's lots of things you can do with 00:02:57.120 |
confidence intervals and whatnot, but it really depends a great deal on the 00:03:07.520 |
specific details of the application, what you're trying to do and how you're trying to do it. 00:03:12.320 |
Good question, Daniel. I'm just talking about probability of an incident or risk 00:03:20.560 |
related to the road surface. So you're going to need some sort of tabular data that has 00:03:27.920 |
the occurrences with each road surface that you're trying to. 00:03:34.320 |
And why wouldn't XGBoost give you that if you had a predictive model of incidents? 00:03:43.760 |
In my mind, one of the disadvantages of XGBoost is the fact that it only gives you a single set 00:03:53.920 |
of variable effects. Whereas in what we're dealing with, we've got some really high crash roads. 00:04:05.280 |
We've got a different conditional probability relationship between the predictors and the 00:04:11.600 |
response compared to, say, the average. XGBoost does an excellent job in making the predictions, 00:04:21.120 |
but you've got no idea as to the group of instances that they're actually making the 00:04:29.520 |
prediction or the actual variable effects. Okay, so I think I understand your question now, 00:04:38.160 |
and I think the answer is actually it does. And what I suggest you do, if you haven't already, 00:04:44.240 |
is read the chapter of the first AI book on tabular modeling, and it will cover something 00:04:51.440 |
very similar, which is random forests, which is another ensemble of decision trees, and it will 00:04:56.080 |
show you how to get exactly the kind of insights that I think you're looking for. And all of the 00:05:05.680 |
techniques there would work equally well for random forests, and they also work equally well 00:05:09.280 |
for deep learning. So maybe after you've done that, you can come back and let us know whether 00:05:13.200 |
that helped. Yeah, well, I've sort of played with random forests. It doesn't really 00:05:20.400 |
give you what I'm looking for. I strongly suggest you read the chapter before you say that. I will. 00:05:27.360 |
Because I'm pretty sure it will. And if it doesn't, that would be very interesting to me. 00:05:35.840 |
In fact, I mentioned to you last time, but I'm really looking forward to the tabular data. 00:05:42.480 |
Cool. Great. I'll show you guys what I've been working on, which has been fun. 00:05:59.760 |
So the first thing I did, you know, after I got off our last call was I basically just 00:06:08.880 |
threw together the kind of like most obvious basic steps one would do for 00:06:23.280 |
a standard image recognition competition, just in order to show people that that can be quite good. 00:06:30.800 |
And it was actually a little embarrassing because I didn't mean to do this. When I submitted it, 00:06:36.800 |
it turned out I got first on the leaderboard. So now I feel like I'm going to have to 00:06:42.320 |
write down exactly what I did because, you know, during an active competition, everybody 00:06:51.440 |
needs to needs to share what they're what they're doing if they share it with anybody 00:06:55.360 |
some publically. So I thought I'd show you what I did here. But I think this is about to go up 00:07:01.920 |
quite a lot, because, you know, what we're doing here is where they're interesting images 00:07:15.760 |
for a couple of reasons. One is that they're kind of like things that you see in ImageNet, 00:07:21.200 |
like their pictures of natural objects, their photos. But I don't think ImageNet has any kind 00:07:29.600 |
of like categories about diseases, you know, they have categories about like, what's the main 00:07:36.480 |
object in this? So they might have a category about like, I don't know if they do like some 00:07:40.240 |
different kinds of grass, or some different types of even some different types of, you know, 00:07:48.080 |
fields or something, but I'm pretty sure they don't have anything about different kinds of 00:07:52.320 |
crop disease. So it's a bit different to ImageNet, which is what most of our pre-trained models are 00:07:58.880 |
trained on. But it's not that different. And it's also interesting because nearly all of the images 00:08:05.760 |
are the same shape and size. So we can kind of try to take advantage of that. 00:08:15.600 |
And, you know, so when we fine-tune a pre-trained model, 00:08:19.840 |
there's, so let me pull up this Kaggle notebook I just created. 00:08:42.320 |
Kind of look at what are the best vision models for fine-tuning. And so I kind of realized that 00:08:46.640 |
there are two key dimensions that really seem to impact how well a model can be fine-tuned, 00:08:52.320 |
you know, whether it works well or not, or how it's different. So one is what I just talked about, 00:08:57.280 |
which is how similar is your data set to the data set used for the pre-trained model. 00:09:05.920 |
If it's really similar, like pets to ImageNet, then like the critical factor is how well does 00:09:14.960 |
the fine-tuning of the model maintain the weights that are pre-trained, you know, because you're 00:09:20.720 |
probably not going to be changing very, very much. And you're probably going to be able to take 00:09:24.160 |
advantage of really big, accurate models because they've already learned to do almost the exact 00:09:29.600 |
thing you're trying to do. On the other hand, so that's the pets data set. On the other hand, 00:09:35.840 |
there's a data set called the planet data set, which is images of satellite images. 00:09:46.080 |
And these are not really at all like anything that ImageNet ever saw, you know, they're taken 00:09:52.880 |
from above, they're taken from much further away, there's no single main object. So a lot of the 00:10:03.280 |
weights of a pre-trained model are going to be useless for fine-tuning this because they've 00:10:08.560 |
learned specific features like, you know, what does text look like, what do eyeballs look like, 00:10:14.800 |
what does fur look like, you know, which none of which are going to be very 00:10:18.880 |
useful. So that's the first dimension. The second dimension is just how big the data set is. 00:10:24.560 |
So on a big data set, you've got time, you've got epochs to 00:10:29.600 |
to take advantage of having lots of parameters in the model to learn to use them effectively. 00:10:40.400 |
And if you don't have much data, then you don't have much ability to do that. 00:10:47.360 |
So you might imagine that deep learning practitioners already know these 00:10:53.200 |
answers of how do we, you know, what's the best models for fine-tuning. But in fact, 00:10:56.960 |
we don't, as far as I know, nobody's ever done an analysis before of which models 00:11:01.440 |
are the best for fine-tuning. So that's what I did over the weekend. 00:11:04.720 |
And not just over the weekend, but really over the last couple of weeks. 00:11:09.200 |
And I did this with Thomas Capelle, who works at Weights of Biases, another 00:11:15.840 |
fast AI community member/alumni. And so what we did was we tried fine-tuning lots of models 00:11:24.160 |
on two data sets, one which has over 10 times less images and where those images are not at all like 00:11:32.480 |
ImageNet, that being the Kaggle Planet sample, and one which is a lot like ImageNet and has a 00:11:39.280 |
lot more images, that being IIT pets. And I kind of figured like if we get some insights from those 00:11:46.400 |
two, perhaps they'll be something that we can leverage more generally. 00:11:49.360 |
So Thomas wrote this script, which it's 86 lines, but really there's only like three or four lines 00:12:06.880 |
and they're all be lines you recognize, right? The lines are UntieData, ImageDataLoaders.FromBlah, 00:12:14.480 |
and then Vision Learner, DLs, Model, etc. So there's the normal like three or four lines of 00:12:23.920 |
code we see over and over again. And then, you know, the rest of it is basically lets you 00:12:29.440 |
pass into the script different choices about batch size, epochs, and so forth. 00:12:35.680 |
And that's about it. So this is like how simple the script was that we used. And then 00:12:47.280 |
partly because Thomas works at Weights and Biases, and partly because Weights and Biases is pretty 00:12:54.240 |
cool. We used Weights and Biases then to feed in different values for each of those parameters. 00:13:03.360 |
So this is a YAML file that Weights and Biases uses where you can say, okay, try each of these 00:13:10.320 |
different learning rates, try each of these different models, try, let's see if I can find 00:13:16.480 |
another one, try each of these different resize methods, each of these different pooling methods, 00:13:23.520 |
this distribution of learning rates, you know, whatever, and it goes away and tries them. 00:13:30.240 |
And then you can use their Web GUI to look at like the training results. So then you basically say, 00:13:37.520 |
okay, start training and it trains each of these models of each of these datasets with each of these 00:13:42.400 |
pool values and each of these resize methods and a few different selections from this distribution 00:13:46.320 |
of learning rates and creates a Web GUI that you can dive into. I personally hate Web GUIs. I would 00:13:54.240 |
much rather use Python, but they also thankfully have an API. So yeah, so once we ran that script 00:14:01.360 |
for a few hours, I then checked the results into a GIST. So a GIST is just a place to check 00:14:19.360 |
text files basically, if you haven't used it before. So I checked my CSV file in here. 00:14:31.920 |
As you can see, it kind of displays it in a nice way, or you can just click on 00:14:36.640 |
to see the raw data. So I find that quite a nice place just to check things which I'm just going to 00:14:44.240 |
share publicly. And so then I can check if there's the URL to the GIST. 00:15:38.000 |
So I just kind of like everything to be automated so I can always easily redo it because I always 00:15:51.040 |
assume my first effort is going to be crap, and it always is. And normally my second, 00:15:54.640 |
third efforts are crap as well. So here's my little notebook I put together. 00:16:01.600 |
So basically, each time you do one of these sweeps on weights and biases, it generates a new ID. And 00:16:09.760 |
so we ended up kind of doing five different ones as we realized we were able to add different models 00:16:14.080 |
and change things a little bit. And so they have this API that you can use. And so they you basically 00:16:21.600 |
can go through and say, go through each of the sweep IDs and ask the API for that sweep and grab 00:16:28.560 |
the runs from it. And then for each one create a dictionary containing a summary and the model name. 00:16:34.800 |
So the details don't matter too much, but you kind of get the idea, hopefully, and then turn that into 00:16:38.640 |
a data frame. And so I kind of end up with this data frame that contains all the different 00:16:46.800 |
configuration parameters along with their loss and their speed, their accuracy, GPU, 00:17:00.560 |
maximum memory usage, and so forth. So that's basically what I wanted to chuck into a GIST. 00:17:08.160 |
And so specifically, I really wanted this subset of the columns. So these are the columns I wanted. 00:17:12.560 |
So I can grab those columns and put them into a CSV. Now, one thing you might not realize is 00:17:19.120 |
I would say for most Python libraries, or at least most well-written ones, 00:17:25.520 |
anyway, you can put a file names. And only when you say to CSV, you put here a file name or a path. 00:17:30.240 |
You could instead put something called a string IO object, which is something that behaves exactly 00:17:35.440 |
like a file, but it actually just stores it into a string. Because I don't want this stored into 00:17:44.080 |
a file, it's put it into a string. So if you then call .getValue, I actually get the string. 00:17:49.680 |
And so even things like creating the GIST, I want to do that automatically. So there's a 00:17:55.120 |
library I'm very fond of. I'm very biased because I made it called ghapi, which is an API for GitHub, 00:18:04.800 |
where we can do things like, say, create GIST. And you give it a description. And here's the text, 00:18:10.640 |
which is the contents of the CSV. And the file name, make it public. And then you can get the HTML, 00:18:18.000 |
URL of the GIST. So that's how I used, in this case, a notebook as my kind of, you know, 00:18:26.400 |
interactive REPL, read of our print loop for manipulating this data set, putting it together, 00:18:34.720 |
and then uploading it to GitHub. Jeremy, I had a doubt in this fond of data frame. Here you have 00:18:41.600 |
it, like, in your, I just take it to put your GIST and it had in the data set entries with planned 00:18:48.240 |
and that other data set as well, the pet status. So how did you populate it? So what's your question? 00:18:55.680 |
How did I populate this data set? Yeah. Just here. So I passed it a list of dictionaries. 00:19:05.760 |
The list of dictionaries I created using a list comprehension. Okay. Containing a bunch of 00:19:12.240 |
dictionaries. Okay. Got it. And so that's going to make each key. So that means all the dictionaries 00:19:21.200 |
should have, you know, roughly the same keys. Anyone sort of missing are going to end up being 00:19:26.160 |
NA. And then I just fiddled around with it slightly. So, for example, make sure everything had an error 00:19:33.680 |
rate that was equal to one minus the accuracy. On the planet data set, it's not called accuracy. 00:19:38.880 |
So I copied accuracy, multi into accuracy. Yeah, nothing very exciting. Thank you. 00:19:46.960 |
Jeremy, what's the actual goal of this? Let me show you. So what we've now got 00:20:06.800 |
also very helpful. Okay. A CSV, which I can then use Panda's pivot table functionality 00:20:19.520 |
to group by the data set, the model family and name, and calculate the mean of error rate, 00:20:26.560 |
fit, time and GPU memory. And I can then take the pets subset of that 00:20:36.560 |
sort by score, where score represents a combination of error and speed and take the top 15. 00:20:45.040 |
And this now shows me the top 15 best models for fine-tuning on pets. 00:20:55.840 |
And this is this is gold, in my opinion. I don't think anybody's ever done anything 00:21:00.080 |
like this before. There's never been a list of like, here are the best models for fine-tuning. 00:21:04.960 |
Sorry, I have a question. So you fine-tuned different models with pets and then collected 00:21:16.000 |
this information. Is that correct? That's correct. And then based on the information that you collected 00:21:21.920 |
from the fine-tuned of five or whatever number of iterations. We did three runs for each model. Yes. 00:21:28.800 |
And then you collected this information to find out which one is the best behave model for this 00:21:37.280 |
specific case. Correct, correct, correct, correct. Exactly. And best is going to involve two things. 00:21:43.680 |
It's going to be which ones have the lowest error rate and which ones are the fastest. 00:21:47.440 |
Now, I created this kind of arbitrary scoring function where I multiplied the error rate 00:21:53.200 |
times fit time plus 80. Just because I felt like that particular value of that constant gave me an 00:22:00.800 |
ordering that I was reasonably comfortable with. But you can kind of look through here and see like, 00:22:05.440 |
okay, well, VIT base has a much better error rate than conv next tiny. But it's also much slower. 00:22:15.360 |
Like, you can decide for your needs where you want to trade off. So that's what I kind of, 00:22:22.160 |
the first thing I did was to create this kind of top 15. And it's interesting looking at the family, 00:22:27.120 |
right? The family is like each of these different architectures, you know, is kind of from, you know, 00:22:32.720 |
from, you know, different sizes of a smaller subset of families, right? So there's conv next tiny, 00:22:39.280 |
conv next base, conv next tiny and 22K and so forth. So you can kind of get a sense of like, 00:22:46.080 |
if you want to learn more about architectures, which ones seem most interesting and, you know, 00:22:49.920 |
for fine tuning on pets, it looks like conv next, VIT, SWIN, ResNet are the main ones. 00:22:58.800 |
So that, you know, the first thing I did, the second thing I then did was to 00:23:04.480 |
take those most interesting families, actually also added this one called ResNetX and created 00:23:14.240 |
a scatterplot of them, colored by family. And so you can kind of see, like, for example, conv next, 00:23:23.680 |
which I'm rather fond of, is this kind of blue line, these blue ones, right? And so you can see 00:23:33.680 |
that the very best error rate actually was a conv next. So they're pretty good. You can see this one 00:23:44.880 |
here, which is ResNetX, seems to be, had some pretty nice values. They're like super fast, 00:23:56.800 |
seems like these tiny SWINs seem to be pretty good. So it kind of gives you a sense of like, 00:24:02.480 |
you know, depending on how much time you've got to run or how accurate you want to be, 00:24:05.600 |
what families are likely to most useful. And then the last thing I did for pets was I 00:24:15.600 |
grabbed a subset of the basically the ones which are in the top, basically smaller than the median 00:24:24.000 |
and faster than the median, because these are the ones I generally care about most of the time, 00:24:27.840 |
because most of the time I'm going to be training quick iterations. And then I just ordered those 00:24:34.320 |
by error rate. And so conv next tiny has got the best error rate of those which are in the upper 00:24:42.080 |
half of both speed and accuracy. >> What's GPU memory in this context? 00:24:51.520 |
>> That's the maximum amount of GPU memory that was used. I can't remember what 00:24:57.360 |
the units of measure are, but they don't matter too much because it'll be different 00:25:04.800 |
for your dataset or that matters is the relative usage. And so if you want something, 00:25:13.280 |
you know, if you try to use this and it's actually uses too much GPU memory, 00:25:18.960 |
you could try ResNet 50D, for example, or, you know, it's interesting that like ResNet 26 00:25:28.240 |
is really good for memory and speed. Or if you want something really lightweight on memory, 00:25:36.560 |
RegNet Y004. But the error rates are getting much worse once you get out to here, as you can see. 00:25:44.160 |
So then I looked at Planet. And so as I said, Planet's kind of as different a dataset 00:25:51.920 |
as you're going to get in one sense, or it's very different. And so not surprisingly, 00:26:00.400 |
its top 15 is also very different. And interestingly, all of the top six are from the same family. 00:26:09.120 |
So this VIT family, these are kind of model called transformers models. And what this is 00:26:15.760 |
basically showing is that these models are particularly good at rapidly identifying 00:26:23.360 |
features of data types it hasn't seen before. So, you know, if you were doing something like 00:26:28.880 |
medical imaging or satellite imagery or something like that, these would probably be a good thing 00:26:33.920 |
to try. And SWIN, by the way, is kind of another transformers based model, which, as you can see, 00:26:41.680 |
it's actually the most accurate at all, but it's also the smallest. This is SWIN V2. 00:26:48.160 |
So I thought that was pretty interesting. And, you know, these VIT models, there are ones with 00:26:59.360 |
pretty good error rates that also have very little memory use and also run very quickly. 00:27:03.760 |
So I did the same thing for Planet. And so perhaps not surprisingly, but interestingly for Planet, 00:27:12.960 |
these lines don't necessarily go down, which is to say that the really big models, 00:27:19.440 |
the big slow models don't necessarily have better error rates. And that makes sense, right? Because 00:27:26.000 |
if they've got heaps of parameters, but they're trying to learn something they've never seen 00:27:29.680 |
before on very little data, it's unlikely we're going to be able to take advantage of those 00:27:34.160 |
parameters. So when you're doing stuff that doesn't really look much like ImageNet, 00:27:41.760 |
you might want to be down more towards this end. So here's the VIT, for example. 00:27:52.880 |
And here's that really good SWIN model. And there's ConfNEXT Tiny. So then we can do the 00:28:01.600 |
same thing again of like, okay, let's take the top half, both in terms of speed and memory use. 00:28:06.640 |
Yeah, ConfNEXT Tiny still looks good. These VIT models is 224. Yeah, this is because you can 00:28:16.400 |
only run these models on images of size 224 by 224. You can't use different sizes, 00:28:24.160 |
whereas the ConfNEXT models, you can use any size. So it's also interesting to see the 00:28:31.360 |
classic ResNet still. Again, they do pretty well. Yeah, so I'm pretty excited about this. 00:28:41.760 |
It feels like exactly what we need to kick us on this PADI doctor competition, or indeed 00:28:52.240 |
any kind of computer vision classification task needs this. And I ran this sweep on 00:29:08.240 |
three consumer RTX GPUs in 12 hours or something. Like this is not big institutional resources 00:29:19.120 |
required. And one of the reasons why is because I didn't try every possible level of everything, 00:29:30.160 |
right? I tried a couple of, you know, so Thomas did a kind of a quick learning rate sweep to kind 00:29:40.560 |
of get a sense of the broad range of learning rates that seemed pretty good. And then we just 00:29:44.000 |
tried a couple of learning rates and a couple of the best resize methods and a couple of the best 00:29:48.640 |
polling types across a few broadly different kinds of models across the two different datasets 00:29:58.880 |
to kind of see if there was any common features. And we found in every single case the same learning 00:30:04.160 |
rate, the same resize method and the same polling type was the best. So we didn't need to try every 00:30:09.120 |
possible combination of everything, you know. And this is where like a lot of the stuff you see from 00:30:15.040 |
like Google and stuff, they tend to do hundreds of thousands of experiments, because I guess they 00:30:21.280 |
have no need to do things efficiently, right? Yeah, but you don't have to do it the Google way. You 00:30:30.320 |
can do it the fast AI way. Quick question, Jeremy. Which cards did you use? And another question 00:30:43.520 |
is, which cards did you say? Yeah, the GPU cards. Oh, RTX 3090. Oh, okay. So they were all three 00:30:52.960 |
different. They're all RTX 3090s. Okay. And you reset the index after the query? Why? Oh, just 00:31:04.080 |
because otherwise, it shows the numeric ID here will be the numeric ID from the original dataset. 00:31:11.360 |
And I wanted to be able to quickly kind of say, what's number six? What's number 10? What's number 00:31:14.560 |
three? That's all. So visually. Yeah. Okay. Jeremy, getting back to the earth, 00:31:22.720 |
satellite images, when you say, you know, like the classification, what is it trying to classify? 00:31:32.640 |
We have some examples. Basically, they try to classify for each area of the satellite imagery. 00:31:59.680 |
What's it a picture of? Is it forest or farmland or town or whatever? 00:32:06.720 |
And what weather conditions to observe, if I remember correctly. 00:32:10.880 |
Question in this image space is, is it just these two major datasets? Or how do you find other 00:32:23.200 |
models that are trained on beside the planet and imagine it? 00:32:27.440 |
Oh, you mean beside planet and pets? Sorry. Yep. That's the answer. What was your question? How 00:32:34.400 |
do you do what with them? How do you find other trained pre-trained models that have been worked 00:32:40.800 |
on different data sets? These all use pre-trained models, pre-trained on ImageNet. These are only 00:32:46.560 |
using pre-trained models, pre-trained on ImageNet. So how do you find pre-trained models, 00:32:52.320 |
pre-trained on other things? Mainly, you don't. There aren't many. But, you know, just Google. 00:33:00.160 |
Depends what you're interested in. And academic papers. 00:33:07.040 |
There's there is a I don't know how it's doing. It's there was a model. So there is a model zoo. 00:33:20.720 |
Which I've never had much success with, to be honest. 00:33:25.600 |
So these are a range of pre-trained models that you can download. 00:33:35.920 |
Yeah. But as I say, I haven't found it particularly successful, to be honest. 00:33:41.520 |
You could also try papers with papers with code. 00:33:45.440 |
And I think these, yeah, they have a link to the paper and the code. That doesn't necessarily mean 00:33:59.760 |
they've got a pre-trained model. And then you can just click on the code and see. 00:34:08.480 |
And of course, for NLP models, there's the Hugging Face Model Hub, which we've seen before. 00:34:22.160 |
And that is an easy answer for NLP. Lots of different pre-trained models are on that hub. 00:34:29.200 |
Jeremy, since you touch on academic papers and papers with code, 00:34:36.560 |
first question, will this comparison, do you or Tomau intend to publish it? 00:34:42.880 |
If not, if you were to do that, what would you go for, actually? What kind of journal would you look at? 00:34:52.720 |
So I'm not a good person to ask that question because I very rarely publish anything. 00:34:59.120 |
Which is partly a philosophical thing. I find academia overly exclusive and I don't 00:35:08.560 |
love PDFs as a publication form. And I don't love the writing style, which is kind of required if 00:35:15.280 |
you're going to get published as being rather difficult to follow. I have published a couple 00:35:26.000 |
of papers, but like only really one significant deep learning one. And that was because 00:35:32.720 |
a guy named Sebastian Ruder was doing his PhD at the time. And he said it'd be really helpful to 00:35:39.600 |
him if we could co-publish something and that he would kind of take the lead on writing the paper. 00:35:45.920 |
And so that was good because I'm always very happy to help students. And 00:35:52.080 |
he did a good job and he was a terrific researcher to work with. The other time I've written a paper, 00:35:59.680 |
the main time was when I wanted to get that message out about masks. And I felt like it's 00:36:05.360 |
probably not going to be taken seriously unless it's in an exclusive academic paper because 00:36:09.520 |
medical people are very inter-exclusive things. So I don't know. I'd say this kind of thing, 00:36:19.120 |
I suspect would be quite hard to publish because most deep learning academic venues are very 00:36:26.320 |
focused on things with kind of reasonably strong theoretical pieces. And this kind of 00:36:33.840 |
field of like trying things and seeing what works is experiment-based is certainly a 00:36:46.160 |
very important part of science in other areas. But in the deep learning world, 00:36:49.760 |
it hasn't really yet been recognized as a valid source of research, as far as I can tell. 00:36:55.840 |
I could concur with all the domains and feel the same quandary to be honest. 00:37:03.920 |
Hydrology, but more the computational science part of it. 00:37:25.760 |
I mean, this is kind of a bit at the same time, but I went back to Patty 00:37:35.360 |
and I wanted to try out a few of these interesting looking models reasonably quickly. 00:37:46.080 |
So what I did was I kind of took our standard, well, in this case, three lines of code because I've 00:38:01.280 |
already untarted earlier, took our three lines of code. So I could basically say train and pass 00:38:09.520 |
in an architecture and pass in some per item pre-processing, in this case resizing everything 00:38:19.280 |
to the same square using Squish and some per batch pre-processing, which in this case is the standard 00:38:25.280 |
fast AI data augmentation transforms targeting a final size of 224, which is what most models tend 00:38:32.560 |
to be trained at. And so then train a model using those parameters. And then finally, it would use 00:38:42.720 |
test time augmentation. So test time augmentation is where I think we briefly mentioned it last time. 00:38:49.920 |
We, in this case, on the validation set, I basically run the fine-tuned model four times 00:39:01.920 |
using random data augmentations each time. And then I run it one more time with no data 00:39:09.360 |
augmentations at all and take an average of all of those five predictions basically. 00:39:13.760 |
And that gives me some predictions. And then I take an error rate for TTA for the test time 00:39:20.560 |
augmentation. So that basically spits out a number, which is an error rate for PADI. 00:39:28.000 |
And I use a fixed random seed when picking out my validation set. So each time I run this, 00:39:37.200 |
it's going to be with the same validation set. And so I can compare. So I've got a few different 00:39:42.320 |
conf next small models I've run. First of all, by squishing when I resize and then by cropping 00:39:51.680 |
when I resize. So that was 235. This is also 235. And then instead of resizing to a square, 00:40:04.160 |
I resize to a rectangle. In theory, this wouldn't have been necessary. I thought they were all 480 00:40:13.120 |
by 640. But when I ran this, I got an error. And then I looked back at the results of that 00:40:19.680 |
parallel image sizing thing we ran. And I realized there was actually three or four images that were 00:40:24.880 |
the opposite aspect ratio. So that's why. So the vast majority of the images, 00:40:32.240 |
this resizing does nothing at all. But it's three or four that are the opposite aspect ratio. 00:40:37.440 |
And then for the augmentation, yeah, pick a size based on 224 00:40:45.680 |
of a similar aspect ratio. But what I'm actually aiming for here is something that is a 00:40:54.800 |
multiple of 32 on both edges. And the reason for that we'll kind of get into later when we learn 00:41:01.840 |
about how convolutional networks really work. But it basically turns out that the kind of the 00:41:07.200 |
final patch size in a conf net is 32 by 32 pixels. So you generally want both of your sides. Normally 00:41:14.560 |
you want them to be multiples of 32. So this one got a pretty similar result again, 240. And then 00:41:23.360 |
I wasn't sure about my contention that they need to be multiples of 32. I thought maybe it's better 00:41:28.320 |
if they like a really crisp resizing by using an exact multiple. So I tried that as well. 00:41:35.920 |
And that, as I suspected, was a bit worse. Oh, what's this? I've got some which, 00:41:53.520 |
which ones are the right way around? Now I'm confused. I think, let's check. 00:42:01.600 |
Some of these, originally I had my aspect ratio backwards. That's why I've got both. It looks 00:42:14.960 |
like I never got around to removing the ones that were unnecessary. Oops, wrong button. 00:42:52.640 |
Oops, pad mode. This makes it a bit easier to see what's going on if you do padding 00:43:18.800 |
There we go. Okay, yeah, so you can clearly see this is the one way around, 00:43:28.320 |
right? I've tried to make them wide, but actually they were tall. So the best way around is actually 00:43:34.640 |
640 by 480. That's more like it. So 640 by 480 is best. So let's get 00:43:48.640 |
rid of the ones that were the wrong way around. Okay, all right. 00:43:56.800 |
Yeah, so that was all, you know, various different transforms, pre-processing for 00:44:06.080 |
ConvNEXT Small, and then I did the same thing for one of the VITs, VIT Small. 00:44:15.840 |
Now VIT, remember I mentioned it can only work on 224 by 224 images, so these rectangular 00:44:22.880 |
approaches aren't going to be possible. So I've just got the squish and the crop versions. 00:44:30.720 |
The crop version doesn't look very good. The squish version must look pretty good. 00:44:37.280 |
And I also tried a pad version, which looks pretty good. 00:44:55.680 |
And then, yeah, I also tried SWIN, so here's SWIN V2. 00:45:09.840 |
So I had to go down to the 192 pixel version, but actually it seems to work very well. 00:45:19.920 |
This is the first time we've had one that's better than 0.02, which is interesting. 00:45:32.800 |
This one's also very good. So it's interesting that this slow memory intensive model works 00:45:40.640 |
better even on smaller size, 192 pixel size, which I think is pretty interesting. 00:45:46.480 |
And then there was one more SWIN, which seemed to do pretty well, so I included that, 00:45:59.440 |
OK results. So I kind of did that for all these different small models. And as you can see, 00:46:09.440 |
they run pretty quickly, right? 5 or 10 minutes. And so then I picked out the ones that look 00:46:18.560 |
pretty fast, pretty accurate, and created just a copy of that, which are called 00:46:30.880 |
patty large. And this time I just replaced small with large. 00:46:35.360 |
And actually, I've made a mistake. I'm going to have to rerun this because there should not be a 00:46:45.680 |
seed equals 42. I actually want to run this on a different subset each time. And the reason why is 00:46:51.680 |
my plan is to train. So basically what I did was I deleted the ones that were less good 00:46:59.760 |
in patty small. And so now I'm just running the large ones. Now some of these, particularly 00:47:07.680 |
something like this one, which is 288 by 224, they ran out of memory. They were too big for my 00:47:14.960 |
graphics card. And a lot of people at this point say, oh, I need to go buy a more expensive graphics 00:47:21.040 |
card. But that's not true. You don't. So if you guys remember our training loop, we get the 00:47:34.880 |
gradients. We add the gradients times the learning rate to the weights. And then we zero the gradients. 00:47:44.080 |
What you could do is half the batch size. So for example, from 64 to 32. And then only zero the 00:47:52.880 |
gradients every two iterations. And only do the update every two iterations. So basically you can 00:48:01.360 |
calculate in two batches what you used to calculate in one batch. And it will be mathematically 00:48:06.960 |
identical. And that's called gradient accumulation. And so for the ones which ran out of memory, 00:48:13.440 |
I added this little acume equals true, which is here in my function. And I said, yeah, I said if 00:48:21.200 |
acume equals true, then set the batch size to 32. Because by default it's 64. And add this thing 00:48:29.760 |
called a callback. Callbacks are basically things that change the behavior of the training. And 00:48:34.960 |
there's a thing called gradient accumulation callback. Which gradient accumulation. And this 00:48:52.880 |
is like just for people that are interested. This is not like massively complex stuff. The entire 00:49:00.720 |
gradient accumulation callback is that many lines of code. Right? These are not big things. And 00:49:07.520 |
like literally all it does is it keeps a count of how many iterations it's been. And it 00:49:17.760 |
adds the, you know, keeps track of the count. And as long as we're not up to the point where we, 00:49:29.360 |
there's the number of accumulations we want, we skip the step and the zero gradient basically. 00:49:36.320 |
So it's, yeah, things like gradient accumulation, they sound like big complex things. But they, 00:49:44.880 |
yeah, turn out not to be. At least when you have a nice code base like fast AIs. 00:49:56.720 |
Jeremy, can I get a question here? How exactly do the batch size mass animations work? 00:50:04.880 |
So we will get into that in detail in the course. And certainly we get into it in detail in the book. 00:50:15.360 |
But basically all that happens is we randomly shuffle the dataset and we grab, so if the batch 00:50:25.120 |
size is 64, we grab the next 64 images. We resize them all to be the same size and we stack them 00:50:34.800 |
on top of each other. So if it's black and white images, for example, we would have 64, 00:50:41.600 |
whatever, 640 by 480 images. And so we would end up with a 640 by, 64 by 640 by 480 00:50:53.840 |
tensor. And pretty much all the functionality provided by TyTorch will work fine for a mini 00:51:06.320 |
batch of things, just as it would for a single thing on the whole. 00:51:13.600 |
So in the larger scheme of things, you know, like some huge processes that's trying to characterise, 00:51:23.200 |
what role does the batch sort of play? Well, it's just about trying to get the 00:51:31.440 |
most out of your GPU. Your GPU can do 10,000 things at once. So if you just give it one image 00:51:38.240 |
at a time, you can use it. So if you give it 64 things, it can do one, you know, a thing on each 00:51:46.560 |
image and then on each channel in that image, and then you don't have another few other kind of 00:51:51.120 |
degrees of parallization it can do. And so that's where you start with, you know, we saw that NVIDIA 00:51:57.360 |
SMI daemon command that shows you the utilisation of your symmetric multiprocessor. Yeah, if you use 00:52:05.280 |
a batch size of one, you'll see that SM will be like 1%, 2% and everything will be useless. 00:52:10.640 |
It's a bit tricky at inference time, you know, in production or whatever, because, 00:52:16.320 |
you know, most of the time you only get one thing to do at a time. And so often inference is done 00:52:22.400 |
on CPU rather than GPU, because we don't get to benefit from batching. 00:52:32.960 |
Or, you know, all people will queue a few of them up and stick the model in the GPU at once. And, 00:52:37.840 |
you know, stuff like that. But yeah, for training, it's pretty easy to take advantage of many batches. 00:52:45.840 |
Jeremy, you've trained so many models. Will you consider using a majority vote or something like 00:52:56.880 |
that? No, I wouldn't, because a majority vote throws away information, it throws away 00:53:04.400 |
the probabilities. So I pretty much always find I get better results by averaging the probabilities. 00:53:12.400 |
So each of them, each of the models after I've trained it, I'm exporting 00:53:18.160 |
to a uniquely named model, which is going to be the name of the architecture, then an underscore, 00:53:25.440 |
and then some description, which is just the thing I pass in. And so that way, 00:53:29.840 |
yeah, when I'm done training, I can just have a little loop which opens each of those models up, 00:53:36.560 |
grabs the TTA predictions, sticks them into a list. And then at the end, I'll average those 00:53:46.320 |
TTA predictions across the models. And that will be my ensemble prediction. 00:53:53.440 |
So that's my next step. I'm not up to that yet. Okay. 00:53:57.040 |
All right. Well, I think that's it. So that's really more of a like little update on what 00:54:07.120 |
I've been doing over my weekend. But hopefully, yeah, gives you some ideas for things to try. 00:54:17.200 |
And hopefully, you find the Kaggle notebook useful. 00:54:23.040 |
So Jeremy, so how many hours did you spend in all these explanations? Because you spend a lot of 00:54:34.240 |
experiments here. So, you know, it's like a week or two of work to do the fine tuning experiments, 00:54:43.600 |
but that was like a few hours here and a few hours there. The final sweep was probably 00:54:50.480 |
maybe six hours of three GPUs. The patty competition stuff was maybe four hours a day 00:55:09.440 |
over the last four days since I last saw you guys. And writing the notebook was maybe another four 00:55:16.160 |
hours. Thanks. It helps. No worries. All right. Bye, everybody. Nice to see you all.