back to index

Live coding 10


Chapters

0:0 Questions
6:0 Steps for Entering a Standard Image Recognition Competition on Kaggle
8:40 The best models for fine tuning image recognition
12:0 Thomas Capelle script to run experiments
14:0 Github Gist
16:0 Weights and Biases API
17:0 Automating Gist generation
20:30 Summarising and ranking models for fine tuning
23:0 Scatter plot of performance by model family
25:40 Best models for images that don't look like Imagenet
33:0 Pretrained models - Model Zoo, Papers With Code, Huggingface
37:30 Applying learning on Paddy notebook with small models
46:0 Applying learning on large models
47:0 Gradient accumulation to prevent out of memory
52:50 Majority vote

Whisper Transcript | Transcript Only Page

00:00:00.000 | I've got a question. Yeah, it's to do with, is there a way that machine learning can actually
00:00:08.800 | find the sort of conditional probabilistic segments that are say in sort of heterogeneous data?
00:00:18.240 | I am having trouble passing that question. Can you give like an example or something? Yeah,
00:00:25.120 | okay. All right. Well, I'm waddling with road surface friction with road risk rather. And
00:00:33.600 | quite immediately there's this set of stereotypes in road analysis. And we all know that there's
00:00:42.720 | highways, freeways, urban materials. And they actually go through a series of stages,
00:00:53.040 | almost like states. And each of the states has got a sort of conditional probabilistic
00:01:00.480 | relationship between the set of predictors and the actual response variable,
00:01:06.480 | the crash response variable. Is there anything that white that in deep learning?
00:01:14.800 | So how is that different to a normal predictive model? Like, I mean, all predictive models are
00:01:22.480 | conditional probabilities, right? What's the... Well, I mean, if you take something like XGBoost,
00:01:30.880 | for example, and you want to predict the risk of a given road, so it'll give you a value.
00:01:37.600 | But then you've got no idea as to what's happening inside of the model. And
00:01:42.640 | we're really interested in that because once you find the distributions,
00:01:51.680 | you can start to do some quality testing on whether they actually follow the domain or
00:01:57.600 | whether your segmentation process that actually determines your predictions is good or not.
00:02:04.320 | And so, in a way, rather than, say, predicting some sort of
00:02:14.880 | crash rate or risk or whatever, I'm really looking for those probabilistic distributions
00:02:24.080 | and learning beneath the surface. So all deep learning models will return a set of probabilities.
00:02:33.520 | That's what their final layer returns. And then we decode them by taking the argmax
00:02:40.160 | across them. But there's nothing to stop you using those probabilities directly.
00:02:44.800 | But I'm probably misunderstanding your question. It's a little abstract for me to understand.
00:02:53.520 | Like, I mean, I know there's lots of things you can do with
00:02:57.120 | confidence intervals and whatnot, but it really depends a great deal on the
00:03:07.520 | specific details of the application, what you're trying to do and how you're trying to do it.
00:03:12.320 | Good question, Daniel. I'm just talking about probability of an incident or risk
00:03:20.560 | related to the road surface. So you're going to need some sort of tabular data that has
00:03:27.920 | the occurrences with each road surface that you're trying to.
00:03:34.320 | And why wouldn't XGBoost give you that if you had a predictive model of incidents?
00:03:43.760 | In my mind, one of the disadvantages of XGBoost is the fact that it only gives you a single set
00:03:53.920 | of variable effects. Whereas in what we're dealing with, we've got some really high crash roads.
00:04:05.280 | We've got a different conditional probability relationship between the predictors and the
00:04:11.600 | response compared to, say, the average. XGBoost does an excellent job in making the predictions,
00:04:21.120 | but you've got no idea as to the group of instances that they're actually making the
00:04:29.520 | prediction or the actual variable effects. Okay, so I think I understand your question now,
00:04:38.160 | and I think the answer is actually it does. And what I suggest you do, if you haven't already,
00:04:44.240 | is read the chapter of the first AI book on tabular modeling, and it will cover something
00:04:51.440 | very similar, which is random forests, which is another ensemble of decision trees, and it will
00:04:56.080 | show you how to get exactly the kind of insights that I think you're looking for. And all of the
00:05:05.680 | techniques there would work equally well for random forests, and they also work equally well
00:05:09.280 | for deep learning. So maybe after you've done that, you can come back and let us know whether
00:05:13.200 | that helped. Yeah, well, I've sort of played with random forests. It doesn't really
00:05:20.400 | give you what I'm looking for. I strongly suggest you read the chapter before you say that. I will.
00:05:27.360 | Because I'm pretty sure it will. And if it doesn't, that would be very interesting to me.
00:05:35.840 | In fact, I mentioned to you last time, but I'm really looking forward to the tabular data.
00:05:42.480 | Cool. Great. I'll show you guys what I've been working on, which has been fun.
00:05:59.760 | So the first thing I did, you know, after I got off our last call was I basically just
00:06:08.880 | threw together the kind of like most obvious basic steps one would do for
00:06:23.280 | a standard image recognition competition, just in order to show people that that can be quite good.
00:06:30.800 | And it was actually a little embarrassing because I didn't mean to do this. When I submitted it,
00:06:36.800 | it turned out I got first on the leaderboard. So now I feel like I'm going to have to
00:06:42.320 | write down exactly what I did because, you know, during an active competition, everybody
00:06:51.440 | needs to needs to share what they're what they're doing if they share it with anybody
00:06:55.360 | some publically. So I thought I'd show you what I did here. But I think this is about to go up
00:07:01.920 | quite a lot, because, you know, what we're doing here is where they're interesting images
00:07:15.760 | for a couple of reasons. One is that they're kind of like things that you see in ImageNet,
00:07:21.200 | like their pictures of natural objects, their photos. But I don't think ImageNet has any kind
00:07:29.600 | of like categories about diseases, you know, they have categories about like, what's the main
00:07:36.480 | object in this? So they might have a category about like, I don't know if they do like some
00:07:40.240 | different kinds of grass, or some different types of even some different types of, you know,
00:07:48.080 | fields or something, but I'm pretty sure they don't have anything about different kinds of
00:07:52.320 | crop disease. So it's a bit different to ImageNet, which is what most of our pre-trained models are
00:07:58.880 | trained on. But it's not that different. And it's also interesting because nearly all of the images
00:08:05.760 | are the same shape and size. So we can kind of try to take advantage of that.
00:08:15.600 | And, you know, so when we fine-tune a pre-trained model,
00:08:19.840 | there's, so let me pull up this Kaggle notebook I just created.
00:08:26.400 | So I just published this yesterday.
00:08:42.320 | Kind of look at what are the best vision models for fine-tuning. And so I kind of realized that
00:08:46.640 | there are two key dimensions that really seem to impact how well a model can be fine-tuned,
00:08:52.320 | you know, whether it works well or not, or how it's different. So one is what I just talked about,
00:08:57.280 | which is how similar is your data set to the data set used for the pre-trained model.
00:09:05.920 | If it's really similar, like pets to ImageNet, then like the critical factor is how well does
00:09:14.960 | the fine-tuning of the model maintain the weights that are pre-trained, you know, because you're
00:09:20.720 | probably not going to be changing very, very much. And you're probably going to be able to take
00:09:24.160 | advantage of really big, accurate models because they've already learned to do almost the exact
00:09:29.600 | thing you're trying to do. On the other hand, so that's the pets data set. On the other hand,
00:09:35.840 | there's a data set called the planet data set, which is images of satellite images.
00:09:46.080 | And these are not really at all like anything that ImageNet ever saw, you know, they're taken
00:09:52.880 | from above, they're taken from much further away, there's no single main object. So a lot of the
00:10:03.280 | weights of a pre-trained model are going to be useless for fine-tuning this because they've
00:10:08.560 | learned specific features like, you know, what does text look like, what do eyeballs look like,
00:10:14.800 | what does fur look like, you know, which none of which are going to be very
00:10:18.880 | useful. So that's the first dimension. The second dimension is just how big the data set is.
00:10:24.560 | So on a big data set, you've got time, you've got epochs to
00:10:29.600 | to take advantage of having lots of parameters in the model to learn to use them effectively.
00:10:40.400 | And if you don't have much data, then you don't have much ability to do that.
00:10:47.360 | So you might imagine that deep learning practitioners already know these
00:10:53.200 | answers of how do we, you know, what's the best models for fine-tuning. But in fact,
00:10:56.960 | we don't, as far as I know, nobody's ever done an analysis before of which models
00:11:01.440 | are the best for fine-tuning. So that's what I did over the weekend.
00:11:04.720 | And not just over the weekend, but really over the last couple of weeks.
00:11:09.200 | And I did this with Thomas Capelle, who works at Weights of Biases, another
00:11:15.840 | fast AI community member/alumni. And so what we did was we tried fine-tuning lots of models
00:11:24.160 | on two data sets, one which has over 10 times less images and where those images are not at all like
00:11:32.480 | ImageNet, that being the Kaggle Planet sample, and one which is a lot like ImageNet and has a
00:11:39.280 | lot more images, that being IIT pets. And I kind of figured like if we get some insights from those
00:11:46.400 | two, perhaps they'll be something that we can leverage more generally.
00:11:49.360 | So Thomas wrote this script, which it's 86 lines, but really there's only like three or four lines
00:12:06.880 | and they're all be lines you recognize, right? The lines are UntieData, ImageDataLoaders.FromBlah,
00:12:14.480 | and then Vision Learner, DLs, Model, etc. So there's the normal like three or four lines of
00:12:23.920 | code we see over and over again. And then, you know, the rest of it is basically lets you
00:12:29.440 | pass into the script different choices about batch size, epochs, and so forth.
00:12:35.680 | And that's about it. So this is like how simple the script was that we used. And then
00:12:47.280 | partly because Thomas works at Weights and Biases, and partly because Weights and Biases is pretty
00:12:54.240 | cool. We used Weights and Biases then to feed in different values for each of those parameters.
00:13:03.360 | So this is a YAML file that Weights and Biases uses where you can say, okay, try each of these
00:13:10.320 | different learning rates, try each of these different models, try, let's see if I can find
00:13:16.480 | another one, try each of these different resize methods, each of these different pooling methods,
00:13:23.520 | this distribution of learning rates, you know, whatever, and it goes away and tries them.
00:13:30.240 | And then you can use their Web GUI to look at like the training results. So then you basically say,
00:13:37.520 | okay, start training and it trains each of these models of each of these datasets with each of these
00:13:42.400 | pool values and each of these resize methods and a few different selections from this distribution
00:13:46.320 | of learning rates and creates a Web GUI that you can dive into. I personally hate Web GUIs. I would
00:13:54.240 | much rather use Python, but they also thankfully have an API. So yeah, so once we ran that script
00:14:01.360 | for a few hours, I then checked the results into a GIST. So a GIST is just a place to check
00:14:19.360 | text files basically, if you haven't used it before. So I checked my CSV file in here.
00:14:31.920 | As you can see, it kind of displays it in a nice way, or you can just click on
00:14:36.640 | to see the raw data. So I find that quite a nice place just to check things which I'm just going to
00:14:44.240 | share publicly. And so then I can check if there's the URL to the GIST.
00:14:49.120 | And maybe let me show you how I did that.
00:15:02.160 | Right.
00:15:22.960 | Right.
00:15:38.000 | So I just kind of like everything to be automated so I can always easily redo it because I always
00:15:51.040 | assume my first effort is going to be crap, and it always is. And normally my second,
00:15:54.640 | third efforts are crap as well. So here's my little notebook I put together.
00:16:01.600 | So basically, each time you do one of these sweeps on weights and biases, it generates a new ID. And
00:16:09.760 | so we ended up kind of doing five different ones as we realized we were able to add different models
00:16:14.080 | and change things a little bit. And so they have this API that you can use. And so they you basically
00:16:21.600 | can go through and say, go through each of the sweep IDs and ask the API for that sweep and grab
00:16:28.560 | the runs from it. And then for each one create a dictionary containing a summary and the model name.
00:16:34.800 | So the details don't matter too much, but you kind of get the idea, hopefully, and then turn that into
00:16:38.640 | a data frame. And so I kind of end up with this data frame that contains all the different
00:16:46.800 | configuration parameters along with their loss and their speed, their accuracy, GPU,
00:17:00.560 | maximum memory usage, and so forth. So that's basically what I wanted to chuck into a GIST.
00:17:08.160 | And so specifically, I really wanted this subset of the columns. So these are the columns I wanted.
00:17:12.560 | So I can grab those columns and put them into a CSV. Now, one thing you might not realize is
00:17:19.120 | I would say for most Python libraries, or at least most well-written ones,
00:17:25.520 | anyway, you can put a file names. And only when you say to CSV, you put here a file name or a path.
00:17:30.240 | You could instead put something called a string IO object, which is something that behaves exactly
00:17:35.440 | like a file, but it actually just stores it into a string. Because I don't want this stored into
00:17:44.080 | a file, it's put it into a string. So if you then call .getValue, I actually get the string.
00:17:49.680 | And so even things like creating the GIST, I want to do that automatically. So there's a
00:17:55.120 | library I'm very fond of. I'm very biased because I made it called ghapi, which is an API for GitHub,
00:18:04.800 | where we can do things like, say, create GIST. And you give it a description. And here's the text,
00:18:10.640 | which is the contents of the CSV. And the file name, make it public. And then you can get the HTML,
00:18:18.000 | URL of the GIST. So that's how I used, in this case, a notebook as my kind of, you know,
00:18:26.400 | interactive REPL, read of our print loop for manipulating this data set, putting it together,
00:18:34.720 | and then uploading it to GitHub. Jeremy, I had a doubt in this fond of data frame. Here you have
00:18:41.600 | it, like, in your, I just take it to put your GIST and it had in the data set entries with planned
00:18:48.240 | and that other data set as well, the pet status. So how did you populate it? So what's your question?
00:18:55.680 | How did I populate this data set? Yeah. Just here. So I passed it a list of dictionaries.
00:19:05.760 | The list of dictionaries I created using a list comprehension. Okay. Containing a bunch of
00:19:12.240 | dictionaries. Okay. Got it. And so that's going to make each key. So that means all the dictionaries
00:19:21.200 | should have, you know, roughly the same keys. Anyone sort of missing are going to end up being
00:19:26.160 | NA. And then I just fiddled around with it slightly. So, for example, make sure everything had an error
00:19:33.680 | rate that was equal to one minus the accuracy. On the planet data set, it's not called accuracy.
00:19:38.880 | So I copied accuracy, multi into accuracy. Yeah, nothing very exciting. Thank you.
00:19:46.960 | Jeremy, what's the actual goal of this? Let me show you. So what we've now got
00:19:53.200 | is a CSV, which I can then
00:20:06.800 | also very helpful. Okay. A CSV, which I can then use Panda's pivot table functionality
00:20:19.520 | to group by the data set, the model family and name, and calculate the mean of error rate,
00:20:26.560 | fit, time and GPU memory. And I can then take the pets subset of that
00:20:36.560 | sort by score, where score represents a combination of error and speed and take the top 15.
00:20:45.040 | And this now shows me the top 15 best models for fine-tuning on pets.
00:20:55.840 | And this is this is gold, in my opinion. I don't think anybody's ever done anything
00:21:00.080 | like this before. There's never been a list of like, here are the best models for fine-tuning.
00:21:04.960 | Sorry, I have a question. So you fine-tuned different models with pets and then collected
00:21:16.000 | this information. Is that correct? That's correct. And then based on the information that you collected
00:21:21.920 | from the fine-tuned of five or whatever number of iterations. We did three runs for each model. Yes.
00:21:28.800 | And then you collected this information to find out which one is the best behave model for this
00:21:37.280 | specific case. Correct, correct, correct, correct. Exactly. And best is going to involve two things.
00:21:43.680 | It's going to be which ones have the lowest error rate and which ones are the fastest.
00:21:47.440 | Now, I created this kind of arbitrary scoring function where I multiplied the error rate
00:21:53.200 | times fit time plus 80. Just because I felt like that particular value of that constant gave me an
00:22:00.800 | ordering that I was reasonably comfortable with. But you can kind of look through here and see like,
00:22:05.440 | okay, well, VIT base has a much better error rate than conv next tiny. But it's also much slower.
00:22:15.360 | Like, you can decide for your needs where you want to trade off. So that's what I kind of,
00:22:22.160 | the first thing I did was to create this kind of top 15. And it's interesting looking at the family,
00:22:27.120 | right? The family is like each of these different architectures, you know, is kind of from, you know,
00:22:32.720 | from, you know, different sizes of a smaller subset of families, right? So there's conv next tiny,
00:22:39.280 | conv next base, conv next tiny and 22K and so forth. So you can kind of get a sense of like,
00:22:46.080 | if you want to learn more about architectures, which ones seem most interesting and, you know,
00:22:49.920 | for fine tuning on pets, it looks like conv next, VIT, SWIN, ResNet are the main ones.
00:22:58.800 | So that, you know, the first thing I did, the second thing I then did was to
00:23:04.480 | take those most interesting families, actually also added this one called ResNetX and created
00:23:14.240 | a scatterplot of them, colored by family. And so you can kind of see, like, for example, conv next,
00:23:23.680 | which I'm rather fond of, is this kind of blue line, these blue ones, right? And so you can see
00:23:33.680 | that the very best error rate actually was a conv next. So they're pretty good. You can see this one
00:23:44.880 | here, which is ResNetX, seems to be, had some pretty nice values. They're like super fast,
00:23:56.800 | seems like these tiny SWINs seem to be pretty good. So it kind of gives you a sense of like,
00:24:02.480 | you know, depending on how much time you've got to run or how accurate you want to be,
00:24:05.600 | what families are likely to most useful. And then the last thing I did for pets was I
00:24:15.600 | grabbed a subset of the basically the ones which are in the top, basically smaller than the median
00:24:24.000 | and faster than the median, because these are the ones I generally care about most of the time,
00:24:27.840 | because most of the time I'm going to be training quick iterations. And then I just ordered those
00:24:34.320 | by error rate. And so conv next tiny has got the best error rate of those which are in the upper
00:24:42.080 | half of both speed and accuracy. >> What's GPU memory in this context?
00:24:51.520 | >> That's the maximum amount of GPU memory that was used. I can't remember what
00:24:57.360 | the units of measure are, but they don't matter too much because it'll be different
00:25:04.800 | for your dataset or that matters is the relative usage. And so if you want something,
00:25:13.280 | you know, if you try to use this and it's actually uses too much GPU memory,
00:25:18.960 | you could try ResNet 50D, for example, or, you know, it's interesting that like ResNet 26
00:25:28.240 | is really good for memory and speed. Or if you want something really lightweight on memory,
00:25:36.560 | RegNet Y004. But the error rates are getting much worse once you get out to here, as you can see.
00:25:44.160 | So then I looked at Planet. And so as I said, Planet's kind of as different a dataset
00:25:51.920 | as you're going to get in one sense, or it's very different. And so not surprisingly,
00:26:00.400 | its top 15 is also very different. And interestingly, all of the top six are from the same family.
00:26:09.120 | So this VIT family, these are kind of model called transformers models. And what this is
00:26:15.760 | basically showing is that these models are particularly good at rapidly identifying
00:26:23.360 | features of data types it hasn't seen before. So, you know, if you were doing something like
00:26:28.880 | medical imaging or satellite imagery or something like that, these would probably be a good thing
00:26:33.920 | to try. And SWIN, by the way, is kind of another transformers based model, which, as you can see,
00:26:41.680 | it's actually the most accurate at all, but it's also the smallest. This is SWIN V2.
00:26:48.160 | So I thought that was pretty interesting. And, you know, these VIT models, there are ones with
00:26:59.360 | pretty good error rates that also have very little memory use and also run very quickly.
00:27:03.760 | So I did the same thing for Planet. And so perhaps not surprisingly, but interestingly for Planet,
00:27:12.960 | these lines don't necessarily go down, which is to say that the really big models,
00:27:19.440 | the big slow models don't necessarily have better error rates. And that makes sense, right? Because
00:27:26.000 | if they've got heaps of parameters, but they're trying to learn something they've never seen
00:27:29.680 | before on very little data, it's unlikely we're going to be able to take advantage of those
00:27:34.160 | parameters. So when you're doing stuff that doesn't really look much like ImageNet,
00:27:41.760 | you might want to be down more towards this end. So here's the VIT, for example.
00:27:52.880 | And here's that really good SWIN model. And there's ConfNEXT Tiny. So then we can do the
00:28:01.600 | same thing again of like, okay, let's take the top half, both in terms of speed and memory use.
00:28:06.640 | Yeah, ConfNEXT Tiny still looks good. These VIT models is 224. Yeah, this is because you can
00:28:16.400 | only run these models on images of size 224 by 224. You can't use different sizes,
00:28:24.160 | whereas the ConfNEXT models, you can use any size. So it's also interesting to see the
00:28:31.360 | classic ResNet still. Again, they do pretty well. Yeah, so I'm pretty excited about this.
00:28:41.760 | It feels like exactly what we need to kick us on this PADI doctor competition, or indeed
00:28:52.240 | any kind of computer vision classification task needs this. And I ran this sweep on
00:29:08.240 | three consumer RTX GPUs in 12 hours or something. Like this is not big institutional resources
00:29:19.120 | required. And one of the reasons why is because I didn't try every possible level of everything,
00:29:30.160 | right? I tried a couple of, you know, so Thomas did a kind of a quick learning rate sweep to kind
00:29:40.560 | of get a sense of the broad range of learning rates that seemed pretty good. And then we just
00:29:44.000 | tried a couple of learning rates and a couple of the best resize methods and a couple of the best
00:29:48.640 | polling types across a few broadly different kinds of models across the two different datasets
00:29:58.880 | to kind of see if there was any common features. And we found in every single case the same learning
00:30:04.160 | rate, the same resize method and the same polling type was the best. So we didn't need to try every
00:30:09.120 | possible combination of everything, you know. And this is where like a lot of the stuff you see from
00:30:15.040 | like Google and stuff, they tend to do hundreds of thousands of experiments, because I guess they
00:30:21.280 | have no need to do things efficiently, right? Yeah, but you don't have to do it the Google way. You
00:30:30.320 | can do it the fast AI way. Quick question, Jeremy. Which cards did you use? And another question
00:30:43.520 | is, which cards did you say? Yeah, the GPU cards. Oh, RTX 3090. Oh, okay. So they were all three
00:30:52.960 | different. They're all RTX 3090s. Okay. And you reset the index after the query? Why? Oh, just
00:31:04.080 | because otherwise, it shows the numeric ID here will be the numeric ID from the original dataset.
00:31:11.360 | And I wanted to be able to quickly kind of say, what's number six? What's number 10? What's number
00:31:14.560 | three? That's all. So visually. Yeah. Okay. Jeremy, getting back to the earth,
00:31:22.720 | satellite images, when you say, you know, like the classification, what is it trying to classify?
00:31:29.120 | In this case, the planet competition.
00:31:32.640 | We have some examples. Basically, they try to classify for each area of the satellite imagery.
00:31:59.680 | What's it a picture of? Is it forest or farmland or town or whatever?
00:32:06.720 | And what weather conditions to observe, if I remember correctly.
00:32:10.880 | Question in this image space is, is it just these two major datasets? Or how do you find other
00:32:23.200 | models that are trained on beside the planet and imagine it?
00:32:27.440 | Oh, you mean beside planet and pets? Sorry. Yep. That's the answer. What was your question? How
00:32:34.400 | do you do what with them? How do you find other trained pre-trained models that have been worked
00:32:40.800 | on different data sets? These all use pre-trained models, pre-trained on ImageNet. These are only
00:32:46.560 | using pre-trained models, pre-trained on ImageNet. So how do you find pre-trained models,
00:32:52.320 | pre-trained on other things? Mainly, you don't. There aren't many. But, you know, just Google.
00:33:00.160 | Depends what you're interested in. And academic papers.
00:33:07.040 | There's there is a I don't know how it's doing. It's there was a model. So there is a model zoo.
00:33:20.720 | Which I've never had much success with, to be honest.
00:33:25.600 | So these are a range of pre-trained models that you can download.
00:33:35.920 | Yeah. But as I say, I haven't found it particularly successful, to be honest.
00:33:41.520 | You could also try papers with papers with code.
00:33:45.440 | And I think these, yeah, they have a link to the paper and the code. That doesn't necessarily mean
00:33:59.760 | they've got a pre-trained model. And then you can just click on the code and see.
00:34:08.480 | And of course, for NLP models, there's the Hugging Face Model Hub, which we've seen before.
00:34:22.160 | And that is an easy answer for NLP. Lots of different pre-trained models are on that hub.
00:34:29.200 | Jeremy, since you touch on academic papers and papers with code,
00:34:36.560 | first question, will this comparison, do you or Tomau intend to publish it?
00:34:42.880 | If not, if you were to do that, what would you go for, actually? What kind of journal would you look at?
00:34:52.720 | So I'm not a good person to ask that question because I very rarely publish anything.
00:34:59.120 | Which is partly a philosophical thing. I find academia overly exclusive and I don't
00:35:08.560 | love PDFs as a publication form. And I don't love the writing style, which is kind of required if
00:35:15.280 | you're going to get published as being rather difficult to follow. I have published a couple
00:35:26.000 | of papers, but like only really one significant deep learning one. And that was because
00:35:32.720 | a guy named Sebastian Ruder was doing his PhD at the time. And he said it'd be really helpful to
00:35:39.600 | him if we could co-publish something and that he would kind of take the lead on writing the paper.
00:35:45.920 | And so that was good because I'm always very happy to help students. And
00:35:52.080 | he did a good job and he was a terrific researcher to work with. The other time I've written a paper,
00:35:59.680 | the main time was when I wanted to get that message out about masks. And I felt like it's
00:36:05.360 | probably not going to be taken seriously unless it's in an exclusive academic paper because
00:36:09.520 | medical people are very inter-exclusive things. So I don't know. I'd say this kind of thing,
00:36:19.120 | I suspect would be quite hard to publish because most deep learning academic venues are very
00:36:26.320 | focused on things with kind of reasonably strong theoretical pieces. And this kind of
00:36:33.840 | field of like trying things and seeing what works is experiment-based is certainly a
00:36:46.160 | very important part of science in other areas. But in the deep learning world,
00:36:49.760 | it hasn't really yet been recognized as a valid source of research, as far as I can tell.
00:36:55.840 | I could concur with all the domains and feel the same quandary to be honest.
00:37:01.760 | Fair enough. What's your domain?
00:37:03.920 | Hydrology, but more the computational science part of it.
00:37:14.480 | Okay. So then what I did was I,
00:37:25.760 | I mean, this is kind of a bit at the same time, but I went back to Patty
00:37:35.360 | and I wanted to try out a few of these interesting looking models reasonably quickly.
00:37:46.080 | So what I did was I kind of took our standard, well, in this case, three lines of code because I've
00:38:01.280 | already untarted earlier, took our three lines of code. So I could basically say train and pass
00:38:09.520 | in an architecture and pass in some per item pre-processing, in this case resizing everything
00:38:19.280 | to the same square using Squish and some per batch pre-processing, which in this case is the standard
00:38:25.280 | fast AI data augmentation transforms targeting a final size of 224, which is what most models tend
00:38:32.560 | to be trained at. And so then train a model using those parameters. And then finally, it would use
00:38:42.720 | test time augmentation. So test time augmentation is where I think we briefly mentioned it last time.
00:38:49.920 | We, in this case, on the validation set, I basically run the fine-tuned model four times
00:39:01.920 | using random data augmentations each time. And then I run it one more time with no data
00:39:09.360 | augmentations at all and take an average of all of those five predictions basically.
00:39:13.760 | And that gives me some predictions. And then I take an error rate for TTA for the test time
00:39:20.560 | augmentation. So that basically spits out a number, which is an error rate for PADI.
00:39:28.000 | And I use a fixed random seed when picking out my validation set. So each time I run this,
00:39:37.200 | it's going to be with the same validation set. And so I can compare. So I've got a few different
00:39:42.320 | conf next small models I've run. First of all, by squishing when I resize and then by cropping
00:39:51.680 | when I resize. So that was 235. This is also 235. And then instead of resizing to a square,
00:40:04.160 | I resize to a rectangle. In theory, this wouldn't have been necessary. I thought they were all 480
00:40:13.120 | by 640. But when I ran this, I got an error. And then I looked back at the results of that
00:40:19.680 | parallel image sizing thing we ran. And I realized there was actually three or four images that were
00:40:24.880 | the opposite aspect ratio. So that's why. So the vast majority of the images,
00:40:32.240 | this resizing does nothing at all. But it's three or four that are the opposite aspect ratio.
00:40:37.440 | And then for the augmentation, yeah, pick a size based on 224
00:40:45.680 | of a similar aspect ratio. But what I'm actually aiming for here is something that is a
00:40:54.800 | multiple of 32 on both edges. And the reason for that we'll kind of get into later when we learn
00:41:01.840 | about how convolutional networks really work. But it basically turns out that the kind of the
00:41:07.200 | final patch size in a conf net is 32 by 32 pixels. So you generally want both of your sides. Normally
00:41:14.560 | you want them to be multiples of 32. So this one got a pretty similar result again, 240. And then
00:41:23.360 | I wasn't sure about my contention that they need to be multiples of 32. I thought maybe it's better
00:41:28.320 | if they like a really crisp resizing by using an exact multiple. So I tried that as well.
00:41:35.920 | And that, as I suspected, was a bit worse. Oh, what's this? I've got some which,
00:41:53.520 | which ones are the right way around? Now I'm confused. I think, let's check.
00:42:01.600 | Some of these, originally I had my aspect ratio backwards. That's why I've got both. It looks
00:42:14.960 | like I never got around to removing the ones that were unnecessary. Oops, wrong button.
00:42:23.200 | Copy paste size.
00:42:29.120 | Paste.
00:42:36.080 | Leave those off.
00:42:49.680 | Method equals add.
00:42:52.640 | Oops, pad mode. This makes it a bit easier to see what's going on if you do padding
00:43:05.680 | with black around them.
00:43:18.800 | There we go. Okay, yeah, so you can clearly see this is the one way around,
00:43:28.320 | right? I've tried to make them wide, but actually they were tall. So the best way around is actually
00:43:34.640 | 640 by 480. That's more like it. So 640 by 480 is best. So let's get
00:43:48.640 | rid of the ones that were the wrong way around. Okay, all right.
00:43:56.800 | Yeah, so that was all, you know, various different transforms, pre-processing for
00:44:06.080 | ConvNEXT Small, and then I did the same thing for one of the VITs, VIT Small.
00:44:15.840 | Now VIT, remember I mentioned it can only work on 224 by 224 images, so these rectangular
00:44:22.880 | approaches aren't going to be possible. So I've just got the squish and the crop versions.
00:44:30.720 | The crop version doesn't look very good. The squish version must look pretty good.
00:44:37.280 | And I also tried a pad version, which looks pretty good.
00:44:55.680 | And then, yeah, I also tried SWIN, so here's SWIN V2.
00:45:03.760 | And this one is slow and memory intensive.
00:45:09.840 | So I had to go down to the 192 pixel version, but actually it seems to work very well.
00:45:19.920 | This is the first time we've had one that's better than 0.02, which is interesting.
00:45:32.800 | This one's also very good. So it's interesting that this slow memory intensive model works
00:45:40.640 | better even on smaller size, 192 pixel size, which I think is pretty interesting.
00:45:46.480 | And then there was one more SWIN, which seemed to do pretty well, so I included that,
00:45:54.240 | which I was able to do at 224. That one had
00:45:59.440 | OK results. So I kind of did that for all these different small models. And as you can see,
00:46:09.440 | they run pretty quickly, right? 5 or 10 minutes. And so then I picked out the ones that look
00:46:18.560 | pretty fast, pretty accurate, and created just a copy of that, which are called
00:46:30.880 | patty large. And this time I just replaced small with large.
00:46:35.360 | And actually, I've made a mistake. I'm going to have to rerun this because there should not be a
00:46:45.680 | seed equals 42. I actually want to run this on a different subset each time. And the reason why is
00:46:51.680 | my plan is to train. So basically what I did was I deleted the ones that were less good
00:46:59.760 | in patty small. And so now I'm just running the large ones. Now some of these, particularly
00:47:07.680 | something like this one, which is 288 by 224, they ran out of memory. They were too big for my
00:47:14.960 | graphics card. And a lot of people at this point say, oh, I need to go buy a more expensive graphics
00:47:21.040 | card. But that's not true. You don't. So if you guys remember our training loop, we get the
00:47:34.880 | gradients. We add the gradients times the learning rate to the weights. And then we zero the gradients.
00:47:44.080 | What you could do is half the batch size. So for example, from 64 to 32. And then only zero the
00:47:52.880 | gradients every two iterations. And only do the update every two iterations. So basically you can
00:48:01.360 | calculate in two batches what you used to calculate in one batch. And it will be mathematically
00:48:06.960 | identical. And that's called gradient accumulation. And so for the ones which ran out of memory,
00:48:13.440 | I added this little acume equals true, which is here in my function. And I said, yeah, I said if
00:48:21.200 | acume equals true, then set the batch size to 32. Because by default it's 64. And add this thing
00:48:29.760 | called a callback. Callbacks are basically things that change the behavior of the training. And
00:48:34.960 | there's a thing called gradient accumulation callback. Which gradient accumulation. And this
00:48:52.880 | is like just for people that are interested. This is not like massively complex stuff. The entire
00:49:00.720 | gradient accumulation callback is that many lines of code. Right? These are not big things. And
00:49:07.520 | like literally all it does is it keeps a count of how many iterations it's been. And it
00:49:17.760 | adds the, you know, keeps track of the count. And as long as we're not up to the point where we,
00:49:29.360 | there's the number of accumulations we want, we skip the step and the zero gradient basically.
00:49:36.320 | So it's, yeah, things like gradient accumulation, they sound like big complex things. But they,
00:49:44.880 | yeah, turn out not to be. At least when you have a nice code base like fast AIs.
00:49:56.720 | Jeremy, can I get a question here? How exactly do the batch size mass animations work?
00:50:04.880 | So we will get into that in detail in the course. And certainly we get into it in detail in the book.
00:50:15.360 | But basically all that happens is we randomly shuffle the dataset and we grab, so if the batch
00:50:25.120 | size is 64, we grab the next 64 images. We resize them all to be the same size and we stack them
00:50:34.800 | on top of each other. So if it's black and white images, for example, we would have 64,
00:50:41.600 | whatever, 640 by 480 images. And so we would end up with a 640 by, 64 by 640 by 480
00:50:53.840 | tensor. And pretty much all the functionality provided by TyTorch will work fine for a mini
00:51:06.320 | batch of things, just as it would for a single thing on the whole.
00:51:13.600 | So in the larger scheme of things, you know, like some huge processes that's trying to characterise,
00:51:23.200 | what role does the batch sort of play? Well, it's just about trying to get the
00:51:31.440 | most out of your GPU. Your GPU can do 10,000 things at once. So if you just give it one image
00:51:38.240 | at a time, you can use it. So if you give it 64 things, it can do one, you know, a thing on each
00:51:46.560 | image and then on each channel in that image, and then you don't have another few other kind of
00:51:51.120 | degrees of parallization it can do. And so that's where you start with, you know, we saw that NVIDIA
00:51:57.360 | SMI daemon command that shows you the utilisation of your symmetric multiprocessor. Yeah, if you use
00:52:05.280 | a batch size of one, you'll see that SM will be like 1%, 2% and everything will be useless.
00:52:10.640 | It's a bit tricky at inference time, you know, in production or whatever, because,
00:52:16.320 | you know, most of the time you only get one thing to do at a time. And so often inference is done
00:52:22.400 | on CPU rather than GPU, because we don't get to benefit from batching.
00:52:32.960 | Or, you know, all people will queue a few of them up and stick the model in the GPU at once. And,
00:52:37.840 | you know, stuff like that. But yeah, for training, it's pretty easy to take advantage of many batches.
00:52:42.560 | Okay, thank you. No worries.
00:52:45.840 | Jeremy, you've trained so many models. Will you consider using a majority vote or something like
00:52:56.880 | that? No, I wouldn't, because a majority vote throws away information, it throws away
00:53:04.400 | the probabilities. So I pretty much always find I get better results by averaging the probabilities.
00:53:12.400 | So each of them, each of the models after I've trained it, I'm exporting
00:53:18.160 | to a uniquely named model, which is going to be the name of the architecture, then an underscore,
00:53:25.440 | and then some description, which is just the thing I pass in. And so that way,
00:53:29.840 | yeah, when I'm done training, I can just have a little loop which opens each of those models up,
00:53:36.560 | grabs the TTA predictions, sticks them into a list. And then at the end, I'll average those
00:53:46.320 | TTA predictions across the models. And that will be my ensemble prediction.
00:53:53.440 | So that's my next step. I'm not up to that yet. Okay.
00:53:57.040 | All right. Well, I think that's it. So that's really more of a like little update on what
00:54:07.120 | I've been doing over my weekend. But hopefully, yeah, gives you some ideas for things to try.
00:54:17.200 | And hopefully, you find the Kaggle notebook useful.
00:54:23.040 | So Jeremy, so how many hours did you spend in all these explanations? Because you spend a lot of
00:54:34.240 | experiments here. So, you know, it's like a week or two of work to do the fine tuning experiments,
00:54:43.600 | but that was like a few hours here and a few hours there. The final sweep was probably
00:54:50.480 | maybe six hours of three GPUs. The patty competition stuff was maybe four hours a day
00:55:09.440 | over the last four days since I last saw you guys. And writing the notebook was maybe another four
00:55:16.160 | hours. Thanks. It helps. No worries. All right. Bye, everybody. Nice to see you all.
00:55:26.320 | Bye so much. Thanks, Jeremy. Bye, everyone.