Back to Index

Live coding 10


Chapters

0:0 Questions
6:0 Steps for Entering a Standard Image Recognition Competition on Kaggle
8:40 The best models for fine tuning image recognition
12:0 Thomas Capelle script to run experiments
14:0 Github Gist
16:0 Weights and Biases API
17:0 Automating Gist generation
20:30 Summarising and ranking models for fine tuning
23:0 Scatter plot of performance by model family
25:40 Best models for images that don't look like Imagenet
33:0 Pretrained models - Model Zoo, Papers With Code, Huggingface
37:30 Applying learning on Paddy notebook with small models
46:0 Applying learning on large models
47:0 Gradient accumulation to prevent out of memory
52:50 Majority vote

Transcript

I've got a question. Yeah, it's to do with, is there a way that machine learning can actually find the sort of conditional probabilistic segments that are say in sort of heterogeneous data? I am having trouble passing that question. Can you give like an example or something? Yeah, okay. All right.

Well, I'm waddling with road surface friction with road risk rather. And quite immediately there's this set of stereotypes in road analysis. And we all know that there's highways, freeways, urban materials. And they actually go through a series of stages, almost like states. And each of the states has got a sort of conditional probabilistic relationship between the set of predictors and the actual response variable, the crash response variable.

Is there anything that white that in deep learning? So how is that different to a normal predictive model? Like, I mean, all predictive models are conditional probabilities, right? What's the... Well, I mean, if you take something like XGBoost, for example, and you want to predict the risk of a given road, so it'll give you a value.

But then you've got no idea as to what's happening inside of the model. And we're really interested in that because once you find the distributions, you can start to do some quality testing on whether they actually follow the domain or whether your segmentation process that actually determines your predictions is good or not.

And so, in a way, rather than, say, predicting some sort of crash rate or risk or whatever, I'm really looking for those probabilistic distributions and learning beneath the surface. So all deep learning models will return a set of probabilities. That's what their final layer returns. And then we decode them by taking the argmax across them.

But there's nothing to stop you using those probabilities directly. But I'm probably misunderstanding your question. It's a little abstract for me to understand. Like, I mean, I know there's lots of things you can do with confidence intervals and whatnot, but it really depends a great deal on the specific details of the application, what you're trying to do and how you're trying to do it.

Good question, Daniel. I'm just talking about probability of an incident or risk related to the road surface. So you're going to need some sort of tabular data that has the occurrences with each road surface that you're trying to. And why wouldn't XGBoost give you that if you had a predictive model of incidents?

In my mind, one of the disadvantages of XGBoost is the fact that it only gives you a single set of variable effects. Whereas in what we're dealing with, we've got some really high crash roads. We've got a different conditional probability relationship between the predictors and the response compared to, say, the average.

XGBoost does an excellent job in making the predictions, but you've got no idea as to the group of instances that they're actually making the prediction or the actual variable effects. Okay, so I think I understand your question now, and I think the answer is actually it does. And what I suggest you do, if you haven't already, is read the chapter of the first AI book on tabular modeling, and it will cover something very similar, which is random forests, which is another ensemble of decision trees, and it will show you how to get exactly the kind of insights that I think you're looking for.

And all of the techniques there would work equally well for random forests, and they also work equally well for deep learning. So maybe after you've done that, you can come back and let us know whether that helped. Yeah, well, I've sort of played with random forests. It doesn't really give you what I'm looking for.

I strongly suggest you read the chapter before you say that. I will. Because I'm pretty sure it will. And if it doesn't, that would be very interesting to me. In fact, I mentioned to you last time, but I'm really looking forward to the tabular data. Cool. Great. I'll show you guys what I've been working on, which has been fun.

So the first thing I did, you know, after I got off our last call was I basically just threw together the kind of like most obvious basic steps one would do for a standard image recognition competition, just in order to show people that that can be quite good. And it was actually a little embarrassing because I didn't mean to do this.

When I submitted it, it turned out I got first on the leaderboard. So now I feel like I'm going to have to write down exactly what I did because, you know, during an active competition, everybody needs to needs to share what they're what they're doing if they share it with anybody some publically.

So I thought I'd show you what I did here. But I think this is about to go up quite a lot, because, you know, what we're doing here is where they're interesting images for a couple of reasons. One is that they're kind of like things that you see in ImageNet, like their pictures of natural objects, their photos.

But I don't think ImageNet has any kind of like categories about diseases, you know, they have categories about like, what's the main object in this? So they might have a category about like, I don't know if they do like some different kinds of grass, or some different types of even some different types of, you know, fields or something, but I'm pretty sure they don't have anything about different kinds of crop disease.

So it's a bit different to ImageNet, which is what most of our pre-trained models are trained on. But it's not that different. And it's also interesting because nearly all of the images are the same shape and size. So we can kind of try to take advantage of that. And, you know, so when we fine-tune a pre-trained model, there's, so let me pull up this Kaggle notebook I just created.

So I just published this yesterday. Kind of look at what are the best vision models for fine-tuning. And so I kind of realized that there are two key dimensions that really seem to impact how well a model can be fine-tuned, you know, whether it works well or not, or how it's different.

So one is what I just talked about, which is how similar is your data set to the data set used for the pre-trained model. If it's really similar, like pets to ImageNet, then like the critical factor is how well does the fine-tuning of the model maintain the weights that are pre-trained, you know, because you're probably not going to be changing very, very much.

And you're probably going to be able to take advantage of really big, accurate models because they've already learned to do almost the exact thing you're trying to do. On the other hand, so that's the pets data set. On the other hand, there's a data set called the planet data set, which is images of satellite images.

And these are not really at all like anything that ImageNet ever saw, you know, they're taken from above, they're taken from much further away, there's no single main object. So a lot of the weights of a pre-trained model are going to be useless for fine-tuning this because they've learned specific features like, you know, what does text look like, what do eyeballs look like, what does fur look like, you know, which none of which are going to be very useful.

So that's the first dimension. The second dimension is just how big the data set is. So on a big data set, you've got time, you've got epochs to to take advantage of having lots of parameters in the model to learn to use them effectively. And if you don't have much data, then you don't have much ability to do that.

So you might imagine that deep learning practitioners already know these answers of how do we, you know, what's the best models for fine-tuning. But in fact, we don't, as far as I know, nobody's ever done an analysis before of which models are the best for fine-tuning. So that's what I did over the weekend.

And not just over the weekend, but really over the last couple of weeks. And I did this with Thomas Capelle, who works at Weights of Biases, another fast AI community member/alumni. And so what we did was we tried fine-tuning lots of models on two data sets, one which has over 10 times less images and where those images are not at all like ImageNet, that being the Kaggle Planet sample, and one which is a lot like ImageNet and has a lot more images, that being IIT pets.

And I kind of figured like if we get some insights from those two, perhaps they'll be something that we can leverage more generally. So Thomas wrote this script, which it's 86 lines, but really there's only like three or four lines and they're all be lines you recognize, right? The lines are UntieData, ImageDataLoaders.FromBlah, and then Vision Learner, DLs, Model, etc.

So there's the normal like three or four lines of code we see over and over again. And then, you know, the rest of it is basically lets you pass into the script different choices about batch size, epochs, and so forth. And that's about it. So this is like how simple the script was that we used.

And then partly because Thomas works at Weights and Biases, and partly because Weights and Biases is pretty cool. We used Weights and Biases then to feed in different values for each of those parameters. So this is a YAML file that Weights and Biases uses where you can say, okay, try each of these different learning rates, try each of these different models, try, let's see if I can find another one, try each of these different resize methods, each of these different pooling methods, this distribution of learning rates, you know, whatever, and it goes away and tries them.

And then you can use their Web GUI to look at like the training results. So then you basically say, okay, start training and it trains each of these models of each of these datasets with each of these pool values and each of these resize methods and a few different selections from this distribution of learning rates and creates a Web GUI that you can dive into.

I personally hate Web GUIs. I would much rather use Python, but they also thankfully have an API. So yeah, so once we ran that script for a few hours, I then checked the results into a GIST. So a GIST is just a place to check text files basically, if you haven't used it before.

So I checked my CSV file in here. As you can see, it kind of displays it in a nice way, or you can just click on to see the raw data. So I find that quite a nice place just to check things which I'm just going to share publicly.

And so then I can check if there's the URL to the GIST. And maybe let me show you how I did that. Right. Right. So I just kind of like everything to be automated so I can always easily redo it because I always assume my first effort is going to be crap, and it always is.

And normally my second, third efforts are crap as well. So here's my little notebook I put together. So basically, each time you do one of these sweeps on weights and biases, it generates a new ID. And so we ended up kind of doing five different ones as we realized we were able to add different models and change things a little bit.

And so they have this API that you can use. And so they you basically can go through and say, go through each of the sweep IDs and ask the API for that sweep and grab the runs from it. And then for each one create a dictionary containing a summary and the model name.

So the details don't matter too much, but you kind of get the idea, hopefully, and then turn that into a data frame. And so I kind of end up with this data frame that contains all the different configuration parameters along with their loss and their speed, their accuracy, GPU, maximum memory usage, and so forth.

So that's basically what I wanted to chuck into a GIST. And so specifically, I really wanted this subset of the columns. So these are the columns I wanted. So I can grab those columns and put them into a CSV. Now, one thing you might not realize is I would say for most Python libraries, or at least most well-written ones, anyway, you can put a file names.

And only when you say to CSV, you put here a file name or a path. You could instead put something called a string IO object, which is something that behaves exactly like a file, but it actually just stores it into a string. Because I don't want this stored into a file, it's put it into a string.

So if you then call .getValue, I actually get the string. And so even things like creating the GIST, I want to do that automatically. So there's a library I'm very fond of. I'm very biased because I made it called ghapi, which is an API for GitHub, where we can do things like, say, create GIST.

And you give it a description. And here's the text, which is the contents of the CSV. And the file name, make it public. And then you can get the HTML, URL of the GIST. So that's how I used, in this case, a notebook as my kind of, you know, interactive REPL, read of our print loop for manipulating this data set, putting it together, and then uploading it to GitHub.

Jeremy, I had a doubt in this fond of data frame. Here you have it, like, in your, I just take it to put your GIST and it had in the data set entries with planned and that other data set as well, the pet status. So how did you populate it?

So what's your question? How did I populate this data set? Yeah. Just here. So I passed it a list of dictionaries. The list of dictionaries I created using a list comprehension. Okay. Containing a bunch of dictionaries. Okay. Got it. And so that's going to make each key. So that means all the dictionaries should have, you know, roughly the same keys.

Anyone sort of missing are going to end up being NA. And then I just fiddled around with it slightly. So, for example, make sure everything had an error rate that was equal to one minus the accuracy. On the planet data set, it's not called accuracy. So I copied accuracy, multi into accuracy.

Yeah, nothing very exciting. Thank you. Jeremy, what's the actual goal of this? Let me show you. So what we've now got is a CSV, which I can then also very helpful. Okay. A CSV, which I can then use Panda's pivot table functionality to group by the data set, the model family and name, and calculate the mean of error rate, fit, time and GPU memory.

And I can then take the pets subset of that sort by score, where score represents a combination of error and speed and take the top 15. And this now shows me the top 15 best models for fine-tuning on pets. And this is this is gold, in my opinion. I don't think anybody's ever done anything like this before.

There's never been a list of like, here are the best models for fine-tuning. Sorry, I have a question. So you fine-tuned different models with pets and then collected this information. Is that correct? That's correct. And then based on the information that you collected from the fine-tuned of five or whatever number of iterations.

We did three runs for each model. Yes. And then you collected this information to find out which one is the best behave model for this specific case. Correct, correct, correct, correct. Exactly. And best is going to involve two things. It's going to be which ones have the lowest error rate and which ones are the fastest.

Now, I created this kind of arbitrary scoring function where I multiplied the error rate times fit time plus 80. Just because I felt like that particular value of that constant gave me an ordering that I was reasonably comfortable with. But you can kind of look through here and see like, okay, well, VIT base has a much better error rate than conv next tiny.

But it's also much slower. Like, you can decide for your needs where you want to trade off. So that's what I kind of, the first thing I did was to create this kind of top 15. And it's interesting looking at the family, right? The family is like each of these different architectures, you know, is kind of from, you know, from, you know, different sizes of a smaller subset of families, right?

So there's conv next tiny, conv next base, conv next tiny and 22K and so forth. So you can kind of get a sense of like, if you want to learn more about architectures, which ones seem most interesting and, you know, for fine tuning on pets, it looks like conv next, VIT, SWIN, ResNet are the main ones.

So that, you know, the first thing I did, the second thing I then did was to take those most interesting families, actually also added this one called ResNetX and created a scatterplot of them, colored by family. And so you can kind of see, like, for example, conv next, which I'm rather fond of, is this kind of blue line, these blue ones, right?

And so you can see that the very best error rate actually was a conv next. So they're pretty good. You can see this one here, which is ResNetX, seems to be, had some pretty nice values. They're like super fast, seems like these tiny SWINs seem to be pretty good.

So it kind of gives you a sense of like, you know, depending on how much time you've got to run or how accurate you want to be, what families are likely to most useful. And then the last thing I did for pets was I grabbed a subset of the basically the ones which are in the top, basically smaller than the median and faster than the median, because these are the ones I generally care about most of the time, because most of the time I'm going to be training quick iterations.

And then I just ordered those by error rate. And so conv next tiny has got the best error rate of those which are in the upper half of both speed and accuracy. >> What's GPU memory in this context? >> That's the maximum amount of GPU memory that was used.

I can't remember what the units of measure are, but they don't matter too much because it'll be different for your dataset or that matters is the relative usage. And so if you want something, you know, if you try to use this and it's actually uses too much GPU memory, you could try ResNet 50D, for example, or, you know, it's interesting that like ResNet 26 is really good for memory and speed.

Or if you want something really lightweight on memory, RegNet Y004. But the error rates are getting much worse once you get out to here, as you can see. So then I looked at Planet. And so as I said, Planet's kind of as different a dataset as you're going to get in one sense, or it's very different.

And so not surprisingly, its top 15 is also very different. And interestingly, all of the top six are from the same family. So this VIT family, these are kind of model called transformers models. And what this is basically showing is that these models are particularly good at rapidly identifying features of data types it hasn't seen before.

So, you know, if you were doing something like medical imaging or satellite imagery or something like that, these would probably be a good thing to try. And SWIN, by the way, is kind of another transformers based model, which, as you can see, it's actually the most accurate at all, but it's also the smallest.

This is SWIN V2. So I thought that was pretty interesting. And, you know, these VIT models, there are ones with pretty good error rates that also have very little memory use and also run very quickly. So I did the same thing for Planet. And so perhaps not surprisingly, but interestingly for Planet, these lines don't necessarily go down, which is to say that the really big models, the big slow models don't necessarily have better error rates.

And that makes sense, right? Because if they've got heaps of parameters, but they're trying to learn something they've never seen before on very little data, it's unlikely we're going to be able to take advantage of those parameters. So when you're doing stuff that doesn't really look much like ImageNet, you might want to be down more towards this end.

So here's the VIT, for example. And here's that really good SWIN model. And there's ConfNEXT Tiny. So then we can do the same thing again of like, okay, let's take the top half, both in terms of speed and memory use. Yeah, ConfNEXT Tiny still looks good. These VIT models is 224.

Yeah, this is because you can only run these models on images of size 224 by 224. You can't use different sizes, whereas the ConfNEXT models, you can use any size. So it's also interesting to see the classic ResNet still. Again, they do pretty well. Yeah, so I'm pretty excited about this.

It feels like exactly what we need to kick us on this PADI doctor competition, or indeed any kind of computer vision classification task needs this. And I ran this sweep on three consumer RTX GPUs in 12 hours or something. Like this is not big institutional resources required. And one of the reasons why is because I didn't try every possible level of everything, right?

I tried a couple of, you know, so Thomas did a kind of a quick learning rate sweep to kind of get a sense of the broad range of learning rates that seemed pretty good. And then we just tried a couple of learning rates and a couple of the best resize methods and a couple of the best polling types across a few broadly different kinds of models across the two different datasets to kind of see if there was any common features.

And we found in every single case the same learning rate, the same resize method and the same polling type was the best. So we didn't need to try every possible combination of everything, you know. And this is where like a lot of the stuff you see from like Google and stuff, they tend to do hundreds of thousands of experiments, because I guess they have no need to do things efficiently, right?

Yeah, but you don't have to do it the Google way. You can do it the fast AI way. Quick question, Jeremy. Which cards did you use? And another question is, which cards did you say? Yeah, the GPU cards. Oh, RTX 3090. Oh, okay. So they were all three different.

They're all RTX 3090s. Okay. And you reset the index after the query? Why? Oh, just because otherwise, it shows the numeric ID here will be the numeric ID from the original dataset. And I wanted to be able to quickly kind of say, what's number six? What's number 10? What's number three?

That's all. So visually. Yeah. Okay. Jeremy, getting back to the earth, satellite images, when you say, you know, like the classification, what is it trying to classify? In this case, the planet competition. We have some examples. Basically, they try to classify for each area of the satellite imagery. What's it a picture of?

Is it forest or farmland or town or whatever? And what weather conditions to observe, if I remember correctly. Question in this image space is, is it just these two major datasets? Or how do you find other models that are trained on beside the planet and imagine it? Oh, you mean beside planet and pets?

Sorry. Yep. That's the answer. What was your question? How do you do what with them? How do you find other trained pre-trained models that have been worked on different data sets? These all use pre-trained models, pre-trained on ImageNet. These are only using pre-trained models, pre-trained on ImageNet. So how do you find pre-trained models, pre-trained on other things?

Mainly, you don't. There aren't many. But, you know, just Google. Depends what you're interested in. And academic papers. There's there is a I don't know how it's doing. It's there was a model. So there is a model zoo. Which I've never had much success with, to be honest. So these are a range of pre-trained models that you can download.

Yeah. But as I say, I haven't found it particularly successful, to be honest. You could also try papers with papers with code. And I think these, yeah, they have a link to the paper and the code. That doesn't necessarily mean they've got a pre-trained model. And then you can just click on the code and see.

And of course, for NLP models, there's the Hugging Face Model Hub, which we've seen before. And that is an easy answer for NLP. Lots of different pre-trained models are on that hub. Jeremy, since you touch on academic papers and papers with code, first question, will this comparison, do you or Tomau intend to publish it?

If not, if you were to do that, what would you go for, actually? What kind of journal would you look at? So I'm not a good person to ask that question because I very rarely publish anything. Which is partly a philosophical thing. I find academia overly exclusive and I don't love PDFs as a publication form.

And I don't love the writing style, which is kind of required if you're going to get published as being rather difficult to follow. I have published a couple of papers, but like only really one significant deep learning one. And that was because a guy named Sebastian Ruder was doing his PhD at the time.

And he said it'd be really helpful to him if we could co-publish something and that he would kind of take the lead on writing the paper. And so that was good because I'm always very happy to help students. And he did a good job and he was a terrific researcher to work with.

The other time I've written a paper, the main time was when I wanted to get that message out about masks. And I felt like it's probably not going to be taken seriously unless it's in an exclusive academic paper because medical people are very inter-exclusive things. So I don't know.

I'd say this kind of thing, I suspect would be quite hard to publish because most deep learning academic venues are very focused on things with kind of reasonably strong theoretical pieces. And this kind of field of like trying things and seeing what works is experiment-based is certainly a very important part of science in other areas.

But in the deep learning world, it hasn't really yet been recognized as a valid source of research, as far as I can tell. I could concur with all the domains and feel the same quandary to be honest. Fair enough. What's your domain? Hydrology, but more the computational science part of it.

Okay. So then what I did was I, I mean, this is kind of a bit at the same time, but I went back to Patty and I wanted to try out a few of these interesting looking models reasonably quickly. So what I did was I kind of took our standard, well, in this case, three lines of code because I've already untarted earlier, took our three lines of code.

So I could basically say train and pass in an architecture and pass in some per item pre-processing, in this case resizing everything to the same square using Squish and some per batch pre-processing, which in this case is the standard fast AI data augmentation transforms targeting a final size of 224, which is what most models tend to be trained at.

And so then train a model using those parameters. And then finally, it would use test time augmentation. So test time augmentation is where I think we briefly mentioned it last time. We, in this case, on the validation set, I basically run the fine-tuned model four times using random data augmentations each time.

And then I run it one more time with no data augmentations at all and take an average of all of those five predictions basically. And that gives me some predictions. And then I take an error rate for TTA for the test time augmentation. So that basically spits out a number, which is an error rate for PADI.

And I use a fixed random seed when picking out my validation set. So each time I run this, it's going to be with the same validation set. And so I can compare. So I've got a few different conf next small models I've run. First of all, by squishing when I resize and then by cropping when I resize.

So that was 235. This is also 235. And then instead of resizing to a square, I resize to a rectangle. In theory, this wouldn't have been necessary. I thought they were all 480 by 640. But when I ran this, I got an error. And then I looked back at the results of that parallel image sizing thing we ran.

And I realized there was actually three or four images that were the opposite aspect ratio. So that's why. So the vast majority of the images, this resizing does nothing at all. But it's three or four that are the opposite aspect ratio. And then for the augmentation, yeah, pick a size based on 224 of a similar aspect ratio.

But what I'm actually aiming for here is something that is a multiple of 32 on both edges. And the reason for that we'll kind of get into later when we learn about how convolutional networks really work. But it basically turns out that the kind of the final patch size in a conf net is 32 by 32 pixels.

So you generally want both of your sides. Normally you want them to be multiples of 32. So this one got a pretty similar result again, 240. And then I wasn't sure about my contention that they need to be multiples of 32. I thought maybe it's better if they like a really crisp resizing by using an exact multiple.

So I tried that as well. And that, as I suspected, was a bit worse. Oh, what's this? I've got some which, which ones are the right way around? Now I'm confused. I think, let's check. Some of these, originally I had my aspect ratio backwards. That's why I've got both.

It looks like I never got around to removing the ones that were unnecessary. Oops, wrong button. Copy paste size. Paste. Leave those off. Method equals add. Oops, pad mode. This makes it a bit easier to see what's going on if you do padding with black around them. There we go.

Okay, yeah, so you can clearly see this is the one way around, right? I've tried to make them wide, but actually they were tall. So the best way around is actually 640 by 480. That's more like it. So 640 by 480 is best. So let's get rid of the ones that were the wrong way around.

Okay, all right. Yeah, so that was all, you know, various different transforms, pre-processing for ConvNEXT Small, and then I did the same thing for one of the VITs, VIT Small. Now VIT, remember I mentioned it can only work on 224 by 224 images, so these rectangular approaches aren't going to be possible.

So I've just got the squish and the crop versions. The crop version doesn't look very good. The squish version must look pretty good. And I also tried a pad version, which looks pretty good. And then, yeah, I also tried SWIN, so here's SWIN V2. And this one is slow and memory intensive.

So I had to go down to the 192 pixel version, but actually it seems to work very well. This is the first time we've had one that's better than 0.02, which is interesting. This one's also very good. So it's interesting that this slow memory intensive model works better even on smaller size, 192 pixel size, which I think is pretty interesting.

And then there was one more SWIN, which seemed to do pretty well, so I included that, which I was able to do at 224. That one had OK results. So I kind of did that for all these different small models. And as you can see, they run pretty quickly, right?

5 or 10 minutes. And so then I picked out the ones that look pretty fast, pretty accurate, and created just a copy of that, which are called patty large. And this time I just replaced small with large. And actually, I've made a mistake. I'm going to have to rerun this because there should not be a seed equals 42.

I actually want to run this on a different subset each time. And the reason why is my plan is to train. So basically what I did was I deleted the ones that were less good in patty small. And so now I'm just running the large ones. Now some of these, particularly something like this one, which is 288 by 224, they ran out of memory.

They were too big for my graphics card. And a lot of people at this point say, oh, I need to go buy a more expensive graphics card. But that's not true. You don't. So if you guys remember our training loop, we get the gradients. We add the gradients times the learning rate to the weights.

And then we zero the gradients. What you could do is half the batch size. So for example, from 64 to 32. And then only zero the gradients every two iterations. And only do the update every two iterations. So basically you can calculate in two batches what you used to calculate in one batch.

And it will be mathematically identical. And that's called gradient accumulation. And so for the ones which ran out of memory, I added this little acume equals true, which is here in my function. And I said, yeah, I said if acume equals true, then set the batch size to 32.

Because by default it's 64. And add this thing called a callback. Callbacks are basically things that change the behavior of the training. And there's a thing called gradient accumulation callback. Which gradient accumulation. And this is like just for people that are interested. This is not like massively complex stuff.

The entire gradient accumulation callback is that many lines of code. Right? These are not big things. And like literally all it does is it keeps a count of how many iterations it's been. And it adds the, you know, keeps track of the count. And as long as we're not up to the point where we, there's the number of accumulations we want, we skip the step and the zero gradient basically.

So it's, yeah, things like gradient accumulation, they sound like big complex things. But they, yeah, turn out not to be. At least when you have a nice code base like fast AIs. Jeremy, can I get a question here? How exactly do the batch size mass animations work? So we will get into that in detail in the course.

And certainly we get into it in detail in the book. But basically all that happens is we randomly shuffle the dataset and we grab, so if the batch size is 64, we grab the next 64 images. We resize them all to be the same size and we stack them on top of each other.

So if it's black and white images, for example, we would have 64, whatever, 640 by 480 images. And so we would end up with a 640 by, 64 by 640 by 480 tensor. And pretty much all the functionality provided by TyTorch will work fine for a mini batch of things, just as it would for a single thing on the whole.

So in the larger scheme of things, you know, like some huge processes that's trying to characterise, what role does the batch sort of play? Well, it's just about trying to get the most out of your GPU. Your GPU can do 10,000 things at once. So if you just give it one image at a time, you can use it.

So if you give it 64 things, it can do one, you know, a thing on each image and then on each channel in that image, and then you don't have another few other kind of degrees of parallization it can do. And so that's where you start with, you know, we saw that NVIDIA SMI daemon command that shows you the utilisation of your symmetric multiprocessor.

Yeah, if you use a batch size of one, you'll see that SM will be like 1%, 2% and everything will be useless. It's a bit tricky at inference time, you know, in production or whatever, because, you know, most of the time you only get one thing to do at a time.

And so often inference is done on CPU rather than GPU, because we don't get to benefit from batching. Or, you know, all people will queue a few of them up and stick the model in the GPU at once. And, you know, stuff like that. But yeah, for training, it's pretty easy to take advantage of many batches.

Okay, thank you. No worries. Jeremy, you've trained so many models. Will you consider using a majority vote or something like that? No, I wouldn't, because a majority vote throws away information, it throws away the probabilities. So I pretty much always find I get better results by averaging the probabilities.

So each of them, each of the models after I've trained it, I'm exporting to a uniquely named model, which is going to be the name of the architecture, then an underscore, and then some description, which is just the thing I pass in. And so that way, yeah, when I'm done training, I can just have a little loop which opens each of those models up, grabs the TTA predictions, sticks them into a list.

And then at the end, I'll average those TTA predictions across the models. And that will be my ensemble prediction. So that's my next step. I'm not up to that yet. Okay. All right. Well, I think that's it. So that's really more of a like little update on what I've been doing over my weekend.

But hopefully, yeah, gives you some ideas for things to try. And hopefully, you find the Kaggle notebook useful. So Jeremy, so how many hours did you spend in all these explanations? Because you spend a lot of experiments here. So, you know, it's like a week or two of work to do the fine tuning experiments, but that was like a few hours here and a few hours there.

The final sweep was probably maybe six hours of three GPUs. The patty competition stuff was maybe four hours a day over the last four days since I last saw you guys. And writing the notebook was maybe another four hours. Thanks. It helps. No worries. All right. Bye, everybody. Nice to see you all.

Bye so much. Thanks, Jeremy. Bye, everyone.