Back to Index

Live coding 15


Chapters

0:0 Questions
2:0 Running notebooks in the background from scripts
4:0 Connecting to your server using Xrdp
7:0 Can you connect to Paperspace machines remotely?
13:30 Dealing with startup issue in Paperspace
16:20 Native windows in tmux with tmux -CC a
18:30 Getting mouse support working in tmux
20:0 Experimenting with closing notebooks while fine tuning
24:30 Progressive resizing recap
26:0 Building a weighted model
29:0 Picking out the images that are difficult to classify
33:30 Weighting with 1/np.sqrt(value_counts())
36:0 Merging the weights with the image description dataframe
38:0 Building a datablock
40:20 How we want weighted dataloaders to work
41:30 Datasets for indexing into to get an x,y pair
44:0 Sorting list of weights and image files to match them up
47:30 Batch transforms not applied because datasets didn’t have this method
49:0 Reviewing errors
55:0 Python debugger pdb %debug
56:50 List comprehension to assign weight to images
59:0 Use set_index instead of sort_values
59:30 Review weighted dataloaders with show_batch
60:40 Review of how weighted sampling will work
63:30 Updating the fastai library to make weighted sampling easier
64:47 Is WeightedDL a callback?
66:20 modifying weighted_dataloaders function
70:0 fixing tests for weighted_dataloaders
73:40 editable install of pip in fastai
76:15 modifying PETS notebooks to work on splitter
78:7 How does splitters work in datablock
80:25 modifying weighted dataloader by using weights from Dataset
83:36 running tests in fastai and creating github issues
84:33 fixing the failing test
89:40 creating a commit to fix the issue
91:30 nbdev hook to clean notebooks

Transcript

So you've got a new PADI submission. Let's take a look. Kaggle competition. By the way, it's really beautiful to see over the last week or two all these fast AI people just pop up at the top of that leaderboard. It's so cool. Okay. Fast AI, fast AI, fast AI, fast AI.

Who's this person? Is this fast AI? At least the top five. Yeah, like most of the top five or top ten are following you in these walkthroughs. You've all got the same score though. Somebody's got to like, you know, Kurian's got something. Secret source. Well, I've got a few ideas I can show you guys today if you want to try and take it a bit further.

Which I bet you do. Anybody have any comments or questions in the meantime? All right. Chair screen. And that's the right screen. And I'll move you guys onto the other screen. And now I can see. All right. So, Addy leaderboard. There we are. Where's Radek? Not here. Serada? I see.

One thing that I guess it would be nice if it wasn't so sort of, I don't know, a little bit missed because I set this up in paper space, and then started running it, and then I went to bed, because it was taking so long. I just have a fear that if my browser sleeps or goes to sleep that it'll just basically stop the session, even though there's more hours and processor in the workstation is running.

I wasn't sure. So, I mean, it, it shouldn't. But what happens is it queues up, you know, for when your browser comes back. But the problem is, there is some limit to how much it will queue. So, although it'll have run, if you've hit that limit, you won't see all the outputs, which is nearly just as bad.

So, you know, there's a few things you can do. The most obvious one would be to use nbdev to export the notebook to a script, and then run the script in tmux. Because then you can close it down, come back, reattach tmux, and there it is. Okay, interesting. Now something, yeah, so maybe we'll look at that sometime.

Yeah, something I don't, well, does paper space gradient, like you have, doesn't let your SSH in with a suitable IP? I'm not sure. If you've got your own GPU at home, you know, or on AWS or GCP or whatever, then what I do is I run xrdp on it, which is a remote desktop server.

And then I can connect to it, like so, and run Firefox. And so this is my, yeah, this is my server's screen, you know, remote desktoping in. So if I now go in and run something. Hattie, I remember from last time. Okay, so I can set this running, and then I can close it down, go to sleep, come back the next day, reconnect to that screen, and it's still been running.

So that's the, that's a preferred way to do it. But I, yeah, as I say, I don't know if it's possible on paper space gradient. Sorry, go on. Machines seem to have a limit of six hours that I've seen so far. If you subscribe to their pro or whatever, you can bump it up or get rid of it all together.

So it's this tab here. Machine tab, you can change the auto shutdown. Okay, looks like a week so maximum. I know there's no limit there as well. You're paying, but I mean, you know, it's, I think it's like eight bucks a month yeah eight bucks a month. You may as well.

Yeah, I've got the pride but I don't know when you pick a free machine. Oh, yeah. Right, free P 5000. Right. maximum six hours. Yep. So, Jeremy. Yeah, sorry to interrupt hyperspace when they're in a support channels they they talk about you can assign a public IP to a machine, and then SSH to it.

So you could SSH and then T marks. Is that to a radiant machine though. Well, good question. Look, I'm not sure that it would be. No, it's not. And so they also have this thing called core right, which are like some more like AWS or Google servers, which absolutely lets you do a static IP.

Even neatly I don't even know if you need static IP necessarily but you could use a dynamic IP. Just as well. Bit cheaper. The thing is, though, I reckon they're pretty expensive. Yeah, cool product. So that these are very basic GPUs. So that's not bad 45 cents an hour.

I guess they're not too terrible. If you want to Tx, I guess they're the same price really 56 cents so I take that back. I guess the thing I found expensive was this CPU pricing for running it all the time. Yeah. So, Jeremy, with this RDP solution that you showed, how does that work.

Do you have. Close here. Okay. So how does that work. I didn't get to where what computer you're already into my own GPU machine, but it could just as well be a AWS machine or GCP machine. This is basically the same as VNC, if you've come across the NC before.

RDP is the kind of Microsoft version of that. I like it. Generally quite a lot better. And much to my surprise, the Mac client RDP is better than the Windows client ready. He even shows you a little mini screenshot, you know, with the screen. This is now finished training.

No, no, no, we finished halfway through training, whatever. But what's this tricky to set up because you're like, you're running an index server. Not even slightly tricky to set up. So, yeah, you just it's called XRDP since it's RDP for X Windows. You just go after install. Yeah, I hate installing this kind of thing.

It drives me crazy. But this is it. You just sudo apt install sudo ad user sudo system CTO restart. And then you might also want to run sudo system CTO enable, which will cause it to automatically start when you start your computer. And I don't think I, oh, you know, if you've got a firewall, you have to let it in.

So it's port 3389. Basically this line of code and I think I did have a firewall so I also ran this. Yeah, that was it. It just used my username and password that I had on the machine. Yeah, so very surprisingly not annoying. And then I think I just installed Microsoft Remote Desktop from the Mac App Store or on Windows.

I think it comes with Windows. So that was easy. Yeah, nobody says to talk about it much. People mainly talk about BNC, which is also fine, but I find it a bit slower and a little bit more awkward. All right. I mean, one weird thing, I guess, is I guess my machine, and this is pretty common, I haven't set up really to be a graphical workstation.

I always use it from the console. So I actually don't really have much of a window manager here. I can't even like, I can do a little bit. I don't know. I don't know what the whole window manager is even using. But often you'll find like there is no window manager or whatever running.

But, you know, a bit of Googling will show you how to app to install. You know, whatever. KDAE or stuff. Okay. Since we're on the installation topic, could I ask a question? So I think I kind of brought it up a little bit, but I can't launch FastAI, a machine that runs FastAI, and PyTorch, a PyTorch one would work.

So what suggestions would you have? So that means that your prerun.sh file has got a problem. So maybe comment it out from PyTorch, just start it up. Yeah, open up your PyTorch, open up a PyTorch machine, move prerun.sh to prerun.back or something. Or just open it and see, like, it might be obvious what's wrong with it.

Yeah, I couldn't see anything. When you say it's not working, what's like, what's not working? Well, it just says error when I try to start it up, just says error. And I tried to reach out to the paper space support a couple of times, but maybe it's a too abstract question.

But I'll try that. People are putting stuff in the text chat. Please try to say things, verbal chat, if you can, because it's way nicer for me, and I don't have to check multiple windows. I know it's not possible for everybody. Okay, so sorry, Jeremy, there is a way to SSH into a gradient machine, but you have to trigger the virtual machine to be built from the command line.

So you have to initiate the job and there's a space to have a GitHub repo. And is there any reason to do that? Like, that sounds complicated, like, would you just run a... It's way more effort than it's worth. Just run a paper space core machine if you want to, I guess.

Yeah, exactly. So you can do it. It's just, why would you? So yeah, I mean, so for paper space, the issue around the notebook closing, I would like start running something, close the notebook, and then reopen it just to see what happens. Ghetto. And, you know, let's try it here, right?

Now, what was that thing we learned the other day? It was shifting. Then go to the other one. Oh, that was my one. Okay, I got to learn how to... Hey, Jeremy. Yeah. Can you... Are you using I-term too? Because you can do tmux minus cc and you'll get native windows in tmux instead of the little sort of terminal ones.

That sounds interesting. Let me try that. Yeah, I'm addicted to that. It's awesome. Minus capital capital, minus capital cc. Unknown option C. Does that have to be before the A? Yeah, so it'll be tmux minus capital capital. Yeah, there you go. Okay. And what are the benefits of this approach?

They're native windows. You can click and drag them and move them around, pop them out. Yeah, all that stuff. You can click and drag tmux windows as well. Okay, this is all the same as... Like, if you've got to have mouse mode on for them to work... If the shortcut is like Command + Shift + D will split panes.

You don't have to go into I think it's a colon or something and command something. It's just like less dimmy. You just use Control + B. Maybe it's exactly the same. Yeah. I mean, you'll have the same shortcuts. Control + B doesn't work anymore, so what about tmux shortcuts are not going to work anymore?

How do I detect now? Yeah, I think they're different. Escape, I think, or... If you go back to the original window that launched it, it'll have like a... Okay. Okay. Yeah, I'm not convinced it's going to help my workflow, but I think, yeah, for people who are more familiar with tmux shortcuts, that could be cool.

Thanks for the tip. What's going on down here? It's really good. The trick to get mouse support working, so, for example, my scroll wheel, as you can see, works nicely in this normal tmux window, is to... Have a .tmux.conf file that contains set option minus G mouse on. And then you can also increase your history limit.

And, yeah, that's how come I can scroll. I think the thing like, you know, or a thing I like about tmux is it's very integrated with my kind of the normal way of doing things in Unix, you know? So, for example, if I want to search through my previous session, I could just hit question mark to search up, and I could search for makefile, for example.

And I, you know, hit N, just like I would in vim, hit slash to look forwards. You know, it's like my terminal works the same way as vim or whatever, which I, yeah, which I really like. And I think, yeah, that way I don't have to know like, oh, the I-term sort of shortcuts and some other sort of shortcuts.

It's just this kind of like general Unix-y way of doing things, I guess. And, of course, they'll also all work on the paper space terminal as well. Yeah, so let's try this. So if we start running this. Okay. Close that. Leave it for a few seconds. And you can see here it says in my console, starting buffering.

So it's remembering things that were sent to me. So if I click now back here, there we go. It's, let's see. Hmm. That didn't seem to work, did it? That's interesting. Okay, so let's try something different. So I don't think you can just close it and reopen it. Let's try something else.

What if we fake a network disconnection by closing SSH? Okay, so now, all right, connections failed. So I'll leave that window open, and then we reconnect. And, yep, okay, so that worked. So there's some of her answer. But yeah, I think there's something now, if you leave it long enough, it says I've stopped listening for events because there's been too many and tells you there's some configuration option you can change to make it bigger.

Should probably be a useful thing to know about. Let me just go and turn this alarm off. Hang on. Okay. Okay. Yeah. Sorry about that. My daughter likes to be permanently entertained so any gaps in her homeschooling schedule. She likes to be amused. She doesn't like the fact that I'm doing this and Rachel's a CrossFit.

Okay. So we had a look the other day at progressive resizing, right. And so this is where I got to, I think like progressive resizing one interesting thing you can do is like you can go crazy like you can go extra lunch. And, you know, we start out with some teeny tiny images and train for a while.

And then combine that with gradient accumulation to then go up to big images that don't have to train so long. I think this is a good trick for probably particularly for code competitions on Kaggle where you've got serious resource constraints, you know, or just wanting to do more with less time.

So I think, yeah, Kaggle you would have needed accumulation level of four rather than two to make this fit because they've got 16 gig cards we also have a 24 gig card. So then something else that then we started talking about was weighted models. That's weird. What happened to my weighted model?

Did I move it to course 22? Well, that's fine. So the question I think we had yesterday was about unbalanced data sets and would it be a good idea to balance our data set. So let's start with a nice small model to use as a base case, something we've done before.

Con next. Okay, let's use this one. So actually there's no point copying progressive, I guess. Let's copy small models. Okay. Rename and so this is going to be for weighted. Okay. So the resizing that we needed on my machine but since we'll be putting it on Kaggle and as well.

Okay, so that's going to be our base case. So for weighting, we can df.label.value accounts. So there's our level of unbalancedness. So it's not too bad. There's a lot of normals, a lot of blasts. Not many of these are bacterial thingies. Nick, I don't know if you're around. I mean, I can see you are around.

I don't know if you're able to talk. But if you are, you might be able to tell us about what you found because I know you've been looking at these, which of these are hard to kind of visually see the difference between. Yeah, yeah, for sure. I'm sorry. I dropped out earlier because we had a power cut here, but I'm back now.

So I'm intentionally video lists or. I am. I'm not intentionally video list, but that's that's the the break at the moment. Sorry about that. But yeah, like one thing that I did just to, I guess, get a better handle on the data set was going through them and having a look at the different types.

It's really hard to pick even what what the difference was between a normal image and, you know, say like Downey mildew or whatever. It could be quite hard to pick out. And so one thing I thought it would be fun to do was to almost like segment or mask the images playing with the color channel to see if they would come out a bit better.

And then when I did that, I was able to take kind of, I guess, the yellow dead bits or disease parts and I could see them better when they were, you know, like in bright red. And the thing is, is that so many of these like when I found like when I've trained them, I find that there is a handful of a handful of images really like like 20 to 25 images that are very difficult to classify.

And it tends to be these actually from these imbalance classes where it tends to categorize them as blast when it's not. And I think you're all ones tend to get. Yeah, in fact, let me just pull up in one of my notebooks share your screen. Yeah, let me see if I like when you look at this.

Are you able to see. But it helped to make these bigger. Are you able to see the disease and these because I don't know what I'm looking for. Yeah. How do we make this bigger. Probably there's like a figure size in that plot lib isn't there that plot lib.

So like a big size, big size. Yes. Big signs equals. I don't know. Which way around is it. We can't hear you by the way, Nick. I don't know if you if we lost you. I also tried to look into the image using the confusion matrix and then the most loss to put it over.

It's just too hard. It's beyond my domain and I was planning to do that today, actually. So that's yeah. I don't know what happened to Nick. Maybe he's having some internet problems again. I wonder if it's just like red spots or something. So, yeah, I mean, it's an anyway.

It's interesting that Nick said he found these ones difficult. So, yeah, there's basically two reasons to to wait different rows differently. One is that some of them are harder and that you want them to be shown more often to give the computer more of a chance to learn them.

And the other is some are less common and same thing. So, you know, one possible waiting for these would be to take their reciprocal. And so then, you know, normal is going to be shown less often if we wait all the normal ones by this amount and all the bacterial, panical, blight ones, this amount, you're going to get more of these.

So that's like one approach we could use. I feel like that might be overkill. So I'd be inclined to kind of like not do it quite that much. So like another approach would be to like take the square root, maybe one over the square root, kind of like that.

So then these are going to be shown about twice as often as these, you know. So maybe like let's start with trying this as our set of waitings. Jeremy, if I could ask a question at this point. So the waiting and when you talk about waiting such that images are shown more or less often.

I wonder in cases where it's very imbalanced, whether that could lead to some classes being overfitted to because the model learns about the images themselves, I came across in looking at classification. Yeah. And whether there was a way to, I read about how to deal with imbalances. And I've seen some recommendations to try to wait when calculating the losses, rather than resampling the input.

So I just wondered whether it was possible. It's different, right? So in the end, you want it to be able to recognize the features of the images you care about. And there's no substitute for like having them see the images enough times to recognize them. However, when it does that, it is then going to, because it sees the rare cases more often, it's going to think that those rare cases are more probable than they actually are.

So you have to reverse that then when you make predictions. So that's, yeah, that's something to be careful of. So I mean, I think it probably just helped to try to take a look at it to see what that looks like. So yeah, so here's our waits, right? I would be inclined to probably, can we merge things directly?

Let's take a look. So if I go df.merge, which is kind of like a way of doing a join in pandas. And the right hand side, yeah, the right hand side can be a series. Cool. So merge on waits. What does that look like? Nope. Why not? And then, okay.

Left. I see. So left. Okay, so on. Left. Left on. Left on equals label. And right. I think that's called the index. I'm not a pandas expert. I don't know if anybody is. There we go. Okay, so that's added these waits here. Given the slightly weird name, but that's okay.

So if we called that way to df. And so then. We could take out a little function and move them over here. And I think what we want to do is use data blocks at this point. It's often a good idea. And we have a data blocks version. Certainly make one otherwise.

Okay, here's a data block. So let's get a data block. Got an image block and a category block. Get why is parent label. Okay, item transforms is. Item transforms is this. Jeremy, I think you're in the wrong book should be weighted. Thank you. Thank you. Yes, I had these here but thank you.

Okay. Okay, and batch transforms. Let's use the same ones we had here to make it fair. Okay. So there's our data block. We actually use this resizing. Okay, Jeremy. Yeah, sorry, sorry to interrupt there. So this approach is we're going to use the data block to even the numbers of what's being sampled so that we can more augmentations of the same images for the lower represented samples or relative.

So, it's nothing to do with the data block, we're going to use things called weighted data loaders, and the way to data loader is going to use these numbers here to as as basically like probabilities of how likely it is to pick that row, but it grabs a row in a batch.

Yeah, I was going to add them all up and do each of these divided by the sum so they're allowed to one. So what we need to data blocks is because the weighted data loaders method is a method of data block. It's not something we get in the, you know, quick and dirty image data loaders thing that doesn't have as much flexibility.

So now that we've got a data block we can type that day block dot. Import it import fast.ai dot call back dot. What was it in again. I don't remember fast.ai weighted data loader. It's a data callback. Oh, okay, so that's it's a, it's actually a method of data sets.

So we can get a data sets object from a data block. Like so. And we pass in source. So that would be our list of image files. So we can files equals get image files. In our training set, pass those in and there's our training set and there's a validation set.

So they're data sets. So these are the things that remember we can index into and get a single xy pair. And so weighted data loaders is then something we can pass data sets to and give it weights and a batch size. Okay. And the weights are for the training set.

Okay, we're gonna have to be careful about this. So we should go to DSS dot weighted data loaders. And so the source code. Yes, it calls. I said to do. Which is here. Okay. What's called weights. All right, I'm not 100% sure how this is going to work but let's try it.

So our weighted data frame. So this is the weight for each row, right. And then we've got our files. Yeah, we've got to be a bit careful here, right, because they're in different orders. So we actually need a way to get a list of weights where the two orders are going to match each other.

And you do it by key lookup. Can you put again we could do it by key lookup. I'm actually thinking of something a little lazier, which is just to sort them both. Okay, so although I don't have what's going on here. Doesn't have them all. Are they not contiguous.

So values by image ID. They are contiguous. So where is image 100001. The sorting must be by folder Thursday. Yes, of course, that's exactly what it is. Thank you. Okay. So we could use a key. That looks hopeful it says here if the key is a string use attribute getters so I think I can just pass in the key name.

Ah, that is magic. That is the magic of fast call right there. There we go. So that's sorting by name. And we can do the same thing for this one. Like so. And so now they're sorted by the same thing. So that's a good step. So the weights are basically WDF dot label Y.

Now that's a pandas series, which Yes to numpy would turn it into an array. That is not quite sure whether this has to be just for the training set or is for both. We'll find out in a moment. If I run that, it doesn't like it. That's interesting. Of course, so the batch transforms actually didn't end up getting applied, because we use data sets which doesn't apply batch transforms.

So we would need to now apply them here. So that's quite confusing. So presumably, I don't see it here but I would expect to be able to go batch transforms at this point. This is all quite awkward, isn't it? So data loader keyword arguments equals batch. So if we're creating a data loader, a weighted data loader.

You know what would be a good idea would probably be to look at the data block data loaders source code to see how that does it. Data sets data loaders. Here we go. After underscore batch is what it is. After underscore batch that's not it. Let's see. Okay, it's calling data loaders passing in the keyword arguments and Okay, data loaders does not call it after batch.

That's dot data sets. Yeah, so okay, so data sets dot data loaders is this thing here, and that doesn't equal it after underscore batch. Oh, and I think I know why I think that's because when we look the other day at data block, we noticed that it like adds Oh, yes, yes, yes, the image block that adds int to float tensor as a batch transform.

So we might need to add that as well. Okay. So it's getting pale images. So the fact is getting pale images means it's never been converted to a tensor. So data block. I think there's something that calls to tensor or something at some point. Oh, there is here item transforms.

So why isn't that getting called because Oh, item transforms, I think, are also done at the data loaders stage. Item transforms. Let's see. I am transforms. Yes, that's also done. Okay. So basically, using data sets instead of data loaders is quite awkward. I think we need to fix this in fast AI because yes, it's not being done for us.

But you know, what we could do actually is what we could do is the same thing that data block does, which is just to use these self dot item transforms and self dot batch transforms. So if we have a look at our data block. Oops, Daisy. Okay, I think this is all going to become clear in a moment.

Hopefully it's got these item transforms in it. And it's got these batch transforms in it. And so what we actually want to do when we create a data loaders is say that after batch is whatever the data block says the batch transforms are and after item. Is whatever the data block.

Says the item transforms are. Okay, that's ugly. So that's something I think we should try to make easier. So hopefully by the time people see this video, this will all be easier. So there's some data loaders. Okay. So my guess is that here is we've given the wrong number of weights.

I'm guessing this needs to be weights just for the training set. So the way I would check this is I would type percent debug and that puts us into the Python debugger and the Python debugger is a very, very cool thing. It's called PDB and definitely want to know how to use it.

H gives you the help and W shows you where in the stack you are. So you can see this is the line of code I'm about to run. And so I can print out with P self dot n and I can print out with P self dot weights and I can you don't actually normally need to even say P.

It just assumes that so I can just say soft or it's touch shape. And so there's the problem. So it's expecting eight thousand three hundred and twenty six weights, not ten thousand four hundred and seven weights. And so that's because and to be fair, the documentation warned us about this.

It's expecting weights just for the training set, not for both training and validation sets. Okay, no problem. Could you pre-determine you split both by adding another column in you in the same data set there to put the weights in? Yeah, I could do that. But actually and somebody actually asked about this the other day.

This is our training set. And items tells you the the file names, actually. So we just need to look each of these up. In the data frame. So what we could do is we could say weights. Equals. And so we could go through each of those. So that's going to be all of our files.

And then we need to. Look up the image ID. And I think something you could possibly do here is. Set the index to image ID. Which is this kind of pandas idea WTF equals. And then. We say. Location of one. One dot JPG. There it is. For label Y.

There it is. So. If we copy that over to here. And replace that with our. Oh, oh, don't name. Look at that. Okay, so. Okay, so we don't want to sort values. We want to set index. I should probably take more use make more use of indices in pandas.

I guess I still don't have a great sense in my head of quite how they work. So I tend to under use them. Okay, so weights should now be the right length. For the training set. Okay, so now. Our weights here. It's just weights. Cool. And then what I'd be inclined to do is to do a few more.

And what I find encouraging here is that we've got a lot of bacterials. But, yeah, you know, this seems like a good mix, right? So then. We should just be able to. Add those to a learner. Fine tune for five epochs. All right, sorry that was a bit more awkward than I would have liked and definitely used a whole bunch of concepts which we haven't covered before.

So don't worry if you're feeling lost about the implementation here. Yeah, I mean, just about the how the sampling works. We've got weights and that's creating. How is that actually sampled from the training set is it. Do we have a number of rows or number of images that we're trying to create a sample.

Yeah, so what happens is it creates it creates batches. So each batch will have 64 things in. And so it's going to grab at random 64 images. But it's a weighted random sample where each row is weighted by this. This weight. And so an epochs not exactly an epoch anymore in that it won't necessarily see every image once an epochs and epochs just equal to the total number of rows in the data set is how many rows I've seen.

But, you know, we'll see a lot of the less common ones multiple times. And so there's a definite danger of overfitting. The weighted sampling is not done for the validation set. So we should be able to compare these. Let's take a look. So five point six versus four point six.

Now, you know, this is expected that where this might be interesting would be like. Do all of our training and then maybe at the very end do a few epochs with weighted training, you know, at the point that it's already really good just to show it a few more examples of the less common ones or just train it for longer with more data augmentation.

But yeah, I mean, you know, you would expect the error rate at this point to be worse, I think, because the most common types, which it's particularly want to care about because they're the ones that's going to have mainly in the training set, it hasn't seen very much. So the overall error has gone down.

But yeah, I think you like it. It might they may well be ways to to use this. I'm Jerry. Yeah, it's possible you could quickly explain where the deficiency was in this random weighted API, how you would prefer that to look like you said you. Oh, yeah, sure. Fix it up later.

But I mean, I think I think the way this ought to look would be that I can say deals equals D block dot weighted data loader like that. In fact, you know, we could we could fix it up now. The reusing the existing after batch and after items already.

And yeah, we could we can fix it up now if you're interested. Yeah, I'd love to see how to commit a change. So, you know, the first thing I'd do before I change the fast AI library is make sure I've got the latest version of it by doing a get pool because nobody likes conflicts.

All right, it's up to date. So then I would go into the notebooks and it was in the data callbacks to call back data. And so here's where the data loaders. Jeremy, is this a bit of a silly question? But is it a callback or is it just kind of like a transform within the actual data block?

Should it be if you send weights to a data block, then it just does it. Is it a call back? No, it's not a callback. So it's it's in a strange place. It's not a callback. What it is, it's a data loader, actually, and a patch to data sets.

So there's a. You know, something I like very much in fast core called patch, which is allows us to add a method with this name to this class. And I want to add something to the data block class. Like so. And but yeah, I think that the doc string is correct.

And I would then be inclined to just grab this. Here copy and paste it in here. OK, and so this would be calling. Yeah, so we're calling the data blocks. So I guess we're going to do the two steps. Manually, aren't we? So we're just going to go. The data sets.

And so that means we need to be passed in. The items. Called source, and I'd be inclined to like grab all that. OK, so this. This thing in data block. It's going to need a source. It's going to need the weights. It's going to need a batch size. Apparently there's something called verbose.

I don't know what that means, but that's fine. The so the data sets is self data sets passing in the source. And verbose equals verbose. And then. We called DSS data loaders. And when we did that. OK, so now we're going to be passing doing DSS.weighted data loaders. This.

Weighted data loaders. And that. That's basically. Oops, what happened there? And then we pass in the weights. Weights. So weighted data loaders gets the weights. And then the batch size and then the things we added. Any additional keyword arguments. And this will delegate down to. Data sets.weighted data loaders is where the keyword arguments get passed to.

OK, so. As far as I can tell, these same tests. Should all work. We don't need these labels anymore. It is valid. We've already got a data block. So previously we called data set and item transforms and weights manually. So that is our source. So we could get rid of all this.

And we're now going to go data block.weighted data loaders. And we've got to pass in our source. OK, and we've got to pass in our weights, which were called weights. And we don't need that. And we don't need that anymore. OK, why did I get zero? That's slightly surprising to me.

I can get zero. Yeah, that's fine. Yeah, get zero or one. Yeah, because it depends how it. Why is it slightly random? I'm sure something slightly random. But anyway, it's working. So then. OK, then again, for this one, we shouldn't need to do data set start. We should be able to go data block.weighted data loaders.

And we should be able to pass in our items. And our weights. And. OK, what did I do wrong there? Data block, weighted data loaders. Oh, it's got it. OK, let's see. We've got our source, weights. Why doesn't it like that? Source equals. So let's see how it's different to what this one said.

Data sets. OK, this doesn't use a data block. So, OK, I can't replicate that. That's fine. OK, so. That's our test. There we go. So what I would then do is I would export it. And if. So that that I don't have to like rebuild or reinstall or anything like that.

My first library, that's because I have it installed using something called an editable install. So if you haven't seen that before, basically, or maybe you have a new one. Why? When you go pip install minus a dot in a get repo. Basically, that creates like a symlink from your Python library to this folder.

And so fastai, when I when I import fastai, it's actually going to import it from this folder. And so now back over here in my weighted thingy. If I. Do all this. Data block. We should find that there's now a deep block. Data loaders. Which. I can pass source and weights.

And my sources files and my weights is. Wait. And my weights. Okay, so that's interesting. I wait. Yes, we don't have data sets yet. So that's a very interesting point. So how do we know what our weights are we don't because they haven't been split. So the. And then through is one of the blocks in the as a column get from and then use that because then it would be linked quite intimately with the actual row.

Well, we don't need to. I think what we need to do is pass in weights. We should pass in all the weights and then this thing here should then be responsible for grabbing the subset for the training set. And that would actually be much more convenient, which is after all is what we want.

So we should determine the weights based on the the distribution across the classes rather than just a lot. We should split the weights based on the splitter into training and test set. So then we don't need any of this. So then weights. Actually, we'll simply be. That's our way to data frame.

So basically what I would do here is this will actually we'll go back to saying this is sort values. And then our weights will be. WTF dot label Y. That's actually our weights. As a number. Silly, silly question. Could you not just see the function for weights to the standard data block and if it doesn't get one, then it does nothing.

Potentially, we could. It's I kind of like this though because like yeah, I don't know. It's like weights were all one as a default then could use the one solution for yeah, yeah, you could. I just I don't I find it's a little bit too coupled for me. I don't love it, but it's it's it would be doable.

It's an unnecessary multiplication, I suppose, you know, I like how nicely decoupled this is. So I think this is what I want it to look like. So. So I would look at how the splitters work. So the splitter. OK, so the splits gets created here in datasets. Cool. And then.

I wonder if datasets remembers what those splits are. Oh, I don't have tags here. What do you mean no tags file? OK, there we go. Datasets. So that's control right square bracket to remind you to jump to a symbol in Vim. I see. And that's actually mainly happening in this inheritance that superclass is where.

This is split stuff here, yes, splits. I see. There is a splits. So DSS dot splits. Oh, OK. DSS dot splits. Yeah, so there's the indices of the training and test sets. And so that's the indices of the training set. So the actual weights we want to those ones.

So over here. We can say training weights. So we'll change this to data set from training set. And so this will be the weights. At those indices. And that's what we'd use. Like so self dot splits. Thank you so much. No DSS dot splits self is a data block and it's actually the DSS that has the splits.

The data block has a function that knows how to split, but the split doesn't happen until you create it. That way you can get different random splits each time if you want them. Thank you for checking though. OK, so I'll export that. And. Probably be good to have auto load going, but we don't.

So be it. OK. Now that we did miss the self, but it's not the one you thought of. This one here. Yeah. OK. I guess actually if I just comment this out. Then we can just run all above without worrying. Okay, things are happening so deals equals that. OK, that looks pretty good.

OK, so I think we've created our feature. So then the next thing I would do is to be very, very weird if any tests broke, but I would go ahead and run the tests. I would then create an issue for my feature. And so I'm going to. So I've got a bunch of tiny little aliases and functions ones called enhancement, which creates an issue with the enhancement label.

So I'll go enhancement add data block dot weighted data loaders. That creates the issue as 3706. So if you were interested, you could take a look at that issue. Not the world's most interesting issue, but there it is. All right, looks like the tests are basically. Oh, no, we've got an issue.

There we go. So we've got a test that's failed. Range in just use must be integers or slices. Yes, right. So I'm glad we checked. OK, so the problem here is that I sliced into my weights on the assumption that this is something I can slice into, which would only be true if it was a tensor or an array.

But in this case, actually, my weights are not either of those things. So what would I do to fix that? Yeah. When you split, you only keep you back the index of the training and validation data set. And how can you know this is the weights because you haven't actually do the calculation and do the inverse of one square work kind of thing.

The weights are being passed in as a parameter. And so we calculated the weights up here. And then we passed them in here. What's the incorrect type that's coming through in the test? It's not that it's an incorrect type. It's that see how here I'm indexing into the weights using my splits.

This here is a list or an array. You can't index into a Python list with a list. You can only do that with tensors or numpy arrays. Yeah, I mean, what we actually want to do is check whether it's an array type. Is there a listy or something that function?

There is, but that's not quite. I think we want the opposite, which is, is this the kind of thing that one could expect to be able to do numpy style indexing on. And I believe the correct way to do that might be to look for this thing. Yeah, so I would be inclined to say.

There may well already be something in first AI that knows how to check for this, to be honest. Okay, so this, what's this thing? Oh, that's something that's commented out. Alright, so I guess I don't have anything which checks for that. So we'll just do it manually. So if weights has the dunder array attribute.

I'm pretty sure that tensors have that as well. Yeah, it does. So if it has that attribute, then I think we're good to go. Otherwise, we can use a list comprehension. Oh, you know what we could do. Yeah, okay. What we'll do is we'll just say if it doesn't have that.

I don't know if this is too, too rude to change their values, but I think this is fine. It's not a numpy type array. It's probably going to benefit from being converted to one anyway, right? Yeah, I mean, I don't, I mean, I don't see a downside. Passes our test.

Passes all of our tests. Okay, so, and that was our only test that failed, which is now passing. So I would now say we've fixed issue 3706. So I've got a fixes little function that does that 3706. Okay, and so now, if we look at that issue, you'll see that it's been resolved using this commit.

Yeah, before, but what do you commit from the notebook? Do you sort of have it like reset with empty cells or do you run the cells? I commit them basically however they are, but with unnecessary metadata removed. So there's a hook that automatically runs this function, which is the thing that removes stuff like the execution count, unnecessary notebook metadata, stuff like that.

So, the idea is that the notebooks want to have all the outputs in place, because they get turned into documentation. And we wouldn't want to run them all in continuous integration to create documentation because they can like involve like spending 10 hours training an NLP model, for example. We don't remove the outputs for that reason and also because I want people to be able to look at the notebooks in GitHub and see, you know, all the pictures and stuff.

All right, I better stop there. Oh, that's interesting today. Okay, I guess I don't have my hook installed so I'm glad I ran that manually so you can see exactly what it does right empties out the execution counts and removes the metadata. I'm sorry for another question. I'm just trying to find it.

Is that get hook available in the repo or do you do? Yeah, so it's if you go mb dev install get hooks, it installs the hook. And specifically, it's gonna. Whoopsie daisy. Is that under MBS folder? No, this is part of nb dev. Oh, okay. Right. So once that package is installed, it's a building.

And so that then installs a filter here. I'll read more about it. And it also installs a get hook to trust the notebooks, which calls nb dev trust mbs. Anyway, yeah, that's all in the nb dev docs. And then what's going to happen now on the first day ice on the GitHub side is it's now busily running all the tests again.

And one of the things that checks is to make sure that the notebooks are clean and that the exports been run, then it checks all the notebooks, somewhat in parallel. Yeah. All right, I'm gonna go. See you all. Thanks. Bye. Thank you. Bye-bye. - Thank you, bye bye.