back to indexLive coding 7
Chapters
0:0
0:42 Background for Kaggle Competitions
10:0 Setting up for Kaggle competitions on you local machine
14:30 Create API token for Kaggle
18:0 Download kaggle competition zip file
25:0 Using the pipe output to head
28:55 Back to Paperspace
29:30 Remove pip from /storage
30:0 Install kaggle and update symlinks
32:0 Upload kaggle API json
33:20 Download kaggle competition file to Paperspace
35:0 Install unzip to persistent conda env
36:45 Unzipping kaggle file in notebooks is too slow
40:0 Unzip kaggle file in home directory for speed
41:20 Create an executable script for unzipping kaggle file
43:10 Create a notebook to explore kaggle data
48:0 Browse image files
51:0 Review image metadata
53:0 Image data loaders and labelling function
56:30 Create a learner
57:0 Monitor training with nvidia-smi dmon
62:0 Summary
00:00:00.000 |
Okay, so let's, yeah, let's look at how we can automate a process like the one that we 00:00:17.540 |
packed together yesterday. And actually what we might do is have a look at something that 00:00:32.160 |
Raddick put on the forums because I kind of want to have all the stuff we want to include 00:00:39.840 |
there and then we can automate the whole thing. Yeah, here's Raddick's thing. Okay, so Raddick, 00:00:48.960 |
do you want to tell us about what this thing is? What your project, your goal, or what 00:00:55.400 |
is, yeah, what's this forum post about? This is another of Raddick's crazy inventions, 00:01:02.840 |
but essentially this is how I learned, right? So if I, this is the first AI way of learning. 00:01:11.600 |
I think of this as the first AI way of learning. So for those who haven't read it yet, can 00:01:17.520 |
you just take us through what it is? Oh, yeah, absolutely. So yesterday in the works through, 00:01:24.800 |
we covered some material and I wanted to find a fun way to practice it. So I went to Kaggle, 00:01:31.680 |
I looked for competitions and there was a competition that seemed like participating 00:01:38.080 |
in it could allow me to practice what we learned about the work through. So I created this 00:01:44.200 |
as a resource for others. You know, I've been creating such things for a couple of years 00:01:48.840 |
now and maybe it requires also a little bit of understanding of Kaggle, which somebody 00:01:56.160 |
who has not played around with Kaggle yet, they might not have it. So I thought, Hey, 00:02:00.880 |
let me put this together, share it with others. So this is a getting started part with a community 00:02:08.800 |
competition. It features images of plants. And you're supposed to detect one of nine 00:02:19.560 |
or one of 10 classes of plant diseases that they can be affected with. I haven't looked 00:02:28.760 |
too much at the data. Essentially, I relied on first AI's functionality, where if I present 00:02:36.400 |
the image data loaders, and then by that token, if I present the learner with data that seems 00:02:46.560 |
to be appropriately formatted, then the learner will do the rest. So I tried to bridge through 00:02:53.200 |
this as quickly as I could just to put it together and to practice on my own as well 00:02:59.680 |
while doing this. And hopefully this can be useful to others. 00:03:04.800 |
Great. Shall we try going it through together? There's somebody's got a fast AI one here 00:03:11.440 |
already. Sounds fun. Let's try it. So that's great. And I like the fact that you're looking 00:03:27.560 |
to run it on paper space rather than on Kaggle. I think that's good practice. So let's do 00:03:37.640 |
the same thing that Radek did. So on Kaggle, you can see all the competitions that are 00:03:50.480 |
running, active competitions. So there are various different types, right? There's kind 00:03:59.440 |
of the normal ones that have money involved. And as well as having money involved, they 00:04:09.040 |
also have ranking points. So that gives you the opportunity to try to become a master 00:04:17.160 |
or grandmaster or whatever by getting ranking points. Then there are some which are just 00:04:26.280 |
for kudos. So there's no money involved and no ranking points. And you can kind of use 00:04:41.660 |
the little buttons at the top to find getting started competitions. So those are just knowledge. 00:04:53.880 |
Some have prizes. I wonder what prizes they have. A TPU star. 20 extra hours of TPU time 00:05:02.320 |
for four weeks. There you go. Kind of sounds like a drug. Like you use TPUs to win and 00:05:13.640 |
then you need more to keep going. I love it. And then there's also playground competitions 00:05:22.760 |
which they kind of repeat each month. All right. So looks like Radek's picked out a kudos competition. 00:05:32.640 |
So we're not going to get any money or ranking points. We're just doing it for the enjoyment 00:05:37.800 |
and the learning. So we have to click, before you can download the data, you have to click 00:05:44.160 |
join competition. And this is a really common mistake people make is they try to download 00:05:49.840 |
the data without doing that. You'll get an error. So as it says you can download the 00:05:56.720 |
data by running this command here. Now that's not going to be installed yet. So we need 00:06:04.480 |
to install it. So we can install it with pip install minus minus user etc. I feel like 00:06:19.080 |
given how much we're typing pip install minus minus user whatever it might be worth creating 00:06:24.320 |
an alias for. This is taking a long time. >> Do you like paper space so far generally? 00:06:49.560 |
>> Very very much. Yeah. It's the first platform I found which I feel like I can use this you 00:07:00.480 |
know. >> Nice. I have to give another try then. >> Not opening my cheaper 11. Yeah, 00:07:12.480 |
make sure you do all the previous walkthroughs because it really does take you through like 00:07:18.120 |
how to take advantage of this and it's yeah, otherwise it's not particularly exciting. 00:07:23.440 |
All right. I'm having some trouble with Jupyter Labs. So I'm just going to go to and use their 00:07:29.360 |
rather unappealing GUI. Hopefully that'll work. No, even that's not working. >> Do they 00:07:37.080 |
have the ability to SSH into these machines? Do you know? >> They do not. But because you've 00:07:42.920 |
got Jupyter Lab installed, you get a full terminal. So it doesn't really make any difference. 00:07:47.920 |
And you can also connect VS Code to them. Anyway, for some reason, this is the first 00:07:58.600 |
time this has happened. It's not liking me. >> I can't open a Jupyter Lab notebook either. 00:08:09.800 |
It's timing out. I'll see if I can fire it up in VS Code then. It's unusually sluggish 00:08:22.600 |
today. >> This is my experience whenever I've tried 00:08:30.680 |
it before. Maybe it's just I'm here, so the curse came with me. 00:08:34.520 |
>> It is to happen to me all the time. And they seem to have really improved. I'm going 00:08:44.280 |
to try a paid one. >> I've never had one before. >> I'll try a 00:09:10.440 |
green dot. It's promising. >> I've got two tabs. The green dot is the 00:09:34.240 |
free one. >> I can't connect through VS Code either. 00:09:40.840 |
>> Not wanting to start. >> It should be a red dot. It's not working. 00:09:47.880 |
>> No worries. Change of plans. We'll do things locally. We will yell at paper space. What 00:10:00.240 |
the hell just happened? Let me just switch to my other user. Let's see if Kaggle's 00:10:24.520 |
installed. There's no Kaggle. Here I am with no Kaggle. I'll start by creating a Tmux 00:10:32.920 |
session. I like to be able to run a few things at the same time. I would run pip install 00:10:45.920 |
minus user Kaggle. There we go. Now, Kaggle is not just a Python library, but it also 00:11:04.560 |
has a command line tool. And because I did minus minus user, it installed the command 00:11:11.160 |
line tool into my home directory, into the dot local folder. And binaries, things you 00:11:19.640 |
can execute, are generally put in bin. But dot local slash bin in my home directory isn't 00:11:26.480 |
in my path, and therefore I can't type Kaggle. As we know, to fix that, if you're on paper 00:11:36.200 |
space, you would modify slash storage slash bash dot local, or here in my local machine, 00:11:43.400 |
I would just modify my home directory.bashrc. >> On paper space, if you do this, you can 00:11:52.880 |
modify and bash dot local. It will not run before Jupyter notebook runs. >> Correct. 00:11:59.840 |
Which is fine. >> Okay. >> Yes. Because Jupyter notebook doesn't 00:12:04.560 |
need access to this, unless you want to do exclamation mark Kaggle. If you want to put 00:12:12.240 |
exclamation mark Kaggle in Jupyter, you would need to put it in tree dash run dot sh. Is 00:12:22.200 |
>> Cool. Cool. Great. Yeah. So, something dot bash dot local only will execute in a 00:12:29.400 |
new terminal. Sorry. Go on. >> So, I suppose, like, in some operating systems, 00:12:36.000 |
I think the local bin directory is on the path by default. So, maybe, do you think that 00:12:42.200 |
it's just whatever reason? >> I've never seen that. But it's possible. 00:12:46.040 |
There's a lot of distributions around. >> I could be confusing with, like, user local 00:12:52.800 |
bin or whatever. There's some of them on Mac. >> Yeah, slash user slash local slash bin 00:12:58.960 |
is not in your home directory, and that is always part of your path. But this is something 00:13:08.400 |
>> All right. Okay. So, by default, Ubuntu has a bunch of stuff in your bash RC, by the 00:13:18.800 |
way. So, I'm just going to go to the bottom. So, to go to the bottom in Vim, it's shift 00:13:22.560 |
G to go to the bottom. And then O to open up a new line underneath this one. So, insert 00:13:31.040 |
beneath. And so, we will export path equals Tilda slash dot local slash bin. And then colon 00:13:44.920 |
and then everything that's already in your path. So, that prepends it to our path. So, 00:13:49.800 |
I could close and reopen my bash RC, or I can re-execute it. And any exported variables 00:13:57.920 |
I want to go into my shell. So, to execute stuff and put variables into your current 00:14:02.160 |
shell, you can type source. So, source dot bash RC is going to save me having to close 00:14:07.520 |
and reopen my terminal. And dot bash RC is the last thing on the last line. So, I can 00:14:12.880 |
just do that. And so now we can Kaggle. Okay. So, the next thing we need is somewhere to 00:14:23.320 |
authenticate. And Kaggle uses something called Kaggle dot JSON to do that. So, if I go to 00:14:38.720 |
Kaggle, you can grab it. There we are. You can grab it by clicking create new API token. 00:15:06.680 |
And what that will do is it will download a file called Kaggle dot JSON to your computer. 00:15:31.200 |
And so, once it's downloaded, depending on where you are on Mac, it might be in your 00:15:36.760 |
Twitter slash downloads directory. And on Windows, it will be in slash mount slash C 00:15:42.440 |
is your Windows C drive. And it will be in your user's user name downloads directory. 00:15:53.080 |
So, they say it needs to be in a directory called dot Kaggle. So, I'll go make dot Kaggle. 00:16:06.880 |
I think it's probably just created that for us when we tried to run it. That's good. So, 00:16:10.160 |
now I can copy it. And in this case, I think what I'm going to do is just copy it from 00:16:22.520 |
my other account. Copy dot Kaggle slash there's my JSON. And I'll copy it into my by the way, 00:16:36.280 |
so I want to get the JPH 00's home directory. So, tilde JPH 00 refers to the home directory 00:16:42.920 |
belonging to JPH 00. Tilde on its own means the current user's own directory. So, I'm 00:16:48.600 |
going to copy it over to dot Kaggle. There we go. And change its ownership so it's owned 00:16:58.120 |
by JPH 00. So, you won't have to do this because you'll be downloading it and copying it from 00:17:05.480 |
downloads. I'm just doing this because I'm copying it from a different user. All right. 00:17:18.840 |
So, yep, that now belongs to JPH 00. So, now I should be able to 00:17:30.440 |
go back into that username and type Kaggle. Okay, great. So, I've got Kaggle installed. 00:17:38.120 |
And we'll do a check from time to time to see 00:18:02.520 |
said we can download it with this command. So, I'll copy that. 00:18:13.320 |
And let's create a directory for the competition. Patty. And run that command. 00:18:27.000 |
Nice. Gigabyte of data. All right. Did anybody have any questions or anything 00:18:43.720 |
So, Jeremy, we can use Mamba install here since you're doing it on local, right? 00:18:58.200 |
If it's on, yes, it is on kind of forge. So, yeah, Mamba install Kaggle should be fine. 00:19:09.560 |
Although, you know, to be honest, like, for simple pure Python stuff like this, I 00:19:18.200 |
often just use pip anyway. Because things like this, like, pretty much most tools are 00:19:28.760 |
Python libraries. Like, pip is the main thing people are kind of targeting. So, 00:19:35.880 |
you can be sure that that's going to be the most recent version. 00:19:39.320 |
Unless the documentation explicitly says, like, we provide kind of packages as well, 00:19:44.280 |
there's often a good chance that the kind of packages will be behind. So, if I was going to 00:19:49.720 |
do a member install, I would be inclined to, like, double check that this is actually the 00:19:58.360 |
most recent version. But, yeah, as you see, I just use pip anyway, I suspect, for something like this. 00:20:10.920 |
>> This used to be the case that you -- I remember something about, like, cookies 00:20:18.360 |
and there was a browser extension and maybe you had your own tool for this. Or am I just 00:20:23.160 |
hallucinating? Did it used to be this way? From, like, an older, faster? Okay. 00:20:29.880 |
>> Okay. So, there are always things so we can unzip it. 00:20:36.120 |
Okay. So, I hate it when that happens. Because it actually takes ages. So, 00:20:42.920 |
minus Q for quietly unzip it. Right. Okay. So, that's going to give us our data. 00:20:54.920 |
I guess one thing is for getting our Kaggle.json onto paper space, 00:21:03.640 |
the easiest way is to click the file upload button in JupyterLab. So, there's just a little 00:21:11.160 |
upward pointing arrow button. If you click that, it'll upload it. And then, yes, copy it to 00:21:20.920 |
tilde slash dot Kaggle. And it does have to have the correct permissions. 00:21:32.680 |
Which is hopefully you might be able to recognize this. So, that's 4 plus 2 is 6. And then 0 0. So, 00:21:39.720 |
chmod 600 on that file will give you the correct permissions. 00:21:43.000 |
Okay. So, now, the only problem is that this is my desktop, which does not have a GPU. So, 00:21:58.120 |
that was actually a stupid place to put this. So, I've got to copy this to my GPU server. 00:22:05.960 |
So, to copy files from one Linux or Mac thing to another, a very easy way to do it is SCP, 00:22:15.640 |
secure copy, and type the name of the file. And then type where you want to send it to. 00:22:25.000 |
Oh, except I don't have that set up here. All right. So, I'm just going to go back to my 00:22:34.760 |
normal user. So, you know, copy tilde jph00, get patty disease classification. I'm going to copy 00:22:45.720 |
that here. So, you can use SCP to copy a file to another machine. And off that goes. 00:23:11.400 |
So, how does it know what local colon is? So, there's a very underutilized handy 00:23:20.360 |
file called .ssh/config where you can type things like host local. And when I SSH to that, 00:23:32.360 |
it will actually SSH to this host name. And it will actually use this user name. And it will set up, 00:23:39.320 |
we haven't talked about SSH forwarding, but if you know about that, it will set up SSH forwarding. 00:23:43.320 |
So, this is just a little trick for people who do use SSH, that using the SSH config file is great. 00:23:50.600 |
And it's not just for SSH, it's also for anything that uses SSH, including SCP. SCP is a secure copy 00:23:58.280 |
over SSH. All right. So, now that's done, I can log in to that machine. And now we're on a GPU 00:24:09.880 |
machine. So, to check your GPUs, you can type nvidia-smi. And so, this has got three GPUs. 00:24:20.760 |
And I can move that file, I just copy it into here, into here. So, should we use SCP or Rsync? 00:24:30.520 |
Oh, that's fine. Yeah. I use SCP just because I don't have to type any flags to it. Strictly 00:24:39.480 |
speaking, SCP is kind of considered deprecated nowadays, but it actually works fine. 00:24:48.280 |
Unzip that. Cool. Okay. Making good progress. Let's see what we've got. Okay. So, 00:25:02.040 |
there's a sample submission.csv, there's a train.csv, train images, test images. 00:25:10.920 |
So, ls train images, if this has got like 10,000 things in it, that's going to be annoying. 00:25:17.640 |
So, if you pipe to head, so remember this vertical bar is called pipe, 00:25:23.080 |
means take the input of this, output of this program and pass it in as the input to this 00:25:28.600 |
program. And this program shows you the first 10 lines of its input. Okay. So, actually, it turns 00:25:35.480 |
out that's got folders for each category. So, I don't really need to pipe it to head. Okay. And 00:25:41.400 |
so then we could do the same thing with one of these, bacterial leaf blight and pipe that to head. 00:25:48.520 |
There we go. So, now we might want to know like how many of those are there. So, instead of piping 00:25:54.120 |
to head, we can pipe it to word count, which is wc. But despite the name, it doesn't only count 00:26:00.280 |
words. If you pass in L for line, it'll do a line count. So, that's how many bacterial leaf blight 00:26:08.040 |
images there are. So, it's really useful to play around with these things you can pipe into. So, 00:26:16.600 |
head, wc, another useful one is tail, which is the last 10 lines. And then one we've seen before 00:26:24.760 |
is grep. So, not particularly useful, but show me all the ones with the number 33 in it. Okay. 00:26:34.440 |
And you can use head and tail also on files. So, head is very useful for csv files. If you're in 00:26:41.960 |
your Jupyter notebook and it's streaming at you, then it cannot read a csv file. It cannot parse a 00:26:47.640 |
csv file. You can just jump into console or even from Jupyter notebook, just do head. Yeah. Well, 00:26:53.320 |
let's try it, right? Because so, I think we know that if you type cat and a file name, it will 00:27:01.000 |
send it to the output, which by default prints it to the screen. So, we could pipe that to head, 00:27:12.440 |
right? Now, real Unix gurus will say, well, that was silly because actually if you look at the band 00:27:20.120 |
page for head, if you pass it a file name, it does the same thing. But to me, I prefer to learn a 00:27:25.800 |
small smaller number of composable things. So, piping stuff to head is not a bad idea. And we 00:27:35.320 |
could even, and, you know, another nice thing about cat is I can pipe it into grep and search for 00:27:43.720 |
everything with, I don't know, how many of these ADT45s are there? That's grep for ADT45 and then 00:27:52.360 |
pipe that into word count but count lines. Yeah. So, you can quickly get some information at the 00:28:05.240 |
console, which, yeah, I think can be quite useful. All right. So, next thing to do, I reckon, is to 00:28:18.440 |
fire up a Jupyter. So, let's see the get Jupyter notebook. 00:28:35.160 |
Excuse me, Jeremy, if you were interested in my paper space, Jupyter instances started up now. 00:28:45.960 |
Hooray. So, I don't know if yours would have to. Look at that. 00:28:53.080 |
Fantastic. All right. So, it's probably worth just quickly going through the exact same process 00:29:08.200 |
one more time, I guess, isn't it? So, we open up the terminal. Pip install Kaggle minus minus user. 00:29:20.200 |
Ah, that's interesting. So, this is because I installed stuff to that Conda directory the 00:29:31.400 |
other day. And so, if I go which pip, it's actually finding that one. And I don't want 00:29:37.240 |
there to be a pip there. So, we'll remove it. 00:29:50.360 |
Do I have to reopen this terminal? How confused is it? Which pip? There we go. Okay, now it's happy. 00:30:06.680 |
Control R, install. To find the last thing I typed, saying install. 00:30:25.560 |
radix approach. I'm putting it in pre-run, so that way we have the ability to use this if we wish 00:30:35.320 |
in Jupyter. Not echo. So, export path equals dot local 00:30:44.520 |
n and then the current path. One of the confusing things I find about 00:30:53.400 |
pass, and it got me a couple of times, if you are then export something, a variable name, 00:31:02.440 |
you need to have the equality sign straight after the variable name. Oh, yeah, no space. 00:31:08.760 |
It won't work. And, you know, it's just one of these little quirks where things are different. 00:31:14.840 |
Yeah, you know, Bash is a very old program, and it has these weird old quirks about 00:31:22.840 |
white space sensitivity. So, that's a really important point to mention. Thank you. 00:31:26.760 |
And I'll run it here as well, rather than restarting. And so, now Kaggle 00:31:38.440 |
should exist. It does. It runs. That's good. All right. And so, 00:31:56.120 |
let's copy this into my downloads directory. Or else, I guess what I could do... 00:32:05.560 |
Yeah, let's just do that. Copy tilde slash dot Kaggle 00:32:14.840 |
Kaggle slash mount slash the user's J downloads. 00:32:25.240 |
And so, we should be able to now upload it from my downloads directory. 00:32:42.840 |
it's created a dot Kaggle directory for us. Wait, oh, sorry, this is my wrong. Sorry, 00:32:53.960 |
let's do that again. CD tilde slash dot Kaggle. Yeah, it's created a Kaggle directory for us. 00:33:01.960 |
And so, we should be able to move the thing that we just uploaded to slash notebooks 00:33:06.920 |
into here. And the permissions will be wrong. So, we can fix them. 00:33:21.560 |
Okay. And so, let's see if it works here as well. 00:33:25.560 |
It does. And PaperSpaces network is faster than my connection in Australia, not surprisingly. 00:33:44.760 |
Although, you know, mine wasn't bad, actually. 00:33:49.480 |
Okay, so... Oh, that was a dumb place to put it, obviously. I don't want to put it 00:34:03.080 |
in paddy disease classification. You know, we're only going to use this for this notebook, 00:34:09.720 |
I guess. So, maybe move that to slash notebooks. 00:34:33.800 |
So, that means... Okay, that's interesting. There's no unzip, but we know how to deal with that. 00:34:54.360 |
Oh, because Control-R does a refresh. Oh, that's annoying, isn't it? 00:35:00.280 |
So, how do we search our history in these terminals? 00:35:08.200 |
Oh, well. That's fine. I will just type it in manually and we will figure out 00:35:18.280 |
how to make Control-R working at some other point. So, micro member minus C condor forge 00:35:32.200 |
minus prefix tilde slash condor install. Probably need the install first. 00:35:44.280 |
Yeah, a lot of the keyboard shortcuts don't work in the browser-based terminal, 00:35:52.440 |
which is actually pretty annoying. They work a bit better on Mac than on Windows, 00:36:00.840 |
because on Windows, the Control key is both used for the Linux terminal commands, and it's also 00:36:07.480 |
used for the normal browser commands. Whereas on Mac, they use command for the browser commands, 00:36:13.320 |
and so the Control key doesn't get overwritten. So, this would probably be a better experience 00:36:17.080 |
on Mac, actually, than Windows. Okay, so we're going to install Unzip. 00:36:32.200 |
And hopefully, by the time people watch this video, if it's like 00:36:38.920 |
July or later, things like Mamba and Unzip will already be installed. Okay, let's check. 00:36:48.680 |
Okay, we have an Unzip. That's good. Okay, so that is on its way. 00:37:01.240 |
So, that's going to use up a gigabyte of space in my persistent storage, which 00:37:11.000 |
you might not want to do that, right? And if you don't want to do that, then instead, you should 00:37:16.680 |
unzip it into your home directory. If you unzip it into your home directory, it won't be there if 00:37:22.520 |
you close it down and reopen it, right? So, you might want to create a little script for yourself 00:37:26.840 |
that does the Kaggle download and the Unzip on your notebook, and then you can run that 00:37:32.520 |
each time you start it up. So, these are the issues. I mean, look, having said that, 00:37:39.080 |
the average cost on paper space for storage, I believe, is $0.29 per gigabyte per month. 00:37:45.240 |
So, your convenience of putting it in storage is probably worth $0.29 for the one month. You're 00:37:52.040 |
probably going to want it there. So, maybe that's just a better plan. I do know, though, that the... 00:37:58.760 |
Well, maybe this is a problem, actually, because I do know the paper space 00:38:04.520 |
/notebooks and /storage are very, very, very, very slow. And we can actually see that 00:38:12.920 |
when we're unzipping this. So, maybe this is a bad idea. Maybe we shouldn't put data, 00:38:20.600 |
at least when there's lots of fails. Because this is painful. I'm going to cancel it and see 00:38:28.440 |
how far it got. TU minus SH train images. 426. And how about test images? 00:38:47.800 |
Wouldn't you know it? It was nearly finished. But, yeah, I think this is actually slower. 00:39:15.480 |
And I have a strong feeling if we move it back to our home directory, 00:39:19.400 |
it's going to be faster. I sure hope so. And the reason I care is not so much for the unzipping 00:39:26.680 |
speed, but when it comes to training a model, we don't want it to be taking ages to open up 00:39:32.120 |
each of those files. You see, even RM minus RF takes a long time. So, while that's running, 00:39:44.840 |
let's move patty/zipfile, pop it into our home directory. 00:39:58.680 |
There we go. And then cd to our home directory. 00:40:09.240 |
So, in terms of the steps we're going to do, it would be first we would make a directory for it. 00:40:25.160 |
We would then do the Kaggle download. Actually, which we can just copy easily enough from Kaggle. 00:40:35.720 |
And then we would unzip. Let's see how long it takes. So, the time Unix command 00:40:42.360 |
runs whatever command you put after it and tells you how long it took. 00:40:51.560 |
Did I not move it there? Oof.dot/patty. Oh, I didn't move it there. Yeah. Time, unzip, quietly, 00:41:01.800 |
patty. So, yeah. So, I think what I would do, now I think about it, is 00:41:19.080 |
I would have a patty directory in my notebooks. I wouldn't store anything big here. I just have 00:41:25.640 |
my notebooks here. And I would put a script here called get data, say. And 00:41:37.000 |
it will just have each of the steps I need. So, the steps would be cd to my home directory, 00:41:47.080 |
make the patty folder, cd to the patty folder, do the wget, 00:42:03.240 |
or not wget, Kaggle competitions download, I should say, unzip it, patty disease. 00:42:17.800 |
Yeah. And I think that's it, right? So, we can make that executable with chmodu+x to add the 00:42:32.120 |
executable permission to it. And so, yeah. So, then all I have to do is run that thing each time I 00:42:43.480 |
start up PaperSpace. And, yeah, it's only going to take eight seconds to unzip, and it took about 00:42:48.920 |
five seconds to download. So, actually, that's not going to be really any trouble at all, is it? 00:42:56.600 |
Cool. And that's, you know, /notebooks, remember, is persistent on this machine. 00:43:07.240 |
So, that's all good. So, now we can create a notebook for it. 00:43:13.160 |
And so, my first step is always just to import computer vision functionality in general. 00:43:23.720 |
Which is the same thing we used yesterday. And now you know exactly what that does. 00:43:30.600 |
And then my second step is to look at the data. So, it's easiest to look at the data if we set a path 00:43:38.680 |
to it. So, it's going to be in our home directory. And it's going to be called patty.plash. 00:44:04.920 |
Well, that's okay. It's just /patty, right? It can go past that home. Wow. I didn't know that. 00:44:10.520 |
That's quite neat. Yeah, it is quite neat. So, that's that. Okay. So, we can path.ls 00:44:20.600 |
tells me what's in there. And if you remember my trick from yesterday, I also like to set that 00:44:30.680 |
to be the path.basepath just so that my LSs look a bit easier to read. 00:44:39.720 |
There we go. So, at this point, we could create a data frame by reading in the CSV of path/train.csv. 00:44:58.760 |
Okay. So, we've got 10,000 rows. Each one is a JPEG. Each one's got a label. And so, 00:45:09.320 |
let's take a look at one of the images, shall we? 00:45:23.320 |
Oh, yeah, PIO image. Path/train/ actually, you know, let's make life a little bit easier 00:45:48.040 |
for ourselves by creating a train path. Because, you know, it's just so good to be lazy. 00:46:01.320 |
/100330.jpg. Oh, no, because then they're inside the label directory. 00:46:27.000 |
Yes. So, what we actually probably should have done would be to say turn path.ls 00:46:41.080 |
train_images. And that's another good reason to put it in a variable. So, you have to change 00:46:48.840 |
it in one place. And so, there we have that. And so, let's create, I don't know, let's call it the 00:46:57.000 |
bacterial_leaf_light_path=train_path/bacterial_leaf_light. 00:47:11.960 |
So, now we should be able to go BLB and look at that image. 00:47:30.120 |
All right. So, might be nice to, like, find out a bit about this. 00:47:48.680 |
Maybe look at the size. So, it's a 480 by 640 image. 00:47:56.360 |
Great. You know, another way we can take a look at an image, you might remember from yesterday, 00:48:13.080 |
you can go files=get_image_files and pass in a path. 00:48:21.080 |
And this will be recursive. So, I can do this. 00:48:29.160 |
As you can see. So, this has got the 10,000. Okay. And that number there matches that number there. 00:48:36.840 |
So, that's a good sign. And so, another way to do that would have been to go 00:48:49.880 |
Okay. And we could even take a look at a few, right? So, if we wanted to check that the image 00:49:02.120 |
size seems reasonably consistent, we could go o.size for o in, well, actually, pil image 00:49:11.000 |
dot create o dot size for o in files 10, for example. 00:49:28.120 |
So, you know, this is not particularly rigorous, but it looks like they're generally 480 by 640 00:49:33.080 |
files. They're all the same size, which is handy. That's interesting. 00:49:47.080 |
And they're probably bigger than we normally need. You know, we normally use images that are about 00:49:55.720 |
224 or so. Having said that, I don't know if, like, presumably this is some disease 00:50:04.360 |
thing, paddy disease competition. So, it's rice. 00:50:16.280 |
Classify the images according to their disease. So, I can't even tell that this thing has a disease. 00:50:26.280 |
So, I don't know how big it needs to be to see the disease. So, it is possible it'll turn out 00:50:35.880 |
that we actually need full-sized images. So, like, I would start by using smaller images 00:50:46.840 |
Anyway, 640 by 480 is not giant. So, we should be fine. 00:51:03.160 |
has got one extra bit of information, which is the variety. 00:51:10.440 |
Radek, did you find out what this variety thing is about? 00:51:13.800 |
From the doubt, I didn't even know that that CSV file existed. 00:51:19.240 |
But it's fun because we can build a multimodal model from data. 00:51:32.360 |
as opposed to the type of disease. Yeah. So, maybe 00:51:38.440 |
the different diseases might look different depending on what type of rice it's on. 00:51:46.280 |
My guess is that we wouldn't need to use that information because given how many images there 00:51:52.680 |
are, I would guess that it's going to do a perfectly good job of recognizing the varieties 00:52:00.760 |
by itself without us telling it. Unless there's a whole lot of different types of varieties, 00:52:05.880 |
which we can check easily enough by checking the data frame, grabbing the variety, 00:52:14.200 |
and doing a .value counts. And we can see how many there are of each. 00:52:23.800 |
Okay. So, look, I mean, there's a couple of tiny varieties, but on the whole, 00:52:29.880 |
most of it is ADT 45 and quite a bit of Kanaka-Ponni. It does seem like a bit of a rice session today, 00:52:40.440 |
doesn't it? Lots of rice going on. Yeah. So, I think it's very unlikely that this variety 00:52:49.000 |
field is going to help because there's so many examples of the main one anyway that it's going 00:52:57.080 |
to be able to recognize it. I mean, at some point we can try it, but I would be making 00:53:06.520 |
that a pretty low priority for this competition. And so, yeah, given we're doing a practice walk 00:53:14.120 |
through, I'd be inclined to fire up Fastbook and the intro and see if we can just basically do the 00:53:25.480 |
same thing that we did last time. So I'm going to merge these back together again. We've already 00:53:34.280 |
got those, too. We've got those. Well, there's not much there, is there? Oh, I'm in APL mode. 00:53:53.960 |
I wonder why things aren't working. I don't know how that happened. I haven't used APL today. 00:53:59.320 |
Copy, paste. Okay. So this is how we did cats. So we needed a labeling function. Now, in our case, 00:54:10.600 |
the labels are very easy. Each image is inside the directory, which is its label. So the parent 00:54:17.480 |
folder name is the label. And so we already have a function to label from folders. So we can actually 00:54:29.240 |
just do image data loaders from folders, because that's all we need. So we're still going to need 00:54:35.960 |
the path. Train and valid actually have different names. So let's fill all of those in. So we're 00:54:42.200 |
going to have path train equals train underscore, what was it? Images? Yep. And test images. Train 00:54:58.600 |
underscore images. Valid percent. So that's fine. We'll do that the same as last time. 00:55:14.680 |
It's expecting to have train and valid subfolders. Oh, all valid percent. So hopefully that'll work. 00:55:28.680 |
Let's try it. And we'll use the same resizes last time. 00:55:43.640 |
All right. Oh, no, that did. Well, did that work? No, it didn't work because we've got, 00:56:02.440 |
that's interesting. Test images. So my guess is it's got confused by the fact. 00:56:08.120 |
Yes. Okay. So possibly what we should instead do is use train path here. 00:56:21.480 |
And use valid percent instead. I wonder if that'll fix that problem. 00:56:29.320 |
There we go. Let's fix that problem. Okay. Great. So we should then be able to create a learner. 00:56:47.800 |
And learn dot fine-tune. Let's just do one epoch to start with. 00:57:01.160 |
There it goes. So it can be useful to kind of make sure it's 00:57:11.400 |
being reasonably productive as it's training. And we can do that with nvidia smi. 00:58:16.920 |
Okay. That's just finished. So while it was running, so this is something people often say 00:58:33.640 |
to use watch nvidia smi to like have it refresh. But actually I don't think most people know that 00:58:40.600 |
there's a daemon subcommand where you can use that just as you can see it shows you every second how 00:58:45.160 |
it's going. And it's showing me the most important thing is this column SM. SM stands for symmetric 00:58:52.120 |
multiprocessor. That's kind of what they call it instead of a CPU for their GPUs. And it's showing 00:58:57.800 |
me that it's being used 70 to 90 percent kind of effectively if you like. And that's a good sign. 00:59:06.600 |
That's fine. So if this was like under 50, then that would be a problem. But it looks like it's 00:59:13.240 |
using my GPU reasonably effectively. Yeah, and it's got the error rate down to 13 percent. 00:59:20.440 |
So we are successfully training a model. So that sounds good. So, Jeremy, just a quick question. 00:59:29.240 |
When you're saying that like if it's under 50 percent, then that can be a problem. Is that 00:59:33.560 |
because you've oversized the GPU like when you selected it or like just just want to clarify 00:59:39.480 |
what you know about that? What that would mean. Yeah, thanks. It's a good question. 00:59:45.320 |
Just rename this. It would probably mean that we're not able to read and process the images 00:59:54.760 |
fast enough. And so in particular, my guess is that if they're in slash storage or slash notebooks, 01:00:02.200 |
you would see the SM percent be really low because I think it would be taking a really 01:00:05.880 |
long time to open each image because it's coming from a network storage. And so generally, yeah, 01:00:11.720 |
a low SM means that your IO, your input output, your reading or processing time is too high. 01:00:17.960 |
And so the ways to fix that would be a few. One would be to move the images onto the local machine 01:00:24.120 |
so they're not on a network drive. A second would be to resize the images ahead of time 01:00:29.480 |
to make them a more reasonable size. And a third would be to decrease the amount of kind of 01:00:35.560 |
augmentation that you're doing. Or another would be to pick a different instance type with more CPUs. 01:00:41.880 |
So those are basically the things. All right. Okay. Just to end the system, 01:00:49.880 |
my command also has a lot of useful information like your CUDA version and stuff like that. So 01:00:57.800 |
you know, it's also useful command even without demo to know that it exists. Yep. A lot of details 01:01:08.680 |
here. So if you're looking for the IDX of your GPU, it might be GPUs. And some of the variables 01:01:16.040 |
here are a little bit more descriptive. So it might be easier to get started with that command 01:01:22.440 |
or to at least use it every now and then. And if you'd like to have this one running in a loop, 01:01:27.400 |
which is what I generally do, just do nvdr-smi-i. Yeah. Yeah. I mean, I agree this is useful, 01:01:36.920 |
but I would suggest in a loop to use the daemon because there's only two columns you care about. 01:01:41.800 |
And this one does not show you SM, right? So if you want to actually see it's being utilized, 01:01:48.440 |
you need to use daemon. And you can also see the percentage memory utilization. 01:01:52.280 |
So just look at these two columns. The other ones you can actually ignore. 01:01:57.880 |
Yeah. Okay. I think that's a pretty good place to stop. I'm glad you put us onto this competition 01:02:05.640 |
Radek. It looks fun. And I feel like we've got a reasonable start. So yeah, maybe next time we can 01:02:16.040 |
try doing a submission. And we could also try creating a Kaggle notebook for other people to 01:02:27.640 |
see. How does that sound? Sounds excellent. One thing I also like about this is that 01:02:39.160 |
we're coming up across problems as we go and jumping through those hoops. And these are the 01:02:49.080 |
beginner sorts of roadblocks that we'll have to face, I guess. Exactly. And if you guys, you know, 01:02:56.680 |
repeat these steps or do it on another dataset or whatever and hit some roadblocks, then it's really 01:03:04.440 |
helpful. If you solve them, you know, come back tomorrow and tell us what happened and how you 01:03:09.480 |
solved it. And if you didn't, come back tomorrow and tell us to fix it for you. I think they're 01:03:14.280 |
both useful things to do. So things like Radek's example of like doing a bash environment variable 01:03:20.760 |
and having a space next to the equal sign, you know, that kind of stuff. I forget even to mention 01:03:26.200 |
it, but really useful information. You know, this competition is nice because it's relatively small, 01:03:33.080 |
like 10,000 images, and it's aligned with what you're doing in the course. But if you'd like to 01:03:37.800 |
try something out on a competition that is not active right now, you can still do this. Rascaggle 01:03:43.800 |
allows you to do this late submission thing. And this opens up many competitions to play around with. 01:03:54.440 |
The current competitions that are, how do they call it, ranked competitions, so they award you 01:04:01.640 |
points and there are prizes, they are not on images. So we explore something on your own to 01:04:09.480 |
try the methods on another competition on an image that might be something quite useful. 01:04:16.120 |
So to find those, you need to scroll to the bottom and click explore all competitions. 01:04:22.280 |
And yeah, this will let you see closed competitions as well. 01:04:34.920 |
And you can even see, I guess, here you go, you can find out which were the ones with the 01:04:44.520 |
most popular of all time. That can be interesting. Crypto forecasting. Well, of course it would be. 01:04:50.840 |
That's a bit sad, but there you go. That's interesting. This patent phrase one is super 01:04:59.480 |
popular. That's good to see. Instant gratification. All right. Thanks all. See you next time. Bye.