back to index

Live coding 7


Chapters

0:0
0:42 Background for Kaggle Competitions
10:0 Setting up for Kaggle competitions on you local machine
14:30 Create API token for Kaggle
18:0 Download kaggle competition zip file
25:0 Using the pipe output to head
28:55 Back to Paperspace
29:30 Remove pip from /storage
30:0 Install kaggle and update symlinks
32:0 Upload kaggle API json
33:20 Download kaggle competition file to Paperspace
35:0 Install unzip to persistent conda env
36:45 Unzipping kaggle file in notebooks is too slow
40:0 Unzip kaggle file in home directory for speed
41:20 Create an executable script for unzipping kaggle file
43:10 Create a notebook to explore kaggle data
48:0 Browse image files
51:0 Review image metadata
53:0 Image data loaders and labelling function
56:30 Create a learner
57:0 Monitor training with nvidia-smi dmon
62:0 Summary

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, so let's, yeah, let's look at how we can automate a process like the one that we
00:00:17.540 | packed together yesterday. And actually what we might do is have a look at something that
00:00:32.160 | Raddick put on the forums because I kind of want to have all the stuff we want to include
00:00:39.840 | there and then we can automate the whole thing. Yeah, here's Raddick's thing. Okay, so Raddick,
00:00:48.960 | do you want to tell us about what this thing is? What your project, your goal, or what
00:00:55.400 | is, yeah, what's this forum post about? This is another of Raddick's crazy inventions,
00:01:02.840 | but essentially this is how I learned, right? So if I, this is the first AI way of learning.
00:01:11.600 | I think of this as the first AI way of learning. So for those who haven't read it yet, can
00:01:17.520 | you just take us through what it is? Oh, yeah, absolutely. So yesterday in the works through,
00:01:24.800 | we covered some material and I wanted to find a fun way to practice it. So I went to Kaggle,
00:01:31.680 | I looked for competitions and there was a competition that seemed like participating
00:01:38.080 | in it could allow me to practice what we learned about the work through. So I created this
00:01:44.200 | as a resource for others. You know, I've been creating such things for a couple of years
00:01:48.840 | now and maybe it requires also a little bit of understanding of Kaggle, which somebody
00:01:56.160 | who has not played around with Kaggle yet, they might not have it. So I thought, Hey,
00:02:00.880 | let me put this together, share it with others. So this is a getting started part with a community
00:02:08.800 | competition. It features images of plants. And you're supposed to detect one of nine
00:02:19.560 | or one of 10 classes of plant diseases that they can be affected with. I haven't looked
00:02:28.760 | too much at the data. Essentially, I relied on first AI's functionality, where if I present
00:02:36.400 | the image data loaders, and then by that token, if I present the learner with data that seems
00:02:46.560 | to be appropriately formatted, then the learner will do the rest. So I tried to bridge through
00:02:53.200 | this as quickly as I could just to put it together and to practice on my own as well
00:02:59.680 | while doing this. And hopefully this can be useful to others.
00:03:04.800 | Great. Shall we try going it through together? There's somebody's got a fast AI one here
00:03:11.440 | already. Sounds fun. Let's try it. So that's great. And I like the fact that you're looking
00:03:27.560 | to run it on paper space rather than on Kaggle. I think that's good practice. So let's do
00:03:37.640 | the same thing that Radek did. So on Kaggle, you can see all the competitions that are
00:03:50.480 | running, active competitions. So there are various different types, right? There's kind
00:03:59.440 | of the normal ones that have money involved. And as well as having money involved, they
00:04:09.040 | also have ranking points. So that gives you the opportunity to try to become a master
00:04:17.160 | or grandmaster or whatever by getting ranking points. Then there are some which are just
00:04:26.280 | for kudos. So there's no money involved and no ranking points. And you can kind of use
00:04:41.660 | the little buttons at the top to find getting started competitions. So those are just knowledge.
00:04:53.880 | Some have prizes. I wonder what prizes they have. A TPU star. 20 extra hours of TPU time
00:05:02.320 | for four weeks. There you go. Kind of sounds like a drug. Like you use TPUs to win and
00:05:13.640 | then you need more to keep going. I love it. And then there's also playground competitions
00:05:22.760 | which they kind of repeat each month. All right. So looks like Radek's picked out a kudos competition.
00:05:32.640 | So we're not going to get any money or ranking points. We're just doing it for the enjoyment
00:05:37.800 | and the learning. So we have to click, before you can download the data, you have to click
00:05:44.160 | join competition. And this is a really common mistake people make is they try to download
00:05:49.840 | the data without doing that. You'll get an error. So as it says you can download the
00:05:56.720 | data by running this command here. Now that's not going to be installed yet. So we need
00:06:04.480 | to install it. So we can install it with pip install minus minus user etc. I feel like
00:06:19.080 | given how much we're typing pip install minus minus user whatever it might be worth creating
00:06:24.320 | an alias for. This is taking a long time. >> Do you like paper space so far generally?
00:06:49.560 | >> Very very much. Yeah. It's the first platform I found which I feel like I can use this you
00:07:00.480 | know. >> Nice. I have to give another try then. >> Not opening my cheaper 11. Yeah,
00:07:12.480 | make sure you do all the previous walkthroughs because it really does take you through like
00:07:18.120 | how to take advantage of this and it's yeah, otherwise it's not particularly exciting.
00:07:23.440 | All right. I'm having some trouble with Jupyter Labs. So I'm just going to go to and use their
00:07:29.360 | rather unappealing GUI. Hopefully that'll work. No, even that's not working. >> Do they
00:07:37.080 | have the ability to SSH into these machines? Do you know? >> They do not. But because you've
00:07:42.920 | got Jupyter Lab installed, you get a full terminal. So it doesn't really make any difference.
00:07:47.920 | And you can also connect VS Code to them. Anyway, for some reason, this is the first
00:07:58.600 | time this has happened. It's not liking me. >> I can't open a Jupyter Lab notebook either.
00:08:09.800 | It's timing out. I'll see if I can fire it up in VS Code then. It's unusually sluggish
00:08:22.600 | today. >> This is my experience whenever I've tried
00:08:30.680 | it before. Maybe it's just I'm here, so the curse came with me.
00:08:34.520 | >> It is to happen to me all the time. And they seem to have really improved. I'm going
00:08:44.280 | to try a paid one. >> I've never had one before. >> I'll try a
00:09:10.440 | green dot. It's promising. >> I've got two tabs. The green dot is the
00:09:34.240 | free one. >> I can't connect through VS Code either.
00:09:40.840 | >> Not wanting to start. >> It should be a red dot. It's not working.
00:09:47.880 | >> No worries. Change of plans. We'll do things locally. We will yell at paper space. What
00:10:00.240 | the hell just happened? Let me just switch to my other user. Let's see if Kaggle's
00:10:24.520 | installed. There's no Kaggle. Here I am with no Kaggle. I'll start by creating a Tmux
00:10:32.920 | session. I like to be able to run a few things at the same time. I would run pip install
00:10:45.920 | minus user Kaggle. There we go. Now, Kaggle is not just a Python library, but it also
00:11:04.560 | has a command line tool. And because I did minus minus user, it installed the command
00:11:11.160 | line tool into my home directory, into the dot local folder. And binaries, things you
00:11:19.640 | can execute, are generally put in bin. But dot local slash bin in my home directory isn't
00:11:26.480 | in my path, and therefore I can't type Kaggle. As we know, to fix that, if you're on paper
00:11:36.200 | space, you would modify slash storage slash bash dot local, or here in my local machine,
00:11:43.400 | I would just modify my home directory.bashrc. >> On paper space, if you do this, you can
00:11:52.880 | modify and bash dot local. It will not run before Jupyter notebook runs. >> Correct.
00:11:59.840 | Which is fine. >> Okay. >> Yes. Because Jupyter notebook doesn't
00:12:04.560 | need access to this, unless you want to do exclamation mark Kaggle. If you want to put
00:12:12.240 | exclamation mark Kaggle in Jupyter, you would need to put it in tree dash run dot sh. Is
00:12:17.400 | that your point there? >> It is, yes.
00:12:22.200 | >> Cool. Cool. Great. Yeah. So, something dot bash dot local only will execute in a
00:12:29.400 | new terminal. Sorry. Go on. >> So, I suppose, like, in some operating systems,
00:12:36.000 | I think the local bin directory is on the path by default. So, maybe, do you think that
00:12:42.200 | it's just whatever reason? >> I've never seen that. But it's possible.
00:12:46.040 | There's a lot of distributions around. >> I could be confusing with, like, user local
00:12:52.800 | bin or whatever. There's some of them on Mac. >> Yeah, slash user slash local slash bin
00:12:58.960 | is not in your home directory, and that is always part of your path. But this is something
00:13:02.960 | in your home directory. >> Oh, yeah. Okay.
00:13:08.400 | >> All right. Okay. So, by default, Ubuntu has a bunch of stuff in your bash RC, by the
00:13:18.800 | way. So, I'm just going to go to the bottom. So, to go to the bottom in Vim, it's shift
00:13:22.560 | G to go to the bottom. And then O to open up a new line underneath this one. So, insert
00:13:31.040 | beneath. And so, we will export path equals Tilda slash dot local slash bin. And then colon
00:13:44.920 | and then everything that's already in your path. So, that prepends it to our path. So,
00:13:49.800 | I could close and reopen my bash RC, or I can re-execute it. And any exported variables
00:13:57.920 | I want to go into my shell. So, to execute stuff and put variables into your current
00:14:02.160 | shell, you can type source. So, source dot bash RC is going to save me having to close
00:14:07.520 | and reopen my terminal. And dot bash RC is the last thing on the last line. So, I can
00:14:12.880 | just do that. And so now we can Kaggle. Okay. So, the next thing we need is somewhere to
00:14:23.320 | authenticate. And Kaggle uses something called Kaggle dot JSON to do that. So, if I go to
00:14:38.720 | Kaggle, you can grab it. There we are. You can grab it by clicking create new API token.
00:15:06.680 | And what that will do is it will download a file called Kaggle dot JSON to your computer.
00:15:31.200 | And so, once it's downloaded, depending on where you are on Mac, it might be in your
00:15:36.760 | Twitter slash downloads directory. And on Windows, it will be in slash mount slash C
00:15:42.440 | is your Windows C drive. And it will be in your user's user name downloads directory.
00:15:53.080 | So, they say it needs to be in a directory called dot Kaggle. So, I'll go make dot Kaggle.
00:16:06.880 | I think it's probably just created that for us when we tried to run it. That's good. So,
00:16:10.160 | now I can copy it. And in this case, I think what I'm going to do is just copy it from
00:16:22.520 | my other account. Copy dot Kaggle slash there's my JSON. And I'll copy it into my by the way,
00:16:36.280 | so I want to get the JPH 00's home directory. So, tilde JPH 00 refers to the home directory
00:16:42.920 | belonging to JPH 00. Tilde on its own means the current user's own directory. So, I'm
00:16:48.600 | going to copy it over to dot Kaggle. There we go. And change its ownership so it's owned
00:16:58.120 | by JPH 00. So, you won't have to do this because you'll be downloading it and copying it from
00:17:05.480 | downloads. I'm just doing this because I'm copying it from a different user. All right.
00:17:18.840 | So, yep, that now belongs to JPH 00. So, now I should be able to
00:17:30.440 | go back into that username and type Kaggle. Okay, great. So, I've got Kaggle installed.
00:17:38.120 | And we'll do a check from time to time to see
00:17:42.360 | whether anything's working. Not really.
00:17:58.200 | Okay. So, the Kaggle competition
00:18:02.520 | said we can download it with this command. So, I'll copy that.
00:18:13.320 | And let's create a directory for the competition. Patty. And run that command.
00:18:27.000 | Nice. Gigabyte of data. All right. Did anybody have any questions or anything
00:18:41.160 | about this as we wait for that to download?
00:18:43.720 | So, Jeremy, we can use Mamba install here since you're doing it on local, right?
00:18:55.720 | It's just demonstrating paper space.
00:18:58.200 | If it's on, yes, it is on kind of forge. So, yeah, Mamba install Kaggle should be fine.
00:19:09.560 | Although, you know, to be honest, like, for simple pure Python stuff like this, I
00:19:18.200 | often just use pip anyway. Because things like this, like, pretty much most tools are
00:19:28.760 | Python libraries. Like, pip is the main thing people are kind of targeting. So,
00:19:35.880 | you can be sure that that's going to be the most recent version.
00:19:39.320 | Unless the documentation explicitly says, like, we provide kind of packages as well,
00:19:44.280 | there's often a good chance that the kind of packages will be behind. So, if I was going to
00:19:49.720 | do a member install, I would be inclined to, like, double check that this is actually the
00:19:58.360 | most recent version. But, yeah, as you see, I just use pip anyway, I suspect, for something like this.
00:20:10.920 | >> This used to be the case that you -- I remember something about, like, cookies
00:20:18.360 | and there was a browser extension and maybe you had your own tool for this. Or am I just
00:20:23.160 | hallucinating? Did it used to be this way? From, like, an older, faster? Okay.
00:20:29.880 | >> Okay. So, there are always things so we can unzip it.
00:20:36.120 | Okay. So, I hate it when that happens. Because it actually takes ages. So,
00:20:42.920 | minus Q for quietly unzip it. Right. Okay. So, that's going to give us our data.
00:20:54.920 | I guess one thing is for getting our Kaggle.json onto paper space,
00:21:03.640 | the easiest way is to click the file upload button in JupyterLab. So, there's just a little
00:21:11.160 | upward pointing arrow button. If you click that, it'll upload it. And then, yes, copy it to
00:21:20.920 | tilde slash dot Kaggle. And it does have to have the correct permissions.
00:21:32.680 | Which is hopefully you might be able to recognize this. So, that's 4 plus 2 is 6. And then 0 0. So,
00:21:39.720 | chmod 600 on that file will give you the correct permissions.
00:21:43.000 | Okay. So, now, the only problem is that this is my desktop, which does not have a GPU. So,
00:21:58.120 | that was actually a stupid place to put this. So, I've got to copy this to my GPU server.
00:22:05.960 | So, to copy files from one Linux or Mac thing to another, a very easy way to do it is SCP,
00:22:15.640 | secure copy, and type the name of the file. And then type where you want to send it to.
00:22:25.000 | Oh, except I don't have that set up here. All right. So, I'm just going to go back to my
00:22:34.760 | normal user. So, you know, copy tilde jph00, get patty disease classification. I'm going to copy
00:22:45.720 | that here. So, you can use SCP to copy a file to another machine. And off that goes.
00:23:11.400 | So, how does it know what local colon is? So, there's a very underutilized handy
00:23:20.360 | file called .ssh/config where you can type things like host local. And when I SSH to that,
00:23:32.360 | it will actually SSH to this host name. And it will actually use this user name. And it will set up,
00:23:39.320 | we haven't talked about SSH forwarding, but if you know about that, it will set up SSH forwarding.
00:23:43.320 | So, this is just a little trick for people who do use SSH, that using the SSH config file is great.
00:23:50.600 | And it's not just for SSH, it's also for anything that uses SSH, including SCP. SCP is a secure copy
00:23:58.280 | over SSH. All right. So, now that's done, I can log in to that machine. And now we're on a GPU
00:24:09.880 | machine. So, to check your GPUs, you can type nvidia-smi. And so, this has got three GPUs.
00:24:20.760 | And I can move that file, I just copy it into here, into here. So, should we use SCP or Rsync?
00:24:30.520 | Oh, that's fine. Yeah. I use SCP just because I don't have to type any flags to it. Strictly
00:24:39.480 | speaking, SCP is kind of considered deprecated nowadays, but it actually works fine.
00:24:48.280 | Unzip that. Cool. Okay. Making good progress. Let's see what we've got. Okay. So,
00:25:02.040 | there's a sample submission.csv, there's a train.csv, train images, test images.
00:25:10.920 | So, ls train images, if this has got like 10,000 things in it, that's going to be annoying.
00:25:17.640 | So, if you pipe to head, so remember this vertical bar is called pipe,
00:25:23.080 | means take the input of this, output of this program and pass it in as the input to this
00:25:28.600 | program. And this program shows you the first 10 lines of its input. Okay. So, actually, it turns
00:25:35.480 | out that's got folders for each category. So, I don't really need to pipe it to head. Okay. And
00:25:41.400 | so then we could do the same thing with one of these, bacterial leaf blight and pipe that to head.
00:25:48.520 | There we go. So, now we might want to know like how many of those are there. So, instead of piping
00:25:54.120 | to head, we can pipe it to word count, which is wc. But despite the name, it doesn't only count
00:26:00.280 | words. If you pass in L for line, it'll do a line count. So, that's how many bacterial leaf blight
00:26:08.040 | images there are. So, it's really useful to play around with these things you can pipe into. So,
00:26:16.600 | head, wc, another useful one is tail, which is the last 10 lines. And then one we've seen before
00:26:24.760 | is grep. So, not particularly useful, but show me all the ones with the number 33 in it. Okay.
00:26:34.440 | And you can use head and tail also on files. So, head is very useful for csv files. If you're in
00:26:41.960 | your Jupyter notebook and it's streaming at you, then it cannot read a csv file. It cannot parse a
00:26:47.640 | csv file. You can just jump into console or even from Jupyter notebook, just do head. Yeah. Well,
00:26:53.320 | let's try it, right? Because so, I think we know that if you type cat and a file name, it will
00:27:01.000 | send it to the output, which by default prints it to the screen. So, we could pipe that to head,
00:27:12.440 | right? Now, real Unix gurus will say, well, that was silly because actually if you look at the band
00:27:20.120 | page for head, if you pass it a file name, it does the same thing. But to me, I prefer to learn a
00:27:25.800 | small smaller number of composable things. So, piping stuff to head is not a bad idea. And we
00:27:35.320 | could even, and, you know, another nice thing about cat is I can pipe it into grep and search for
00:27:43.720 | everything with, I don't know, how many of these ADT45s are there? That's grep for ADT45 and then
00:27:52.360 | pipe that into word count but count lines. Yeah. So, you can quickly get some information at the
00:28:05.240 | console, which, yeah, I think can be quite useful. All right. So, next thing to do, I reckon, is to
00:28:18.440 | fire up a Jupyter. So, let's see the get Jupyter notebook.
00:28:35.160 | Excuse me, Jeremy, if you were interested in my paper space, Jupyter instances started up now.
00:28:45.960 | Hooray. So, I don't know if yours would have to. Look at that.
00:28:53.080 | Fantastic. All right. So, it's probably worth just quickly going through the exact same process
00:29:08.200 | one more time, I guess, isn't it? So, we open up the terminal. Pip install Kaggle minus minus user.
00:29:20.200 | Ah, that's interesting. So, this is because I installed stuff to that Conda directory the
00:29:31.400 | other day. And so, if I go which pip, it's actually finding that one. And I don't want
00:29:37.240 | there to be a pip there. So, we'll remove it.
00:29:39.640 | Conda, oh, in my home directory.
00:29:45.880 | Okay, let's try that again.
00:29:50.360 | Do I have to reopen this terminal? How confused is it? Which pip? There we go. Okay, now it's happy.
00:30:06.680 | Control R, install. To find the last thing I typed, saying install.
00:30:10.280 | Okay, we've got the path issue again.
00:30:16.440 | So, vim slash, I think I prefer
00:30:25.560 | radix approach. I'm putting it in pre-run, so that way we have the ability to use this if we wish
00:30:35.320 | in Jupyter. Not echo. So, export path equals dot local
00:30:44.520 | n and then the current path. One of the confusing things I find about
00:30:53.400 | pass, and it got me a couple of times, if you are then export something, a variable name,
00:31:02.440 | you need to have the equality sign straight after the variable name. Oh, yeah, no space.
00:31:08.760 | It won't work. And, you know, it's just one of these little quirks where things are different.
00:31:14.840 | Yeah, you know, Bash is a very old program, and it has these weird old quirks about
00:31:22.840 | white space sensitivity. So, that's a really important point to mention. Thank you.
00:31:26.760 | And I'll run it here as well, rather than restarting. And so, now Kaggle
00:31:38.440 | should exist. It does. It runs. That's good. All right. And so,
00:31:56.120 | let's copy this into my downloads directory. Or else, I guess what I could do...
00:32:05.560 | Yeah, let's just do that. Copy tilde slash dot Kaggle
00:32:14.840 | Kaggle slash mount slash the user's J downloads.
00:32:25.240 | And so, we should be able to now upload it from my downloads directory.
00:32:36.840 | There it is. Okay. And so,
00:32:42.840 | it's created a dot Kaggle directory for us. Wait, oh, sorry, this is my wrong. Sorry,
00:32:53.960 | let's do that again. CD tilde slash dot Kaggle. Yeah, it's created a Kaggle directory for us.
00:33:01.960 | And so, we should be able to move the thing that we just uploaded to slash notebooks
00:33:06.920 | into here. And the permissions will be wrong. So, we can fix them.
00:33:21.560 | Okay. And so, let's see if it works here as well.
00:33:25.560 | It does. And PaperSpaces network is faster than my connection in Australia, not surprisingly.
00:33:44.760 | Although, you know, mine wasn't bad, actually.
00:33:49.480 | Okay, so... Oh, that was a dumb place to put it, obviously. I don't want to put it
00:34:03.080 | in paddy disease classification. You know, we're only going to use this for this notebook,
00:34:09.720 | I guess. So, maybe move that to slash notebooks.
00:34:14.680 | And so, let's create a paddy folder.
00:34:24.440 | Pop it in there. And unzip it.
00:34:33.800 | So, that means... Okay, that's interesting. There's no unzip, but we know how to deal with that.
00:34:41.400 | Why isn't Control-R working for me?
00:34:54.360 | Oh, because Control-R does a refresh. Oh, that's annoying, isn't it?
00:35:00.280 | So, how do we search our history in these terminals?
00:35:08.200 | Oh, well. That's fine. I will just type it in manually and we will figure out
00:35:18.280 | how to make Control-R working at some other point. So, micro member minus C condor forge
00:35:32.200 | minus prefix tilde slash condor install. Probably need the install first.
00:35:44.280 | Yeah, a lot of the keyboard shortcuts don't work in the browser-based terminal,
00:35:52.440 | which is actually pretty annoying. They work a bit better on Mac than on Windows,
00:36:00.840 | because on Windows, the Control key is both used for the Linux terminal commands, and it's also
00:36:07.480 | used for the normal browser commands. Whereas on Mac, they use command for the browser commands,
00:36:13.320 | and so the Control key doesn't get overwritten. So, this would probably be a better experience
00:36:17.080 | on Mac, actually, than Windows. Okay, so we're going to install Unzip.
00:36:32.200 | And hopefully, by the time people watch this video, if it's like
00:36:38.920 | July or later, things like Mamba and Unzip will already be installed. Okay, let's check.
00:36:48.680 | Okay, we have an Unzip. That's good. Okay, so that is on its way.
00:37:01.240 | So, that's going to use up a gigabyte of space in my persistent storage, which
00:37:11.000 | you might not want to do that, right? And if you don't want to do that, then instead, you should
00:37:16.680 | unzip it into your home directory. If you unzip it into your home directory, it won't be there if
00:37:22.520 | you close it down and reopen it, right? So, you might want to create a little script for yourself
00:37:26.840 | that does the Kaggle download and the Unzip on your notebook, and then you can run that
00:37:32.520 | each time you start it up. So, these are the issues. I mean, look, having said that,
00:37:39.080 | the average cost on paper space for storage, I believe, is $0.29 per gigabyte per month.
00:37:45.240 | So, your convenience of putting it in storage is probably worth $0.29 for the one month. You're
00:37:52.040 | probably going to want it there. So, maybe that's just a better plan. I do know, though, that the...
00:37:58.760 | Well, maybe this is a problem, actually, because I do know the paper space
00:38:04.520 | /notebooks and /storage are very, very, very, very slow. And we can actually see that
00:38:12.920 | when we're unzipping this. So, maybe this is a bad idea. Maybe we shouldn't put data,
00:38:20.600 | at least when there's lots of fails. Because this is painful. I'm going to cancel it and see
00:38:28.440 | how far it got. TU minus SH train images. 426. And how about test images?
00:38:47.800 | Wouldn't you know it? It was nearly finished. But, yeah, I think this is actually slower.
00:38:55.480 | So, I'm going to remove that.
00:39:15.480 | And I have a strong feeling if we move it back to our home directory,
00:39:19.400 | it's going to be faster. I sure hope so. And the reason I care is not so much for the unzipping
00:39:26.680 | speed, but when it comes to training a model, we don't want it to be taking ages to open up
00:39:32.120 | each of those files. You see, even RM minus RF takes a long time. So, while that's running,
00:39:44.840 | let's move patty/zipfile, pop it into our home directory.
00:39:58.680 | There we go. And then cd to our home directory.
00:40:03.240 | Okay, that's now finished.
00:40:09.240 | So, in terms of the steps we're going to do, it would be first we would make a directory for it.
00:40:25.160 | We would then do the Kaggle download. Actually, which we can just copy easily enough from Kaggle.
00:40:35.720 | And then we would unzip. Let's see how long it takes. So, the time Unix command
00:40:42.360 | runs whatever command you put after it and tells you how long it took.
00:40:51.560 | Did I not move it there? Oof.dot/patty. Oh, I didn't move it there. Yeah. Time, unzip, quietly,
00:41:01.800 | patty. So, yeah. So, I think what I would do, now I think about it, is
00:41:19.080 | I would have a patty directory in my notebooks. I wouldn't store anything big here. I just have
00:41:25.640 | my notebooks here. And I would put a script here called get data, say. And
00:41:37.000 | it will just have each of the steps I need. So, the steps would be cd to my home directory,
00:41:47.080 | make the patty folder, cd to the patty folder, do the wget,
00:42:03.240 | or not wget, Kaggle competitions download, I should say, unzip it, patty disease.
00:42:17.800 | Yeah. And I think that's it, right? So, we can make that executable with chmodu+x to add the
00:42:32.120 | executable permission to it. And so, yeah. So, then all I have to do is run that thing each time I
00:42:43.480 | start up PaperSpace. And, yeah, it's only going to take eight seconds to unzip, and it took about
00:42:48.920 | five seconds to download. So, actually, that's not going to be really any trouble at all, is it?
00:42:56.600 | Cool. And that's, you know, /notebooks, remember, is persistent on this machine.
00:43:07.240 | So, that's all good. So, now we can create a notebook for it.
00:43:13.160 | And so, my first step is always just to import computer vision functionality in general.
00:43:23.720 | Which is the same thing we used yesterday. And now you know exactly what that does.
00:43:30.600 | And then my second step is to look at the data. So, it's easiest to look at the data if we set a path
00:43:38.680 | to it. So, it's going to be in our home directory. And it's going to be called patty.plash.
00:44:04.920 | Well, that's okay. It's just /patty, right? It can go past that home. Wow. I didn't know that.
00:44:10.520 | That's quite neat. Yeah, it is quite neat. So, that's that. Okay. So, we can path.ls
00:44:20.600 | tells me what's in there. And if you remember my trick from yesterday, I also like to set that
00:44:30.680 | to be the path.basepath just so that my LSs look a bit easier to read.
00:44:39.720 | There we go. So, at this point, we could create a data frame by reading in the CSV of path/train.csv.
00:44:58.760 | Okay. So, we've got 10,000 rows. Each one is a JPEG. Each one's got a label. And so,
00:45:09.320 | let's take a look at one of the images, shall we?
00:45:23.320 | Oh, yeah, PIO image. Path/train/ actually, you know, let's make life a little bit easier
00:45:48.040 | for ourselves by creating a train path. Because, you know, it's just so good to be lazy.
00:46:01.320 | /100330.jpg. Oh, no, because then they're inside the label directory.
00:46:27.000 | Yes. So, what we actually probably should have done would be to say turn path.ls
00:46:34.360 | patty_train, is that not right?
00:46:41.080 | train_images. And that's another good reason to put it in a variable. So, you have to change
00:46:48.840 | it in one place. And so, there we have that. And so, let's create, I don't know, let's call it the
00:46:57.000 | bacterial_leaf_light_path=train_path/bacterial_leaf_light.
00:47:11.960 | So, now we should be able to go BLB and look at that image.
00:47:25.320 | Oh, there we go. So, we have an image. Yay.
00:47:30.120 | All right. So, might be nice to, like, find out a bit about this.
00:47:48.680 | Maybe look at the size. So, it's a 480 by 640 image.
00:47:56.360 | Great. You know, another way we can take a look at an image, you might remember from yesterday,
00:48:13.080 | you can go files=get_image_files and pass in a path.
00:48:21.080 | And this will be recursive. So, I can do this.
00:48:29.160 | As you can see. So, this has got the 10,000. Okay. And that number there matches that number there.
00:48:36.840 | So, that's a good sign. And so, another way to do that would have been to go
00:48:42.840 | image=pil.create_files 0.
00:48:49.880 | Okay. And we could even take a look at a few, right? So, if we wanted to check that the image
00:49:02.120 | size seems reasonably consistent, we could go o.size for o in, well, actually, pil image
00:49:11.000 | dot create o dot size for o in files 10, for example.
00:49:28.120 | So, you know, this is not particularly rigorous, but it looks like they're generally 480 by 640
00:49:33.080 | files. They're all the same size, which is handy. That's interesting.
00:49:47.080 | And they're probably bigger than we normally need. You know, we normally use images that are about
00:49:55.720 | 224 or so. Having said that, I don't know if, like, presumably this is some disease
00:50:04.360 | thing, paddy disease competition. So, it's rice.
00:50:16.280 | Classify the images according to their disease. So, I can't even tell that this thing has a disease.
00:50:26.280 | So, I don't know how big it needs to be to see the disease. So, it is possible it'll turn out
00:50:35.880 | that we actually need full-sized images. So, like, I would start by using smaller images
00:50:43.560 | and kind of see how we go.
00:50:46.840 | Anyway, 640 by 480 is not giant. So, we should be fine.
00:50:50.680 | The CSV file
00:51:03.160 | has got one extra bit of information, which is the variety.
00:51:10.440 | Radek, did you find out what this variety thing is about?
00:51:13.800 | From the doubt, I didn't even know that that CSV file existed.
00:51:19.240 | But it's fun because we can build a multimodal model from data.
00:51:29.560 | I see. It's the type of rice
00:51:32.360 | as opposed to the type of disease. Yeah. So, maybe
00:51:38.440 | the different diseases might look different depending on what type of rice it's on.
00:51:46.280 | My guess is that we wouldn't need to use that information because given how many images there
00:51:52.680 | are, I would guess that it's going to do a perfectly good job of recognizing the varieties
00:52:00.760 | by itself without us telling it. Unless there's a whole lot of different types of varieties,
00:52:05.880 | which we can check easily enough by checking the data frame, grabbing the variety,
00:52:14.200 | and doing a .value counts. And we can see how many there are of each.
00:52:23.800 | Okay. So, look, I mean, there's a couple of tiny varieties, but on the whole,
00:52:29.880 | most of it is ADT 45 and quite a bit of Kanaka-Ponni. It does seem like a bit of a rice session today,
00:52:40.440 | doesn't it? Lots of rice going on. Yeah. So, I think it's very unlikely that this variety
00:52:49.000 | field is going to help because there's so many examples of the main one anyway that it's going
00:52:57.080 | to be able to recognize it. I mean, at some point we can try it, but I would be making
00:53:06.520 | that a pretty low priority for this competition. And so, yeah, given we're doing a practice walk
00:53:14.120 | through, I'd be inclined to fire up Fastbook and the intro and see if we can just basically do the
00:53:25.480 | same thing that we did last time. So I'm going to merge these back together again. We've already
00:53:34.280 | got those, too. We've got those. Well, there's not much there, is there? Oh, I'm in APL mode.
00:53:53.960 | I wonder why things aren't working. I don't know how that happened. I haven't used APL today.
00:53:59.320 | Copy, paste. Okay. So this is how we did cats. So we needed a labeling function. Now, in our case,
00:54:10.600 | the labels are very easy. Each image is inside the directory, which is its label. So the parent
00:54:17.480 | folder name is the label. And so we already have a function to label from folders. So we can actually
00:54:29.240 | just do image data loaders from folders, because that's all we need. So we're still going to need
00:54:35.960 | the path. Train and valid actually have different names. So let's fill all of those in. So we're
00:54:42.200 | going to have path train equals train underscore, what was it? Images? Yep. And test images. Train
00:54:58.600 | underscore images. Valid percent. So that's fine. We'll do that the same as last time.
00:55:12.360 | Okay.
00:55:14.680 | It's expecting to have train and valid subfolders. Oh, all valid percent. So hopefully that'll work.
00:55:28.680 | Let's try it. And we'll use the same resizes last time.
00:55:35.880 | Okay.
00:55:43.640 | All right. Oh, no, that did. Well, did that work? No, it didn't work because we've got,
00:56:02.440 | that's interesting. Test images. So my guess is it's got confused by the fact.
00:56:08.120 | Yes. Okay. So possibly what we should instead do is use train path here.
00:56:21.480 | And use valid percent instead. I wonder if that'll fix that problem.
00:56:29.320 | There we go. Let's fix that problem. Okay. Great. So we should then be able to create a learner.
00:56:47.800 | And learn dot fine-tune. Let's just do one epoch to start with.
00:57:01.160 | There it goes. So it can be useful to kind of make sure it's
00:57:11.400 | being reasonably productive as it's training. And we can do that with nvidia smi.
00:57:18.040 | nvidia smi minus help. Oh, so much help.
00:57:30.120 | So there's, let's take a look here.
00:57:39.160 | We've only got one GPU, so that's fine.
00:57:46.520 | Loop query.
00:57:53.800 | Okay. We're not modifying anything.
00:58:07.080 | daemon. I think that's the one we want.
00:58:08.840 | nvidia smi. daemon.
00:58:16.920 | Okay. That's just finished. So while it was running, so this is something people often say
00:58:33.640 | to use watch nvidia smi to like have it refresh. But actually I don't think most people know that
00:58:40.600 | there's a daemon subcommand where you can use that just as you can see it shows you every second how
00:58:45.160 | it's going. And it's showing me the most important thing is this column SM. SM stands for symmetric
00:58:52.120 | multiprocessor. That's kind of what they call it instead of a CPU for their GPUs. And it's showing
00:58:57.800 | me that it's being used 70 to 90 percent kind of effectively if you like. And that's a good sign.
00:59:06.600 | That's fine. So if this was like under 50, then that would be a problem. But it looks like it's
00:59:13.240 | using my GPU reasonably effectively. Yeah, and it's got the error rate down to 13 percent.
00:59:20.440 | So we are successfully training a model. So that sounds good. So, Jeremy, just a quick question.
00:59:29.240 | When you're saying that like if it's under 50 percent, then that can be a problem. Is that
00:59:33.560 | because you've oversized the GPU like when you selected it or like just just want to clarify
00:59:39.480 | what you know about that? What that would mean. Yeah, thanks. It's a good question.
00:59:45.320 | Just rename this. It would probably mean that we're not able to read and process the images
00:59:54.760 | fast enough. And so in particular, my guess is that if they're in slash storage or slash notebooks,
01:00:02.200 | you would see the SM percent be really low because I think it would be taking a really
01:00:05.880 | long time to open each image because it's coming from a network storage. And so generally, yeah,
01:00:11.720 | a low SM means that your IO, your input output, your reading or processing time is too high.
01:00:17.960 | And so the ways to fix that would be a few. One would be to move the images onto the local machine
01:00:24.120 | so they're not on a network drive. A second would be to resize the images ahead of time
01:00:29.480 | to make them a more reasonable size. And a third would be to decrease the amount of kind of
01:00:35.560 | augmentation that you're doing. Or another would be to pick a different instance type with more CPUs.
01:00:41.880 | So those are basically the things. All right. Okay. Just to end the system,
01:00:49.880 | my command also has a lot of useful information like your CUDA version and stuff like that. So
01:00:57.800 | you know, it's also useful command even without demo to know that it exists. Yep. A lot of details
01:01:08.680 | here. So if you're looking for the IDX of your GPU, it might be GPUs. And some of the variables
01:01:16.040 | here are a little bit more descriptive. So it might be easier to get started with that command
01:01:22.440 | or to at least use it every now and then. And if you'd like to have this one running in a loop,
01:01:27.400 | which is what I generally do, just do nvdr-smi-i. Yeah. Yeah. I mean, I agree this is useful,
01:01:36.920 | but I would suggest in a loop to use the daemon because there's only two columns you care about.
01:01:41.800 | And this one does not show you SM, right? So if you want to actually see it's being utilized,
01:01:48.440 | you need to use daemon. And you can also see the percentage memory utilization.
01:01:52.280 | So just look at these two columns. The other ones you can actually ignore.
01:01:57.880 | Yeah. Okay. I think that's a pretty good place to stop. I'm glad you put us onto this competition
01:02:05.640 | Radek. It looks fun. And I feel like we've got a reasonable start. So yeah, maybe next time we can
01:02:16.040 | try doing a submission. And we could also try creating a Kaggle notebook for other people to
01:02:27.640 | see. How does that sound? Sounds excellent. One thing I also like about this is that
01:02:39.160 | we're coming up across problems as we go and jumping through those hoops. And these are the
01:02:49.080 | beginner sorts of roadblocks that we'll have to face, I guess. Exactly. And if you guys, you know,
01:02:56.680 | repeat these steps or do it on another dataset or whatever and hit some roadblocks, then it's really
01:03:04.440 | helpful. If you solve them, you know, come back tomorrow and tell us what happened and how you
01:03:09.480 | solved it. And if you didn't, come back tomorrow and tell us to fix it for you. I think they're
01:03:14.280 | both useful things to do. So things like Radek's example of like doing a bash environment variable
01:03:20.760 | and having a space next to the equal sign, you know, that kind of stuff. I forget even to mention
01:03:26.200 | it, but really useful information. You know, this competition is nice because it's relatively small,
01:03:33.080 | like 10,000 images, and it's aligned with what you're doing in the course. But if you'd like to
01:03:37.800 | try something out on a competition that is not active right now, you can still do this. Rascaggle
01:03:43.800 | allows you to do this late submission thing. And this opens up many competitions to play around with.
01:03:54.440 | The current competitions that are, how do they call it, ranked competitions, so they award you
01:04:01.640 | points and there are prizes, they are not on images. So we explore something on your own to
01:04:09.480 | try the methods on another competition on an image that might be something quite useful.
01:04:16.120 | So to find those, you need to scroll to the bottom and click explore all competitions.
01:04:22.280 | And yeah, this will let you see closed competitions as well.
01:04:34.920 | And you can even see, I guess, here you go, you can find out which were the ones with the
01:04:44.520 | most popular of all time. That can be interesting. Crypto forecasting. Well, of course it would be.
01:04:50.840 | That's a bit sad, but there you go. That's interesting. This patent phrase one is super
01:04:59.480 | popular. That's good to see. Instant gratification. All right. Thanks all. See you next time. Bye.