Lesson 3: Deep Learning 2018

00:00:00.000 | Welcome back everybody

00:00:02.000 | I'm sure you've noticed

00:00:05.640 | But there's been a lot of cool activity on the forum this week and one of the things that's been really great to see

00:00:11.400 | Is that a lot of you have started creating?

00:00:14.040 | Really helpful materials both for your classmates to better understand stuff and also for you to better understand stuff by

00:00:22.120 | Trying to teach what you've learned. I just wanted to highlight a few I've actually

00:00:29.040 | Posted to the wiki thread a few of these, but there's lots more

00:00:33.220 | Reshma has posted a whole bunch of nice introductory tutorials so for example if you're having any trouble getting connected with AWS

00:00:44.200 | She's got a whole step-by-step

00:00:47.000 | How to go about logging in and getting everything working which I think is a really terrific thing and so it's a kind of thing

00:00:54.840 | that if you

00:00:57.600 | Writing some notes for yourself to remind you how to do it

00:01:00.520 | You may as well post them for others to do it to do it as well and by using a markdown file like this

00:01:06.280 | It's actually good practice if you haven't used github before if you put it up on github

00:01:10.700 | Everybody can now use it or of course you can just put it in the forum

00:01:14.960 | so

00:01:16.640 | more advanced

00:01:18.200 | Thing that Reshma wrote up about is she noticed that I like using tmux

00:01:22.600 | Which is a handy little thing which lets me?

00:01:27.480 | Let me basically have a window. Let's see if I've got one. I'll show you

00:01:31.400 | So as soon as I log into my computer

00:01:34.200 | If I run tmux

00:01:37.280 | You'll see that all of my windows pop straight up

00:01:39.820 | Basically and I can like continue running stuff in the background and I can like I've got vim over here

00:01:45.760 | And I can kind of zoom into it or I can move over to the top which is here's budget

00:01:50.600 | But I can all running and so forth so if that sounds interesting Reshma has a

00:01:56.400 | Tutorial here on how you can use tmux

00:01:58.520 | And it's actually got a whole bunch of stuff in her github, so that's that's really cool

00:02:04.520 | Up built among has written a very nice kind of summary basically of our last lesson

00:02:12.160 | Which kind of covers

00:02:15.880 | What are the key things we did and why did we do them so if you're if you're kind of?

00:02:20.060 | Wondering like how does it fit together? I think this is a really helpful summary

00:02:26.160 | Like what did those couple of hours look like if we summarize it all into a page or two?

00:02:30.800 | I

00:02:33.520 | also really like Pavel has

00:02:36.080 | Dark kind of done a deep dive on the learning rate finder

00:02:41.420 | which is a

00:02:44.200 | Topic that a lot of you have been interested in learning more about particularly

00:02:47.680 | Those of you who have done deep learning before I've realized that this is like a solution to a problem that you've been having for

00:02:54.120 | A long time and haven't seen before and so it's kind of something which hasn't really been vlogged about before so this is the first

00:02:59.940 | Time I've seen this blogged about so when I put this on Twitter a link to

00:03:03.800 | Pavel's post it's been shared now hundreds of times

00:03:07.280 | It's been really really popular and viewed many thousands of times, so that's some great content

00:03:12.960 | Radek has posted lots of cool stuff. I really like this practitioners guide to pytorch which again

00:03:20.360 | This is more for more advanced students, but it's like digging into people who have never used pytorch before but know a bit about

00:03:27.200 | Numerical programming in general and it's a quick introduction to how pytorch is different

00:03:33.080 | And then there's been some interesting little bits of research like what's the relationship between learning rate and batch size so one of the

00:03:41.080 | Students actually asked me this before class and I said well one of the other students has written an analysis of exactly that

00:03:49.080 | so what he's done is basically looked through and tried different batch sizes and different learning rates and tried to see how they seem to

00:03:54.960 | Relate together and these are all like cool experiments, which you know you can try yourself

00:03:59.960 | Radek again, he's written something again a kind of a research into this question. I made a claim that

00:04:07.600 | The the stochastic gradient descent with restarts finds more generalizable

00:04:14.240 | Parts of the function surface because they're kind of flatter, and he's been trying to figure out. Is there a way to measure that more directly?

00:04:20.440 | Not quite successful yet, but a really interesting piece of research

00:04:24.080 | got some

00:04:27.120 | introductions to convolutional neural networks

00:04:29.120 | and

00:04:31.120 | then

00:04:33.000 | something that we'll be learning about towards the end of this course, but I'm sure you've noticed we're using something called ResNet and

00:04:39.560 | Anand Sahar actually posted a pretty impressive analysis of like what's a ResNet and why is it interesting?

00:04:46.400 | And this one's actually been very already shared very widely around the internet. I've seen also

00:04:51.280 | So so we're advanced students who are interested in

00:04:55.000 | Jumping ahead can look at that and appeal to mom also has done something similar

00:05:00.840 | so lots of

00:05:03.600 | Yeah, lots of stuff going on on the forums. I'm sure you've also noticed we have a beginner forum now

00:05:09.760 | specifically for you know asking questions which

00:05:12.880 | You know

00:05:15.760 | There's always the case that there are no

00:05:17.760 | Dumb questions, but when there's lots of people around you talking about advanced topics. It might not feel that way

00:05:23.320 | so hopefully the beginners forum is just a

00:05:25.320 | less intimidating space and

00:05:27.800 | If you're a more advanced

00:05:30.960 | Student who can help answer those questions, please do but remember when you do answer those questions try to answer in a way

00:05:37.560 | That's friendly to people that maybe you know have no more than a year of programming experience and haven't done any machine learning before

00:05:43.580 | So you know I hope

00:05:48.760 | Other people in the class

00:05:51.720 | Feel like you can contribute as well and just remember all of the people we just looked at or many of them

00:05:56.460 | I believe have never

00:05:58.520 | Posted anything to the internet before right I mean you don't have to be a particular kind of person to be allowed to blog

00:06:04.760 | or something you can just drop down your notes throw it up there and

00:06:08.880 | One handy thing is if you just put it on the forum, and you're not quite sure of some of the details then

00:06:16.800 | Then you know you have an opportunity to get feedback and say like ah well

00:06:20.720 | That's not quite how that works

00:06:22.000 | You know actually it works this way instead or or that's a really interesting insight had you thought about taking this further and so forth

00:06:29.480 | So what we've done so far is a kind of a an introduction as a just as a practitioner to

00:06:35.920 | Convolutional neural networks for images, and we haven't really talked much at all about

00:06:42.460 | The theory or why they work or the math of them, but on the other hand what we have done is seen

00:06:49.200 | how to

00:06:51.360 | Build a model which actually works exceptionally well in fact world-class level models

00:06:59.240 | and we'll kind of review a little bit of that today and

00:07:03.600 | Then also today

00:07:06.440 | We're going to dig in a little bit quite a lot more actually into the underlying theory of like

00:07:10.180 | What is a what is a cnn? What's a convolution?

00:07:12.880 | How does this work and then we're going to kind of go through this this cycle where we're going to dig

00:07:18.260 | We're going to do a little intro into a whole bunch of application areas using neural nets for structured data

00:07:25.120 | so kind of like logistics or forecasting or you know financial data or that kind of thing and then looking at

00:07:33.080 | language applications NLP applications using recurrent neural nets and then

00:07:38.880 | collaborative filtering for

00:07:41.720 | Recommendation systems and so these will all be like

00:07:46.200 | Similar to what we've done for cnn's images

00:07:49.800 | It'll be like here's how you can get a state-of-the-art result without digging into the theory

00:07:53.800 | But but knowing how to actually make it work

00:07:55.880 | And then we're kind of go go to go back through those almost in reverse order

00:08:01.160 | So then we're going to dig right into collaborative filtering in a lot of detail and see how how to write the code

00:08:06.960 | Underneath and how the math works underneath and then we're going to do the same thing for structured data analysis

00:08:12.760 | We're going to do the same thing for confidence images and finally an in-depth dig dive into recurrent neural networks

00:08:19.560 | So that's kind of where we're getting

00:08:23.240 | so let's start by

00:08:25.240 | Doing a little bit of a review and I want to

00:08:29.280 | Also provide a bit more detail on some on some steps that we only briefly skipped over

00:08:36.040 | So I want to make sure that we're all able to complete

00:08:38.920 | Kind of last week's assignment, which was the the dog breeds

00:08:44.080 | I mean to basically apply what you've learned to it another data set and I thought the easiest one to do would be the dog

00:08:49.520 | Breeds Kaggle competition and so I want to make sure everybody has everything you need to do this right now

00:08:54.280 | So and the first thing is to make sure that you know how to download

00:08:58.800 | Data and so there's there's two main places at the moment. We're kind of downloading data from one is from Kaggle

00:09:05.600 | And the other is from like anywhere else

00:09:08.720 | And so I'll first of all do the the Kaggle version

00:09:13.840 | So to download from Kaggle

00:09:17.080 | We use something called Kaggle CLI

00:09:20.360 | Which is here and to install it I think it's already in let's just double check

00:09:29.120 | Yeah, so it's or it should already be in your

00:09:36.000 | environment

00:09:38.600 | But to make sure one thing that happens is because this is downloading from the Kaggle website through like screen scraping every time Kaggle changes

00:09:45.420 | The website it breaks so anytime you try to use it and

00:09:48.940 | If Kaggle's websites changed recently you'll need to make sure you get the most recent version so you can always go to pip

00:09:56.500 | install

00:09:59.160 | Kaggle - CLI

00:10:01.280 | - - upgrade and so that'll just make sure that you've got the latest version of of it and everything that it depends on

00:10:10.720 | okay, and

00:10:12.800 | so then having done that you can

00:10:15.420 | Follow the instructions. Actually, I think rational was kind enough to they go. There's a Kaggle CLI

00:10:20.820 | Feel like everything you need to know can be found at rational's

00:10:24.140 | GitHub

00:10:27.620 | So basically to do that the next step you go

00:10:32.160 | KG download

00:10:35.540 | And then you provide your username with - you you provide your password with - P and then - see you did the competition name

00:10:44.400 | And a lot of people in the forum has been confused about what to enter here

00:10:47.680 | And so the key thing to note is that when you're at a Kaggle competition?

00:10:51.220 | After the /c there's a specific name planet - understanding - etc. Right? That's the name you need

00:10:59.200 | Okay

00:11:01.560 | the other thing you'll need to make sure is that you've

00:11:04.280 | On your own computer have attempted to click download at least once because when you do it will ask you to accept the rules

00:11:11.000 | If you've forgotten to do that

00:11:14.580 | KG download will give you a hint it'll say it looks like you might have forgotten

00:11:17.800 | the rules if you log into Kaggle with like a

00:11:21.700 | Google account like anything other than a username password this won't work

00:11:25.980 | So you'll need to click forgot password on Kaggle and get them to send you a normal password

00:11:31.300 | So that's the Kaggle version

00:11:33.700 | Right and so when you do that you end up with a whole folder created for you with all of that competition data in it

00:11:41.960 | So a couple of reasons you might want to not use that

00:11:44.980 | The first is that you're using a data set that's not on Kaggle

00:11:48.220 | The second is that you don't want all of the data sets in a Kaggle competition for example the planet competition

00:11:54.700 | That we've been looking at a little bit. We'll look at again today

00:11:57.380 | Has data in two formats TIFF and JPEG the TIFF is 19 gigabytes and the JPEG is 600 megabytes

00:12:06.420 | So you probably don't want to download both

00:12:09.460 | So I'll show you a really cool kit, which actually somebody on the forum taught me

00:12:14.040 | I think was one of the MSAN students here at USF. There's a

00:12:17.860 | Chrome extension called curl w get

00:12:22.480 | So you can just search for curl w get

00:12:26.220 | And then you install it by just clicking on install if you haven't installed extension before and then from now on

00:12:33.520 | Every time you try to download something, so I'll try and download this file

00:12:38.460 | and

00:12:40.460 | I'll just go ahead and cancel it right and now you see this little yellow button. That's added up here

00:12:46.340 | There's a whole command here

00:12:48.940 | All right, so I can copy that and

00:12:52.060 | Paste it

00:12:55.940 | into my

00:12:57.700 | window and

00:12:59.420 | Hit go and it's there goes okay

00:13:02.980 | So what that does is like all of your cookies and headers and everything else needed to download that file is like save

00:13:09.620 | So this is not just useful for

00:13:12.060 | downloading data

00:13:13.980 | It's also useful if you like trying to download some I don't know TV show or something anything where you're hidden behind a

00:13:20.220 | Log in or something you can you can grab it and actually that is very useful for data science because quite often we want to

00:13:27.620 | Analyze things like videos on our on our consoles

00:13:31.140 | So this is a good trick. All right, so there's two ways to get the data

00:13:34.500 | So then

00:13:38.380 | Having got the data you then need to

00:13:42.020 | Build your model, right?

00:13:45.140 | So what I tend to do like you'll notice that I tend to assume that the data is in a directory called data

00:13:51.860 | That's a subdirectory of wherever your notebook is, right?

00:13:55.620 | Now you don't necessarily

00:13:59.380 | Actually want to put your data there

00:14:00.860 | You might want to put it directly in your home directory or you might want to put it on another drive or whatever

00:14:05.260 | so what I do is if you look inside my courses deal one folder, you'll see that data is actually a

00:14:13.020 | symbolic link

00:14:15.660 | To a different drive, right? So you can put it anywhere you like and then you can just add a symbolic link

00:14:20.820 | Or you can just put it there directly. It's up to you

00:14:24.660 | You haven't used some links before they're like aliases or shortcuts on the Mac or Windows

00:14:30.340 | Very handy and there's some threads on the forum about how to use them if you want help with that

00:14:35.980 | that's for example is also how we actually have the

00:14:39.420 | fast AI modules

00:14:41.660 | Available from the same place as our notebooks. It's just a similar to where they come from

00:14:46.540 | anytime you want to see like

00:14:50.340 | Where things actually point to in Linux you can just use the minus L flag to listing a directory

00:14:57.340 | And it'll show you where the sim links

00:14:59.340 | Exist and also show you which things are directories so forth

00:15:03.040 | Okay, so one thing which

00:15:06.580 | May be a little unclear based on what we've done so far is like

00:15:15.220 | How little code you actually need to do this end-to-end so what I've got here is is in a single window is an entire

00:15:22.860 | End-to-end process to get a state-of-the-art result for cats versus dogs, right?

00:15:28.260 | I've the only step I've skipped is the bit where we've downloaded it in Kaggle and then where we unzipped it, right?

00:15:35.660 | so

00:15:37.220 | These are literally all the steps

00:15:39.460 | and so we

00:15:42.660 | Import our libraries and actually if you import this one conf learner that basically imports everything else

00:15:48.900 | So that's that we need to tell it the path of where things are the size that we want the batch size that we want

00:15:56.540 | alright

00:15:58.500 | So then and we're going to learn a lot more about what these do very shortly

00:16:02.340 | But basically we say how do we want to transform our data so we want to transform it in a way

00:16:07.500 | That's suitable to this particular kind of model and it assumes that the photos are side on photos

00:16:13.420 | And that we're going to zoom in up to 10% each time

00:16:16.220 | We say that we want to get some data

00:16:19.500 | Based on paths and so remember this is this idea that there's a path called cats and a path called dogs

00:16:25.180 | And they're inside a path called train and a path called valid

00:16:28.340 | Note that you can always

00:16:33.500 | Overwrite these with other things so if your things are in different named folders you could either rename them or you can see here

00:16:40.340 | There's like a train name and a vowel name you can always pick something else here

00:16:45.020 | Also notice there's a test name

00:16:48.820 | So if you want to submit some into Kaggle you'll need to fill in the name the name of the folder where the test

00:16:54.380 | Set is and obviously those those won't be labeled

00:17:00.220 | So then we create a model from a pre trained model. It's from a ResNet 50 model using this data

00:17:06.900 | And then we call fit and remember by default

00:17:10.380 | That has all of the layers, but the last few frozen and again, we'll learn a lot more about what that means

00:17:16.380 | And so that's that's what that does so that

00:17:19.500 | That took two and a half minutes

00:17:22.220 | Notice here. I didn't say pre compute equals true again

00:17:27.300 | There's been some confusion on the forums about like what that means

00:17:30.260 | It's it's only a it's only something that makes it a little faster for this first step right so you can always skip it

00:17:37.620 | And if you're at all confused about it, or it's causing you any problems. Just leave it off right because it's just a

00:17:43.700 | It's just a shortcut which caches some of that intermediate steps that don't have to be recapulated each time

00:17:52.020 | Okay, and remember that when we are using pre computed activations data augmentation doesn't work right so even if you ask for a data

00:18:00.420 | augmentation if you've got pre computed equals true

00:18:02.860 | It doesn't actually do any data augmentation because it's using the cached

00:18:06.940 | non-augmented

00:18:09.220 | activations

00:18:10.540 | So in this case to keep this as simple as possible. I have no pre computed anything going on

00:18:15.140 | so I do three cycles of length one and

00:18:20.220 | Then I can then unfreeze

00:18:22.900 | So it's now going to train the whole thing

00:18:25.500 | something we haven't seen before and we'll learn about in the second half is

00:18:29.620 | called BN freeze for now all you need to know is that if you're using a

00:18:34.940 | model like a

00:18:37.140 | Bigger deeper model like resnet 50 or res next 101 on a data set

00:18:42.780 | That's very very similar to image net like these cats and dogs later sets on other words

00:18:48.140 | Like side on photos of standard objects

00:18:51.780 | You know of a similar size to image net like somewhere between 200 and 500 pixels

00:18:57.300 | You should probably add this line when you unfreeze for those of you that are more advanced what it's doing is it's it's

00:19:06.020 | Causing the batch normalization

00:19:08.460 | Moving averages to not be updated but in the second half of this course you're going to learn all about why we do that

00:19:14.340 | It's something that's not supported by any other library

00:19:17.020 | But it turns out to be super important anyway, so we do one more epoch

00:19:21.660 | with

00:19:23.820 | training the whole network

00:19:25.820 | And then at the end we use test time augmentation

00:19:29.540 | To ensure that we get the best predictions we can and that gives us ninety nine point four five percent

00:19:37.180 | So that's that's it right so when you try a new data set they're basically the minimum set of steps

00:19:46.260 | That you would need to follow

00:19:48.260 | You'll notice this is assuming. I already know what learning rate to use so you'd use a learning rate finder for that

00:19:54.260 | It's assuming that I know the the directory layout

00:19:57.620 | and so forth

00:20:00.820 | So that's kind of a minimum set now one of the things that I wanted to make sure

00:20:05.020 | You had an understanding of how to do is how to use other libraries other than fast AI

00:20:11.780 | And so I feel like the best thing to look at is to look at Keras because Keras is a library

00:20:18.020 | Just like fast AI sits on top of Pytorch

00:20:20.820 | Keras sits on top of actually a whole variety of different back ends it fits mainly people nowadays use it with TensorFlow

00:20:28.480 | There's also an MX net version. There's also a Microsoft CNTK version

00:20:35.020 | So what I've got if you do a git pull you'll see that there's a

00:20:40.300 | something

00:20:42.220 | Called Keras lesson one where I've attempted to replicate at least parts of lesson one in Keras

00:20:49.020 | Just to give you a sense of how that works

00:20:52.580 | I'm not going to talk more about batch norm freeze now other than to say

00:21:01.880 | if you're using

00:21:04.700 | something

00:21:06.060 | Which has got a number larger than 34 at the end so like resnet 50 or res next 101 and you're

00:21:12.620 | Trading a data set that has that is very similar to image net

00:21:17.500 | So it's like normal photos of normal sizes where the thing of interest takes up most of the frame

00:21:22.780 | Then you probably should add the end freeze true after unfreeze

00:21:27.180 | If in doubt try trading it with and then try trading it without

00:21:32.700 | More advanced students will can certainly talk about it on the forums this week

00:21:36.480 | And we will be talking about the details of it in the second half of the course when we come back to our

00:21:42.740 | CNN in-depth section in the second last lesson

00:21:47.440 | So with Keras

00:21:54.300 | again, we import a bunch of stuff and

00:22:00.940 | Remember I mentioned that this idea that you've got a thing called train and a thing called valid and inside that you've got a

00:22:06.180 | Thing called dogs and the things called cats is a standard way of providing

00:22:10.420 | image

00:22:12.620 | Labeled images so Keras does that too right so it's going to tell it where the training set and the validation set are

00:22:18.780 | Size twice what batch size to use

00:22:22.820 | Now you're noticing Keras. We need much much much more

00:22:28.540 | code to do the same thing

00:22:30.660 | More importantly each part of that code has many many many more things you have to set and if you set them wrong

00:22:37.860 | everything breaks, right, so

00:22:40.300 | I'll give you a summary of what they are. So you're basically rather than creating a single

00:22:47.700 | Data object in Keras we first of all have to define something called a data

00:22:52.860 | Generator to say how to generate the data and so a data generator

00:22:57.140 | We basically have to say what kind of data augmentation

00:23:00.820 | we want to do and

00:23:03.620 | We also we actually have to say what kind of

00:23:07.340 | Normalization do we want to do so we're else with fast AI we just say

00:23:13.180 | Whatever resnet 50 requires just do that for me, please

00:23:16.780 | We actually have to kind of know a little bit about what's expected of us

00:23:20.860 | Generally speaking copy and pasting Keras code from the internet is a good way to make sure you've got the right

00:23:26.660 | The right stuff to make that work

00:23:28.660 | And again, it doesn't have a kind of a standard set of like here the best data augmentation parameters to use for photos

00:23:36.020 | So, you know, I've copied and pasted all of this from the Keras

00:23:39.780 | documentation

00:23:42.620 | So I don't know if it's I don't think it's the best set to use at all, but it's the set that they're using in their

00:23:47.620 | Docs

00:23:48.500 | So having said this is how I want to generate data. So horizontally flip sometimes, you know zoom sometimes she is sometimes

00:23:55.860 | We then create a generator from that by taking that data generator and saying I want to generate

00:24:02.300 | Images by looking from a directory and we pass in the directory which is of the same

00:24:07.700 | directory structure that fast AI uses and

00:24:10.660 | You'll see there's some overlaps with kind of how fast AI works here

00:24:14.780 | You tell it what size images you want to create you tell it what batch size you want in your mini batches

00:24:20.100 | And then there's something here not to worry about too much

00:24:23.340 | But basically if you're just got two possible outcomes you would generally say binary here

00:24:28.300 | If you've got multiple possible outcomes you would say categorical. Yeah, so we've only got cats or dogs. So it's binary

00:24:34.460 | So an example of like where things get a little more complex is you have to do the same thing for the validation set

00:24:42.300 | So it's up to you to create a data generator

00:24:44.300 | That doesn't have data augmentation because obviously for the validation set unless you're using TTA that's going to stuff things up

00:24:52.740 | you also

00:24:54.380 | When you train?

00:24:56.140 | You randomly reorder the images so that they're always shown in different orders to make it more random

00:25:01.540 | but with a validation it's

00:25:04.060 | Vital that you don't do that because if you shuffle the validation set you then can't track how well you're doing

00:25:10.020 | It's in a different order for the labels. That's a

00:25:12.420 | Basically, these are the kind of steps you have to do every time with Keras

00:25:20.340 | So again, the reason I was using resnet 54 is Keras doesn't have resnet 34 unfortunately

00:25:26.120 | So I just wanted to compare like with Mike so we got to use resnet 50 here

00:25:29.680 | There isn't the same idea with Keras of saying like construct a model that is suitable for this data set for me

00:25:39.260 | So you have to do it by hand, right?

00:25:40.940 | So the way you do it is to basically say this is my base model and then you have to construct on top of that manually

00:25:48.700 | The layers that you want to add and so by the end of this course, you'll understand why it is that these

00:25:53.780 | particular three layers are the layers that we add

00:25:57.060 | So having done that in Keras you basically say okay

00:26:02.460 | this is my model and then again there isn't like a

00:26:05.980 | Concept of like automatically freezing things or an API for that

00:26:10.680 | so you just have to allow loop through the layers that you want to freeze and

00:26:15.700 | Call trainable equals false on them

00:26:18.840 | In Keras, there's a concept we don't have in fast AI or pytorch of compiling a model

00:26:25.640 | So basically once your models ready to use you have to compile it

00:26:28.720 | Passing in what kind of optimizer to use what kind of loss to look for or what metrics so again with fast AI

00:26:35.920 | You don't have to pass this in because we know what loss is the right loss to use you can always override it

00:26:42.620 | But for a particular model we give you good defaults

00:26:45.980 | Okay, so having done all that

00:26:47.980 | Rather than calling fit you call fit generator

00:26:50.980 | Passing in those two generators that you saw earlier the train generator and the validation generator

00:26:56.500 | For reasons I don't quite understand Keras expects you to also tell it how many batches there are per epoch

00:27:04.000 | So the number of batches is equal to the size of the generator

00:27:08.340 | Divided by the batch size you can tell it how many epochs

00:27:13.420 | just like in

00:27:15.420 | Fast AI you can say how many

00:27:17.420 | Processes or how many workers to use for pre-processing?

00:27:20.900 | Unlike fast AI the default in Keras is basically not to use any

00:27:27.500 | So you to get good speed you've got to make sure you include this

00:27:32.620 | And so that's basically enough to start fine-tuning the last layers

00:27:42.820 | So as you can see I got to a validation accuracy of 95%

00:27:46.140 | But as you can also see something really weird happened where after one it was like 49 and then it was 69 and then 95

00:27:53.040 | I don't know

00:27:54.900 | Why these are so low? That's not normal. I may have there may be a bug in Keras. They may be a bug in my code

00:28:01.500 | I reached out on Twitter to see if anybody could figure it out, but they couldn't I guess this is one of the challenges with using

00:28:08.700 | Something like this is one of the reasons I wanted to use fast AI for this course is it's much harder to screw things up

00:28:14.740 | So I don't know if I screwed something up or somebody else did yes, you know

00:28:18.700 | This is using the tensorflow back end yeah, yeah, and if you want to run this to try it out yourself

00:28:28.780 | You just can just go pip install

00:28:32.500 | tensorflow - GPU

00:28:36.940 | Keras

00:28:38.500 | Okay, because it's not part of the fast AI environment about default

00:28:42.720 | But that should be all you need to do to get that working

00:28:47.540 | So then

00:28:54.060 | There isn't a concept of like layer groups or differential learning rates or partial unfreezing or whatever

00:29:00.420 | So you have to decide like I had to print out all of the layers and decide manually

00:29:04.980 | How many I wanted to fine-tune so I decided to fine-tune everything from a layer 140 onwards

00:29:10.280 | So that's why I just looped through like this

00:29:12.280 | After you change that you have to recompile the model

00:29:15.540 | And then after that I then ran another step and again

00:29:19.540 | I don't know what happened here the accuracy of the training set stayed about the same but the validation set totally fell in the hole

00:29:25.380 | But I mean the main thing to note is even if we put aside the validation set

00:29:32.340 | We're getting I mean, I guess the main thing is there's a hell of a lot more code here

00:29:36.300 | Which is kind of annoying but also the performance is very different. So we're also here even on the training set

00:29:42.860 | We're getting like 97% after four epochs that took a total of about eight minutes

00:29:48.420 | you know over here we had

00:29:51.140 | 99.5% on the validation set and it ran a lot faster. So it was like

00:29:58.100 | four or five minutes

00:30:00.940 | right

00:30:02.860 | so

00:30:04.860 | Depending on what you do particularly if you end up wanting to deploy stuff to mobile devices at the moment

00:30:12.880 | The kind of pie torch on mobile situation is very early

00:30:18.020 | So you may find yourself wanting to use tensorflow or you may work for a company that's kind of settled on tensorflow

00:30:24.340 | So if you need to convert something like redo something you've learned here in tensorflow

00:30:30.980 | You probably want to do it with Keras, but just recognize

00:30:35.160 | you know, it's going to take a bit more work to get there and

00:30:38.700 | By default it's much harder to get I mean I to get the same state-of-the-art results you get with fast AI

00:30:46.140 | You'd have to like replicate all of the state-of-the-art

00:30:49.620 | Algorithms that are in fast AI so it's hard to get the same

00:30:53.300 | Level of results, but you can see the basic ideas are similar

00:30:59.140 | Okay, and it's certainly

00:31:01.140 | It's certainly possible, you know, like there's nothing I'm doing in fast AI that like would be impossible

00:31:07.380 | But like you would have to implement stochastic gradient percent with restarts. You would have to

00:31:11.260 | Implement differential learning rates you would have to implement batch norm freezing

00:31:16.820 | Which you probably don't want to do. I know well, that's not quite true

00:31:20.940 | I think somewhat one person at least on the forum is

00:31:23.100 | Attempting to create a Keras compatible version of or tons of flow compatible version fast AI

00:31:28.380 | Which I think I hope we'll get there

00:31:30.620 | I actually spoke to Google about this a few weeks ago, and they're very interested in getting fast AI ported to tensorflow

00:31:36.420 | So maybe by the time you're looking at this on the MOOC, maybe that will exist. I certainly hope so

00:31:41.820 | We will see

00:31:44.580 | Anyway, so Keras is Keras and tensorflow are certainly not

00:31:49.900 | You know

00:31:52.940 | That difficult to handle and so I don't think you should worry if you're told you have to learn them

00:31:57.900 | After this course for some reason it'll only take you a couple of days. I'm sure

00:32:02.020 | So that's kind of most of the stuff you would need to

00:32:10.780 | Kind of complete this is kind of assignment from last week

00:32:14.460 | Which was like try to do everything you've seen already, but on the dog breeds data set and just to remind you

00:32:21.300 | The kind of last few minutes of last week's lesson I show you how to do much of that

00:32:28.940 | Including like how I actually explored the data to find out like what the classes were and how big the images were and stuff like

00:32:37.860 | That right so if you've forgotten that or didn't quite follow it all last week check out the video from last week to see

00:32:45.380 | One thing that we didn't talk about is how do you actually submit to Kaggle? So how do you actually get predictions?

00:32:51.200 | So I just wanted to show you that last piece as well

00:32:54.160 | And on the wiki thread this week. I've already put a little image of this to show you these steps

00:32:59.980 | But if you go to the Kaggle

00:33:02.980 | Website for every competition there's a section called evaluation and they tell you what to submit and so I just copied and pasted these

00:33:10.900 | Two lines from from there, and so it says we're expected to submit a file where the first line

00:33:17.060 | Contains the the word the word ID and then a comma separated list of all of the possible dog breeds

00:33:24.300 | And then every line after that will contain the ID itself

00:33:28.700 | Followed by all the probabilities of all the different dog breeds

00:33:32.500 | so

00:33:34.860 | How do you create that?

00:33:37.700 | So I recognize that inside our data object. There's a dot classes

00:33:41.400 | Which has got in alphabetical order all of the all of the classes

00:33:47.560 | and then

00:33:50.460 | So it's got all of the different classes and then inside

00:33:54.580 | Data dot test data set test. Yes, you can also see there's all the file names

00:34:00.460 | So I just remind you

00:34:04.180 | dogs and cats sorry dogs and cats dog breeds

00:34:08.100 | Was not provided in the kind of Keras style format where the dogs and cats are in different folders

00:34:15.260 | But instead it was provided as a CSV file of labels, right? So when you get a CSV file of labels you use

00:34:22.780 | Image classifier data from CSV rather than image classifier data from paths

00:34:30.900 | There isn't an equivalent in Keras, so you'll see like on the Kaggle forums people

00:34:35.100 | Share scripts for how to convert it to a Keras style folders

00:34:39.380 | But in our case we don't have to we just go image classifier data from CSV passing in that CSV file

00:34:44.860 | And so the CSV file will you know has automatically told the data. You know what the classes are

00:34:52.100 | And then also we can see from the folder of test images what the file names of those are

00:35:00.680 | So with those two pieces of information

00:35:02.680 | We're ready to go so I always think it's a good idea to use TTA

00:35:08.040 | As you saw with that dogs and cats example just now it can really improve things particularly when your model is less good

00:35:15.240 | So I can say learn dot TTA and if you pass in

00:35:26.080 | If you pass in is test equals true

00:35:29.600 | Then it's going to give you predictions on the test set rather than the validation set okay, and now obviously we can't now get

00:35:37.480 | An accuracy or anything because by definition. We don't know the labels for the test set right

00:35:43.880 | So by default most

00:35:48.580 | Pytorch models give you back the log of the predictions

00:35:53.240 | So then we just have to go exp of that to get back our probabilities

00:35:57.720 | So in this case the test set had ten thousand three hundred and fifty seven

00:36:01.680 | Images in it, and there are 120 possible breeds all right, so we get back a matrix of of that size

00:36:08.680 | and so we now need to turn that into

00:36:11.680 | Something that looks like this and

00:36:15.400 | So the easiest way to do that is with pandas if you're not familiar with pandas

00:36:20.160 | There's lots of information online about it or check out the machine learning course intro to machine learning that we have

00:36:25.520 | Where we do lots of stuff with pandas?

00:36:27.280 | but basically we can just go PD dot data frame and pass in that matrix and

00:36:32.200 | then we can say the names of the columns are equal to data dot classes and

00:36:37.080 | Then finally we can insert a new column at position zero called ID that contains the file names

00:36:44.080 | But you'll notice that the file names contain

00:36:49.360 | Five letters at the end with a start we don't want and four letters at the end. We don't want so I just

00:36:55.240 | Subset in like so right so at that point

00:37:00.280 | I've got a data frame that looks like this

00:37:04.800 | Which is what we want so you can now

00:37:08.640 | Call data frame data. I should have used a DF not DS

00:37:14.240 | Let's fix it now

00:37:17.000 | data

00:37:19.000 | Frame

00:37:23.240 | Okay, so you can now call data frame to CSV and

00:37:27.400 | Quite often you'll find these files actually get quite big

00:37:32.080 | so it's a good idea to say compression equals G zip and that'll zip it up on the server for you and that's going to create a

00:37:38.920 | zipped up

00:37:41.680 | CSV file on the server on wherever you're running this Jupiter notebook, so you need absent

00:37:46.920 | You now need to get that back to your computer so you can upload it

00:37:49.860 | Or you can use Kaggle CLI so you can type KG submit and do it that way I?

00:37:56.640 | Generally download it to my computer because I like I often like to just like double check it all looks okay

00:38:02.520 | So to do that there's a cool little thing called file link and if you run file link

00:38:08.800 | With a path on your server it gives you back a URL

00:38:12.320 | Which you can click on and it'll download that file from the server onto your computer

00:38:19.000 | so if I click on that now I

00:38:22.040 | Can go ahead and save it and then I can see in my downloads

00:38:29.480 | There it is here's my submission file

00:38:36.600 | If you want to open there yeah, and as you can see it's exactly what I asked for there's my

00:38:46.420 | ID in the 120 different dot breeds and

00:38:49.760 | Then here's my first row containing the file name and the 120 different probabilities

00:38:54.740 | Okay, so then you can go ahead and submit that to Kaggle through there

00:38:58.600 | Through their regular form and so this is also a good way you can see we've now got a good way of both

00:39:06.240 | Grabbing any file off the internet and getting it to our AWS instance or paper space or whatever by using

00:39:12.520 | the

00:39:14.640 | Cool little extension in Chrome, and we've also got a way of grabbing stuff off our server easily

00:39:20.000 | those of you that are more

00:39:22.520 | Command-line oriented you can also use SCP of course, but I kind of like doing everything through the notebook

00:39:28.720 | All right

00:39:32.880 | One other question. I had during the week was like what if I want to just get a single a

00:39:38.600 | single file

00:39:41.080 | that I want to

00:39:42.360 | You know get a prediction for so for example you know maybe I want to get the first file from my validation set

00:39:49.060 | So there's its name

00:39:51.080 | So you can always look at a file just by calling image open

00:39:54.520 | That just uses the regular

00:39:57.520 | Python imaging library

00:40:01.200 | and

00:40:02.520 | So what you can do is there's actually I'll show you the shortest version

00:40:06.320 | You can just call

00:40:08.880 | learn dot predict array

00:40:10.880 | Passing in your your image

00:40:15.600 | Okay, now the image needs to have been

00:40:19.320 | transformed

00:40:21.800 | So you've seen transform transform transforms from model before

00:40:27.120 | Normally, we just put put it all in one variable, but actually behind the scenes. It was returning two things

00:40:32.220 | It was returning training transforms and validation transforms, so I can actually split them apart

00:40:36.840 | And so here you can see I'm actually applying example my training transforms or probably more likely I want to play

00:40:44.040 | validation transforms

00:40:46.760 | That gives me back an array containing the image the transformed image

00:40:51.400 | Which I can then pass to predict array

00:40:55.920 | Everything that gets passed to or returned from our models is

00:41:00.560 | Generally assumed to be a mini batch right generally assumed to be a bunch of images

00:41:05.780 | So we'll talk more about some numpy tricks later, but basically in this case. We only have one image

00:41:12.220 | So we have to turn that into a mini batch of images so in other words. We need to create a tensor

00:41:17.520 | That basically is not just

00:41:20.960 | Rows by columns by channels, but it's number of image by rows by columns by channels and and it has one image

00:41:27.980 | So it's basically becomes a four-dimensional tensor so there's a cool little trick in numpy that if you index

00:41:34.360 | Into an array with none that basically adds additional unit access to the start

00:41:40.760 | So it turns it from an image into a mini batch of one images, and so that's why we had to do that

00:41:46.000 | So if you basically find you're trying to do things with a single image

00:41:51.360 | With any kind of pytorch or fastai thing this is just something you might you might find it says like expecting four

00:41:59.160 | Dimensions only got three it probably means that or if you get back a return

00:42:04.420 | Value from something that has like some weird first axis. That's probably why it's probably giving you like back a mini batch

00:42:12.200 | Okay, and so we'll learn a lot more about this, but it's just something to be aware of

00:42:16.040 | Okay, so that's kind of everything you need to do in practice

00:42:25.360 | So now we're going to kind of get into a little bit of theory

00:42:30.480 | What's actually going on behind the scenes with these convolutional neural networks, and you might remember in

00:42:38.040 | back in lesson one

00:42:41.960 | We

00:42:43.960 | actually saw

00:42:45.960 | Our first little bit of theory

00:42:49.240 | Which we stole from this fantastic website so toaster dot IO EV explained visually

00:42:55.260 | And we learned that a that a convolution is something where we basically have a little matrix

00:43:01.320 | In deep learning nearly always three by three a little matrix that we basically multiply every element of that matrix

00:43:08.920 | By every element of a three by three section of an image

00:43:12.600 | Add them all together to get the result of that convolution at one point right now

00:43:19.140 | Let's see how that all gets turned together

00:43:22.960 | to create these

00:43:25.400 | These various layers that we saw in the the xyla and burgers paper and to do that again

00:43:31.720 | I'm going to steal off somebody who's much smarter than I am

00:43:34.080 | we're going to steal from a

00:43:37.520 | Guy called a tavio good a tavio good was the guy who created word lens

00:43:43.240 | Which nowadays is part of Google Translate if on Google Translate you've ever like done that thing where you you point your camera at something?

00:43:51.680 | At something with it which has any kind of foreign language on it and in real time it overlays it with the translation

00:43:57.520 | That was the potatoes company that built that

00:44:00.160 | And so it was kind enough to share this fantastic video. He created he's at Google now

00:44:08.320 | And I want to kind of step you through it because I think it explains really really well

00:44:11.940 | What's going on and then after we look at the video? We're going to see how to implement the whole a whole

00:44:17.160 | Sequence of convo an entire set of layers of convolutional neural network in Microsoft Excel

00:44:22.960 | So whether you're a visual learner or a spreadsheet learner, hopefully you'll be able to understand all this

00:44:28.520 | So we're going to start with an image

00:44:31.480 | And something that we're going to do later in the course is we're going to learn to recognize digits

00:44:35.920 | So we'll do it like end-to-end. We'll do the whole thing. So this is pretty similar

00:44:39.840 | So we're going to try and recognize in this case letters

00:44:43.200 | So here's an a which obviously it's actually a grid of numbers, right?

00:44:48.440 | And so there's the grid of numbers. And so what we do is we take our first

00:44:52.800 | Convolutional filter, so we're assuming this is always this is assuming that these are already learnt

00:44:58.760 | Right and you can see this one. It's got white down the right hand side, right and black down the left

00:45:04.440 | So it's like 0 0 0 or maybe negative 1 negative 1 negative 1 0 0 0 1 1 1 and so we're taking each

00:45:10.720 | 3 by 3 part of the image and multiplying it by that 3 by 3

00:45:15.380 | Matrix not as a matrix product that an element wise product and so you can see what happens is

00:45:21.520 | everywhere where the the white edge is

00:45:25.120 | Matching the edge of the a and the black edge isn't we're getting green

00:45:30.160 | We're getting a positive and everywhere where it's the opposite. We're getting a negative

00:45:34.280 | We're getting a red right and so that's the first filter creating the first

00:45:39.520 | The result of the first kernel right and so here's a new kernel

00:45:44.740 | This one is is got a white stripe along the top right so we literally scan it through every three by three part of the matrix

00:45:52.400 | multiplying those three bits of the a

00:45:55.280 | Nine bits of the a by the nine bits of the filter to find out whether it's red or green and how red or green it is

00:46:01.880 | Okay, and so this is assuming we had two filters one was a bottom edge

00:46:05.880 | One was a left edge and you can see here the top edge not surprisingly

00:46:09.960 | It's red here. Sorry bottom edge was red here and green here the right edge red here and green here

00:46:15.560 | And then in the next step we add a non-linearity

00:46:18.320 | Okay, the rectified linear unit which literally means throw away the negatives so here the reds all gone

00:46:26.000 | Okay, so here's layer one the input here's layer two the result of two convolutional filters

00:46:31.960 | Here's layer three which is which is throw away all of the red stuff

00:46:36.640 | And that's called a rectified linear unit and then layer four is something called a max pull

00:46:42.320 | And a layer four we replace every

00:46:45.000 | two by two

00:46:47.200 | Part of this grid and we replace it with its maximum right so it basically makes it half the size

00:46:53.560 | It's basically the same thing, but half the size and then we can go through and do exactly the same thing

00:46:58.840 | We can have some new

00:47:00.600 | Filter three by three filter that we put through each of the two results of the previous layer

00:47:06.480 | Okay

00:47:08.200 | And again, we can throw away the red bits

00:47:10.520 | Right so get rid of all the negatives so we just keep the positives. That's called applying a rectified linear unit

00:47:16.480 | and

00:47:18.800 | That gets us to our next layer of this convolutional neural network

00:47:22.840 | So you can see that by you know at this layer back here. It was kind of very interpretable

00:47:29.180 | It's like we've either got bottom edges or left edges, but then the next layer was combining

00:47:34.360 | The results of convolution so it's starting to become a lot less clear like intuitively what's happening

00:47:40.480 | But it's doing the same thing and then we do another max pull right so we replace every two by two or three by three

00:47:47.680 | Section with a single digit so here this two by two. It's all black so we replaced it with a black

00:47:53.800 | All right, and then we go and we take that and we we compare it

00:47:58.200 | To basically a kind of a template of what we would expect to see if it was an A

00:48:04.020 | It was a B. It was a C. It was D

00:48:05.860 | It was me and we see how closely it matches and we can do it in exactly the same way

00:48:11.500 | We can multiply every one of the values in this four by eight matrix with every one of the four by eight in this one

00:48:19.520 | And this one and this one and we add we just add them together to say like how often does it match?

00:48:24.720 | Versus how often does it not match and then that could be converted to give us a percentage

00:48:30.720 | Probability that this isn't a so in this case this particular template matched well with a

00:48:38.720 | So notice we're not doing any training here, right? This is how it would work if we have a pre trained model

00:48:45.040 | All right

00:48:45.920 | So when we download a pre trained image net model off the internet and visit on an image without any changing to it

00:48:51.820 | This is what's happening or if we take a model that you've trained and you're applying it to some test set or to some new image

00:48:58.840 | This is what it's doing right is it's basically taking it through. It's applying a convolution to each layer to each well multiple

00:49:07.080 | convolutional filters to each layer

00:49:09.080 | And then during the rectified linear unit so throw away the negatives and then do the max pull

00:49:16.360 | And then repeat that a bunch of times and so then we can do it with a new

00:49:21.840 | Letter a or letter B or whatever and keep going through

00:49:26.440 | That process, right?

00:49:29.480 | So as you can see that's a far nicer visualization thing and I could have created because I'm not a tevio

00:49:35.360 | So thanks to him for for sharing this with us because it's totally awesome

00:49:39.520 | He actually this is not done by hand. He actually wrote a piece of computer software to actually do these convolutions

00:49:45.740 | This is actually being actually being done dynamically. It's pretty cool

00:49:50.240 | So I'm more of a spreadsheet guy personally. I'm a simple person

00:49:55.200 | So here is the same thing now in spreadsheet form right and so you'll find this in the github repo, so you can either

00:50:04.360 | Get clone the repo to your own computer to open up the spreadsheet

00:50:08.320 | or you can just go to github.com slash fastai and

00:50:11.520 | Click on this it sits inside

00:50:14.560 | If you go to our repo

00:50:22.320 | And just go to courses as usual go to deal one as usual you'll see there's an Excel section there

00:50:28.480 | Okay, and so here they all are so you can just download them by clicking them

00:50:31.920 | Or you can clone the whole repo, and we're looking at conv example convolution example

00:50:37.280 | right, so you can see I have here an

00:50:41.600 | Input right so in this case the input is the number seven so I grabbed this from a data set called end list

00:50:49.760 | MNist which we'll be looking at in a lot of detail

00:50:52.960 | and I just took one of those digits at random and I put it into Excel and so you can see every

00:51:00.560 | Pixel is actually just a number between naught and one

00:51:03.720 | okay, very often actually it'll be a

00:51:07.480 | Bite between naught and 255

00:51:11.120 | Or sometimes it might be a float between naught and one it doesn't really matter by the time it gets to PI torch

00:51:18.160 | We're generally dealing with floats

00:51:20.280 | So we if one of the steps we often will take will be to convert it to a number between naught and one

00:51:28.320 | So you can see I've just used conditional formatting in Excel to kind of make the higher numbers more red

00:51:34.480 | So you can clearly see that this is a red that this is a seven

00:51:38.400 | But but it's just a bunch of numbers that have been imported into Excel okay, so here's our input

00:51:46.040 | So remember what Atavio did was he then applied two filters

00:51:54.600 | Right with different shapes so here. I've created a filter which is designed to detect top edges

00:52:00.860 | So this is a 3 by 3 filter

00:52:03.760 | Okay, and I've got ones along the top zeros in the middle minus ones at the bottom right so let's take a look at an example

00:52:11.720 | That's here right and so if I hit that - you can see here highlighted

00:52:18.060 | This is the 3 by 3 part of the input that this particular thing is calculating right

00:52:24.000 | so here you can see it's got 1 1 1 are all being multiplied by 1 and

00:52:29.560 | 0.1 0 0 are all being multiplied by negative 1

00:52:34.840 | Okay, so in other words all the positive bits are getting a lot of positive the negative bits are getting nearly nothing at all

00:52:41.540 | So we end up with a high number

00:52:43.720 | Okay, where else on the other side of this bit of the seven?

00:52:48.880 | Right you can see how you know this is basically zeros here or perhaps more interestingly on the top of it

00:52:57.060 | Right

00:53:01.320 | Here we've got

00:53:03.320 | High numbers at the top, but we've also got high numbers at the bottom which are negating it

00:53:07.800 | Okay, so you can see that the only place that we end up

00:53:11.340 | activating is

00:53:14.000 | Where we're actually at an edge

00:53:17.760 | So in this case this here this number three

00:53:20.320 | This is called an activation

00:53:23.200 | Okay, so when I say an activation I mean a number a number a

00:53:30.320 | Number that is calculated and it is calculated by taking

00:53:37.000 | some numbers from the input and

00:53:40.040 | applying some kind of linear operation in this case a convolutional kernel to

00:53:47.740 | Calculate an output, right?

00:53:49.740 | You'll notice that other than going

00:53:52.940 | Inputs multiplied by kernel and summing it together

00:53:58.740 | Right. So here's my sum and here's my multiply

00:54:03.060 | I then take that and I go max of zero comma that and

00:54:07.940 | So that's my rectified linear unit. So it sounds very fancy

00:54:13.220 | Rectified linear unit, but what they actually mean is open up Excel and type equals max zero comma thing. Okay

00:54:19.540 | That's all a red and you'll see people in the biz sort of say real you okay

00:54:26.020 | So really you means rectified linear unit means max zero comma thing and I'm not like simplifying it

00:54:33.700 | I really mean it like when I say like if I'm simplifying always say I'm simplifying

00:54:38.060 | But if I'm not saying I'm simplifying that's the entirety. Okay, so a rectified linear unit in its entirety is this

00:54:44.460 | And a convolution in its entirety is is this

00:54:50.060 | Okay, so a single layer of a convolutional neural network is being implemented in its entirety

00:54:58.940 | Here in Excel, okay, and so you can see what it's done is it's deleted pretty much the vertical edges

00:55:08.020 | And highlighted the horizontal edges

00:55:10.580 | so again, this is assuming that

00:55:13.580 | our network is trained and

00:55:15.900 | That at the end of training it had created a convolutional filter with these specific nine numbers in

00:55:22.020 | And so here is a second convolutional filter

00:55:26.860 | It's just a different nine numbers

00:55:29.860 | Now pie torch doesn't store them as two separate nine digit arrays

00:55:36.500 | It stores it as a tensor. Remember a tensor just means an array with

00:55:42.660 | More dimensions. Okay, you can use the word array as well

00:55:48.280 | It's the same thing but in pytorch. They always use the word tensor. So I'm going to say tensor

00:55:54.700 | Okay, so it's just a tensor with an additional axis which allows us to stack

00:56:00.180 | Each of these filters together

00:56:02.780 | right a filter and kernel

00:56:06.260 | Pretty much mean the same thing. Yeah, right it refers to one of these three by three

00:56:11.100 | Matrices or one of these three by three

00:56:14.340 | slices of a three dimensional tensor

00:56:18.380 | So if I take this one and here I've literally just copied the formulas in Excel from above

00:56:23.980 | Okay, and so you can see this one is now finding a vertical edge as we would expect. Okay, so

00:56:34.860 | We've now created

00:56:36.860 | One

00:56:39.500 | Layer right this here is a layer and specifically we'd say it's a hidden layer

00:56:44.500 | Which is it's not an input layer and it's not an output layer. So everything else is a hidden layer. Okay, and

00:56:51.060 | this particular hidden layer has is

00:56:55.180 | A size 2 on this dimension, right because it has two

00:57:00.060 | Filters

00:57:03.220 | Right two kernels

00:57:05.220 | So what happens next

00:57:08.740 | Well

00:57:11.260 | Let's do another one

00:57:12.900 | Okay, so as we kind of go along things can

00:57:16.340 | Multiply a little bit in complexity right because my next filter is going to have to contain

00:57:24.060 | Two of these three by threes because I'm going to have to say how do I want to bring how do I want to?

00:57:30.900 | Wait these three things and at the same time, how do I want to wait the corresponding three things down here?

00:57:37.220 | But because in PyTorch

00:57:39.260 | This is going to be this whole thing here is going to be stored as a multi-dimensional tensor, right?

00:57:45.900 | So you shouldn't really think of this now as two three by three kernels, but one

00:57:51.660 | two by three by three kernel

00:57:54.980 | Okay, so to calculate this value here

00:58:00.420 | I've got the sum product of all of that plus

00:58:05.180 | the sum product of

00:58:08.260 | Scroll down

00:58:11.940 | All of that

00:58:14.420 | Okay, and

00:58:16.620 | So the top ones are being multiplied by this part of the kernel and the bottom ones are being multiplied by this part of the

00:58:22.420 | kernel and so over time

00:58:25.060 | You want to start to get very comfortable with the idea of these like higher dimensional?

00:58:31.020 | Linear combinations, right?

00:58:33.820 | Like it's it's harder to draw it on the screen like I had to put one above the other

00:58:39.340 | But conceptually just stack it in your mind like this. That's really how you want to think

00:58:44.660 | Right and actually Jeffrey Hinton in his original

00:58:47.880 | 2012 neural nets

00:58:50.860 | Coursera class has a tip which is how all computer scientists deal with like very high dimensional spaces

00:58:57.660 | Which is that they basically just visualize the two-dimensional space and then say like 12 dimensions really fast in their head lots of times

00:59:06.080 | So that's it right we can see two dimensions on the screen, and then you just got to try to trust

00:59:11.620 | That you can have more dimensions like the concepts just you know

00:59:17.220 | There's there's nothing different about them, and so you can see in Excel

00:59:20.420 | You know Excel doesn't have the ability to handle three-dimensional tenses, so I had to like say okay take this two-dimensional

00:59:26.860 | Dot product add on this two-dimensional dot product right, but if there was some kind of 3d excel

00:59:34.460 | I could have just done that in a single formula

00:59:36.940 | And then again apply max 0 comma otherwise known as rectified linear unit otherwise known as real you

00:59:45.460 | Okay, so here is my second layer, and so when people create different

00:59:51.600 | architectures right and architecture means

00:59:55.140 | Like how big is your kernel at layer one how many filters are in your kernel at layer one so here?

01:00:03.280 | I've got a 3 by 3

01:00:05.100 | Where's number one and a 3 by 3 there's number two so like this architecture?

01:00:11.180 | I've created starts off with two three by three convolutional kernels and

01:00:16.900 | then my

01:00:19.940 | Second layer has another two kernels of size two by three by three

01:00:25.900 | So there's the first one and then down here. Here's the second two by three by three kernel, okay, and so

01:00:32.960 | Remember one of these specific where any one of these numbers is an activation

01:00:39.500 | Okay, so this activation is being calculated from these three things here and other three things up there

01:00:46.460 | And we're using these this two by three by three

01:00:49.580 | kernel okay

01:00:52.020 | And so what tends to happen is people generally give names to their layers, so I say okay

01:00:57.780 | Let's call this layer here cons one and this layer here

01:01:02.740 | and this and

01:01:06.860 | This layer here con two right so that's you know

01:01:11.680 | Generally, you'll just see that like when you print out a summary of a network every layer will have some kind of name

01:01:18.840 | Okay, and so then what happens next?

01:01:22.740 | Well part of the architecture is like do you have some max pooling?

01:01:27.940 | Whereabouts is that max pooling happens or in this architecture? We're inventing we're going to next step

01:01:33.980 | Is to max pooling okay max pooling is a little hard to?

01:01:38.660 | Kind of show in Excel, but we've got it

01:01:41.980 | So max pooling if I do a two by two max pooling it's going to have the resolution both height and width

01:01:49.980 | So you can see here that I've replaced

01:01:52.660 | These four numbers

01:01:57.340 | with the maximum of those four numbers

01:02:00.740 | Right and so because I'm having the resolution it only makes sense to actually have something every two cells

01:02:05.980 | Okay, so you can see here the way. I've got kind of the same

01:02:11.500 | Looking shape as I had back here, okay, but it's now half the resolution because I've replaced every

01:02:17.860 | two by two

01:02:19.860 | With its max and you'll notice like it's not every possible two by two I skip over from here

01:02:25.620 | So this is like starting at BQ and then the next one starts at

01:02:31.020 | Bs

01:02:32.380 | Right, so they're like non overlapping. That's why it's decreasing the resolution

01:02:36.540 | Okay, so anybody who's comfortable with spreadsheets

01:02:40.800 | You know you can open this and have a look and so after our max pooling

01:02:45.860 | There's a number of different things we could do next and I'm going to show you a kind of

01:02:56.620 | Classic old style approach nowadays in fact what generally happens nowadays is we do a max pool where we kind of like max across the

01:03:04.860 | entire size right

01:03:06.860 | But on older architectures and also on all the structured data stuff we do

01:03:11.400 | We actually do something called a fully connected layer, and so here's a fully connected layer

01:03:17.100 | I'm going to take every single one of these activations, and I'm going to give every single one of them a weight

01:03:24.980 | Right and so then I'm going to take over here

01:03:28.900 | here is the sum product of every one of the activations by every one of the weights for both of the

01:03:38.340 | Two

01:03:41.580 | Levels of my three-dimensional tensor right and so this is called a fully connected layer notice. It's different to a convolution

01:03:48.820 | I'm not going through a few at a time

01:03:50.860 | Right, but I'm creating a really big weight matrix right so rather than having a couple of little three by three kernels

01:03:58.380 | My weight matrix is now as big as the entire input

01:04:01.260 | And so as you can imagine

01:04:04.060 | Architectures that make heavy use of fully convolutional layers can have a lot of weights

01:04:11.860 | Which means they can have trouble with overfitting and they can also be slow and so you're going to see a lot

01:04:19.420 | An architecture called VGG because it was the first kind of successful deeper architecture

01:04:25.060 | It has up to 19 layers and VGG

01:04:27.660 | Actually contains a fully connected layer with 4,096 weights

01:04:33.020 | Connected to at a hidden layer with 4,000 sorry 4,096

01:04:38.060 | activations connected to a hidden layer with 4,096 activations, so you've got like 4,096 by

01:04:46.900 | 4,096 multiplied by remember multiplied by the number of kind of kernels that we've calculated

01:04:53.700 | so in VGG

01:04:56.540 | there's

01:04:58.940 | This I think it's like 300 million

01:05:01.260 | Weights of which something like 250 million of them are in these fully connected layers

01:05:07.740 | So we'll learn later on in the course about how we can kind of avoid using these big fully connected layers and behind the scenes

01:05:15.860 | All the stuff that you've seen us using like res net and res next none of them use very large

01:05:21.620 | Fully connected layers you know you had a question

01:05:24.580 | So you tell us more about for example if we had like three channels of the input what would be the

01:05:35.740 | The shape yeah these filters right so that's a great question

01:05:41.500 | So if we had three channels of input it would look exactly like conv1 right conv1 kind of has two channels

01:05:49.740 | Right and so you can see with conv1. We had two channels so therefore our filters

01:05:55.820 | had to have like two channels per filter and so you could like

01:06:00.460 | Imagine that this input didn't exist you know and actually this was the input right so when you have a multi-channel input

01:06:08.140 | It just means that your filters look like this and so images often full color

01:06:14.480 | They have three red green and blue sometimes. They also have an alpha channel

01:06:19.020 | So however many you have that's how many inputs you need and so something which I know

01:06:24.660 | Yannette was playing with recently was like using a full color image net model

01:06:30.540 | In medical imaging for something called bone age calculations

01:06:34.860 | Which has a single channel and so what she did was basically take the the input

01:06:40.940 | The single channel input and make three copies of it

01:06:44.820 | So you end up with basically like one two three versions of the same thing which is like

01:06:51.460 | It's kind of it's not ideal like it's kind of redundant information that we don't quite want

01:06:58.260 | But it does mean that then if you had a something that expected a three channel

01:07:04.180 | convolutional filter

01:07:05.460 | You can use it right and so at the moment. There's a Kaggle competition for iceberg detection

01:07:11.820 | using

01:07:13.820 | Some funky satellite specific data format that has two channels

01:07:17.980 | So here's how you could do that you could

01:07:21.220 | Either copy one of those two channels into the third channel

01:07:25.100 | Or I think what people on Kaggle are doing is to take the average of the two

01:07:30.420 | Again, it's not ideal, but it's a way that you can use pre-trained networks

01:07:34.740 | Yeah, I've done a lot of

01:07:38.700 | fiddling around like that you can also actually I've actually done things where I wanted to use a

01:07:44.340 | Three channel image net network on four channel data. I had a satellite data where the fourth channel was near infrared

01:07:51.460 | And so basically I added an extra

01:07:57.380 | kind of

01:07:58.780 | Level to my convolutional kernels that were all zeros and so basically like started off by ignoring the new infrared band

01:08:06.860 | And so what happens it basically and you'll see this next week is

01:08:11.380 | That rather than having these like carefully trained filters when you're actually training something from scratch

01:08:18.820 | We're actually going to start with random numbers

01:08:21.420 | That's actually what we do we actually start with random numbers

01:08:24.300 | And then we use this thing called stochastic gradient descent which we've kind of seen

01:08:28.140 | Conceptually to slightly improve those random numbers to make them less random and we basically do that again and again and again

01:08:35.460 | Okay, great. Let's take a seven-minute break, and we'll come back at 750

01:08:41.820 | All right, so what happens next so we've got as far as

01:08:53.260 | as

01:08:55.260 | Doing a

01:08:57.100 | Fully connected layer right so we had our the results of our max Pauling layer got fed to a fully connected layer

01:09:03.420 | And you might notice those of you that remember your linear algebra the fully connected layer is actually doing a classic

01:09:10.860 | traditional matrix product

01:09:13.260 | Okay, so it's basically just going through each pair in turn multiplying them together and then adding them up to do a matrix product

01:09:23.100 | now

01:09:25.100 | In practice if we want to calculate which one of the 10 digits we're looking at

01:09:36.900 | This single number we've calculated isn't enough

01:09:42.900 | We would actually calculate

01:09:46.100 | 10 numbers so what we would have is rather than just having

01:09:50.860 | one set of

01:09:52.860 | Fully connected weights like this and I say set because remember. There's like a whole

01:09:58.340 | 3d kind of tensor of them we would actually need

01:10:02.460 | 10 of those

01:10:05.300 | Right so you can see that these tensors start to get a little bit

01:10:08.520 | High dimensional right and so this is where my patience with doing it an Excel ran out

01:10:15.220 | But imagine that I had done this 10 times I could now have 10 different numbers all being calculated here

01:10:21.660 | Using exactly the same process right it just be 10 of these

01:10:25.220 | fully connected

01:10:28.620 | To by and by and

01:10:32.540 | Arrays basically

01:10:36.220 | and

01:10:37.620 | So then we would have 10 numbers being spat out, so what happens next?

01:10:43.860 | So next up

01:10:45.580 | We can open up a different Excel

01:10:47.580 | worksheet

01:10:49.940 | Entropy example that XLS that's got two

01:10:52.540 | different

01:10:55.020 | Worksheets one of them is called softmax

01:10:57.340 | And what happens here? I'm sorry I've changed domains rather than predicting whether it's a number from one not to nine

01:11:05.620 | I'm going to predict whether something is a cat a dog a plane of fish or building okay, so out of our that fully connected layer

01:11:13.660 | We've got in this case. We'd have five numbers and notice at this point

01:11:18.340 | There's no value okay, and then last layer. There's no value okay, so I can have negatives

01:11:24.980 | Okay, so I want to turn these five numbers

01:11:30.140 | Each into a probability I want to turn it into a probability from not to one that it's a cat

01:11:37.380 | That's a dog. There's a plane that it's a fish that it's a building and

01:11:42.220 | I want those probabilities to have a couple of characteristics first is that each of them should be between zero and one and

01:11:47.860 | The second is that they together should add up to one right? It's definitely one of these five things

01:11:54.380 | Okay, so to do that we use a different kind of activation function

01:11:59.420 | What's an activation function an activation function is a function that is applied to activations?

01:12:07.380 | so for example max zero comma

01:12:11.740 | something is a

01:12:13.740 | function that I applied to an activation

01:12:16.940 | So an activation function always takes in

01:12:20.420 | One number and spits out one number so max of zero comma X

01:12:26.540 | Takes in a number X and spits out some different number value of X

01:12:30.900 | That's all an activation function is and if you remember back to that PowerPoint we saw and

01:12:41.420 | Lesson one

01:12:43.420 | Each of our layers

01:12:48.020 | Was just a linear

01:12:50.300 | Function and then after every layer

01:12:53.860 | We said we needed some non-linearity

01:12:56.860 | Right because if you stack a bunch of

01:12:59.980 | linear layers together

01:13:02.780 | Right then all you end up with is a linear layer

01:13:05.260 | right

01:13:07.260 | So if somebody's talking can can you not I'm slightly distracting. Thank you

01:13:11.300 | If you stack a number of linear

01:13:15.460 | Functions together you just end up with a linear function and nobody does any cool deep learning with just linear functions

01:13:22.540 | All right, but remember we also learned

01:13:24.700 | that by stacking linear functions

01:13:28.300 | With in between each one a non-linearity we could create like arbitrarily complex shapes

01:13:35.060 | and so the non-linearity that we're using after every hidden layer is a value rectified linear unit a

01:13:42.100 | non-linearity is an activation function an

01:13:46.180 | Activation function is a non-linearity in in it within deep learning. Obviously, there's lots of other non-linearities in the world, but in deep learning

01:13:55.140 | This is what we mean

01:13:57.460 | So an activation function is any function that takes some activation in that's a single number and spits out some new activation

01:14:05.220 | like max of 0 comma

01:14:07.220 | So I'm now going to tell you about a different activation function. It's slightly more complicated than

01:14:12.940 | Rally-u, but not too much. It's called softmax

01:14:16.660 | softmax only ever occurs in the final layer at the very end and the reason why is that softmax always spits out

01:14:25.140 | Numbers as an activation function that always spits out a number between 0 and 1 and it always spits out a bunch of numbers

01:14:33.300 | That add to one

01:14:34.540 | So a softmax gives us what we want, right?

01:14:37.500 | in theory

01:14:40.140 | This isn't strictly necessary right like we could ask our neural net to learn a set of

01:14:47.260 | kernels

01:14:49.220 | Which have you know, which which give probabilities that line up as closely as possible with what we want

01:14:54.980 | But in general with deep learning if you can construct your architecture so that the desired

01:15:01.300 | characteristics are as easy to express as possible

01:15:04.420 | You'll end up with better models like they'll learn more quickly with less parameters

01:15:09.460 | So in this case, we know that our probabilities should end up being between 0 and 1

01:15:14.940 | We know that they should end up adding to one

01:15:17.780 | So if we construct an activation function, which always has those features

01:15:22.140 | Then we're going to make our neural network do a better job. It's going to make it easier for it

01:15:27.820 | It doesn't have to learn to do those things because it all happened automatically

01:15:31.140 | Okay, so in order to make this work

01:15:35.580 | We first of all have to get rid of all of the negatives

01:15:39.340 | Right, like we can't have negative probabilities

01:15:42.700 | So to make things not be negative one way we could do it is just go into the power of

01:15:47.940 | Right. So here you can see my first step is to go x of

01:15:52.300 | the previous one right and I think I've mentioned this before but

01:15:57.740 | Of all the math that you just need to be super familiar with to do deep learning

01:16:02.500 | The one you really need is logarithms and x's right all of deep learning and all of machine learning

01:16:09.500 | They appear all the time, right? So

01:16:12.300 | For example

01:16:16.100 | You absolutely need to know that

01:16:20.500 | log of

01:16:23.580 | x times y

01:16:26.140 | equals log of x

01:16:28.360 | plus log of y

01:16:31.980 | Right and like not just know that that's a formula that exists but have a sense of like what does that mean?

01:16:38.180 | Why is that interesting? Oh, I can turn multiplications into additions. That could be really handy, right and therefore

01:16:46.140 | log of x over y

01:16:48.860 | equals log of x minus log of y

01:16:55.260 | Again, that's going to come in pretty handy, you know rather than dividing I can just subtract things, right?

01:17:00.220 | And also remember that if I've got log of x

01:17:04.140 | equals y

01:17:06.580 | Then that means a to the y

01:17:08.580 | Equals x in other words log

01:17:11.780 | Log and a to the the inverse of each other

01:17:18.980 | Okay again, you just you need to really really understand these things and like so if you if you haven't spent much time with logs

01:17:26.180 | and x for a while

01:17:28.020 | You try plotting them in Excel or a little notebook have a sense of what shape they are how they combine together

01:17:34.420 | Just make sure you're really comfortable with them. So

01:17:37.240 | We're using it here, right?

01:17:40.620 | We're using it here. So one of the things that we know is a to the power of something is positive

01:17:47.580 | Okay, so that's great. The other thing you'll notice about a to the power of something is because it's a power

01:17:52.860 | Numbers that are slightly bigger than other numbers like 4 is a little bit bigger than 2.8

01:17:59.260 | When you go either the power of it really accentuates that difference

01:18:03.080 | Okay, so we're going to take advantage of both of these features for the purpose of deep learning. Okay, so we take our

01:18:09.180 | The results of this fully connected layer we go a to the power of for each of them

01:18:16.420 | and then we're going to

01:18:19.060 | And then we're going to add them up

01:18:25.260 | Okay, so here is the sum of a to the power of

01:18:29.540 | So then here

01:18:32.820 | We're going to take

01:18:34.460 | a to the power of divided by the sum of a to the power of so if you take

01:18:40.140 | All of these things divided by their sum then by definition all of those things must add up to 1 and

01:18:47.420 | Furthermore since we're dividing by their sum

01:18:52.060 | They must always vary between 0 and 1 because they're always positive

01:18:57.100 | Alright, and that's it. So that's what softmax is

01:19:00.740 | Okay, so I've got this kind of

01:19:06.020 | Doing random numbers each time right and so you can see like as I look through

01:19:11.460 | My softmax generally has quite a few things that are so close to 0 that they round down to 0 and you know

01:19:19.140 | Maybe one thing that's nearly 1 right and the reason for that is what we just talked about that is with the x

01:19:25.300 | Just having one number a bit bigger than the others tends to like push it out further, right?

01:19:31.780 | So even though my inputs here are random numbers between negative 5 and 5

01:19:36.420 | Right my outputs from the softmax don't really look that random at all in the sense that

01:19:42.460 | They tend to have one big number and a bunch of small numbers

01:19:47.260 | and

01:19:49.460 | Now that's what we want

01:19:51.460 | Right. We want to say like in terms of like is this a cat a dog a plane a fish or a building

01:19:55.860 | We really want it to say like it's it's that you know

01:19:59.260 | It's it's a dog or it's a plane not like I don't know

01:20:04.180 | Okay, so softmax has lots of these cool

01:20:07.900 | Properties right it's going to return a probability that adds up to one and it's going to tend to want to pick one thing

01:20:15.700 | particularly strongly

01:20:18.660 | Okay, so that's softmax your net. Could you pass actually bust me up?

01:20:26.420 | we how would we do something that as let's say you have an image and you want to kind of categorize as like cat and

01:20:33.460 | The dog or like as multiple things

01:20:35.460 | What what kind of function would we try to use?

01:20:38.540 | So happens we're going to do that right now

01:20:41.300 | so

01:20:43.740 | So have to think about why we might want to do that and so one reason we might want to do that is to do

01:20:50.060 | multi-label

01:20:51.460 | classification so we're looking now at listen to image models and specifically we're going to take a look at the

01:20:57.780 | planet competition satellite imaging competition

01:21:01.260 | Now the satellite imaging competition has

01:21:05.620 | Some similarities to stuff we've seen before right so before we've seen a cat versus dog and these images are a cat or a dog

01:21:16.340 | They're not neither. They're not both right, but the satellite imaging competition

01:21:21.860 | Has data as images that look like this and in fact every single one of the images is classified by weather

01:21:29.600 | There's four kinds of weather one of which is haze and another of which is clear

01:21:34.940 | In addition to which there is a list of features that may be present including agriculture

01:21:41.860 | Which is like some some cleared area used for agriculture

01:21:45.980 | Primary which means primary rainforest and water which means a river or a creek so here is a clear day

01:21:53.700 | Satellite image showing some agriculture some primary rainforest and some water features

01:22:00.020 | And here's one which is in haze and is entirely primary rainforest

01:22:05.300 | So in this case we're going to want to be able to show

01:22:11.380 | We're going to be able to predict multiple things and so softmax wouldn't be good because softmax doesn't like

01:22:17.640 | Predicting multiple things and like I would definitely recommend

01:22:22.340 | Anthropomorphizing your activation functions right they have personalities

01:22:26.860 | Okay, and the personality of the softmax is it wants to pick a thing

01:22:31.780 | Okay, and people forget this all the time. I've seen many people even well regarded researchers in famous academic papers

01:22:41.480 | Using like softmax for multi-label classification it happens all the time, right?

01:22:47.480 | And it's kind of ridiculous because they're not

01:22:50.840 | understanding the personality of their activation function, so

01:22:56.200 | For multi-label classification where each sample can belong to one or more classes. We have to change a few things

01:23:03.980 | But here's the good news in fastai. We don't have to change anything

01:23:09.840 | Right so fastai will look at the labels in the CSV and if there is more than one label ever

01:23:17.840 | for any

01:23:20.720 | Item it will automatically switch into like multi-label mode

01:23:24.680 | So I'm going to show you how it works behind the scenes, but the good news is you don't actually have to care

01:23:30.180 | It happens anyway

01:23:32.560 | so if

01:23:34.560 | You have multi-label

01:23:39.280 | Images multi-label objects you obviously can't use the classic Keras style approach where things are in folders

01:23:47.120 | Because something can't conveniently be in multiple folders at the same time

01:23:52.380 | Right, so that's why we you basically have to use the from CSV

01:23:59.200 | Approach right so if we look at

01:24:06.720 | an example

01:24:08.720 | Actually, I'll show you I tend to take you through it right so we can say okay

01:24:14.840 | This is the CSV file containing our labels

01:24:16.980 | This looks exactly the same as it did before but rather than side on it's top down

01:24:22.400 | And top down I've mentioned before that it can do

01:24:25.820 | Vertical flips it actually does more than that there's actually eight possible symmetries for a square

01:24:31.520 | Which is it can be rotated through 90 180 270 or 0 degrees?

01:24:36.280 | And for each of those it can be flipped and if you think about it for a while you'll realize that that's a complete

01:24:42.600 | enumeration of everything that you can do

01:24:45.560 | In terms of symmetries to a square, so they're called it's called the dihedral group of eight

01:24:52.360 | So if you see in the code, there's actually a transform called dihedral. That's why it's called that

01:24:57.960 | So this transforms will basically do the full set of eight symmetric

01:25:04.520 | dihedral

01:25:06.160 | rotations and flips

01:25:08.160 | Plus everything which we can do to dogs and cats you know small 10-degree rotations little bit of zooming

01:25:14.920 | a little bit of contrast and brightness adjustment

01:25:17.680 | So these images are of size 256 by 256

01:25:21.880 | So I just created a little function here to let me quickly grab you know a

01:25:26.760 | Data loader of any size so here's a 256 by 256

01:25:31.880 | Once you've got a data object inside it

01:25:36.000 | We've already seen that there's things called valve DS test DS train DS

01:25:41.000 | They're things that you can just index into and grab a particular image so you just use square brackets zero

01:25:46.560 | You'll also see that all of those things have a DL. That's a data loader

01:25:50.920 | So DS is data set DL is data loader. These are concepts from pytorch

01:25:55.680 | So if you google pytorch data set or pytorch data loader

01:25:59.600 | You can basically see what it means, but the basic idea is a data set gives you a single image or a single

01:26:06.880 | object back a data loader gives you back a mini-batch and

01:26:10.720 | Specifically it gives you back a transformed mini-batch, so that's why when we create our

01:26:17.320 | data object we can pass in num workers and

01:26:21.560 | Transforms like how many processes do you want to use what transforms?

01:26:26.080 | Do you want and so with a data loader you can't ask for an individual image?

01:26:31.320 | You can only get back at a mini-batch and you can't get that back a particular mini-batch

01:26:36.160 | You can only get back the next mini-batch so something we risk is loop through

01:26:41.560 | Grabbing a mini-batch at a time and so in Python

01:26:45.420 | The thing that does that is called a generator right or an iterator this slightly different versions

01:26:51.600 | Of the same thing so to turn a data loader into an iterator you use the standard Python function called iter

01:26:57.360 | That's a Python function just a regular part of the Python

01:27:00.860 | Basic language that returns to an iterator and an iterator is something that takes you can pass the standard give pass it to the standard

01:27:08.920 | Python

01:27:11.080 | Function or statement next and that just says give me another batch from this iterator

01:27:19.280 | So we're basically this is one of the things I really like about pytorch is it really leverages?

01:27:24.160 | modern pythons

01:27:26.760 | Kind of stuff you know in tensorflow they invent their whole new world of ways of doing things

01:27:33.560 | And so it's kind of more

01:27:36.680 | In a sense. It's more like cross-platform, but another sense like it's not a good fit to any platform

01:27:42.880 | So it's nice if you if you know Python well

01:27:47.880 | Pytorch comes very naturally if you don't know Python well pytorch is a good reason to learn Python well a

01:27:54.800 | Pytorch near module neural network module is a standard Python bus for example

01:28:02.240 | So any work you put into learning Python better will pay off with Pytorch so here. I am using standard

01:28:08.720 | Python

01:28:10.480 | Iterators and next to grab my next mini-batch

01:28:15.040 | From the validation sets data loader, and that's going to return two things

01:28:18.720 | It's going to return the images in the mini-batch and the labels in the mini-batch so standard Python approach

01:28:24.500 | I can pull them apart like so and so here is

01:28:28.760 | one mini-batch of labels

01:28:31.520 | And so not surprisingly since I said that my batch size

01:28:42.480 | Actually, it's the batch size by default is 64 so I didn't pass in a batch size

01:28:48.240 | So just remember shift tab to see like what are the things you can pass and what are the defaults so by default?

01:28:54.920 | My batch size is 64, so I've got back something of size 64 by

01:28:59.720 | 17 so there are 17 of the possible

01:29:03.080 | classes right

01:29:05.960 | So let's take a look at the

01:29:09.560 | zeroth

01:29:11.880 | Set of labels so the zeroth images labels

01:29:14.480 | So I can zip again standard Python things it takes two lists and combines it so you get the zeroth thing from the first

01:29:22.840 | List the zeroth thing from the second list and the first thing for the first first this first thing from the second list and so

01:29:29.200 | Forth so I can zip them together and that way I can find out

01:29:32.640 | For the zeroth image in the validation set it's agriculture

01:29:37.720 | It's clear

01:29:40.040 | It's primary rainforest. It's slash and burn. It's water

01:29:44.560 | okay, so as you can see here, this is a

01:29:48.800 | multi label

01:29:51.320 | You see here's a way to do multi label classification

01:29:54.120 | So by the same token right if we go back to our single label classification

01:30:01.960 | It's a cat dog playing official building

01:30:03.960 | Behind the scenes we haven't actually looked at it, but behind the scenes

01:30:09.080 | Fastai and Pytorch are turning our labels into something called one hot encoded

01:30:16.800 | Labels and so if it was actually a dog then the actual values

01:30:21.400 | Would be like that right so these are like the actuals

01:30:26.760 | Okay, so do you remember at the very end of a tavio's video?

01:30:31.800 | He showed how like the template had to match to one of the like five a b c d or e templates

01:30:37.640 | And so what it's actually doing is it's comparing

01:30:41.440 | When I said it's basically doing a dot product. It's actually a fully connected layer at the end right that calculates an

01:30:48.520 | output activation that goes through a softmax and

01:30:53.360 | Then the softmax is compared to the one hot encoded label right so if it was a dog there would be a one here

01:31:02.800 | And then we take take the difference between the actuals and the softmax

01:31:07.520 | Activations to say and add those add up those differences to say how much error is there essentially?

01:31:13.280 | We're skipping over something called a loss function that we'll learn about next week, but essentially we're basically doing that

01:31:19.260 | Now if it's one hot encoded like if there's only one thing which have a one in it

01:31:27.620 | then actually storing it as 0 1 0 0 0 is

01:31:32.800 | terribly inefficient

01:31:34.720 | Right like we can basically say what are the index of each of these things?

01:31:38.860 | Right so we can say it's like 0 1 2 3 4 like so right and so rather than storing it as 0 1

01:31:47.400 | 0 0 0 we actually just store the index value

01:31:52.160 | Right so if you look at the the y values for the cats and dogs competition or the dog breeds competition

01:32:00.400 | You won't actually see a big lists of ones and zeros like this. You'll see a single integer

01:32:05.340 | Right, which is like. What's what class index is it right and

01:32:09.680 | internally

01:32:12.160 | Inside Pytorch it will actually turn that into a one hot encoded vector, but like you will literally never see it

01:32:19.320 | Okay, and and Pytorch has different loss functions where you basically say this thing's one

01:32:26.600 | This thing is one hot encoded or this thing is not and it uses different loss functions

01:32:31.400 | That's all hidden by the fast AI library right so like you don't have to worry about it

01:32:37.260 | But it's but the the cool thing to realize is that this approach for multi-label encoding with these ones and zeros

01:32:45.920 | Behind the scenes the exact same thing happens for single-level classification

01:32:54.760 | Does it make sense to change the pickiness of the sigmoid of the softmax function by changing the base?

01:33:01.080 | No because when you change the

01:33:05.880 | more math

01:33:09.400 | Log base a of B

01:33:14.200 | equals

01:33:17.040 | log B over

01:33:19.040 | log A

01:33:22.080 | so changing the base is just a linear scaling and

01:33:25.200 | Linear scaling is something which the neural net can learn with that very easily

01:33:31.240 | Good question

01:33:37.960 | Okay, so here is that image right here is the image with slash and burn water etc etc

01:33:46.380 | One of the things to notice here is like when I first displayed this image it was

01:33:51.560 | So washed out I really couldn't see it right but remember images

01:33:58.680 | Now you know we know images are just

01:34:01.480 | Matrices of numbers and so you can see here. I just said times 1.4

01:34:06.280 | Just to make it more visible right so like now that you're kind of it's the kind of thing

01:34:12.480 | I want you to get familiar with is the idea that this stuff you're dealing with they're just matrices of numbers

01:34:17.400 | Then you can fiddle around with them, so if you're looking at something like oh, it's a bit washed out

01:34:21.480 | You can just multiply it by something to

01:34:23.480 | Brighten it up a bit okay, so here. We can see I guess this is the slash and burn

01:34:28.760 | Here's the river. That's the water. Here's the primary rainforest. Maybe that's the agriculture and so forth okay, so

01:34:36.640 | So you know with all that background how do we actually use this?

01:34:44.840 | Exactly the same way as everything we've done before right so you know size you know and and

01:34:49.760 | The interesting thing about playing around with this planet competition is that these images are not at all like image net and I

01:34:58.600 | Would guess that the vast majority of the stuff that the vast majority of you do

01:35:03.560 | involving convolutional neural nets

01:35:06.520 | Won't actually be anything like image net you know it'll be it'll be medical imaging

01:35:13.400 | Or it'll be like classifying different kinds of steel tube or figuring out whether a world

01:35:19.520 | You know is going to break or not or or looking at satellite images, or you know whatever right so?

01:35:27.080 | It's it's good to experiment with stuff like this planet

01:35:32.640 | Competition to get a sense of kind of what you want to do and so you'll see here

01:35:37.480 | I start out by resizing my data to 64 by 64

01:35:42.880 | It starts out at 256 by 256 right now

01:35:46.320 | I wouldn't want to do this for the cats and dogs competition because the cats in dog competition

01:35:51.120 | We start with a pre trained image net network. It's it's nearly it's it's it starts off nearly perfect

01:35:57.440 | Right so if we resized everything to 64 by 64 and then retrained the whole set

01:36:03.840 | We basically destroy the weights that are already pre trained to be very good

01:36:09.360 | Remember image net most image net models are trained at either 224 by 224 or

01:36:14.400 | 299 by 299 right so if we like retrain them at 64 by 64. We're going to we're going to kill it on the other hand

01:36:22.840 | There's nothing in image net that looks anything like this

01:36:26.560 | You know there's no satellite images

01:36:29.200 | So the only useful bits of the image net network for us

01:36:35.600 | kind of layers like this one

01:36:38.800 | You know finding edges and gradients and this one you know finding kind of textures and repeating patterns

01:36:45.160 | And maybe these ones of kind of finding more complex textures, but that's probably about it right so

01:36:54.280 | so in other words

01:36:56.680 | You know starting out by training very small images

01:37:00.560 | Works pretty well when you're using stuff like satellites

01:37:04.160 | So in this case I started right back at 64 by 64

01:37:07.080 | grabbed some data

01:37:09.960 | Built my model found out what learning rate to use interestingly it turned out to be quite high

01:37:17.520 | It seems that because like it's so unlike image net I

01:37:23.960 | Needed to do quite a bit more fitting with just that last layer before it started to flatten out

01:37:30.840 | Then I unfreezed it and again. This is the difference to

01:37:34.380 | Image net like

01:37:37.760 | Data sets is my learning rate in the initial layer

01:37:41.760 | I set to divided by 9 the middle layers I set to divided by 3

01:37:45.640 | Where else for stuff like this like image net I had a multiple of 10 for each of those

01:37:51.160 | You know again the idea being that the earlier layers

01:37:55.000 | Probably are not as close to what they need to be compared to the image net

01:38:01.000 | like data sets

01:38:03.000 | So again unfreeze train for a while

01:38:06.160 | And you can kind of see here. You know there's cycle one. There's cycle two. There's cycle three

01:38:13.060 | And then I kind of increased double the size of my images

01:38:17.880 | Fit for a while

01:38:20.720 | Unfreeze fit for a while double the size of the images again fit for a while unfreeze fit for a while

01:38:26.640 | And then add TTA and so as I mentioned last time we looked at this this process ends up

01:38:31.920 | You know getting us about 30th place in this competition

01:38:35.180 | Which is really cool because people you know a lot of very very smart people

01:38:39.520 | Just a few months ago worked very very hard on this competition

01:38:43.000 | Couple of things people have asked about one is

01:38:49.160 | What is this data dot resize do

01:38:55.120 | so a

01:38:57.120 | Couple of different pieces here the first is that when we say

01:39:00.680 | Back here

01:39:04.960 | What transforms do we apply and here's our transforms we actually pass in a size right?

01:39:10.840 | So one of the things that that one of the things that data loader does is to resize the images like on demand every time

01:39:17.720 | It sees them

01:39:19.720 | This has got nothing to do with that dot resize method right so

01:39:24.900 | This is this is the thing that happens at the end like whatever's passed in before it hits out that before our data

01:39:30.580 | Lotus fits it out. It's going to resize it to this size

01:39:33.380 | If the initial input is like a thousand by a thousand

01:39:39.100 | Reading that JPEG and resizing it to 64 by 64

01:39:44.560 | Turns out to actually take more time than training the confident dots for each batch

01:39:50.940 | Right so basically all resize does is it says hey

01:39:55.820 | I'm not going to be using any images bigger than size times 1.3

01:40:00.260 | So just go through once and create new JPEGs of this size

01:40:05.900 | Right and and they're rectangular right so new JPEGs where the smallest

01:40:11.100 | Edges of this size and again. It's like you never have to do this

01:40:16.180 | There's no reason to ever use it if you don't want to it's just a speed up

01:40:20.860 | okay, but if you've got really big images coming in it saves you a lot of time and you'll often see on like Kaggle kernels or

01:40:27.580 | forum posts or whatever people will have like

01:40:30.900 | Bash scripts stuff like that to like loop through and resize images to save time you never have to do that right just you can

01:40:39.500 | Just say dot resize and it'll just

01:40:41.980 | Create you know once off it'll go through and create that if it's already there

01:40:47.180 | It'll use the resized ones for you. Okay, so it's just a it's just a

01:40:51.820 | Speed up convenience function no more

01:40:55.180 | Okay, so for those of you that are kind of past dog breeds

01:41:03.020 | I

01:41:05.620 | Would be looking at planet

01:41:07.620 | Next you know like track like play around

01:41:10.540 | with

01:41:13.460 | With trying to get a sense of like how can you get this as an accurate model?

01:41:17.380 | One thing to mention, and I'm not really going to go into it in detail

01:41:21.580 | It's nothing to do with deep learning particularly is that I'm using a different metric. I didn't use metrics equals accuracy

01:41:28.140 | But I said metrics equals f2

01:41:30.820 | Just remember from last week that confusion matrix that like two by two you know correct incorrect for each of dogs and cats

01:41:43.180 | There's a lot of different ways you could turn that confusion matrix into a score

01:41:49.100 | You know do you care more about false negatives, or do you care more about false positives, and how do you wait them?

01:41:54.780 | And how do you combine them together right?

01:41:56.780 | There's a base. There's basically a function called f beta

01:42:01.300 | Where the beta says how much do you wait false negatives versus false positives and so f2?

01:42:08.540 | Is f beta with beta equals 2 and it's basically as particular way of waiting false negatives and false positives

01:42:15.620 | And the reason we use it is because cattle told us that planet who were running this competition

01:42:20.300 | Wanted to use this particular

01:42:23.100 | f-theta metric

01:42:25.540 | The important thing for you to know is that you can create

01:42:30.060 | Custom metrics so in this case you can see here

01:42:32.820 | It says from planet import f2 and really I've got this here so that you can see how to do it

01:42:38.260 | Right so if you look inside

01:42:40.260 | Courses deal one

01:42:45.220 | You can see there's something called planet dot py

01:42:49.180 | Right and so if I look at planet dot py

01:42:52.820 | you'll see there's a

01:42:55.540 | function there called f2

01:42:57.980 | right and so f2

01:43:00.980 | simply calls f beta score from

01:43:04.920 | psychic

01:43:07.100 | Or sci-fi and can remember where it came from

01:43:09.220 | And does a couple little tweets that are particularly important

01:43:13.900 | But the important thing is like you can write any metric you like right as long as it takes in

01:43:21.620 | set of predictions and

01:43:24.380 | a set of targets

01:43:26.180 | They're both going to be numpy arrays one-dimensional numpy arrays, and then you return back a number

01:43:32.380 | Okay, and so as long as you create a function that takes two vectors and returns up number

01:43:37.940 | You can call it as a metric and so then when we said

01:43:42.220 | Learn metrics equals and then passed in that array which just contains a single function f2

01:43:55.980 | Then it's just going to be printed out

01:43:58.260 | After every epoch for you, okay, so in general like the the fast AI library

01:44:04.020 | Everything is customizable so kind of the idea is that everything is

01:44:09.540 | Everything is

01:44:13.940 | Kind of gives you what you might want by default, but also everything can be changed as well

01:44:21.260 | Yes, you know

01:44:24.900 | We have a little bit of confusion about the difference between

01:44:27.780 | multi label and

01:44:30.940 | Just single label. Uh-huh. Do you by any chance an example in which you compute?

01:44:35.580 | similarly to the example of the

01:44:38.180 | They just show us. Oh, I didn't get to that activation function. Yeah, so

01:44:43.700 | So I'm so sorry. I said I'd do that and then I didn't so the activation the output activation function for a single label

01:44:53.100 | Classification is softmax for all the reasons that we talked about

01:44:56.380 | but if we were trying to predict something that was like

01:45:00.380 | 00110

01:45:03.700 | Then softmax would be a terrible choice because it's very hard to come up with something where both of these are high

01:45:09.860 | In fact, it's impossible because they have to add up to one. So the closest they could be would be 0.5

01:45:14.940 | so for multi label classification

01:45:18.920 | activation function is called

01:45:22.260 | Sigmoid okay, and again the fast AI library does this automatically for you if it notices you have a multi label

01:45:30.100 | Problem and it does that by checking your data set to see if anything has more than one label applied to it

01:45:36.700 | and so sigmoid is a function which is equal to

01:45:41.740 | It's basically the same thing

01:45:44.900 | Except rather than we never add up

01:45:48.660 | All of these x's but instead we just take this x and we say it's just equal to it

01:45:54.460 | divided by

01:45:57.540 | 1 plus

01:45:59.540 | It

01:46:02.260 | And so the nice thing about that is that now like multiple things can be high at once

01:46:12.020 | Right and so generally then if something is less than zero its sigmoid is going to be less than 0.5

01:46:20.300 | If it's greater than 0 its sigmoid is going to be greater than 0.5

01:46:24.500 | And so the important thing to know about a sigmoid function is that its shape

01:46:30.760 | is

01:46:36.420 | Something which asymptotes the top to one and asymptotes. Oh, I drew that

01:46:42.660 | Asymptotes at the bottom

01:46:48.300 | To zero and so therefore it's a good thing to model a probability with

01:46:54.100 | Anybody who has done any?

01:46:57.380 | logistic regression

01:47:00.660 | Will be familiar with this is what we do in logistic regression

01:47:04.420 | So it kind of appears everywhere in machine learning, and you'll see that kind of a sigmoid and a softmax. They're very close

01:47:10.820 | to each other

01:47:13.500 | Conceptually, but this is what we want is our activation function for multi label

01:47:18.420 | And this is what we want a single label and again and fast AI does it all for you. There was a question over here. Yes

01:47:25.200 | I

01:47:31.340 | have a question about

01:47:33.140 | The initial training that you do if I understand correctly you have we have frozen the

01:47:38.580 | The pre-trained model and you only did initially try to train the latest

01:47:45.700 | Layer, right? Right

01:47:48.860 | But from the other hand we said that only the initial layer

01:47:53.500 | So let's last probably the first layer is like important to us and the other two

01:47:59.340 | Are more like features that are image not related and we didn't apply in this case. Well, it's that they

01:48:04.540 | The layers are very important

01:48:07.900 | But the pre-trained weights in them aren't so it's the later layers that we really want to train the most

01:48:15.500 | so earlier layers

01:48:17.700 | Likely to be like already

01:48:19.740 | Closer to what we want

01:48:22.620 | Okay, so you start with the latest one and then you go right so if you go back to our quick dogs and cats

01:48:28.260 | right

01:48:30.140 | when we create a model from pre trained from a pre trained model it returns something where all of the convolutional layers are frozen and

01:48:38.020 | some randomly set

01:48:40.900 | Fully connected layers we add to the end

01:48:43.780 | Unfrozen and so when we go fit

01:48:47.100 | But first it just trains

01:48:49.980 | The randomly set a randomly initialized fully connected layers, right?

01:48:56.620 | And if something is like really close to image net that's often all we need

01:49:02.220 | But because the other the only layers are already good at finding edges gradients repeating patterns for

01:49:10.020 | ears and dogs heads

01:49:12.820 | So then when we unfreeze

01:49:17.180 | We set the learning rates for the early layers to be really low

01:49:22.020 | Because we don't want to change them much for us the later ones we set them to be higher

01:49:26.940 | Where else for satellite data?

01:49:29.740 | right

01:49:31.860 | This is no longer true. You know the early layers are still like

01:49:35.420 | Better than the later layers, but we still probably need to change them quite a bit

01:49:41.380 | So that's right. This learning rate is nine times smaller than the final learning rate rather than a thousand times smaller

01:49:50.980 | than the final learning rate

01:49:52.980 | Okay, so you play with with the weights of the layers with the learning rates. Yeah, normally

01:49:58.780 | Most of the stuff you see online if they talk about this at all, they'll talk about unfreezing

01:50:05.000 | different subsets of layers

01:50:07.620 | And indeed we do unfreeze our randomly generated ones

01:50:11.780 | But what I found is although the fast AI library you can type learn dot freeze to and just freeze a subset of layers

01:50:20.140 | this approach of using differential learning rates seems to be like

01:50:23.780 | More flexible to the point that I never find myself unfreezing subsets of layers

01:50:29.700 | So but but I don't understand is that I would expect you to start with that

01:50:33.540 | with a differential the different

01:50:36.620 | Learning rates rather than trying to learn the last layer. So the reason okay, so you could skip

01:50:44.500 | this

01:50:47.180 | Training just the last layers and just go straight to differential learning rates

01:50:51.060 | But you probably don't want to the reason you probably don't want to is that there's a difference the convolutional layers all contain

01:50:58.980 | Pre trained weights, so they're like they're not random for things that are close to image net

01:51:05.260 | They're actually really good for things that are not close to image net. They're better than nothing

01:51:09.980 | All of our fully connected layers, however are totally random

01:51:16.260 | So therefore you would always want to make the fully connected weights better than random by training them a bit first

01:51:23.020 | Because otherwise if you go straight to unfreeze

01:51:26.460 | Then you're actually going to be like fiddling around of those early early can early layer weights when the later ones are still random

01:51:35.060 | That's probably not what you want. I

01:51:37.060 | Think there's another question here

01:51:39.060 | So when we unfreeze

01:51:43.420 | What are the things we're trying to change there?

01:51:48.140 | will it change the

01:51:51.300 | kernels themselves

01:51:53.460 | That that's always what SGD does. Yeah, so the only thing

01:51:58.340 | what training means is

01:52:00.980 | setting these numbers

01:52:04.380 | right and

01:52:07.300 | These numbers and

01:52:10.700 | These numbers the weights

01:52:16.460 | so the weights are the weights of the fully connected layers and

01:52:20.820 | The weights in those kernels in the convolutions. So that's what training means

01:52:26.140 | It's and we'll learn about how to do it with SGD. But training literally is setting those numbers

01:52:32.500 | these numbers on the other hand

01:52:35.940 | Activations they're calculated. They're calculated from the weights and the previous layers

01:52:42.660 | activations or imports

01:52:45.780 | I have a question. So can you lift that up higher and speak badly? So in your example of training the satellite image

01:52:52.980 | Example so you start with very small size exit support

01:52:57.340 | Yeah, so does it literally mean that you know the model takes a small area from the entire image?

01:53:03.300 | That is 64 by 64

01:53:05.420 | So how do we get that 64 by 64 depends on?

01:53:09.700 | the transforms

01:53:12.260 | by default our transform takes the smallest edge and

01:53:18.340 | Resize zooms the whole thing out

01:53:21.860 | Resamples it so the smallest edge is the size 64 and then it takes a center crop

01:53:27.260 | of that, okay, although

01:53:32.020 | When we're using data augmentation it actually takes a randomly chosen

01:53:36.700 | prop

01:53:39.460 | In the case where the image has multiple objects like in this case

01:53:43.580 | Like would it be possible like you would just lose the other things that they try to forget?

01:53:49.740 | Yeah, which is why data augmentation is important. So by and particularly their

01:53:54.620 | Test time augmentation is going to be particularly important because you would you wouldn't want to you know

01:54:00.620 | That there may be a artisanal mine out in the corner, which if you take a center crop you you don't see

01:54:07.220 | So data augmentation becomes very important. Yeah

01:54:10.740 | Sure

01:54:14.820 | So when we talk about metrics that users are here see that lower or up to

01:54:18.820 | That's not really what the model tries to that's a great point. That's not the loss function

01:54:24.620 | Yeah, right. The loss function is something we'll be learning about next week

01:54:29.020 | And it uses a cross entropy or otherwise known as like negative log likelihood

01:54:34.500 | The metric is just the thing that's printed so we can see what's going on

01:54:39.900 | Just next to that

01:54:43.460 | So in the context of multi-class

01:54:45.940 | Modeling cannot training data does a training data also have to be multi-class?

01:54:50.460 | So can I train on just like images of pure cats and pure dogs and expect it at prediction time to?

01:54:56.260 | Predict if I give it a picture of both having cat analog

01:54:58.880 | I've never tried that and I've never seen an example of something that needed it. I

01:55:08.140 | Guess conceptually there's no reason it wouldn't work

01:55:12.300 | But it's kind of out there

01:55:15.740 | And you still use a sigmoid activity you would have to make sure you're using a sigmoid loss function

01:55:20.340 | So in this case fast a eyes default would not work because by default fast a I would say your training data

01:55:25.700 | Never has both a cat and a dog, so you would have to override the loss function

01:55:29.260 | When you use the differential learning rates

01:55:38.080 | Those three learning rates do they just kind of spread evenly across the layers?

01:55:43.420 | Yeah, we'll talk more about this later in the course, but I'm in the fast AI library

01:55:49.540 | There's a concept of layer groups so in something like a resnet 50

01:55:54.580 | You know there's hundreds of layers, and I figured you don't want to write down hundreds of learning rates, so I've

01:56:00.940 | basically decided for you how to split them and

01:56:04.420 | The the last one always refers just to the fully connected layers that we've randomly initialized and add it to the end

01:56:12.780 | And then these ones are split generally about halfway through

01:56:18.260 | Basically, I've tried to make it so that

01:56:20.260 | These you know these ones are kind of the ones which you hardly want to change at all

01:56:24.500 | And these are the ones you might want to change a little bit, and I don't think we're covered in the course

01:56:29.420 | But if you're interested we can talk about in the forum

01:56:31.260 | There are ways you can override this behavior to define your own layer groups if you want to

01:56:35.640 | And is there any way to visualize the model easily or like dump dump the layers of the model?

01:56:41.820 | Yeah, absolutely

01:56:43.900 | You can

01:56:45.900 | Make sure we've got one here

01:56:48.420 | Okay

01:56:50.420 | So if you just type learn it doesn't tell you much at all, but what you can do is go

01:56:56.800 | learn summary and

01:56:59.980 | That spits out

01:57:03.900 | basically

01:57:05.580 | everything

01:57:07.020 | There's all the letters and so you can see in this case

01:57:09.980 | These are the names I mentioned how they all got names right so the first layer is called conv 2d - 1

01:57:18.100 | And it's going to take as input

01:57:20.100 | This is useful to actually look at it's taking 64 by 64 images. Which is what we told it

01:57:27.060 | We're going to transform things - this is three channels pie torch

01:57:30.700 | Like most things have channels at the end would say 64 by 64 by 3 pie torch moves it to the front

01:57:38.700 | So it's 3 by 64 by 64

01:57:41.300 | That's because it turns out that some of the GPU computations run faster when it's in that order

01:57:47.260 | Okay, but that happens all behind the scenes automatically so part of that transformation stuff

01:57:52.780 | That's kind of all done automatically is to do that

01:57:55.340 | - 1

01:57:58.580 | Means however however big the batch size is

01:58:01.540 | In Keras they use the number they use a special number none

01:58:07.100 | In pie torch they use - 1 so this is a four-dimensional mini batch

01:58:11.860 | the number of

01:58:14.380 | Elements in the number of images in the image mini batches dynamic you can change that the number of channels is 3

01:58:20.660 | Number of images is 64 by 64. Okay, and so then you can basically see that this particular convolutional kernel

01:58:28.740 | Apparently has 64 kernels in it

01:58:32.220 | And it's also halving we haven't talked about this but convolutions can have something called a stride

01:58:37.100 | That it's like max pooling for changes the size. So it's returning a 32 by 32 by 64 kernel

01:58:44.780 | Tensor and so on and so forth

01:58:48.140 | So that's summary and we'll learn all about what that's doing in detail in the second half of the course

01:58:56.020 | one more I

01:58:59.100 | Clicked in my own data set and I tried to use the and it's a really small data set these currencies from

01:59:04.740 | images and I tried to do a

01:59:07.780 | Learning rate find and then the plot and it just it gave me some numbers which I didn't understand on the learning rate font

01:59:14.980 | Yeah, and then the plot was empty. So yeah, I mean let's let's talk about that on the forum

01:59:19.900 | but basically

01:59:21.020 | The learning rate finder is going to go through a mini batch at a time if you've got a tiny data set

01:59:26.460 | There's just not enough mini batches. So the trick is to make your mini that make your batch size really small

01:59:31.740 | Like try making it like four or eight or something

01:59:34.460 | Okay, they were great questions nothing online to add in it

01:59:41.900 | They were great questions we've got a little bit past where I hope to but let's let's quickly talk about

01:59:49.060 | Structured data so we can start thinking about it for next week

01:59:55.300 | so

01:59:57.300 | This is really weird right to me. There's basically two types of data set we use in machine learning. There's a type of data

02:00:04.780 | like audio

02:00:07.340 | images

02:00:09.740 | natural language text

02:00:11.740 | where all of the all of the things inside an object like all of the pixels inside an image are

02:00:18.180 | All the same kind of thing. They're all pixels or they're all

02:00:23.780 | amplitudes of a waveform or

02:00:25.780 | They're all words

02:00:28.180 | I call this kind of data unstructured and then there's data sets like a

02:00:34.140 | profit-and-loss statement or

02:00:37.060 | the information about a Facebook user

02:00:39.660 | Where each column is like?

02:00:42.460 | Structurally quite different, you know one thing is representing like how many page views last month another one is their sex

02:00:49.620 | Another one is what zip code they're in and I call this structured data

02:00:53.860 | That particular terminology is not

02:00:57.180 | Unusual like lots of people use that terminology, but lots of people don't there's no

02:01:02.980 | Particularly agreed upon

02:01:05.980 | terminology so when I say structured data

02:01:09.020 | I'm referring to kind of columnar data as you might find in a database or a spreadsheet where different columns

02:01:16.700 | represent different kinds of things and each row represents an observation and

02:01:21.740 | So structured data is probably what most of you

02:01:26.180 | Analyzing most of the time

02:01:30.180 | Funnily enough you know academics in the deep learning world don't really give a shit about structured data

02:01:38.340 | Because it's pretty hard to get published in fancy conference proceed proceedings

02:01:43.060 | If you're like if you've got a better logistics model, you know, it's the thing that makes the world goes round

02:01:48.620 | It's a thing that makes everybody you know money and efficiency and make stuff work

02:01:54.020 | But it's largely ignored sadly

02:01:57.600 | So we're not going to ignore it because we're practical deep learning

02:02:02.140 | And Kaggle doesn't ignore it either because people put prize money up on Kaggle to solve real-world problems

02:02:08.940 | So there are some great Kaggle competitions we can look at there's one running right now

02:02:13.400 | Which is the grocery sales forecasting competition for Ecuador's largest chain?

02:02:19.080 | It's always a little I've got to be a little careful about how much I show you about currently running competitions because I don't want

02:02:28.660 | To you know help you cheat, but it so happens. There was a competition a year or two ago

02:02:34.620 | For one of Germany's largest grocery chains, which is almost identical. So I'm going to show you how to do that

02:02:40.640 | So that was called the Rossman stores data

02:02:46.780 | and

02:02:48.740 | So I would suggest you know, first of all try practicing what we're learning on Rossman, right?

02:02:54.860 | but then see if you can get it working on on grocery because currently

02:03:00.340 | On the leaderboard no one seems to basically know what they're doing in the groceries competition. If you look at the leaderboard

02:03:05.820 | The

02:03:09.540 | See here

02:03:11.220 | These ones around five to nine five three. Oh are people that are literally finding like group averages and submitting those

02:03:17.940 | I know because that the kernels that they're using so, you know the basically the people around 20th place

02:03:23.840 | I'm not actually doing any machine learning

02:03:28.500 | So yeah, let's see if we can improve things

02:03:30.500 | So you'll see there's a lesson three Rossman

02:03:35.300 | Notebook sure you get pool. Okay, in fact, you know just reminder, you know before you start working

02:03:41.220 | Get pool in your fast AI repo and from time to time

02:03:45.500 | Conda and update for you guys doing the in-person course the Conda and update

02:03:51.540 | You should do it more often because we're kind of changing things a little bit folks in the MOOC

02:03:57.100 | You know more like once a month should be fine

02:03:59.820 | So anyway, I just I just changed this a little bit so make sure you get pulled to get lesson three Rossman

02:04:06.500 | And there's a couple of new libraries here one is fast AI dot structured

02:04:12.500 | Fast AI dot structured contains stuff, which is actually not at all Pytorch specific

02:04:18.940 | And we actually use that in the machine learning course as well for doing random forests with no Pytorch at all

02:04:24.620 | I mentioned that because you can use that particular library without any of the other parts of fast AI

02:04:31.500 | So that can be handy

02:04:34.300 | And then we're also going to use fast AI dot column data

02:04:37.460 | Which is basically some stuff that allows us to do fast AI Pytorch stuff with

02:04:43.440 | columnar structured data

02:04:46.220 | For structured data we need to use pandas a lot

02:04:52.060 | Anybody who's used our data frames will be very familiar with pandas pandas is basically an attempt to kind of replicate

02:04:58.300 | data frames in Python

02:05:01.140 | You know and a bit more

02:05:04.080 | If you're not entirely familiar with pandas

02:05:09.100 | There's a great book

02:05:12.340 | Which I think I might have mentioned before

02:05:20.580 | Python for data analysis by Wes McKinney. There's a new edition that just came out a couple of weeks ago

02:05:26.180 | Obviously being by the pandas author its coverage of pandas is excellent, but it also covers

02:05:33.340 | numpy

02:05:35.460 | scipy

02:05:36.660 | plotlib

02:05:37.740 | scikit learn

02:05:39.460 | I python and jupyter really well, okay, and so I'm kind of going to assume

02:05:46.500 | That you know your way around these libraries to some extent

02:05:51.020 | Also, there was the workshop we did before this started and there's a video of that online where we kind of have a brief mention

02:05:58.100 | of all of those tools

02:06:00.340 | Structured data is generally shared as CSV files. It was no different in this competition

02:06:07.460 | As you'll see, there's a hyperlink to the Rossman data set here

02:06:11.860 | All right now if you look at the bottom of my screen you'll see this goes to files.fast.ai

02:06:17.060 | Because this doesn't require any login or anything to grab this data set. It's as simple as right clicking

02:06:22.740 | copy link address

02:06:25.540 | Head over to wherever you want it and just type

02:06:29.680 | Wget and

02:06:33.180 | The URL okay, so that's because you know, it's it's not behind a login or anything

02:06:42.060 | so you can grab the grab it from there and

02:06:46.100 | You can always read a CSV file with just pandas dot read CSV now in this particular case. There's a lot of

02:06:53.460 | Pre-processing that we do and what I've actually done here is I've

02:06:59.300 | I've actually

02:07:02.180 | Stolen the entire pipeline from the third-place winner of Rossman. Okay, so they made all their data

02:07:09.980 | They're really great. You know, they've had a github available with everything that we need and I've ported it all across and simplified it and

02:07:16.860 | Tried to make it pretty easy to understand

02:07:19.260 | this

02:07:21.900 | Course is about deep learning not about data processing. So I'm not going to go through it

02:07:26.800 | But we will be going through it in the machine learning course in some detail because feature engineering is really important

02:07:33.820 | So if you're interested

02:07:35.820 | You know check out the machine learning course

02:07:38.980 | for that I

02:07:40.980 | will however show you

02:07:42.980 | Kind of what it looks like. So once we read the CSVs in

02:07:46.580 | You can see basically what's there so the key one is

02:07:51.500 | For a particular store

02:07:57.380 | We have the

02:08:02.900 | We have the date and we have the sales

02:08:09.620 | For that particular store. We know whether that

02:08:13.060 | Thing is on promo or not

02:08:16.100 | We know the number of customers that that particular store had

02:08:20.900 | We know whether that date was a school holiday

02:08:24.540 | We also know

02:08:34.260 | What kind of store it is so like this is pretty common right you'll often get

02:08:38.720 | Data sets where there's some column with like just some kind of code. We don't really know what the code means

02:08:44.460 | Most of the time I find it doesn't matter what it means

02:08:48.200 | Like normally you get given a data dictionary when you start on a project and obviously if you're working on internal project

02:08:54.540 | You can ask the people at your company. What does this column mean? I?

02:08:57.780 | Kind of stay away from learning too much about it. I prefer to like see what the data says

02:09:04.020 | first

02:09:06.020 | There's something about what kind of product are we selling in this particular row?

02:09:10.940 | And then there's information about like how far away is the nearest competitor how long have they been open for

02:09:21.140 | How long is the promo being on for

02:09:30.500 | Each store we can find out what state it's in for each state we can find out the name of the state

02:09:35.980 | this is in Germany and

02:09:37.980 | Interestingly they were allowed to download any data external data

02:09:42.460 | They wanted in this competition

02:09:43.740 | It's very common as long as you share it with everybody else and so some folks tried downloading data from

02:09:50.340 | Google Trends

02:09:53.180 | I'm not sure exactly what it was that they were checking the trend of but we have this information from Google Trends

02:09:59.940 | Somebody downloaded the weather for every day in Germany for every state

02:10:03.780 | And yeah, that's about it right so

02:10:12.580 | You can get a data frame summary with pandas which kind of lets you see how many

02:10:22.520 | Observations and means and standard deviations

02:10:25.180 | Again, I don't do a hell of a lot with that early on

02:10:29.260 | But it's nice to know it there

02:10:31.260 | So what we do, you know, this is called a relational data set a relational data set is one where there's quite a few tables

02:10:38.300 | We have to join together. It's very easy to do that in pandas

02:10:41.960 | There's a thing called merge so I create a little function to do that

02:10:45.020 | And so I just started joining everything together join in the weather the Google Trends

02:10:48.300 | the stores

02:10:51.500 | Yeah, that's about everything I guess

02:10:58.060 | You'll see there's one thing that I'm using from the fast AI library, which is called add date part

02:11:03.740 | We talk about this a lot in the machine learning course

02:11:06.340 | But basically this is going to take a date and pull out of it a bunch of columns day of week

02:11:11.580 | Is at the start of a quarter month of year so on and so forth and add them all in for the data set

02:11:17.820 | Okay, so this is all standard pre-processing

02:11:23.380 | As we join everything together we fiddle around with some of the dates a little bit some of them are in month and year

02:11:28.700 | Format we turn it into date format

02:11:30.700 | We spend a lot of time

02:11:32.980 | Trying to

02:11:35.860 | Take information about for example holidays and add a column for like how long until the next holiday

02:11:41.920 | How long has it been since the last holiday?

02:11:44.100 | ditto for promos

02:11:46.580 | So on and so forth. Okay, so we do all that and at the very end

02:11:51.900 | We basically save a big structured data file that contains all that stuff

02:11:57.020 | Something that those of you that use pandas may not be aware of is that there's a very cool new format called feather

02:12:03.420 | Which you can save a pandas

02:12:06.100 | Data frame into this feather format

02:12:08.380 | It's kind of pretty much takes it as it sits in RAM and dumps it to the disk

02:12:13.180 | and so it's like really really really fast the reason that you need to know this is because the

02:12:19.580 | Ecuadorian grocery competition it's on now has 350 million records

02:12:24.120 | So you will care about how long things take it took I believe about six seconds for me to save

02:12:30.820 | 350 million records to feather format, so it's pretty cool

02:12:34.380 | So at the end of all that I'd save it as feather format and for the rest of this discussion

02:12:39.740 | I'm just going to take it as given that we've got this nicely

02:12:43.700 | Processed feature-engineered file and I can just go read better. Okay, but for you to play along at home

02:12:49.780 | You will have to run those previous cells. Oh

02:12:53.020 | except the

02:12:55.660 | See these ones are commented out

02:12:57.940 | You don't have to run those because the file that you download from files.fast.ai has already done that for you, okay?

02:13:04.700 | All right

02:13:07.820 | So we basically have

02:13:09.820 | all these columns

02:13:12.780 | So it basically is going to tell us

02:13:15.460 | You know how many of this thing was sold on?

02:13:20.980 | This date at this store and so the goal of this competition is to find out

02:13:28.020 | How many things will be sold for each store for each type of thing in the future?

02:13:34.460 | Okay, and so that's basically what we're going to be trying to do

02:13:39.860 | And so here's an example of what some of the data looks like

02:13:42.580 | And so

02:13:46.420 | Next week we're going to see how to go through these steps

02:13:50.380 | But basically what we're going to learn is we're going to learn to split the columns into two types

02:13:56.500 | some columns we're going to treat as

02:13:59.340 | categorical, which is to say

02:14:01.900 | Store ID 1 and store ID 2 are not numerically related to each other the categories

02:14:09.700 | Right we're going to treat day of week like that to Monday and Tuesday day zero and day one not numerically

02:14:16.160 | Where else distance in kilometers to the nearest competitor?

02:14:22.140 | That's a number that we're going to treat numerically

02:14:25.020 | Right so in other words the categorical variables. We basically are going to one hot encode them

02:14:30.580 | You can think of it as one hot encoding them where else the continuous variables. We're going to be feeding into fully connected layers

02:14:38.700 | Just as is

02:14:40.460 | Okay

02:14:42.460 | So what we'll be doing is we'll be basically

02:14:44.780 | creating a

02:14:47.820 | Validation set and you'll see like a lot of these are start to look familiar

02:14:50.900 | This is the same function we used on planet and dog breeds to create a validation set

02:14:55.180 | There's some stuff that you haven't seen before

02:14:59.060 | where we're going to

02:15:01.940 | Basically rather than saying image data dot from CSV. We're going to say columnar data

02:15:08.560 | From data frame right so you can see like the basic API concepts will be the same, but they're a little different, right?

02:15:15.680 | but just like before we're going to get a learner and

02:15:19.600 | we're going to go lr find

02:15:22.880 | to find our best learning rate and

02:15:25.120 | Then we're going to go dot fit with a metric

02:15:28.600 | with a cycle length

02:15:31.440 | Okay, so the basic sequence who's going to end up looking hopefully very familiar. Okay, so we're out of time

02:15:39.480 | so what I suggest you do this week is like

02:15:42.720 | try to

02:15:45.360 | Enter as many Kaggle image competitions as possible like like try to really get this feel for like

02:15:51.640 | cycle lengths learning rates

02:15:54.560 | plotting things

02:15:57.760 | You know that

02:16:01.360 | That post I showed you at the start of class today that kind of took you through lesson one like

02:16:07.560 | Really go through that on as many image data sets as you can to just feel

02:16:12.620 | Really comfortable with it, right?

02:16:15.360 | because you want to get to the point where next week when we start talking about structured data that this idea of like how

02:16:21.960 | Learners kind of work and data works and data loaders and data sets and looking at pictures should be really you know intuitive

02:16:30.320 | Alright, good luck. See you next week

02:16:32.320 | (audience applauding)

02:16:35.480 | (audience applauding)

Lesson 3: Deep Learning 2018

Chapters