back to indexLesson 3: Deep Learning 2018

Chapters
0:0 
1:21 Tmax
8:27 Review
38:5 File Link
47:13 Applying a Rectified Linear Unit
53:21 Activation
63:13 Fully Connected Layer
74:15 Soft Max
76:3 Logarithms
89:52 Multi-Label Classification
102:29 Create Custom Metrics
104:26 Difference between Multi-Label and a Single Label
105:21 Sigmoid
114:25 Loss Function
115:50 Layer Groups
116:56 Learn Summary
119:49 Structured Data
120:0 Two Types of Data Set We Use in Machine Learning
124:49 Pandas
00:00:05.640 | 
But there's been a lot of cool activity on the forum this week and one of the things that's been really great to see 00:00:14.040 | 
Really helpful materials both for your classmates to better understand stuff and also for you to better understand stuff by 00:00:22.120 | 
Trying to teach what you've learned. I just wanted to highlight a few I've actually 00:00:29.040 | 
Posted to the wiki thread a few of these, but there's lots more 00:00:33.220 | 
Reshma has posted a whole bunch of nice introductory tutorials so for example if you're having any trouble getting connected with AWS 00:00:47.000 | 
How to go about logging in and getting everything working which I think is a really terrific thing and so it's a kind of thing 00:00:57.600 | 
Writing some notes for yourself to remind you how to do it 00:01:00.520 | 
You may as well post them for others to do it to do it as well and by using a markdown file like this 00:01:06.280 | 
It's actually good practice if you haven't used github before if you put it up on github 00:01:10.700 | 
Everybody can now use it or of course you can just put it in the forum 00:01:18.200 | 
Thing that Reshma wrote up about is she noticed that I like using tmux 00:01:27.480 | 
Let me basically have a window. Let's see if I've got one. I'll show you 00:01:37.280 | 
You'll see that all of my windows pop straight up 00:01:39.820 | 
Basically and I can like continue running stuff in the background and I can like I've got vim over here 00:01:45.760 | 
And I can kind of zoom into it or I can move over to the top which is here's budget 00:01:50.600 | 
But I can all running and so forth so if that sounds interesting Reshma has a 00:01:58.520 | 
And it's actually got a whole bunch of stuff in her github, so that's that's really cool 00:02:04.520 | 
Up built among has written a very nice kind of summary basically of our last lesson 00:02:15.880 | 
What are the key things we did and why did we do them so if you're if you're kind of? 00:02:20.060 | 
Wondering like how does it fit together? I think this is a really helpful summary 00:02:26.160 | 
Like what did those couple of hours look like if we summarize it all into a page or two? 00:02:36.080 | 
Dark kind of done a deep dive on the learning rate finder 00:02:44.200 | 
Topic that a lot of you have been interested in learning more about particularly 00:02:47.680 | 
Those of you who have done deep learning before I've realized that this is like a solution to a problem that you've been having for 00:02:54.120 | 
A long time and haven't seen before and so it's kind of something which hasn't really been vlogged about before so this is the first 00:02:59.940 | 
Time I've seen this blogged about so when I put this on Twitter a link to 00:03:03.800 | 
Pavel's post it's been shared now hundreds of times 00:03:07.280 | 
It's been really really popular and viewed many thousands of times, so that's some great content 00:03:12.960 | 
Radek has posted lots of cool stuff. I really like this practitioners guide to pytorch which again 00:03:20.360 | 
This is more for more advanced students, but it's like digging into people who have never used pytorch before but know a bit about 00:03:27.200 | 
Numerical programming in general and it's a quick introduction to how pytorch is different 00:03:33.080 | 
And then there's been some interesting little bits of research like what's the relationship between learning rate and batch size so one of the 00:03:41.080 | 
Students actually asked me this before class and I said well one of the other students has written an analysis of exactly that 00:03:49.080 | 
so what he's done is basically looked through and tried different batch sizes and different learning rates and tried to see how they seem to 00:03:54.960 | 
Relate together and these are all like cool experiments, which you know you can try yourself 00:03:59.960 | 
Radek again, he's written something again a kind of a research into this question. I made a claim that 00:04:07.600 | 
The the stochastic gradient descent with restarts finds more generalizable 00:04:14.240 | 
Parts of the function surface because they're kind of flatter, and he's been trying to figure out. Is there a way to measure that more directly? 00:04:20.440 | 
Not quite successful yet, but a really interesting piece of research 00:04:27.120 | 
introductions to convolutional neural networks 00:04:33.000 | 
something that we'll be learning about towards the end of this course, but I'm sure you've noticed we're using something called ResNet and 00:04:39.560 | 
Anand Sahar actually posted a pretty impressive analysis of like what's a ResNet and why is it interesting? 00:04:46.400 | 
And this one's actually been very already shared very widely around the internet. I've seen also 00:04:51.280 | 
So so we're advanced students who are interested in 00:04:55.000 | 
Jumping ahead can look at that and appeal to mom also has done something similar 00:05:03.600 | 
Yeah, lots of stuff going on on the forums. I'm sure you've also noticed we have a beginner forum now 00:05:09.760 | 
specifically for you know asking questions which 00:05:17.760 | 
Dumb questions, but when there's lots of people around you talking about advanced topics. It might not feel that way 00:05:30.960 | 
Student who can help answer those questions, please do but remember when you do answer those questions try to answer in a way 00:05:37.560 | 
That's friendly to people that maybe you know have no more than a year of programming experience and haven't done any machine learning before 00:05:51.720 | 
Feel like you can contribute as well and just remember all of the people we just looked at or many of them 00:05:58.520 | 
Posted anything to the internet before right I mean you don't have to be a particular kind of person to be allowed to blog 00:06:04.760 | 
or something you can just drop down your notes throw it up there and 00:06:08.880 | 
One handy thing is if you just put it on the forum, and you're not quite sure of some of the details then 00:06:16.800 | 
Then you know you have an opportunity to get feedback and say like ah well 00:06:22.000 | 
You know actually it works this way instead or or that's a really interesting insight had you thought about taking this further and so forth 00:06:29.480 | 
So what we've done so far is a kind of a an introduction as a just as a practitioner to 00:06:35.920 | 
Convolutional neural networks for images, and we haven't really talked much at all about 00:06:42.460 | 
The theory or why they work or the math of them, but on the other hand what we have done is seen 00:06:51.360 | 
Build a model which actually works exceptionally well in fact world-class level models 00:06:59.240 | 
and we'll kind of review a little bit of that today and 00:07:06.440 | 
We're going to dig in a little bit quite a lot more actually into the underlying theory of like 00:07:10.180 | 
What is a what is a cnn? What's a convolution? 00:07:12.880 | 
How does this work and then we're going to kind of go through this this cycle where we're going to dig 00:07:18.260 | 
We're going to do a little intro into a whole bunch of application areas using neural nets for structured data 00:07:25.120 | 
so kind of like logistics or forecasting or you know financial data or that kind of thing and then looking at 00:07:33.080 | 
language applications NLP applications using recurrent neural nets and then 00:07:41.720 | 
Recommendation systems and so these will all be like 00:07:49.800 | 
It'll be like here's how you can get a state-of-the-art result without digging into the theory 00:07:55.880 | 
And then we're kind of go go to go back through those almost in reverse order 00:08:01.160 | 
So then we're going to dig right into collaborative filtering in a lot of detail and see how how to write the code 00:08:06.960 | 
Underneath and how the math works underneath and then we're going to do the same thing for structured data analysis 00:08:12.760 | 
We're going to do the same thing for confidence images and finally an in-depth dig dive into recurrent neural networks 00:08:29.280 | 
Also provide a bit more detail on some on some steps that we only briefly skipped over 00:08:36.040 | 
So I want to make sure that we're all able to complete 00:08:38.920 | 
Kind of last week's assignment, which was the the dog breeds 00:08:44.080 | 
I mean to basically apply what you've learned to it another data set and I thought the easiest one to do would be the dog 00:08:49.520 | 
Breeds Kaggle competition and so I want to make sure everybody has everything you need to do this right now 00:08:54.280 | 
So and the first thing is to make sure that you know how to download 00:08:58.800 | 
Data and so there's there's two main places at the moment. We're kind of downloading data from one is from Kaggle 00:09:08.720 | 
And so I'll first of all do the the Kaggle version 00:09:20.360 | 
Which is here and to install it I think it's already in let's just double check 00:09:29.120 | 
Yeah, so it's or it should already be in your 00:09:38.600 | 
But to make sure one thing that happens is because this is downloading from the Kaggle website through like screen scraping every time Kaggle changes 00:09:45.420 | 
The website it breaks so anytime you try to use it and 00:09:48.940 | 
If Kaggle's websites changed recently you'll need to make sure you get the most recent version so you can always go to pip 00:10:01.280 | 
- - upgrade and so that'll just make sure that you've got the latest version of of it and everything that it depends on 00:10:15.420 | 
Follow the instructions. Actually, I think rational was kind enough to they go. There's a Kaggle CLI 00:10:20.820 | 
Feel like everything you need to know can be found at rational's 00:10:35.540 | 
And then you provide your username with - you you provide your password with - P and then - see you did the competition name 00:10:44.400 | 
And a lot of people in the forum has been confused about what to enter here 00:10:47.680 | 
And so the key thing to note is that when you're at a Kaggle competition? 00:10:51.220 | 
After the /c there's a specific name planet - understanding - etc. Right? That's the name you need 00:11:01.560 | 
the other thing you'll need to make sure is that you've 00:11:04.280 | 
On your own computer have attempted to click download at least once because when you do it will ask you to accept the rules 00:11:14.580 | 
KG download will give you a hint it'll say it looks like you might have forgotten 00:11:21.700 | 
Google account like anything other than a username password this won't work 00:11:25.980 | 
So you'll need to click forgot password on Kaggle and get them to send you a normal password 00:11:33.700 | 
Right and so when you do that you end up with a whole folder created for you with all of that competition data in it 00:11:41.960 | 
So a couple of reasons you might want to not use that 00:11:44.980 | 
The first is that you're using a data set that's not on Kaggle 00:11:48.220 | 
The second is that you don't want all of the data sets in a Kaggle competition for example the planet competition 00:11:54.700 | 
That we've been looking at a little bit. We'll look at again today 00:11:57.380 | 
Has data in two formats TIFF and JPEG the TIFF is 19 gigabytes and the JPEG is 600 megabytes 00:12:09.460 | 
So I'll show you a really cool kit, which actually somebody on the forum taught me 00:12:14.040 | 
I think was one of the MSAN students here at USF. There's a 00:12:26.220 | 
And then you install it by just clicking on install if you haven't installed extension before and then from now on 00:12:33.520 | 
Every time you try to download something, so I'll try and download this file 00:12:40.460 | 
I'll just go ahead and cancel it right and now you see this little yellow button. That's added up here 00:13:02.980 | 
So what that does is like all of your cookies and headers and everything else needed to download that file is like save 00:13:13.980 | 
It's also useful if you like trying to download some I don't know TV show or something anything where you're hidden behind a 00:13:20.220 | 
Log in or something you can you can grab it and actually that is very useful for data science because quite often we want to 00:13:27.620 | 
Analyze things like videos on our on our consoles 00:13:31.140 | 
So this is a good trick. All right, so there's two ways to get the data 00:13:45.140 | 
So what I tend to do like you'll notice that I tend to assume that the data is in a directory called data 00:13:51.860 | 
That's a subdirectory of wherever your notebook is, right? 00:14:00.860 | 
You might want to put it directly in your home directory or you might want to put it on another drive or whatever 00:14:05.260 | 
so what I do is if you look inside my courses deal one folder, you'll see that data is actually a 00:14:15.660 | 
To a different drive, right? So you can put it anywhere you like and then you can just add a symbolic link 00:14:20.820 | 
Or you can just put it there directly. It's up to you 00:14:24.660 | 
You haven't used some links before they're like aliases or shortcuts on the Mac or Windows 00:14:30.340 | 
Very handy and there's some threads on the forum about how to use them if you want help with that 00:14:35.980 | 
that's for example is also how we actually have the 00:14:41.660 | 
Available from the same place as our notebooks. It's just a similar to where they come from 00:14:50.340 | 
Where things actually point to in Linux you can just use the minus L flag to listing a directory 00:14:59.340 | 
Exist and also show you which things are directories so forth 00:15:06.580 | 
May be a little unclear based on what we've done so far is like 00:15:15.220 | 
How little code you actually need to do this end-to-end so what I've got here is is in a single window is an entire 00:15:22.860 | 
End-to-end process to get a state-of-the-art result for cats versus dogs, right? 00:15:28.260 | 
I've the only step I've skipped is the bit where we've downloaded it in Kaggle and then where we unzipped it, right? 00:15:42.660 | 
Import our libraries and actually if you import this one conf learner that basically imports everything else 00:15:48.900 | 
So that's that we need to tell it the path of where things are the size that we want the batch size that we want 00:15:58.500 | 
So then and we're going to learn a lot more about what these do very shortly 00:16:02.340 | 
But basically we say how do we want to transform our data so we want to transform it in a way 00:16:07.500 | 
That's suitable to this particular kind of model and it assumes that the photos are side on photos 00:16:13.420 | 
And that we're going to zoom in up to 10% each time 00:16:19.500 | 
Based on paths and so remember this is this idea that there's a path called cats and a path called dogs 00:16:25.180 | 
And they're inside a path called train and a path called valid 00:16:33.500 | 
Overwrite these with other things so if your things are in different named folders you could either rename them or you can see here 00:16:40.340 | 
There's like a train name and a vowel name you can always pick something else here 00:16:48.820 | 
So if you want to submit some into Kaggle you'll need to fill in the name the name of the folder where the test 00:16:54.380 | 
Set is and obviously those those won't be labeled 00:17:00.220 | 
So then we create a model from a pre trained model. It's from a ResNet 50 model using this data 00:17:10.380 | 
That has all of the layers, but the last few frozen and again, we'll learn a lot more about what that means 00:17:22.220 | 
Notice here. I didn't say pre compute equals true again 00:17:27.300 | 
There's been some confusion on the forums about like what that means 00:17:30.260 | 
It's it's only a it's only something that makes it a little faster for this first step right so you can always skip it 00:17:37.620 | 
And if you're at all confused about it, or it's causing you any problems. Just leave it off right because it's just a 00:17:43.700 | 
It's just a shortcut which caches some of that intermediate steps that don't have to be recapulated each time 00:17:52.020 | 
Okay, and remember that when we are using pre computed activations data augmentation doesn't work right so even if you ask for a data 00:18:00.420 | 
augmentation if you've got pre computed equals true 00:18:02.860 | 
It doesn't actually do any data augmentation because it's using the cached 00:18:10.540 | 
So in this case to keep this as simple as possible. I have no pre computed anything going on 00:18:25.500 | 
something we haven't seen before and we'll learn about in the second half is 00:18:29.620 | 
called BN freeze for now all you need to know is that if you're using a 00:18:37.140 | 
Bigger deeper model like resnet 50 or res next 101 on a data set 00:18:42.780 | 
That's very very similar to image net like these cats and dogs later sets on other words 00:18:51.780 | 
You know of a similar size to image net like somewhere between 200 and 500 pixels 00:18:57.300 | 
You should probably add this line when you unfreeze for those of you that are more advanced what it's doing is it's it's 00:19:08.460 | 
Moving averages to not be updated but in the second half of this course you're going to learn all about why we do that 00:19:14.340 | 
It's something that's not supported by any other library 00:19:17.020 | 
But it turns out to be super important anyway, so we do one more epoch 00:19:25.820 | 
And then at the end we use test time augmentation 00:19:29.540 | 
To ensure that we get the best predictions we can and that gives us ninety nine point four five percent 00:19:37.180 | 
So that's that's it right so when you try a new data set they're basically the minimum set of steps 00:19:48.260 | 
You'll notice this is assuming. I already know what learning rate to use so you'd use a learning rate finder for that 00:19:54.260 | 
It's assuming that I know the the directory layout 00:20:00.820 | 
So that's kind of a minimum set now one of the things that I wanted to make sure 00:20:05.020 | 
You had an understanding of how to do is how to use other libraries other than fast AI 00:20:11.780 | 
And so I feel like the best thing to look at is to look at Keras because Keras is a library 00:20:20.820 | 
Keras sits on top of actually a whole variety of different back ends it fits mainly people nowadays use it with TensorFlow 00:20:28.480 | 
There's also an MX net version. There's also a Microsoft CNTK version 00:20:35.020 | 
So what I've got if you do a git pull you'll see that there's a 00:20:42.220 | 
Called Keras lesson one where I've attempted to replicate at least parts of lesson one in Keras 00:20:52.580 | 
I'm not going to talk more about batch norm freeze now other than to say 00:21:06.060 | 
Which has got a number larger than 34 at the end so like resnet 50 or res next 101 and you're 00:21:12.620 | 
Trading a data set that has that is very similar to image net 00:21:17.500 | 
So it's like normal photos of normal sizes where the thing of interest takes up most of the frame 00:21:22.780 | 
Then you probably should add the end freeze true after unfreeze 00:21:27.180 | 
If in doubt try trading it with and then try trading it without 00:21:32.700 | 
More advanced students will can certainly talk about it on the forums this week 00:21:36.480 | 
And we will be talking about the details of it in the second half of the course when we come back to our 00:21:42.740 | 
CNN in-depth section in the second last lesson 00:22:00.940 | 
Remember I mentioned that this idea that you've got a thing called train and a thing called valid and inside that you've got a 00:22:06.180 | 
Thing called dogs and the things called cats is a standard way of providing 00:22:12.620 | 
Labeled images so Keras does that too right so it's going to tell it where the training set and the validation set are 00:22:22.820 | 
Now you're noticing Keras. We need much much much more 00:22:30.660 | 
More importantly each part of that code has many many many more things you have to set and if you set them wrong 00:22:40.300 | 
I'll give you a summary of what they are. So you're basically rather than creating a single 00:22:47.700 | 
Data object in Keras we first of all have to define something called a data 00:22:52.860 | 
Generator to say how to generate the data and so a data generator 00:22:57.140 | 
We basically have to say what kind of data augmentation 00:23:07.340 | 
Normalization do we want to do so we're else with fast AI we just say 00:23:13.180 | 
Whatever resnet 50 requires just do that for me, please 00:23:16.780 | 
We actually have to kind of know a little bit about what's expected of us 00:23:20.860 | 
Generally speaking copy and pasting Keras code from the internet is a good way to make sure you've got the right 00:23:28.660 | 
And again, it doesn't have a kind of a standard set of like here the best data augmentation parameters to use for photos 00:23:36.020 | 
So, you know, I've copied and pasted all of this from the Keras 00:23:42.620 | 
So I don't know if it's I don't think it's the best set to use at all, but it's the set that they're using in their 00:23:48.500 | 
So having said this is how I want to generate data. So horizontally flip sometimes, you know zoom sometimes she is sometimes 00:23:55.860 | 
We then create a generator from that by taking that data generator and saying I want to generate 00:24:02.300 | 
Images by looking from a directory and we pass in the directory which is of the same 00:24:10.660 | 
You'll see there's some overlaps with kind of how fast AI works here 00:24:14.780 | 
You tell it what size images you want to create you tell it what batch size you want in your mini batches 00:24:20.100 | 
And then there's something here not to worry about too much 00:24:23.340 | 
But basically if you're just got two possible outcomes you would generally say binary here 00:24:28.300 | 
If you've got multiple possible outcomes you would say categorical. Yeah, so we've only got cats or dogs. So it's binary 00:24:34.460 | 
So an example of like where things get a little more complex is you have to do the same thing for the validation set 00:24:44.300 | 
That doesn't have data augmentation because obviously for the validation set unless you're using TTA that's going to stuff things up 00:24:56.140 | 
You randomly reorder the images so that they're always shown in different orders to make it more random 00:25:04.060 | 
Vital that you don't do that because if you shuffle the validation set you then can't track how well you're doing 00:25:10.020 | 
It's in a different order for the labels. That's a 00:25:12.420 | 
Basically, these are the kind of steps you have to do every time with Keras 00:25:20.340 | 
So again, the reason I was using resnet 54 is Keras doesn't have resnet 34 unfortunately 00:25:26.120 | 
So I just wanted to compare like with Mike so we got to use resnet 50 here 00:25:29.680 | 
There isn't the same idea with Keras of saying like construct a model that is suitable for this data set for me 00:25:40.940 | 
So the way you do it is to basically say this is my base model and then you have to construct on top of that manually 00:25:48.700 | 
The layers that you want to add and so by the end of this course, you'll understand why it is that these 00:25:53.780 | 
particular three layers are the layers that we add 00:25:57.060 | 
So having done that in Keras you basically say okay 00:26:02.460 | 
this is my model and then again there isn't like a 00:26:05.980 | 
Concept of like automatically freezing things or an API for that 00:26:10.680 | 
so you just have to allow loop through the layers that you want to freeze and 00:26:18.840 | 
In Keras, there's a concept we don't have in fast AI or pytorch of compiling a model 00:26:25.640 | 
So basically once your models ready to use you have to compile it 00:26:28.720 | 
Passing in what kind of optimizer to use what kind of loss to look for or what metrics so again with fast AI 00:26:35.920 | 
You don't have to pass this in because we know what loss is the right loss to use you can always override it 00:26:42.620 | 
But for a particular model we give you good defaults 00:26:47.980 | 
Rather than calling fit you call fit generator 00:26:50.980 | 
Passing in those two generators that you saw earlier the train generator and the validation generator 00:26:56.500 | 
For reasons I don't quite understand Keras expects you to also tell it how many batches there are per epoch 00:27:04.000 | 
So the number of batches is equal to the size of the generator 00:27:08.340 | 
Divided by the batch size you can tell it how many epochs 00:27:17.420 | 
Processes or how many workers to use for pre-processing? 00:27:20.900 | 
Unlike fast AI the default in Keras is basically not to use any 00:27:27.500 | 
So you to get good speed you've got to make sure you include this 00:27:32.620 | 
And so that's basically enough to start fine-tuning the last layers 00:27:42.820 | 
So as you can see I got to a validation accuracy of 95% 00:27:46.140 | 
But as you can also see something really weird happened where after one it was like 49 and then it was 69 and then 95 00:27:54.900 | 
Why these are so low? That's not normal. I may have there may be a bug in Keras. They may be a bug in my code 00:28:01.500 | 
I reached out on Twitter to see if anybody could figure it out, but they couldn't I guess this is one of the challenges with using 00:28:08.700 | 
Something like this is one of the reasons I wanted to use fast AI for this course is it's much harder to screw things up 00:28:14.740 | 
So I don't know if I screwed something up or somebody else did yes, you know 00:28:18.700 | 
This is using the tensorflow back end yeah, yeah, and if you want to run this to try it out yourself 00:28:38.500 | 
Okay, because it's not part of the fast AI environment about default 00:28:42.720 | 
But that should be all you need to do to get that working 00:28:54.060 | 
There isn't a concept of like layer groups or differential learning rates or partial unfreezing or whatever 00:29:00.420 | 
So you have to decide like I had to print out all of the layers and decide manually 00:29:04.980 | 
How many I wanted to fine-tune so I decided to fine-tune everything from a layer 140 onwards 00:29:10.280 | 
So that's why I just looped through like this 00:29:12.280 | 
After you change that you have to recompile the model 00:29:15.540 | 
And then after that I then ran another step and again 00:29:19.540 | 
I don't know what happened here the accuracy of the training set stayed about the same but the validation set totally fell in the hole 00:29:25.380 | 
But I mean the main thing to note is even if we put aside the validation set 00:29:32.340 | 
We're getting I mean, I guess the main thing is there's a hell of a lot more code here 00:29:36.300 | 
Which is kind of annoying but also the performance is very different. So we're also here even on the training set 00:29:42.860 | 
We're getting like 97% after four epochs that took a total of about eight minutes 00:29:51.140 | 
99.5% on the validation set and it ran a lot faster. So it was like 00:30:04.860 | 
Depending on what you do particularly if you end up wanting to deploy stuff to mobile devices at the moment 00:30:12.880 | 
The kind of pie torch on mobile situation is very early 00:30:18.020 | 
So you may find yourself wanting to use tensorflow or you may work for a company that's kind of settled on tensorflow 00:30:24.340 | 
So if you need to convert something like redo something you've learned here in tensorflow 00:30:30.980 | 
You probably want to do it with Keras, but just recognize 00:30:35.160 | 
you know, it's going to take a bit more work to get there and 00:30:38.700 | 
By default it's much harder to get I mean I to get the same state-of-the-art results you get with fast AI 00:30:46.140 | 
You'd have to like replicate all of the state-of-the-art 00:30:49.620 | 
Algorithms that are in fast AI so it's hard to get the same 00:30:53.300 | 
Level of results, but you can see the basic ideas are similar 00:31:01.140 | 
It's certainly possible, you know, like there's nothing I'm doing in fast AI that like would be impossible 00:31:07.380 | 
But like you would have to implement stochastic gradient percent with restarts. You would have to 00:31:11.260 | 
Implement differential learning rates you would have to implement batch norm freezing 00:31:16.820 | 
Which you probably don't want to do. I know well, that's not quite true 00:31:20.940 | 
I think somewhat one person at least on the forum is 00:31:23.100 | 
Attempting to create a Keras compatible version of or tons of flow compatible version fast AI 00:31:30.620 | 
I actually spoke to Google about this a few weeks ago, and they're very interested in getting fast AI ported to tensorflow 00:31:36.420 | 
So maybe by the time you're looking at this on the MOOC, maybe that will exist. I certainly hope so 00:31:44.580 | 
Anyway, so Keras is Keras and tensorflow are certainly not 00:31:52.940 | 
That difficult to handle and so I don't think you should worry if you're told you have to learn them 00:31:57.900 | 
After this course for some reason it'll only take you a couple of days. I'm sure 00:32:02.020 | 
So that's kind of most of the stuff you would need to 00:32:10.780 | 
Kind of complete this is kind of assignment from last week 00:32:14.460 | 
Which was like try to do everything you've seen already, but on the dog breeds data set and just to remind you 00:32:21.300 | 
The kind of last few minutes of last week's lesson I show you how to do much of that 00:32:28.940 | 
Including like how I actually explored the data to find out like what the classes were and how big the images were and stuff like 00:32:37.860 | 
That right so if you've forgotten that or didn't quite follow it all last week check out the video from last week to see 00:32:45.380 | 
One thing that we didn't talk about is how do you actually submit to Kaggle? So how do you actually get predictions? 00:32:51.200 | 
So I just wanted to show you that last piece as well 00:32:54.160 | 
And on the wiki thread this week. I've already put a little image of this to show you these steps 00:33:02.980 | 
Website for every competition there's a section called evaluation and they tell you what to submit and so I just copied and pasted these 00:33:10.900 | 
Two lines from from there, and so it says we're expected to submit a file where the first line 00:33:17.060 | 
Contains the the word the word ID and then a comma separated list of all of the possible dog breeds 00:33:24.300 | 
And then every line after that will contain the ID itself 00:33:28.700 | 
Followed by all the probabilities of all the different dog breeds 00:33:37.700 | 
So I recognize that inside our data object. There's a dot classes 00:33:41.400 | 
Which has got in alphabetical order all of the all of the classes 00:33:50.460 | 
So it's got all of the different classes and then inside 00:33:54.580 | 
Data dot test data set test. Yes, you can also see there's all the file names 00:34:08.100 | 
Was not provided in the kind of Keras style format where the dogs and cats are in different folders 00:34:15.260 | 
But instead it was provided as a CSV file of labels, right? So when you get a CSV file of labels you use 00:34:22.780 | 
Image classifier data from CSV rather than image classifier data from paths 00:34:30.900 | 
There isn't an equivalent in Keras, so you'll see like on the Kaggle forums people 00:34:35.100 | 
Share scripts for how to convert it to a Keras style folders 00:34:39.380 | 
But in our case we don't have to we just go image classifier data from CSV passing in that CSV file 00:34:44.860 | 
And so the CSV file will you know has automatically told the data. You know what the classes are 00:34:52.100 | 
And then also we can see from the folder of test images what the file names of those are 00:35:02.680 | 
We're ready to go so I always think it's a good idea to use TTA 00:35:08.040 | 
As you saw with that dogs and cats example just now it can really improve things particularly when your model is less good 00:35:15.240 | 
So I can say learn dot TTA and if you pass in 00:35:29.600 | 
Then it's going to give you predictions on the test set rather than the validation set okay, and now obviously we can't now get 00:35:37.480 | 
An accuracy or anything because by definition. We don't know the labels for the test set right 00:35:48.580 | 
Pytorch models give you back the log of the predictions 00:35:53.240 | 
So then we just have to go exp of that to get back our probabilities 00:35:57.720 | 
So in this case the test set had ten thousand three hundred and fifty seven 00:36:01.680 | 
Images in it, and there are 120 possible breeds all right, so we get back a matrix of of that size 00:36:15.400 | 
So the easiest way to do that is with pandas if you're not familiar with pandas 00:36:20.160 | 
There's lots of information online about it or check out the machine learning course intro to machine learning that we have 00:36:27.280 | 
but basically we can just go PD dot data frame and pass in that matrix and 00:36:32.200 | 
then we can say the names of the columns are equal to data dot classes and 00:36:37.080 | 
Then finally we can insert a new column at position zero called ID that contains the file names 00:36:44.080 | 
But you'll notice that the file names contain 00:36:49.360 | 
Five letters at the end with a start we don't want and four letters at the end. We don't want so I just 00:37:08.640 | 
Call data frame data. I should have used a DF not DS 00:37:23.240 | 
Okay, so you can now call data frame to CSV and 00:37:27.400 | 
Quite often you'll find these files actually get quite big 00:37:32.080 | 
so it's a good idea to say compression equals G zip and that'll zip it up on the server for you and that's going to create a 00:37:41.680 | 
CSV file on the server on wherever you're running this Jupiter notebook, so you need absent 00:37:46.920 | 
You now need to get that back to your computer so you can upload it 00:37:49.860 | 
Or you can use Kaggle CLI so you can type KG submit and do it that way I? 00:37:56.640 | 
Generally download it to my computer because I like I often like to just like double check it all looks okay 00:38:02.520 | 
So to do that there's a cool little thing called file link and if you run file link 00:38:08.800 | 
With a path on your server it gives you back a URL 00:38:12.320 | 
Which you can click on and it'll download that file from the server onto your computer 00:38:22.040 | 
Can go ahead and save it and then I can see in my downloads 00:38:36.600 | 
If you want to open there yeah, and as you can see it's exactly what I asked for there's my 00:38:49.760 | 
Then here's my first row containing the file name and the 120 different probabilities 00:38:54.740 | 
Okay, so then you can go ahead and submit that to Kaggle through there 00:38:58.600 | 
Through their regular form and so this is also a good way you can see we've now got a good way of both 00:39:06.240 | 
Grabbing any file off the internet and getting it to our AWS instance or paper space or whatever by using 00:39:14.640 | 
Cool little extension in Chrome, and we've also got a way of grabbing stuff off our server easily 00:39:22.520 | 
Command-line oriented you can also use SCP of course, but I kind of like doing everything through the notebook 00:39:32.880 | 
One other question. I had during the week was like what if I want to just get a single a 00:39:42.360 | 
You know get a prediction for so for example you know maybe I want to get the first file from my validation set 00:39:51.080 | 
So you can always look at a file just by calling image open 00:40:02.520 | 
So what you can do is there's actually I'll show you the shortest version 00:40:21.800 | 
So you've seen transform transform transforms from model before 00:40:27.120 | 
Normally, we just put put it all in one variable, but actually behind the scenes. It was returning two things 00:40:32.220 | 
It was returning training transforms and validation transforms, so I can actually split them apart 00:40:36.840 | 
And so here you can see I'm actually applying example my training transforms or probably more likely I want to play 00:40:46.760 | 
That gives me back an array containing the image the transformed image 00:40:55.920 | 
Everything that gets passed to or returned from our models is 00:41:00.560 | 
Generally assumed to be a mini batch right generally assumed to be a bunch of images 00:41:05.780 | 
So we'll talk more about some numpy tricks later, but basically in this case. We only have one image 00:41:12.220 | 
So we have to turn that into a mini batch of images so in other words. We need to create a tensor 00:41:20.960 | 
Rows by columns by channels, but it's number of image by rows by columns by channels and and it has one image 00:41:27.980 | 
So it's basically becomes a four-dimensional tensor so there's a cool little trick in numpy that if you index 00:41:34.360 | 
Into an array with none that basically adds additional unit access to the start 00:41:40.760 | 
So it turns it from an image into a mini batch of one images, and so that's why we had to do that 00:41:46.000 | 
So if you basically find you're trying to do things with a single image 00:41:51.360 | 
With any kind of pytorch or fastai thing this is just something you might you might find it says like expecting four 00:41:59.160 | 
Dimensions only got three it probably means that or if you get back a return 00:42:04.420 | 
Value from something that has like some weird first axis. That's probably why it's probably giving you like back a mini batch 00:42:12.200 | 
Okay, and so we'll learn a lot more about this, but it's just something to be aware of 00:42:16.040 | 
Okay, so that's kind of everything you need to do in practice 00:42:25.360 | 
So now we're going to kind of get into a little bit of theory 00:42:30.480 | 
What's actually going on behind the scenes with these convolutional neural networks, and you might remember in 00:42:49.240 | 
Which we stole from this fantastic website so toaster dot IO EV explained visually 00:42:55.260 | 
And we learned that a that a convolution is something where we basically have a little matrix 00:43:01.320 | 
In deep learning nearly always three by three a little matrix that we basically multiply every element of that matrix 00:43:08.920 | 
By every element of a three by three section of an image 00:43:12.600 | 
Add them all together to get the result of that convolution at one point right now 00:43:25.400 | 
These various layers that we saw in the the xyla and burgers paper and to do that again 00:43:31.720 | 
I'm going to steal off somebody who's much smarter than I am 00:43:37.520 | 
Guy called a tavio good a tavio good was the guy who created word lens 00:43:43.240 | 
Which nowadays is part of Google Translate if on Google Translate you've ever like done that thing where you you point your camera at something? 00:43:51.680 | 
At something with it which has any kind of foreign language on it and in real time it overlays it with the translation 00:43:57.520 | 
That was the potatoes company that built that 00:44:00.160 | 
And so it was kind enough to share this fantastic video. He created he's at Google now 00:44:08.320 | 
And I want to kind of step you through it because I think it explains really really well 00:44:11.940 | 
What's going on and then after we look at the video? We're going to see how to implement the whole a whole 00:44:17.160 | 
Sequence of convo an entire set of layers of convolutional neural network in Microsoft Excel 00:44:22.960 | 
So whether you're a visual learner or a spreadsheet learner, hopefully you'll be able to understand all this 00:44:31.480 | 
And something that we're going to do later in the course is we're going to learn to recognize digits 00:44:35.920 | 
So we'll do it like end-to-end. We'll do the whole thing. So this is pretty similar 00:44:39.840 | 
So we're going to try and recognize in this case letters 00:44:43.200 | 
So here's an a which obviously it's actually a grid of numbers, right? 00:44:48.440 | 
And so there's the grid of numbers. And so what we do is we take our first 00:44:52.800 | 
Convolutional filter, so we're assuming this is always this is assuming that these are already learnt 00:44:58.760 | 
Right and you can see this one. It's got white down the right hand side, right and black down the left 00:45:04.440 | 
So it's like 0 0 0 or maybe negative 1 negative 1 negative 1 0 0 0 1 1 1 and so we're taking each 00:45:10.720 | 
3 by 3 part of the image and multiplying it by that 3 by 3 00:45:15.380 | 
Matrix not as a matrix product that an element wise product and so you can see what happens is 00:45:25.120 | 
Matching the edge of the a and the black edge isn't we're getting green 00:45:30.160 | 
We're getting a positive and everywhere where it's the opposite. We're getting a negative 00:45:34.280 | 
We're getting a red right and so that's the first filter creating the first 00:45:39.520 | 
The result of the first kernel right and so here's a new kernel 00:45:44.740 | 
This one is is got a white stripe along the top right so we literally scan it through every three by three part of the matrix 00:45:55.280 | 
Nine bits of the a by the nine bits of the filter to find out whether it's red or green and how red or green it is 00:46:01.880 | 
Okay, and so this is assuming we had two filters one was a bottom edge 00:46:05.880 | 
One was a left edge and you can see here the top edge not surprisingly 00:46:09.960 | 
It's red here. Sorry bottom edge was red here and green here the right edge red here and green here 00:46:15.560 | 
And then in the next step we add a non-linearity 00:46:18.320 | 
Okay, the rectified linear unit which literally means throw away the negatives so here the reds all gone 00:46:26.000 | 
Okay, so here's layer one the input here's layer two the result of two convolutional filters 00:46:31.960 | 
Here's layer three which is which is throw away all of the red stuff 00:46:36.640 | 
And that's called a rectified linear unit and then layer four is something called a max pull 00:46:47.200 | 
Part of this grid and we replace it with its maximum right so it basically makes it half the size 00:46:53.560 | 
It's basically the same thing, but half the size and then we can go through and do exactly the same thing 00:47:00.600 | 
Filter three by three filter that we put through each of the two results of the previous layer 00:47:10.520 | 
Right so get rid of all the negatives so we just keep the positives. That's called applying a rectified linear unit 00:47:18.800 | 
That gets us to our next layer of this convolutional neural network 00:47:22.840 | 
So you can see that by you know at this layer back here. It was kind of very interpretable 00:47:29.180 | 
It's like we've either got bottom edges or left edges, but then the next layer was combining 00:47:34.360 | 
The results of convolution so it's starting to become a lot less clear like intuitively what's happening 00:47:40.480 | 
But it's doing the same thing and then we do another max pull right so we replace every two by two or three by three 00:47:47.680 | 
Section with a single digit so here this two by two. It's all black so we replaced it with a black 00:47:53.800 | 
All right, and then we go and we take that and we we compare it 00:47:58.200 | 
To basically a kind of a template of what we would expect to see if it was an A 00:48:05.860 | 
It was me and we see how closely it matches and we can do it in exactly the same way 00:48:11.500 | 
We can multiply every one of the values in this four by eight matrix with every one of the four by eight in this one 00:48:19.520 | 
And this one and this one and we add we just add them together to say like how often does it match? 00:48:24.720 | 
Versus how often does it not match and then that could be converted to give us a percentage 00:48:30.720 | 
Probability that this isn't a so in this case this particular template matched well with a 00:48:38.720 | 
So notice we're not doing any training here, right? This is how it would work if we have a pre trained model 00:48:45.920 | 
So when we download a pre trained image net model off the internet and visit on an image without any changing to it 00:48:51.820 | 
This is what's happening or if we take a model that you've trained and you're applying it to some test set or to some new image 00:48:58.840 | 
This is what it's doing right is it's basically taking it through. It's applying a convolution to each layer to each well multiple 00:49:09.080 | 
And then during the rectified linear unit so throw away the negatives and then do the max pull 00:49:16.360 | 
And then repeat that a bunch of times and so then we can do it with a new 00:49:21.840 | 
Letter a or letter B or whatever and keep going through 00:49:29.480 | 
So as you can see that's a far nicer visualization thing and I could have created because I'm not a tevio 00:49:35.360 | 
So thanks to him for for sharing this with us because it's totally awesome 00:49:39.520 | 
He actually this is not done by hand. He actually wrote a piece of computer software to actually do these convolutions 00:49:45.740 | 
This is actually being actually being done dynamically. It's pretty cool 00:49:50.240 | 
So I'm more of a spreadsheet guy personally. I'm a simple person 00:49:55.200 | 
So here is the same thing now in spreadsheet form right and so you'll find this in the github repo, so you can either 00:50:04.360 | 
Get clone the repo to your own computer to open up the spreadsheet 00:50:08.320 | 
or you can just go to github.com slash fastai and 00:50:22.320 | 
And just go to courses as usual go to deal one as usual you'll see there's an Excel section there 00:50:28.480 | 
Okay, and so here they all are so you can just download them by clicking them 00:50:31.920 | 
Or you can clone the whole repo, and we're looking at conv example convolution example 00:50:41.600 | 
Input right so in this case the input is the number seven so I grabbed this from a data set called end list 00:50:49.760 | 
MNist which we'll be looking at in a lot of detail 00:50:52.960 | 
and I just took one of those digits at random and I put it into Excel and so you can see every 00:51:00.560 | 
Pixel is actually just a number between naught and one 00:51:11.120 | 
Or sometimes it might be a float between naught and one it doesn't really matter by the time it gets to PI torch 00:51:20.280 | 
So we if one of the steps we often will take will be to convert it to a number between naught and one 00:51:28.320 | 
So you can see I've just used conditional formatting in Excel to kind of make the higher numbers more red 00:51:34.480 | 
So you can clearly see that this is a red that this is a seven 00:51:38.400 | 
But but it's just a bunch of numbers that have been imported into Excel okay, so here's our input 00:51:46.040 | 
So remember what Atavio did was he then applied two filters 00:51:54.600 | 
Right with different shapes so here. I've created a filter which is designed to detect top edges 00:52:03.760 | 
Okay, and I've got ones along the top zeros in the middle minus ones at the bottom right so let's take a look at an example 00:52:11.720 | 
That's here right and so if I hit that - you can see here highlighted 00:52:18.060 | 
This is the 3 by 3 part of the input that this particular thing is calculating right 00:52:24.000 | 
so here you can see it's got 1 1 1 are all being multiplied by 1 and 00:52:29.560 | 
0.1 0 0 are all being multiplied by negative 1 00:52:34.840 | 
Okay, so in other words all the positive bits are getting a lot of positive the negative bits are getting nearly nothing at all 00:52:43.720 | 
Okay, where else on the other side of this bit of the seven? 00:52:48.880 | 
Right you can see how you know this is basically zeros here or perhaps more interestingly on the top of it 00:53:03.320 | 
High numbers at the top, but we've also got high numbers at the bottom which are negating it 00:53:07.800 | 
Okay, so you can see that the only place that we end up 00:53:23.200 | 
Okay, so when I say an activation I mean a number a number a 00:53:30.320 | 
Number that is calculated and it is calculated by taking 00:53:40.040 | 
applying some kind of linear operation in this case a convolutional kernel to 00:53:52.940 | 
Inputs multiplied by kernel and summing it together 00:53:58.740 | 
Right. So here's my sum and here's my multiply 00:54:03.060 | 
I then take that and I go max of zero comma that and 00:54:07.940 | 
So that's my rectified linear unit. So it sounds very fancy 00:54:13.220 | 
Rectified linear unit, but what they actually mean is open up Excel and type equals max zero comma thing. Okay 00:54:19.540 | 
That's all a red and you'll see people in the biz sort of say real you okay 00:54:26.020 | 
So really you means rectified linear unit means max zero comma thing and I'm not like simplifying it 00:54:33.700 | 
I really mean it like when I say like if I'm simplifying always say I'm simplifying 00:54:38.060 | 
But if I'm not saying I'm simplifying that's the entirety. Okay, so a rectified linear unit in its entirety is this 00:54:50.060 | 
Okay, so a single layer of a convolutional neural network is being implemented in its entirety 00:54:58.940 | 
Here in Excel, okay, and so you can see what it's done is it's deleted pretty much the vertical edges 00:55:15.900 | 
That at the end of training it had created a convolutional filter with these specific nine numbers in 00:55:29.860 | 
Now pie torch doesn't store them as two separate nine digit arrays 00:55:36.500 | 
It stores it as a tensor. Remember a tensor just means an array with 00:55:42.660 | 
More dimensions. Okay, you can use the word array as well 00:55:48.280 | 
It's the same thing but in pytorch. They always use the word tensor. So I'm going to say tensor 00:55:54.700 | 
Okay, so it's just a tensor with an additional axis which allows us to stack 00:56:06.260 | 
Pretty much mean the same thing. Yeah, right it refers to one of these three by three 00:56:18.380 | 
So if I take this one and here I've literally just copied the formulas in Excel from above 00:56:23.980 | 
Okay, and so you can see this one is now finding a vertical edge as we would expect. Okay, so 00:56:39.500 | 
Layer right this here is a layer and specifically we'd say it's a hidden layer 00:56:44.500 | 
Which is it's not an input layer and it's not an output layer. So everything else is a hidden layer. Okay, and 00:56:55.180 | 
A size 2 on this dimension, right because it has two 00:57:16.340 | 
Multiply a little bit in complexity right because my next filter is going to have to contain 00:57:24.060 | 
Two of these three by threes because I'm going to have to say how do I want to bring how do I want to? 00:57:30.900 | 
Wait these three things and at the same time, how do I want to wait the corresponding three things down here? 00:57:39.260 | 
This is going to be this whole thing here is going to be stored as a multi-dimensional tensor, right? 00:57:45.900 | 
So you shouldn't really think of this now as two three by three kernels, but one 00:58:16.620 | 
So the top ones are being multiplied by this part of the kernel and the bottom ones are being multiplied by this part of the 00:58:25.060 | 
You want to start to get very comfortable with the idea of these like higher dimensional? 00:58:33.820 | 
Like it's it's harder to draw it on the screen like I had to put one above the other 00:58:39.340 | 
But conceptually just stack it in your mind like this. That's really how you want to think 00:58:44.660 | 
Right and actually Jeffrey Hinton in his original 00:58:50.860 | 
Coursera class has a tip which is how all computer scientists deal with like very high dimensional spaces 00:58:57.660 | 
Which is that they basically just visualize the two-dimensional space and then say like 12 dimensions really fast in their head lots of times 00:59:06.080 | 
So that's it right we can see two dimensions on the screen, and then you just got to try to trust 00:59:11.620 | 
That you can have more dimensions like the concepts just you know 00:59:17.220 | 
There's there's nothing different about them, and so you can see in Excel 00:59:20.420 | 
You know Excel doesn't have the ability to handle three-dimensional tenses, so I had to like say okay take this two-dimensional 00:59:26.860 | 
Dot product add on this two-dimensional dot product right, but if there was some kind of 3d excel 00:59:34.460 | 
I could have just done that in a single formula 00:59:36.940 | 
And then again apply max 0 comma otherwise known as rectified linear unit otherwise known as real you 00:59:45.460 | 
Okay, so here is my second layer, and so when people create different 00:59:55.140 | 
Like how big is your kernel at layer one how many filters are in your kernel at layer one so here? 01:00:05.100 | 
Where's number one and a 3 by 3 there's number two so like this architecture? 01:00:11.180 | 
I've created starts off with two three by three convolutional kernels and 01:00:19.940 | 
Second layer has another two kernels of size two by three by three 01:00:25.900 | 
So there's the first one and then down here. Here's the second two by three by three kernel, okay, and so 01:00:32.960 | 
Remember one of these specific where any one of these numbers is an activation 01:00:39.500 | 
Okay, so this activation is being calculated from these three things here and other three things up there 01:00:46.460 | 
And we're using these this two by three by three 01:00:52.020 | 
And so what tends to happen is people generally give names to their layers, so I say okay 01:00:57.780 | 
Let's call this layer here cons one and this layer here 01:01:06.860 | 
This layer here con two right so that's you know 01:01:11.680 | 
Generally, you'll just see that like when you print out a summary of a network every layer will have some kind of name 01:01:22.740 | 
Well part of the architecture is like do you have some max pooling? 01:01:27.940 | 
Whereabouts is that max pooling happens or in this architecture? We're inventing we're going to next step 01:01:33.980 | 
Is to max pooling okay max pooling is a little hard to? 01:01:41.980 | 
So max pooling if I do a two by two max pooling it's going to have the resolution both height and width 01:02:00.740 | 
Right and so because I'm having the resolution it only makes sense to actually have something every two cells 01:02:05.980 | 
Okay, so you can see here the way. I've got kind of the same 01:02:11.500 | 
Looking shape as I had back here, okay, but it's now half the resolution because I've replaced every 01:02:19.860 | 
With its max and you'll notice like it's not every possible two by two I skip over from here 01:02:25.620 | 
So this is like starting at BQ and then the next one starts at 01:02:32.380 | 
Right, so they're like non overlapping. That's why it's decreasing the resolution 01:02:36.540 | 
Okay, so anybody who's comfortable with spreadsheets 01:02:40.800 | 
You know you can open this and have a look and so after our max pooling 01:02:45.860 | 
There's a number of different things we could do next and I'm going to show you a kind of 01:02:56.620 | 
Classic old style approach nowadays in fact what generally happens nowadays is we do a max pool where we kind of like max across the 01:03:06.860 | 
But on older architectures and also on all the structured data stuff we do 01:03:11.400 | 
We actually do something called a fully connected layer, and so here's a fully connected layer 01:03:17.100 | 
I'm going to take every single one of these activations, and I'm going to give every single one of them a weight 01:03:24.980 | 
Right and so then I'm going to take over here 01:03:28.900 | 
here is the sum product of every one of the activations by every one of the weights for both of the 01:03:41.580 | 
Levels of my three-dimensional tensor right and so this is called a fully connected layer notice. It's different to a convolution 01:03:50.860 | 
Right, but I'm creating a really big weight matrix right so rather than having a couple of little three by three kernels 01:03:58.380 | 
My weight matrix is now as big as the entire input 01:04:04.060 | 
Architectures that make heavy use of fully convolutional layers can have a lot of weights 01:04:11.860 | 
Which means they can have trouble with overfitting and they can also be slow and so you're going to see a lot 01:04:19.420 | 
An architecture called VGG because it was the first kind of successful deeper architecture 01:04:27.660 | 
Actually contains a fully connected layer with 4,096 weights 01:04:33.020 | 
Connected to at a hidden layer with 4,000 sorry 4,096 01:04:38.060 | 
activations connected to a hidden layer with 4,096 activations, so you've got like 4,096 by 01:04:46.900 | 
4,096 multiplied by remember multiplied by the number of kind of kernels that we've calculated 01:05:01.260 | 
Weights of which something like 250 million of them are in these fully connected layers 01:05:07.740 | 
So we'll learn later on in the course about how we can kind of avoid using these big fully connected layers and behind the scenes 01:05:15.860 | 
All the stuff that you've seen us using like res net and res next none of them use very large 01:05:21.620 | 
Fully connected layers you know you had a question 01:05:24.580 | 
So you tell us more about for example if we had like three channels of the input what would be the 01:05:35.740 | 
The shape yeah these filters right so that's a great question 01:05:41.500 | 
So if we had three channels of input it would look exactly like conv1 right conv1 kind of has two channels 01:05:49.740 | 
Right and so you can see with conv1. We had two channels so therefore our filters 01:05:55.820 | 
had to have like two channels per filter and so you could like 01:06:00.460 | 
Imagine that this input didn't exist you know and actually this was the input right so when you have a multi-channel input 01:06:08.140 | 
It just means that your filters look like this and so images often full color 01:06:14.480 | 
They have three red green and blue sometimes. They also have an alpha channel 01:06:19.020 | 
So however many you have that's how many inputs you need and so something which I know 01:06:24.660 | 
Yannette was playing with recently was like using a full color image net model 01:06:30.540 | 
In medical imaging for something called bone age calculations 01:06:34.860 | 
Which has a single channel and so what she did was basically take the the input 01:06:40.940 | 
The single channel input and make three copies of it 01:06:44.820 | 
So you end up with basically like one two three versions of the same thing which is like 01:06:51.460 | 
It's kind of it's not ideal like it's kind of redundant information that we don't quite want 01:06:58.260 | 
But it does mean that then if you had a something that expected a three channel 01:07:05.460 | 
You can use it right and so at the moment. There's a Kaggle competition for iceberg detection 01:07:13.820 | 
Some funky satellite specific data format that has two channels 01:07:21.220 | 
Either copy one of those two channels into the third channel 01:07:25.100 | 
Or I think what people on Kaggle are doing is to take the average of the two 01:07:30.420 | 
Again, it's not ideal, but it's a way that you can use pre-trained networks 01:07:38.700 | 
fiddling around like that you can also actually I've actually done things where I wanted to use a 01:07:44.340 | 
Three channel image net network on four channel data. I had a satellite data where the fourth channel was near infrared 01:07:58.780 | 
Level to my convolutional kernels that were all zeros and so basically like started off by ignoring the new infrared band 01:08:06.860 | 
And so what happens it basically and you'll see this next week is 01:08:11.380 | 
That rather than having these like carefully trained filters when you're actually training something from scratch 01:08:18.820 | 
We're actually going to start with random numbers 01:08:21.420 | 
That's actually what we do we actually start with random numbers 01:08:24.300 | 
And then we use this thing called stochastic gradient descent which we've kind of seen 01:08:28.140 | 
Conceptually to slightly improve those random numbers to make them less random and we basically do that again and again and again 01:08:35.460 | 
Okay, great. Let's take a seven-minute break, and we'll come back at 750 01:08:41.820 | 
All right, so what happens next so we've got as far as 01:08:57.100 | 
Fully connected layer right so we had our the results of our max Pauling layer got fed to a fully connected layer 01:09:03.420 | 
And you might notice those of you that remember your linear algebra the fully connected layer is actually doing a classic 01:09:13.260 | 
Okay, so it's basically just going through each pair in turn multiplying them together and then adding them up to do a matrix product 01:09:25.100 | 
In practice if we want to calculate which one of the 10 digits we're looking at 01:09:36.900 | 
This single number we've calculated isn't enough 01:09:46.100 | 
10 numbers so what we would have is rather than just having 01:09:52.860 | 
Fully connected weights like this and I say set because remember. There's like a whole 01:09:58.340 | 
3d kind of tensor of them we would actually need 01:10:05.300 | 
Right so you can see that these tensors start to get a little bit 01:10:08.520 | 
High dimensional right and so this is where my patience with doing it an Excel ran out 01:10:15.220 | 
But imagine that I had done this 10 times I could now have 10 different numbers all being calculated here 01:10:21.660 | 
Using exactly the same process right it just be 10 of these 01:10:37.620 | 
So then we would have 10 numbers being spat out, so what happens next? 01:10:57.340 | 
And what happens here? I'm sorry I've changed domains rather than predicting whether it's a number from one not to nine 01:11:05.620 | 
I'm going to predict whether something is a cat a dog a plane of fish or building okay, so out of our that fully connected layer 01:11:13.660 | 
We've got in this case. We'd have five numbers and notice at this point 01:11:18.340 | 
There's no value okay, and then last layer. There's no value okay, so I can have negatives 01:11:30.140 | 
Each into a probability I want to turn it into a probability from not to one that it's a cat 01:11:37.380 | 
That's a dog. There's a plane that it's a fish that it's a building and 01:11:42.220 | 
I want those probabilities to have a couple of characteristics first is that each of them should be between zero and one and 01:11:47.860 | 
The second is that they together should add up to one right? It's definitely one of these five things 01:11:54.380 | 
Okay, so to do that we use a different kind of activation function 01:11:59.420 | 
What's an activation function an activation function is a function that is applied to activations? 01:12:20.420 | 
One number and spits out one number so max of zero comma X 01:12:26.540 | 
Takes in a number X and spits out some different number value of X 01:12:30.900 | 
That's all an activation function is and if you remember back to that PowerPoint we saw and 01:13:02.780 | 
Right then all you end up with is a linear layer 01:13:07.260 | 
So if somebody's talking can can you not I'm slightly distracting. Thank you 01:13:15.460 | 
Functions together you just end up with a linear function and nobody does any cool deep learning with just linear functions 01:13:28.300 | 
With in between each one a non-linearity we could create like arbitrarily complex shapes 01:13:35.060 | 
and so the non-linearity that we're using after every hidden layer is a value rectified linear unit a 01:13:46.180 | 
Activation function is a non-linearity in in it within deep learning. Obviously, there's lots of other non-linearities in the world, but in deep learning 01:13:57.460 | 
So an activation function is any function that takes some activation in that's a single number and spits out some new activation 01:14:07.220 | 
So I'm now going to tell you about a different activation function. It's slightly more complicated than 01:14:12.940 | 
Rally-u, but not too much. It's called softmax 01:14:16.660 | 
softmax only ever occurs in the final layer at the very end and the reason why is that softmax always spits out 01:14:25.140 | 
Numbers as an activation function that always spits out a number between 0 and 1 and it always spits out a bunch of numbers 01:14:40.140 | 
This isn't strictly necessary right like we could ask our neural net to learn a set of 01:14:49.220 | 
Which have you know, which which give probabilities that line up as closely as possible with what we want 01:14:54.980 | 
But in general with deep learning if you can construct your architecture so that the desired 01:15:01.300 | 
characteristics are as easy to express as possible 01:15:04.420 | 
You'll end up with better models like they'll learn more quickly with less parameters 01:15:09.460 | 
So in this case, we know that our probabilities should end up being between 0 and 1 01:15:14.940 | 
We know that they should end up adding to one 01:15:17.780 | 
So if we construct an activation function, which always has those features 01:15:22.140 | 
Then we're going to make our neural network do a better job. It's going to make it easier for it 01:15:27.820 | 
It doesn't have to learn to do those things because it all happened automatically 01:15:35.580 | 
We first of all have to get rid of all of the negatives 01:15:39.340 | 
Right, like we can't have negative probabilities 01:15:42.700 | 
So to make things not be negative one way we could do it is just go into the power of 01:15:47.940 | 
Right. So here you can see my first step is to go x of 01:15:52.300 | 
the previous one right and I think I've mentioned this before but 01:15:57.740 | 
Of all the math that you just need to be super familiar with to do deep learning 01:16:02.500 | 
The one you really need is logarithms and x's right all of deep learning and all of machine learning 01:16:31.980 | 
Right and like not just know that that's a formula that exists but have a sense of like what does that mean? 01:16:38.180 | 
Why is that interesting? Oh, I can turn multiplications into additions. That could be really handy, right and therefore 01:16:55.260 | 
Again, that's going to come in pretty handy, you know rather than dividing I can just subtract things, right? 01:17:18.980 | 
Okay again, you just you need to really really understand these things and like so if you if you haven't spent much time with logs 01:17:28.020 | 
You try plotting them in Excel or a little notebook have a sense of what shape they are how they combine together 01:17:34.420 | 
Just make sure you're really comfortable with them. So 01:17:40.620 | 
We're using it here. So one of the things that we know is a to the power of something is positive 01:17:47.580 | 
Okay, so that's great. The other thing you'll notice about a to the power of something is because it's a power 01:17:52.860 | 
Numbers that are slightly bigger than other numbers like 4 is a little bit bigger than 2.8 01:17:59.260 | 
When you go either the power of it really accentuates that difference 01:18:03.080 | 
Okay, so we're going to take advantage of both of these features for the purpose of deep learning. Okay, so we take our 01:18:09.180 | 
The results of this fully connected layer we go a to the power of for each of them 01:18:25.260 | 
Okay, so here is the sum of a to the power of 01:18:34.460 | 
a to the power of divided by the sum of a to the power of so if you take 01:18:40.140 | 
All of these things divided by their sum then by definition all of those things must add up to 1 and 01:18:47.420 | 
Furthermore since we're dividing by their sum 01:18:52.060 | 
They must always vary between 0 and 1 because they're always positive 01:18:57.100 | 
Alright, and that's it. So that's what softmax is 01:19:06.020 | 
Doing random numbers each time right and so you can see like as I look through 01:19:11.460 | 
My softmax generally has quite a few things that are so close to 0 that they round down to 0 and you know 01:19:19.140 | 
Maybe one thing that's nearly 1 right and the reason for that is what we just talked about that is with the x 01:19:25.300 | 
Just having one number a bit bigger than the others tends to like push it out further, right? 01:19:31.780 | 
So even though my inputs here are random numbers between negative 5 and 5 01:19:36.420 | 
Right my outputs from the softmax don't really look that random at all in the sense that 01:19:42.460 | 
They tend to have one big number and a bunch of small numbers 01:19:51.460 | 
Right. We want to say like in terms of like is this a cat a dog a plane a fish or a building 01:19:55.860 | 
We really want it to say like it's it's that you know 01:19:59.260 | 
It's it's a dog or it's a plane not like I don't know 01:20:07.900 | 
Properties right it's going to return a probability that adds up to one and it's going to tend to want to pick one thing 01:20:18.660 | 
Okay, so that's softmax your net. Could you pass actually bust me up? 01:20:26.420 | 
we how would we do something that as let's say you have an image and you want to kind of categorize as like cat and 01:20:35.460 | 
What what kind of function would we try to use? 01:20:43.740 | 
So have to think about why we might want to do that and so one reason we might want to do that is to do 01:20:51.460 | 
classification so we're looking now at listen to image models and specifically we're going to take a look at the 01:20:57.780 | 
planet competition satellite imaging competition 01:21:05.620 | 
Some similarities to stuff we've seen before right so before we've seen a cat versus dog and these images are a cat or a dog 01:21:16.340 | 
They're not neither. They're not both right, but the satellite imaging competition 01:21:21.860 | 
Has data as images that look like this and in fact every single one of the images is classified by weather 01:21:29.600 | 
There's four kinds of weather one of which is haze and another of which is clear 01:21:34.940 | 
In addition to which there is a list of features that may be present including agriculture 01:21:41.860 | 
Which is like some some cleared area used for agriculture 01:21:45.980 | 
Primary which means primary rainforest and water which means a river or a creek so here is a clear day 01:21:53.700 | 
Satellite image showing some agriculture some primary rainforest and some water features 01:22:00.020 | 
And here's one which is in haze and is entirely primary rainforest 01:22:05.300 | 
So in this case we're going to want to be able to show 01:22:11.380 | 
We're going to be able to predict multiple things and so softmax wouldn't be good because softmax doesn't like 01:22:17.640 | 
Predicting multiple things and like I would definitely recommend 01:22:22.340 | 
Anthropomorphizing your activation functions right they have personalities 01:22:26.860 | 
Okay, and the personality of the softmax is it wants to pick a thing 01:22:31.780 | 
Okay, and people forget this all the time. I've seen many people even well regarded researchers in famous academic papers 01:22:41.480 | 
Using like softmax for multi-label classification it happens all the time, right? 01:22:47.480 | 
And it's kind of ridiculous because they're not 01:22:50.840 | 
understanding the personality of their activation function, so 01:22:56.200 | 
For multi-label classification where each sample can belong to one or more classes. We have to change a few things 01:23:03.980 | 
But here's the good news in fastai. We don't have to change anything 01:23:09.840 | 
Right so fastai will look at the labels in the CSV and if there is more than one label ever 01:23:20.720 | 
Item it will automatically switch into like multi-label mode 01:23:24.680 | 
So I'm going to show you how it works behind the scenes, but the good news is you don't actually have to care 01:23:39.280 | 
Images multi-label objects you obviously can't use the classic Keras style approach where things are in folders 01:23:47.120 | 
Because something can't conveniently be in multiple folders at the same time 01:23:52.380 | 
Right, so that's why we you basically have to use the from CSV 01:24:08.720 | 
Actually, I'll show you I tend to take you through it right so we can say okay 01:24:16.980 | 
This looks exactly the same as it did before but rather than side on it's top down 01:24:22.400 | 
And top down I've mentioned before that it can do 01:24:25.820 | 
Vertical flips it actually does more than that there's actually eight possible symmetries for a square 01:24:31.520 | 
Which is it can be rotated through 90 180 270 or 0 degrees? 01:24:36.280 | 
And for each of those it can be flipped and if you think about it for a while you'll realize that that's a complete 01:24:45.560 | 
In terms of symmetries to a square, so they're called it's called the dihedral group of eight 01:24:52.360 | 
So if you see in the code, there's actually a transform called dihedral. That's why it's called that 01:24:57.960 | 
So this transforms will basically do the full set of eight symmetric 01:25:08.160 | 
Plus everything which we can do to dogs and cats you know small 10-degree rotations little bit of zooming 01:25:14.920 | 
a little bit of contrast and brightness adjustment 01:25:21.880 | 
So I just created a little function here to let me quickly grab you know a 01:25:26.760 | 
Data loader of any size so here's a 256 by 256 01:25:36.000 | 
We've already seen that there's things called valve DS test DS train DS 01:25:41.000 | 
They're things that you can just index into and grab a particular image so you just use square brackets zero 01:25:46.560 | 
You'll also see that all of those things have a DL. That's a data loader 01:25:50.920 | 
So DS is data set DL is data loader. These are concepts from pytorch 01:25:55.680 | 
So if you google pytorch data set or pytorch data loader 01:25:59.600 | 
You can basically see what it means, but the basic idea is a data set gives you a single image or a single 01:26:06.880 | 
object back a data loader gives you back a mini-batch and 01:26:10.720 | 
Specifically it gives you back a transformed mini-batch, so that's why when we create our 01:26:21.560 | 
Transforms like how many processes do you want to use what transforms? 01:26:26.080 | 
Do you want and so with a data loader you can't ask for an individual image? 01:26:31.320 | 
You can only get back at a mini-batch and you can't get that back a particular mini-batch 01:26:36.160 | 
You can only get back the next mini-batch so something we risk is loop through 01:26:41.560 | 
Grabbing a mini-batch at a time and so in Python 01:26:45.420 | 
The thing that does that is called a generator right or an iterator this slightly different versions 01:26:51.600 | 
Of the same thing so to turn a data loader into an iterator you use the standard Python function called iter 01:26:57.360 | 
That's a Python function just a regular part of the Python 01:27:00.860 | 
Basic language that returns to an iterator and an iterator is something that takes you can pass the standard give pass it to the standard 01:27:11.080 | 
Function or statement next and that just says give me another batch from this iterator 01:27:19.280 | 
So we're basically this is one of the things I really like about pytorch is it really leverages? 01:27:26.760 | 
Kind of stuff you know in tensorflow they invent their whole new world of ways of doing things 01:27:36.680 | 
In a sense. It's more like cross-platform, but another sense like it's not a good fit to any platform 01:27:47.880 | 
Pytorch comes very naturally if you don't know Python well pytorch is a good reason to learn Python well a 01:27:54.800 | 
Pytorch near module neural network module is a standard Python bus for example 01:28:02.240 | 
So any work you put into learning Python better will pay off with Pytorch so here. I am using standard 01:28:10.480 | 
Iterators and next to grab my next mini-batch 01:28:15.040 | 
From the validation sets data loader, and that's going to return two things 01:28:18.720 | 
It's going to return the images in the mini-batch and the labels in the mini-batch so standard Python approach 01:28:31.520 | 
And so not surprisingly since I said that my batch size 01:28:42.480 | 
Actually, it's the batch size by default is 64 so I didn't pass in a batch size 01:28:48.240 | 
So just remember shift tab to see like what are the things you can pass and what are the defaults so by default? 01:28:54.920 | 
My batch size is 64, so I've got back something of size 64 by 01:29:14.480 | 
So I can zip again standard Python things it takes two lists and combines it so you get the zeroth thing from the first 01:29:22.840 | 
List the zeroth thing from the second list and the first thing for the first first this first thing from the second list and so 01:29:29.200 | 
Forth so I can zip them together and that way I can find out 01:29:32.640 | 
For the zeroth image in the validation set it's agriculture 01:29:40.040 | 
It's primary rainforest. It's slash and burn. It's water 01:29:51.320 | 
You see here's a way to do multi label classification 01:29:54.120 | 
So by the same token right if we go back to our single label classification 01:30:03.960 | 
Behind the scenes we haven't actually looked at it, but behind the scenes 01:30:09.080 | 
Fastai and Pytorch are turning our labels into something called one hot encoded 01:30:16.800 | 
Labels and so if it was actually a dog then the actual values 01:30:21.400 | 
Would be like that right so these are like the actuals 01:30:26.760 | 
Okay, so do you remember at the very end of a tavio's video? 01:30:31.800 | 
He showed how like the template had to match to one of the like five a b c d or e templates 01:30:37.640 | 
And so what it's actually doing is it's comparing 01:30:41.440 | 
When I said it's basically doing a dot product. It's actually a fully connected layer at the end right that calculates an 01:30:48.520 | 
output activation that goes through a softmax and 01:30:53.360 | 
Then the softmax is compared to the one hot encoded label right so if it was a dog there would be a one here 01:31:02.800 | 
And then we take take the difference between the actuals and the softmax 01:31:07.520 | 
Activations to say and add those add up those differences to say how much error is there essentially? 01:31:13.280 | 
We're skipping over something called a loss function that we'll learn about next week, but essentially we're basically doing that 01:31:19.260 | 
Now if it's one hot encoded like if there's only one thing which have a one in it 01:31:34.720 | 
Right like we can basically say what are the index of each of these things? 01:31:38.860 | 
Right so we can say it's like 0 1 2 3 4 like so right and so rather than storing it as 0 1 01:31:52.160 | 
Right so if you look at the the y values for the cats and dogs competition or the dog breeds competition 01:32:00.400 | 
You won't actually see a big lists of ones and zeros like this. You'll see a single integer 01:32:05.340 | 
Right, which is like. What's what class index is it right and 01:32:12.160 | 
Inside Pytorch it will actually turn that into a one hot encoded vector, but like you will literally never see it 01:32:19.320 | 
Okay, and and Pytorch has different loss functions where you basically say this thing's one 01:32:26.600 | 
This thing is one hot encoded or this thing is not and it uses different loss functions 01:32:31.400 | 
That's all hidden by the fast AI library right so like you don't have to worry about it 01:32:37.260 | 
But it's but the the cool thing to realize is that this approach for multi-label encoding with these ones and zeros 01:32:45.920 | 
Behind the scenes the exact same thing happens for single-level classification 01:32:54.760 | 
Does it make sense to change the pickiness of the sigmoid of the softmax function by changing the base? 01:33:22.080 | 
so changing the base is just a linear scaling and 01:33:25.200 | 
Linear scaling is something which the neural net can learn with that very easily 01:33:37.960 | 
Okay, so here is that image right here is the image with slash and burn water etc etc 01:33:46.380 | 
One of the things to notice here is like when I first displayed this image it was 01:33:51.560 | 
So washed out I really couldn't see it right but remember images 01:34:01.480 | 
Matrices of numbers and so you can see here. I just said times 1.4 01:34:06.280 | 
Just to make it more visible right so like now that you're kind of it's the kind of thing 01:34:12.480 | 
I want you to get familiar with is the idea that this stuff you're dealing with they're just matrices of numbers 01:34:17.400 | 
Then you can fiddle around with them, so if you're looking at something like oh, it's a bit washed out 01:34:23.480 | 
Brighten it up a bit okay, so here. We can see I guess this is the slash and burn 01:34:28.760 | 
Here's the river. That's the water. Here's the primary rainforest. Maybe that's the agriculture and so forth okay, so 01:34:36.640 | 
So you know with all that background how do we actually use this? 01:34:44.840 | 
Exactly the same way as everything we've done before right so you know size you know and and 01:34:49.760 | 
The interesting thing about playing around with this planet competition is that these images are not at all like image net and I 01:34:58.600 | 
Would guess that the vast majority of the stuff that the vast majority of you do 01:35:06.520 | 
Won't actually be anything like image net you know it'll be it'll be medical imaging 01:35:13.400 | 
Or it'll be like classifying different kinds of steel tube or figuring out whether a world 01:35:19.520 | 
You know is going to break or not or or looking at satellite images, or you know whatever right so? 01:35:27.080 | 
It's it's good to experiment with stuff like this planet 01:35:32.640 | 
Competition to get a sense of kind of what you want to do and so you'll see here 01:35:46.320 | 
I wouldn't want to do this for the cats and dogs competition because the cats in dog competition 01:35:51.120 | 
We start with a pre trained image net network. It's it's nearly it's it's it starts off nearly perfect 01:35:57.440 | 
Right so if we resized everything to 64 by 64 and then retrained the whole set 01:36:03.840 | 
We basically destroy the weights that are already pre trained to be very good 01:36:09.360 | 
Remember image net most image net models are trained at either 224 by 224 or 01:36:14.400 | 
299 by 299 right so if we like retrain them at 64 by 64. We're going to we're going to kill it on the other hand 01:36:22.840 | 
There's nothing in image net that looks anything like this 01:36:29.200 | 
So the only useful bits of the image net network for us 01:36:38.800 | 
You know finding edges and gradients and this one you know finding kind of textures and repeating patterns 01:36:45.160 | 
And maybe these ones of kind of finding more complex textures, but that's probably about it right so 01:36:56.680 | 
You know starting out by training very small images 01:37:00.560 | 
Works pretty well when you're using stuff like satellites 01:37:04.160 | 
So in this case I started right back at 64 by 64 01:37:09.960 | 
Built my model found out what learning rate to use interestingly it turned out to be quite high 01:37:17.520 | 
It seems that because like it's so unlike image net I 01:37:23.960 | 
Needed to do quite a bit more fitting with just that last layer before it started to flatten out 01:37:30.840 | 
Then I unfreezed it and again. This is the difference to 01:37:37.760 | 
Data sets is my learning rate in the initial layer 01:37:41.760 | 
I set to divided by 9 the middle layers I set to divided by 3 01:37:45.640 | 
Where else for stuff like this like image net I had a multiple of 10 for each of those 01:37:51.160 | 
You know again the idea being that the earlier layers 01:37:55.000 | 
Probably are not as close to what they need to be compared to the image net 01:38:06.160 | 
And you can kind of see here. You know there's cycle one. There's cycle two. There's cycle three 01:38:13.060 | 
And then I kind of increased double the size of my images 01:38:20.720 | 
Unfreeze fit for a while double the size of the images again fit for a while unfreeze fit for a while 01:38:26.640 | 
And then add TTA and so as I mentioned last time we looked at this this process ends up 01:38:31.920 | 
You know getting us about 30th place in this competition 01:38:35.180 | 
Which is really cool because people you know a lot of very very smart people 01:38:39.520 | 
Just a few months ago worked very very hard on this competition 01:38:43.000 | 
Couple of things people have asked about one is 01:38:57.120 | 
Couple of different pieces here the first is that when we say 01:39:04.960 | 
What transforms do we apply and here's our transforms we actually pass in a size right? 01:39:10.840 | 
So one of the things that that one of the things that data loader does is to resize the images like on demand every time 01:39:19.720 | 
This has got nothing to do with that dot resize method right so 01:39:24.900 | 
This is this is the thing that happens at the end like whatever's passed in before it hits out that before our data 01:39:30.580 | 
Lotus fits it out. It's going to resize it to this size 01:39:33.380 | 
If the initial input is like a thousand by a thousand 01:39:39.100 | 
Reading that JPEG and resizing it to 64 by 64 01:39:44.560 | 
Turns out to actually take more time than training the confident dots for each batch 01:39:50.940 | 
Right so basically all resize does is it says hey 01:39:55.820 | 
I'm not going to be using any images bigger than size times 1.3 01:40:00.260 | 
So just go through once and create new JPEGs of this size 01:40:05.900 | 
Right and and they're rectangular right so new JPEGs where the smallest 01:40:11.100 | 
Edges of this size and again. It's like you never have to do this 01:40:16.180 | 
There's no reason to ever use it if you don't want to it's just a speed up 01:40:20.860 | 
okay, but if you've got really big images coming in it saves you a lot of time and you'll often see on like Kaggle kernels or 01:40:27.580 | 
forum posts or whatever people will have like 01:40:30.900 | 
Bash scripts stuff like that to like loop through and resize images to save time you never have to do that right just you can 01:40:41.980 | 
Create you know once off it'll go through and create that if it's already there 01:40:47.180 | 
It'll use the resized ones for you. Okay, so it's just a it's just a 01:40:55.180 | 
Okay, so for those of you that are kind of past dog breeds 01:41:13.460 | 
With trying to get a sense of like how can you get this as an accurate model? 01:41:17.380 | 
One thing to mention, and I'm not really going to go into it in detail 01:41:21.580 | 
It's nothing to do with deep learning particularly is that I'm using a different metric. I didn't use metrics equals accuracy 01:41:30.820 | 
Just remember from last week that confusion matrix that like two by two you know correct incorrect for each of dogs and cats 01:41:43.180 | 
There's a lot of different ways you could turn that confusion matrix into a score 01:41:49.100 | 
You know do you care more about false negatives, or do you care more about false positives, and how do you wait them? 01:41:56.780 | 
There's a base. There's basically a function called f beta 01:42:01.300 | 
Where the beta says how much do you wait false negatives versus false positives and so f2? 01:42:08.540 | 
Is f beta with beta equals 2 and it's basically as particular way of waiting false negatives and false positives 01:42:15.620 | 
And the reason we use it is because cattle told us that planet who were running this competition 01:42:25.540 | 
The important thing for you to know is that you can create 01:42:30.060 | 
Custom metrics so in this case you can see here 01:42:32.820 | 
It says from planet import f2 and really I've got this here so that you can see how to do it 01:42:45.220 | 
You can see there's something called planet dot py 01:43:07.100 | 
Or sci-fi and can remember where it came from 01:43:09.220 | 
And does a couple little tweets that are particularly important 01:43:13.900 | 
But the important thing is like you can write any metric you like right as long as it takes in 01:43:26.180 | 
They're both going to be numpy arrays one-dimensional numpy arrays, and then you return back a number 01:43:32.380 | 
Okay, and so as long as you create a function that takes two vectors and returns up number 01:43:37.940 | 
You can call it as a metric and so then when we said 01:43:42.220 | 
Learn metrics equals and then passed in that array which just contains a single function f2 01:43:58.260 | 
After every epoch for you, okay, so in general like the the fast AI library 01:44:04.020 | 
Everything is customizable so kind of the idea is that everything is 01:44:13.940 | 
Kind of gives you what you might want by default, but also everything can be changed as well 01:44:24.900 | 
We have a little bit of confusion about the difference between 01:44:30.940 | 
Just single label. Uh-huh. Do you by any chance an example in which you compute? 01:44:38.180 | 
They just show us. Oh, I didn't get to that activation function. Yeah, so 01:44:43.700 | 
So I'm so sorry. I said I'd do that and then I didn't so the activation the output activation function for a single label 01:44:53.100 | 
Classification is softmax for all the reasons that we talked about 01:44:56.380 | 
but if we were trying to predict something that was like 01:45:03.700 | 
Then softmax would be a terrible choice because it's very hard to come up with something where both of these are high 01:45:09.860 | 
In fact, it's impossible because they have to add up to one. So the closest they could be would be 0.5 01:45:22.260 | 
Sigmoid okay, and again the fast AI library does this automatically for you if it notices you have a multi label 01:45:30.100 | 
Problem and it does that by checking your data set to see if anything has more than one label applied to it 01:45:36.700 | 
and so sigmoid is a function which is equal to 01:45:48.660 | 
All of these x's but instead we just take this x and we say it's just equal to it 01:46:02.260 | 
And so the nice thing about that is that now like multiple things can be high at once 01:46:12.020 | 
Right and so generally then if something is less than zero its sigmoid is going to be less than 0.5 01:46:20.300 | 
If it's greater than 0 its sigmoid is going to be greater than 0.5 01:46:24.500 | 
And so the important thing to know about a sigmoid function is that its shape 01:46:36.420 | 
Something which asymptotes the top to one and asymptotes. Oh, I drew that 01:46:48.300 | 
To zero and so therefore it's a good thing to model a probability with 01:47:00.660 | 
Will be familiar with this is what we do in logistic regression 01:47:04.420 | 
So it kind of appears everywhere in machine learning, and you'll see that kind of a sigmoid and a softmax. They're very close 01:47:13.500 | 
Conceptually, but this is what we want is our activation function for multi label 01:47:18.420 | 
And this is what we want a single label and again and fast AI does it all for you. There was a question over here. Yes 01:47:33.140 | 
The initial training that you do if I understand correctly you have we have frozen the 01:47:38.580 | 
The pre-trained model and you only did initially try to train the latest 01:47:48.860 | 
But from the other hand we said that only the initial layer 01:47:53.500 | 
So let's last probably the first layer is like important to us and the other two 01:47:59.340 | 
Are more like features that are image not related and we didn't apply in this case. Well, it's that they 01:48:07.900 | 
But the pre-trained weights in them aren't so it's the later layers that we really want to train the most 01:48:22.620 | 
Okay, so you start with the latest one and then you go right so if you go back to our quick dogs and cats 01:48:30.140 | 
when we create a model from pre trained from a pre trained model it returns something where all of the convolutional layers are frozen and 01:48:49.980 | 
The randomly set a randomly initialized fully connected layers, right? 01:48:56.620 | 
And if something is like really close to image net that's often all we need 01:49:02.220 | 
But because the other the only layers are already good at finding edges gradients repeating patterns for 01:49:17.180 | 
We set the learning rates for the early layers to be really low 01:49:22.020 | 
Because we don't want to change them much for us the later ones we set them to be higher 01:49:31.860 | 
This is no longer true. You know the early layers are still like 01:49:35.420 | 
Better than the later layers, but we still probably need to change them quite a bit 01:49:41.380 | 
So that's right. This learning rate is nine times smaller than the final learning rate rather than a thousand times smaller 01:49:52.980 | 
Okay, so you play with with the weights of the layers with the learning rates. Yeah, normally 01:49:58.780 | 
Most of the stuff you see online if they talk about this at all, they'll talk about unfreezing 01:50:07.620 | 
And indeed we do unfreeze our randomly generated ones 01:50:11.780 | 
But what I found is although the fast AI library you can type learn dot freeze to and just freeze a subset of layers 01:50:20.140 | 
this approach of using differential learning rates seems to be like 01:50:23.780 | 
More flexible to the point that I never find myself unfreezing subsets of layers 01:50:29.700 | 
So but but I don't understand is that I would expect you to start with that 01:50:36.620 | 
Learning rates rather than trying to learn the last layer. So the reason okay, so you could skip 01:50:47.180 | 
Training just the last layers and just go straight to differential learning rates 01:50:51.060 | 
But you probably don't want to the reason you probably don't want to is that there's a difference the convolutional layers all contain 01:50:58.980 | 
Pre trained weights, so they're like they're not random for things that are close to image net 01:51:05.260 | 
They're actually really good for things that are not close to image net. They're better than nothing 01:51:09.980 | 
All of our fully connected layers, however are totally random 01:51:16.260 | 
So therefore you would always want to make the fully connected weights better than random by training them a bit first 01:51:23.020 | 
Because otherwise if you go straight to unfreeze 01:51:26.460 | 
Then you're actually going to be like fiddling around of those early early can early layer weights when the later ones are still random 01:51:43.420 | 
What are the things we're trying to change there? 01:51:53.460 | 
That that's always what SGD does. Yeah, so the only thing 01:52:16.460 | 
so the weights are the weights of the fully connected layers and 01:52:20.820 | 
The weights in those kernels in the convolutions. So that's what training means 01:52:26.140 | 
It's and we'll learn about how to do it with SGD. But training literally is setting those numbers 01:52:35.940 | 
Activations they're calculated. They're calculated from the weights and the previous layers 01:52:45.780 | 
I have a question. So can you lift that up higher and speak badly? So in your example of training the satellite image 01:52:52.980 | 
Example so you start with very small size exit support 01:52:57.340 | 
Yeah, so does it literally mean that you know the model takes a small area from the entire image? 01:53:12.260 | 
by default our transform takes the smallest edge and 01:53:21.860 | 
Resamples it so the smallest edge is the size 64 and then it takes a center crop 01:53:32.020 | 
When we're using data augmentation it actually takes a randomly chosen 01:53:39.460 | 
In the case where the image has multiple objects like in this case 01:53:43.580 | 
Like would it be possible like you would just lose the other things that they try to forget? 01:53:49.740 | 
Yeah, which is why data augmentation is important. So by and particularly their 01:53:54.620 | 
Test time augmentation is going to be particularly important because you would you wouldn't want to you know 01:54:00.620 | 
That there may be a artisanal mine out in the corner, which if you take a center crop you you don't see 01:54:07.220 | 
So data augmentation becomes very important. Yeah 01:54:14.820 | 
So when we talk about metrics that users are here see that lower or up to 01:54:18.820 | 
That's not really what the model tries to that's a great point. That's not the loss function 01:54:24.620 | 
Yeah, right. The loss function is something we'll be learning about next week 01:54:29.020 | 
And it uses a cross entropy or otherwise known as like negative log likelihood 01:54:34.500 | 
The metric is just the thing that's printed so we can see what's going on 01:54:45.940 | 
Modeling cannot training data does a training data also have to be multi-class? 01:54:50.460 | 
So can I train on just like images of pure cats and pure dogs and expect it at prediction time to? 01:54:56.260 | 
Predict if I give it a picture of both having cat analog 01:54:58.880 | 
I've never tried that and I've never seen an example of something that needed it. I 01:55:08.140 | 
Guess conceptually there's no reason it wouldn't work 01:55:15.740 | 
And you still use a sigmoid activity you would have to make sure you're using a sigmoid loss function 01:55:20.340 | 
So in this case fast a eyes default would not work because by default fast a I would say your training data 01:55:25.700 | 
Never has both a cat and a dog, so you would have to override the loss function 01:55:38.080 | 
Those three learning rates do they just kind of spread evenly across the layers? 01:55:43.420 | 
Yeah, we'll talk more about this later in the course, but I'm in the fast AI library 01:55:49.540 | 
There's a concept of layer groups so in something like a resnet 50 01:55:54.580 | 
You know there's hundreds of layers, and I figured you don't want to write down hundreds of learning rates, so I've 01:56:00.940 | 
basically decided for you how to split them and 01:56:04.420 | 
The the last one always refers just to the fully connected layers that we've randomly initialized and add it to the end 01:56:12.780 | 
And then these ones are split generally about halfway through 01:56:20.260 | 
These you know these ones are kind of the ones which you hardly want to change at all 01:56:24.500 | 
And these are the ones you might want to change a little bit, and I don't think we're covered in the course 01:56:29.420 | 
But if you're interested we can talk about in the forum 01:56:31.260 | 
There are ways you can override this behavior to define your own layer groups if you want to 01:56:35.640 | 
And is there any way to visualize the model easily or like dump dump the layers of the model? 01:56:50.420 | 
So if you just type learn it doesn't tell you much at all, but what you can do is go 01:57:07.020 | 
There's all the letters and so you can see in this case 01:57:09.980 | 
These are the names I mentioned how they all got names right so the first layer is called conv 2d - 1 01:57:20.100 | 
This is useful to actually look at it's taking 64 by 64 images. Which is what we told it 01:57:27.060 | 
We're going to transform things - this is three channels pie torch 01:57:30.700 | 
Like most things have channels at the end would say 64 by 64 by 3 pie torch moves it to the front 01:57:41.300 | 
That's because it turns out that some of the GPU computations run faster when it's in that order 01:57:47.260 | 
Okay, but that happens all behind the scenes automatically so part of that transformation stuff 01:57:52.780 | 
That's kind of all done automatically is to do that 01:58:01.540 | 
In Keras they use the number they use a special number none 01:58:07.100 | 
In pie torch they use - 1 so this is a four-dimensional mini batch 01:58:14.380 | 
Elements in the number of images in the image mini batches dynamic you can change that the number of channels is 3 01:58:20.660 | 
Number of images is 64 by 64. Okay, and so then you can basically see that this particular convolutional kernel 01:58:32.220 | 
And it's also halving we haven't talked about this but convolutions can have something called a stride 01:58:37.100 | 
That it's like max pooling for changes the size. So it's returning a 32 by 32 by 64 kernel 01:58:48.140 | 
So that's summary and we'll learn all about what that's doing in detail in the second half of the course 01:58:59.100 | 
Clicked in my own data set and I tried to use the and it's a really small data set these currencies from 01:59:07.780 | 
Learning rate find and then the plot and it just it gave me some numbers which I didn't understand on the learning rate font 01:59:14.980 | 
Yeah, and then the plot was empty. So yeah, I mean let's let's talk about that on the forum 01:59:21.020 | 
The learning rate finder is going to go through a mini batch at a time if you've got a tiny data set 01:59:26.460 | 
There's just not enough mini batches. So the trick is to make your mini that make your batch size really small 01:59:31.740 | 
Like try making it like four or eight or something 01:59:34.460 | 
Okay, they were great questions nothing online to add in it 01:59:41.900 | 
They were great questions we've got a little bit past where I hope to but let's let's quickly talk about 01:59:49.060 | 
Structured data so we can start thinking about it for next week 01:59:57.300 | 
This is really weird right to me. There's basically two types of data set we use in machine learning. There's a type of data 02:00:11.740 | 
where all of the all of the things inside an object like all of the pixels inside an image are 02:00:18.180 | 
All the same kind of thing. They're all pixels or they're all 02:00:28.180 | 
I call this kind of data unstructured and then there's data sets like a 02:00:42.460 | 
Structurally quite different, you know one thing is representing like how many page views last month another one is their sex 02:00:49.620 | 
Another one is what zip code they're in and I call this structured data 02:00:57.180 | 
Unusual like lots of people use that terminology, but lots of people don't there's no 02:01:09.020 | 
I'm referring to kind of columnar data as you might find in a database or a spreadsheet where different columns 02:01:16.700 | 
represent different kinds of things and each row represents an observation and 02:01:21.740 | 
So structured data is probably what most of you 02:01:30.180 | 
Funnily enough you know academics in the deep learning world don't really give a shit about structured data 02:01:38.340 | 
Because it's pretty hard to get published in fancy conference proceed proceedings 02:01:43.060 | 
If you're like if you've got a better logistics model, you know, it's the thing that makes the world goes round 02:01:48.620 | 
It's a thing that makes everybody you know money and efficiency and make stuff work 02:01:57.600 | 
So we're not going to ignore it because we're practical deep learning 02:02:02.140 | 
And Kaggle doesn't ignore it either because people put prize money up on Kaggle to solve real-world problems 02:02:08.940 | 
So there are some great Kaggle competitions we can look at there's one running right now 02:02:13.400 | 
Which is the grocery sales forecasting competition for Ecuador's largest chain? 02:02:19.080 | 
It's always a little I've got to be a little careful about how much I show you about currently running competitions because I don't want 02:02:28.660 | 
To you know help you cheat, but it so happens. There was a competition a year or two ago 02:02:34.620 | 
For one of Germany's largest grocery chains, which is almost identical. So I'm going to show you how to do that 02:02:48.740 | 
So I would suggest you know, first of all try practicing what we're learning on Rossman, right? 02:02:54.860 | 
but then see if you can get it working on on grocery because currently 02:03:00.340 | 
On the leaderboard no one seems to basically know what they're doing in the groceries competition. If you look at the leaderboard 02:03:11.220 | 
These ones around five to nine five three. Oh are people that are literally finding like group averages and submitting those 02:03:17.940 | 
I know because that the kernels that they're using so, you know the basically the people around 20th place 02:03:35.300 | 
Notebook sure you get pool. Okay, in fact, you know just reminder, you know before you start working 02:03:41.220 | 
Get pool in your fast AI repo and from time to time 02:03:45.500 | 
Conda and update for you guys doing the in-person course the Conda and update 02:03:51.540 | 
You should do it more often because we're kind of changing things a little bit folks in the MOOC 02:03:57.100 | 
You know more like once a month should be fine 02:03:59.820 | 
So anyway, I just I just changed this a little bit so make sure you get pulled to get lesson three Rossman 02:04:06.500 | 
And there's a couple of new libraries here one is fast AI dot structured 02:04:12.500 | 
Fast AI dot structured contains stuff, which is actually not at all Pytorch specific 02:04:18.940 | 
And we actually use that in the machine learning course as well for doing random forests with no Pytorch at all 02:04:24.620 | 
I mentioned that because you can use that particular library without any of the other parts of fast AI 02:04:34.300 | 
And then we're also going to use fast AI dot column data 02:04:37.460 | 
Which is basically some stuff that allows us to do fast AI Pytorch stuff with 02:04:46.220 | 
For structured data we need to use pandas a lot 02:04:52.060 | 
Anybody who's used our data frames will be very familiar with pandas pandas is basically an attempt to kind of replicate 02:05:20.580 | 
Python for data analysis by Wes McKinney. There's a new edition that just came out a couple of weeks ago 02:05:26.180 | 
Obviously being by the pandas author its coverage of pandas is excellent, but it also covers 02:05:39.460 | 
I python and jupyter really well, okay, and so I'm kind of going to assume 02:05:46.500 | 
That you know your way around these libraries to some extent 02:05:51.020 | 
Also, there was the workshop we did before this started and there's a video of that online where we kind of have a brief mention 02:06:00.340 | 
Structured data is generally shared as CSV files. It was no different in this competition 02:06:07.460 | 
As you'll see, there's a hyperlink to the Rossman data set here 02:06:11.860 | 
All right now if you look at the bottom of my screen you'll see this goes to files.fast.ai 02:06:17.060 | 
Because this doesn't require any login or anything to grab this data set. It's as simple as right clicking 02:06:25.540 | 
Head over to wherever you want it and just type 02:06:33.180 | 
The URL okay, so that's because you know, it's it's not behind a login or anything 02:06:46.100 | 
You can always read a CSV file with just pandas dot read CSV now in this particular case. There's a lot of 02:06:53.460 | 
Pre-processing that we do and what I've actually done here is I've 02:07:02.180 | 
Stolen the entire pipeline from the third-place winner of Rossman. Okay, so they made all their data 02:07:09.980 | 
They're really great. You know, they've had a github available with everything that we need and I've ported it all across and simplified it and 02:07:21.900 | 
Course is about deep learning not about data processing. So I'm not going to go through it 02:07:26.800 | 
But we will be going through it in the machine learning course in some detail because feature engineering is really important 02:07:35.820 | 
You know check out the machine learning course 02:07:42.980 | 
Kind of what it looks like. So once we read the CSVs in 02:07:46.580 | 
You can see basically what's there so the key one is 02:08:09.620 | 
For that particular store. We know whether that 02:08:16.100 | 
We know the number of customers that that particular store had 02:08:20.900 | 
We know whether that date was a school holiday 02:08:34.260 | 
What kind of store it is so like this is pretty common right you'll often get 02:08:38.720 | 
Data sets where there's some column with like just some kind of code. We don't really know what the code means 02:08:44.460 | 
Most of the time I find it doesn't matter what it means 02:08:48.200 | 
Like normally you get given a data dictionary when you start on a project and obviously if you're working on internal project 02:08:54.540 | 
You can ask the people at your company. What does this column mean? I? 02:08:57.780 | 
Kind of stay away from learning too much about it. I prefer to like see what the data says 02:09:06.020 | 
There's something about what kind of product are we selling in this particular row? 02:09:10.940 | 
And then there's information about like how far away is the nearest competitor how long have they been open for 02:09:30.500 | 
Each store we can find out what state it's in for each state we can find out the name of the state 02:09:37.980 | 
Interestingly they were allowed to download any data external data 02:09:43.740 | 
It's very common as long as you share it with everybody else and so some folks tried downloading data from 02:09:53.180 | 
I'm not sure exactly what it was that they were checking the trend of but we have this information from Google Trends 02:09:59.940 | 
Somebody downloaded the weather for every day in Germany for every state 02:10:12.580 | 
You can get a data frame summary with pandas which kind of lets you see how many 02:10:22.520 | 
Observations and means and standard deviations 02:10:25.180 | 
Again, I don't do a hell of a lot with that early on 02:10:31.260 | 
So what we do, you know, this is called a relational data set a relational data set is one where there's quite a few tables 02:10:38.300 | 
We have to join together. It's very easy to do that in pandas 02:10:41.960 | 
There's a thing called merge so I create a little function to do that 02:10:45.020 | 
And so I just started joining everything together join in the weather the Google Trends 02:10:58.060 | 
You'll see there's one thing that I'm using from the fast AI library, which is called add date part 02:11:03.740 | 
We talk about this a lot in the machine learning course 02:11:06.340 | 
But basically this is going to take a date and pull out of it a bunch of columns day of week 02:11:11.580 | 
Is at the start of a quarter month of year so on and so forth and add them all in for the data set 02:11:23.380 | 
As we join everything together we fiddle around with some of the dates a little bit some of them are in month and year 02:11:35.860 | 
Take information about for example holidays and add a column for like how long until the next holiday 02:11:46.580 | 
So on and so forth. Okay, so we do all that and at the very end 02:11:51.900 | 
We basically save a big structured data file that contains all that stuff 02:11:57.020 | 
Something that those of you that use pandas may not be aware of is that there's a very cool new format called feather 02:12:08.380 | 
It's kind of pretty much takes it as it sits in RAM and dumps it to the disk 02:12:13.180 | 
and so it's like really really really fast the reason that you need to know this is because the 02:12:19.580 | 
Ecuadorian grocery competition it's on now has 350 million records 02:12:24.120 | 
So you will care about how long things take it took I believe about six seconds for me to save 02:12:30.820 | 
350 million records to feather format, so it's pretty cool 02:12:34.380 | 
So at the end of all that I'd save it as feather format and for the rest of this discussion 02:12:39.740 | 
I'm just going to take it as given that we've got this nicely 02:12:43.700 | 
Processed feature-engineered file and I can just go read better. Okay, but for you to play along at home 02:12:49.780 | 
You will have to run those previous cells. Oh 02:12:57.940 | 
You don't have to run those because the file that you download from files.fast.ai has already done that for you, okay? 02:13:20.980 | 
This date at this store and so the goal of this competition is to find out 02:13:28.020 | 
How many things will be sold for each store for each type of thing in the future? 02:13:34.460 | 
Okay, and so that's basically what we're going to be trying to do 02:13:39.860 | 
And so here's an example of what some of the data looks like 02:13:46.420 | 
Next week we're going to see how to go through these steps 02:13:50.380 | 
But basically what we're going to learn is we're going to learn to split the columns into two types 02:14:01.900 | 
Store ID 1 and store ID 2 are not numerically related to each other the categories 02:14:09.700 | 
Right we're going to treat day of week like that to Monday and Tuesday day zero and day one not numerically 02:14:16.160 | 
Where else distance in kilometers to the nearest competitor? 02:14:22.140 | 
That's a number that we're going to treat numerically 02:14:25.020 | 
Right so in other words the categorical variables. We basically are going to one hot encode them 02:14:30.580 | 
You can think of it as one hot encoding them where else the continuous variables. We're going to be feeding into fully connected layers 02:14:47.820 | 
Validation set and you'll see like a lot of these are start to look familiar 02:14:50.900 | 
This is the same function we used on planet and dog breeds to create a validation set 02:14:55.180 | 
There's some stuff that you haven't seen before 02:15:01.940 | 
Basically rather than saying image data dot from CSV. We're going to say columnar data 02:15:08.560 | 
From data frame right so you can see like the basic API concepts will be the same, but they're a little different, right? 02:15:15.680 | 
but just like before we're going to get a learner and 02:15:31.440 | 
Okay, so the basic sequence who's going to end up looking hopefully very familiar. Okay, so we're out of time 02:15:45.360 | 
Enter as many Kaggle image competitions as possible like like try to really get this feel for like 02:16:01.360 | 
That post I showed you at the start of class today that kind of took you through lesson one like 02:16:07.560 | 
Really go through that on as many image data sets as you can to just feel 02:16:15.360 | 
because you want to get to the point where next week when we start talking about structured data that this idea of like how 02:16:21.960 | 
Learners kind of work and data works and data loaders and data sets and looking at pictures should be really you know intuitive