back to index

Lesson 3: Deep Learning 2018


Chapters

0:0
1:21 Tmax
8:27 Review
38:5 File Link
47:13 Applying a Rectified Linear Unit
53:21 Activation
63:13 Fully Connected Layer
74:15 Soft Max
76:3 Logarithms
89:52 Multi-Label Classification
102:29 Create Custom Metrics
104:26 Difference between Multi-Label and a Single Label
105:21 Sigmoid
114:25 Loss Function
115:50 Layer Groups
116:56 Learn Summary
119:49 Structured Data
120:0 Two Types of Data Set We Use in Machine Learning
124:49 Pandas

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everybody
00:00:02.000 | I'm sure you've noticed
00:00:05.640 | But there's been a lot of cool activity on the forum this week and one of the things that's been really great to see
00:00:11.400 | Is that a lot of you have started creating?
00:00:14.040 | Really helpful materials both for your classmates to better understand stuff and also for you to better understand stuff by
00:00:22.120 | Trying to teach what you've learned. I just wanted to highlight a few I've actually
00:00:29.040 | Posted to the wiki thread a few of these, but there's lots more
00:00:33.220 | Reshma has posted a whole bunch of nice introductory tutorials so for example if you're having any trouble getting connected with AWS
00:00:44.200 | She's got a whole step-by-step
00:00:47.000 | How to go about logging in and getting everything working which I think is a really terrific thing and so it's a kind of thing
00:00:54.840 | that if you
00:00:57.600 | Writing some notes for yourself to remind you how to do it
00:01:00.520 | You may as well post them for others to do it to do it as well and by using a markdown file like this
00:01:06.280 | It's actually good practice if you haven't used github before if you put it up on github
00:01:10.700 | Everybody can now use it or of course you can just put it in the forum
00:01:16.640 | more advanced
00:01:18.200 | Thing that Reshma wrote up about is she noticed that I like using tmux
00:01:22.600 | Which is a handy little thing which lets me?
00:01:27.480 | Let me basically have a window. Let's see if I've got one. I'll show you
00:01:31.400 | So as soon as I log into my computer
00:01:34.200 | If I run tmux
00:01:37.280 | You'll see that all of my windows pop straight up
00:01:39.820 | Basically and I can like continue running stuff in the background and I can like I've got vim over here
00:01:45.760 | And I can kind of zoom into it or I can move over to the top which is here's budget
00:01:50.600 | But I can all running and so forth so if that sounds interesting Reshma has a
00:01:56.400 | Tutorial here on how you can use tmux
00:01:58.520 | And it's actually got a whole bunch of stuff in her github, so that's that's really cool
00:02:04.520 | Up built among has written a very nice kind of summary basically of our last lesson
00:02:12.160 | Which kind of covers
00:02:15.880 | What are the key things we did and why did we do them so if you're if you're kind of?
00:02:20.060 | Wondering like how does it fit together? I think this is a really helpful summary
00:02:26.160 | Like what did those couple of hours look like if we summarize it all into a page or two?
00:02:33.520 | also really like Pavel has
00:02:36.080 | Dark kind of done a deep dive on the learning rate finder
00:02:41.420 | which is a
00:02:44.200 | Topic that a lot of you have been interested in learning more about particularly
00:02:47.680 | Those of you who have done deep learning before I've realized that this is like a solution to a problem that you've been having for
00:02:54.120 | A long time and haven't seen before and so it's kind of something which hasn't really been vlogged about before so this is the first
00:02:59.940 | Time I've seen this blogged about so when I put this on Twitter a link to
00:03:03.800 | Pavel's post it's been shared now hundreds of times
00:03:07.280 | It's been really really popular and viewed many thousands of times, so that's some great content
00:03:12.960 | Radek has posted lots of cool stuff. I really like this practitioners guide to pytorch which again
00:03:20.360 | This is more for more advanced students, but it's like digging into people who have never used pytorch before but know a bit about
00:03:27.200 | Numerical programming in general and it's a quick introduction to how pytorch is different
00:03:33.080 | And then there's been some interesting little bits of research like what's the relationship between learning rate and batch size so one of the
00:03:41.080 | Students actually asked me this before class and I said well one of the other students has written an analysis of exactly that
00:03:49.080 | so what he's done is basically looked through and tried different batch sizes and different learning rates and tried to see how they seem to
00:03:54.960 | Relate together and these are all like cool experiments, which you know you can try yourself
00:03:59.960 | Radek again, he's written something again a kind of a research into this question. I made a claim that
00:04:07.600 | The the stochastic gradient descent with restarts finds more generalizable
00:04:14.240 | Parts of the function surface because they're kind of flatter, and he's been trying to figure out. Is there a way to measure that more directly?
00:04:20.440 | Not quite successful yet, but a really interesting piece of research
00:04:24.080 | got some
00:04:27.120 | introductions to convolutional neural networks
00:04:33.000 | something that we'll be learning about towards the end of this course, but I'm sure you've noticed we're using something called ResNet and
00:04:39.560 | Anand Sahar actually posted a pretty impressive analysis of like what's a ResNet and why is it interesting?
00:04:46.400 | And this one's actually been very already shared very widely around the internet. I've seen also
00:04:51.280 | So so we're advanced students who are interested in
00:04:55.000 | Jumping ahead can look at that and appeal to mom also has done something similar
00:05:00.840 | so lots of
00:05:03.600 | Yeah, lots of stuff going on on the forums. I'm sure you've also noticed we have a beginner forum now
00:05:09.760 | specifically for you know asking questions which
00:05:12.880 | You know
00:05:15.760 | There's always the case that there are no
00:05:17.760 | Dumb questions, but when there's lots of people around you talking about advanced topics. It might not feel that way
00:05:23.320 | so hopefully the beginners forum is just a
00:05:25.320 | less intimidating space and
00:05:27.800 | If you're a more advanced
00:05:30.960 | Student who can help answer those questions, please do but remember when you do answer those questions try to answer in a way
00:05:37.560 | That's friendly to people that maybe you know have no more than a year of programming experience and haven't done any machine learning before
00:05:43.580 | So you know I hope
00:05:48.760 | Other people in the class
00:05:51.720 | Feel like you can contribute as well and just remember all of the people we just looked at or many of them
00:05:56.460 | I believe have never
00:05:58.520 | Posted anything to the internet before right I mean you don't have to be a particular kind of person to be allowed to blog
00:06:04.760 | or something you can just drop down your notes throw it up there and
00:06:08.880 | One handy thing is if you just put it on the forum, and you're not quite sure of some of the details then
00:06:16.800 | Then you know you have an opportunity to get feedback and say like ah well
00:06:20.720 | That's not quite how that works
00:06:22.000 | You know actually it works this way instead or or that's a really interesting insight had you thought about taking this further and so forth
00:06:29.480 | So what we've done so far is a kind of a an introduction as a just as a practitioner to
00:06:35.920 | Convolutional neural networks for images, and we haven't really talked much at all about
00:06:42.460 | The theory or why they work or the math of them, but on the other hand what we have done is seen
00:06:49.200 | how to
00:06:51.360 | Build a model which actually works exceptionally well in fact world-class level models
00:06:59.240 | and we'll kind of review a little bit of that today and
00:07:03.600 | Then also today
00:07:06.440 | We're going to dig in a little bit quite a lot more actually into the underlying theory of like
00:07:10.180 | What is a what is a cnn? What's a convolution?
00:07:12.880 | How does this work and then we're going to kind of go through this this cycle where we're going to dig
00:07:18.260 | We're going to do a little intro into a whole bunch of application areas using neural nets for structured data
00:07:25.120 | so kind of like logistics or forecasting or you know financial data or that kind of thing and then looking at
00:07:33.080 | language applications NLP applications using recurrent neural nets and then
00:07:38.880 | collaborative filtering for
00:07:41.720 | Recommendation systems and so these will all be like
00:07:46.200 | Similar to what we've done for cnn's images
00:07:49.800 | It'll be like here's how you can get a state-of-the-art result without digging into the theory
00:07:53.800 | But but knowing how to actually make it work
00:07:55.880 | And then we're kind of go go to go back through those almost in reverse order
00:08:01.160 | So then we're going to dig right into collaborative filtering in a lot of detail and see how how to write the code
00:08:06.960 | Underneath and how the math works underneath and then we're going to do the same thing for structured data analysis
00:08:12.760 | We're going to do the same thing for confidence images and finally an in-depth dig dive into recurrent neural networks
00:08:19.560 | So that's kind of where we're getting
00:08:23.240 | so let's start by
00:08:25.240 | Doing a little bit of a review and I want to
00:08:29.280 | Also provide a bit more detail on some on some steps that we only briefly skipped over
00:08:36.040 | So I want to make sure that we're all able to complete
00:08:38.920 | Kind of last week's assignment, which was the the dog breeds
00:08:44.080 | I mean to basically apply what you've learned to it another data set and I thought the easiest one to do would be the dog
00:08:49.520 | Breeds Kaggle competition and so I want to make sure everybody has everything you need to do this right now
00:08:54.280 | So and the first thing is to make sure that you know how to download
00:08:58.800 | Data and so there's there's two main places at the moment. We're kind of downloading data from one is from Kaggle
00:09:05.600 | And the other is from like anywhere else
00:09:08.720 | And so I'll first of all do the the Kaggle version
00:09:13.840 | So to download from Kaggle
00:09:17.080 | We use something called Kaggle CLI
00:09:20.360 | Which is here and to install it I think it's already in let's just double check
00:09:29.120 | Yeah, so it's or it should already be in your
00:09:36.000 | environment
00:09:38.600 | But to make sure one thing that happens is because this is downloading from the Kaggle website through like screen scraping every time Kaggle changes
00:09:45.420 | The website it breaks so anytime you try to use it and
00:09:48.940 | If Kaggle's websites changed recently you'll need to make sure you get the most recent version so you can always go to pip
00:09:56.500 | install
00:09:59.160 | Kaggle - CLI
00:10:01.280 | - - upgrade and so that'll just make sure that you've got the latest version of of it and everything that it depends on
00:10:10.720 | okay, and
00:10:12.800 | so then having done that you can
00:10:15.420 | Follow the instructions. Actually, I think rational was kind enough to they go. There's a Kaggle CLI
00:10:20.820 | Feel like everything you need to know can be found at rational's
00:10:24.140 | GitHub
00:10:27.620 | So basically to do that the next step you go
00:10:32.160 | KG download
00:10:35.540 | And then you provide your username with - you you provide your password with - P and then - see you did the competition name
00:10:44.400 | And a lot of people in the forum has been confused about what to enter here
00:10:47.680 | And so the key thing to note is that when you're at a Kaggle competition?
00:10:51.220 | After the /c there's a specific name planet - understanding - etc. Right? That's the name you need
00:11:01.560 | the other thing you'll need to make sure is that you've
00:11:04.280 | On your own computer have attempted to click download at least once because when you do it will ask you to accept the rules
00:11:11.000 | If you've forgotten to do that
00:11:14.580 | KG download will give you a hint it'll say it looks like you might have forgotten
00:11:17.800 | the rules if you log into Kaggle with like a
00:11:21.700 | Google account like anything other than a username password this won't work
00:11:25.980 | So you'll need to click forgot password on Kaggle and get them to send you a normal password
00:11:31.300 | So that's the Kaggle version
00:11:33.700 | Right and so when you do that you end up with a whole folder created for you with all of that competition data in it
00:11:41.960 | So a couple of reasons you might want to not use that
00:11:44.980 | The first is that you're using a data set that's not on Kaggle
00:11:48.220 | The second is that you don't want all of the data sets in a Kaggle competition for example the planet competition
00:11:54.700 | That we've been looking at a little bit. We'll look at again today
00:11:57.380 | Has data in two formats TIFF and JPEG the TIFF is 19 gigabytes and the JPEG is 600 megabytes
00:12:06.420 | So you probably don't want to download both
00:12:09.460 | So I'll show you a really cool kit, which actually somebody on the forum taught me
00:12:14.040 | I think was one of the MSAN students here at USF. There's a
00:12:17.860 | Chrome extension called curl w get
00:12:22.480 | So you can just search for curl w get
00:12:26.220 | And then you install it by just clicking on install if you haven't installed extension before and then from now on
00:12:33.520 | Every time you try to download something, so I'll try and download this file
00:12:40.460 | I'll just go ahead and cancel it right and now you see this little yellow button. That's added up here
00:12:46.340 | There's a whole command here
00:12:48.940 | All right, so I can copy that and
00:12:52.060 | Paste it
00:12:55.940 | into my
00:12:57.700 | window and
00:12:59.420 | Hit go and it's there goes okay
00:13:02.980 | So what that does is like all of your cookies and headers and everything else needed to download that file is like save
00:13:09.620 | So this is not just useful for
00:13:12.060 | downloading data
00:13:13.980 | It's also useful if you like trying to download some I don't know TV show or something anything where you're hidden behind a
00:13:20.220 | Log in or something you can you can grab it and actually that is very useful for data science because quite often we want to
00:13:27.620 | Analyze things like videos on our on our consoles
00:13:31.140 | So this is a good trick. All right, so there's two ways to get the data
00:13:34.500 | So then
00:13:38.380 | Having got the data you then need to
00:13:42.020 | Build your model, right?
00:13:45.140 | So what I tend to do like you'll notice that I tend to assume that the data is in a directory called data
00:13:51.860 | That's a subdirectory of wherever your notebook is, right?
00:13:55.620 | Now you don't necessarily
00:13:59.380 | Actually want to put your data there
00:14:00.860 | You might want to put it directly in your home directory or you might want to put it on another drive or whatever
00:14:05.260 | so what I do is if you look inside my courses deal one folder, you'll see that data is actually a
00:14:13.020 | symbolic link
00:14:15.660 | To a different drive, right? So you can put it anywhere you like and then you can just add a symbolic link
00:14:20.820 | Or you can just put it there directly. It's up to you
00:14:24.660 | You haven't used some links before they're like aliases or shortcuts on the Mac or Windows
00:14:30.340 | Very handy and there's some threads on the forum about how to use them if you want help with that
00:14:35.980 | that's for example is also how we actually have the
00:14:39.420 | fast AI modules
00:14:41.660 | Available from the same place as our notebooks. It's just a similar to where they come from
00:14:46.540 | anytime you want to see like
00:14:50.340 | Where things actually point to in Linux you can just use the minus L flag to listing a directory
00:14:57.340 | And it'll show you where the sim links
00:14:59.340 | Exist and also show you which things are directories so forth
00:15:03.040 | Okay, so one thing which
00:15:06.580 | May be a little unclear based on what we've done so far is like
00:15:15.220 | How little code you actually need to do this end-to-end so what I've got here is is in a single window is an entire
00:15:22.860 | End-to-end process to get a state-of-the-art result for cats versus dogs, right?
00:15:28.260 | I've the only step I've skipped is the bit where we've downloaded it in Kaggle and then where we unzipped it, right?
00:15:37.220 | These are literally all the steps
00:15:39.460 | and so we
00:15:42.660 | Import our libraries and actually if you import this one conf learner that basically imports everything else
00:15:48.900 | So that's that we need to tell it the path of where things are the size that we want the batch size that we want
00:15:56.540 | alright
00:15:58.500 | So then and we're going to learn a lot more about what these do very shortly
00:16:02.340 | But basically we say how do we want to transform our data so we want to transform it in a way
00:16:07.500 | That's suitable to this particular kind of model and it assumes that the photos are side on photos
00:16:13.420 | And that we're going to zoom in up to 10% each time
00:16:16.220 | We say that we want to get some data
00:16:19.500 | Based on paths and so remember this is this idea that there's a path called cats and a path called dogs
00:16:25.180 | And they're inside a path called train and a path called valid
00:16:28.340 | Note that you can always
00:16:33.500 | Overwrite these with other things so if your things are in different named folders you could either rename them or you can see here
00:16:40.340 | There's like a train name and a vowel name you can always pick something else here
00:16:45.020 | Also notice there's a test name
00:16:48.820 | So if you want to submit some into Kaggle you'll need to fill in the name the name of the folder where the test
00:16:54.380 | Set is and obviously those those won't be labeled
00:17:00.220 | So then we create a model from a pre trained model. It's from a ResNet 50 model using this data
00:17:06.900 | And then we call fit and remember by default
00:17:10.380 | That has all of the layers, but the last few frozen and again, we'll learn a lot more about what that means
00:17:16.380 | And so that's that's what that does so that
00:17:19.500 | That took two and a half minutes
00:17:22.220 | Notice here. I didn't say pre compute equals true again
00:17:27.300 | There's been some confusion on the forums about like what that means
00:17:30.260 | It's it's only a it's only something that makes it a little faster for this first step right so you can always skip it
00:17:37.620 | And if you're at all confused about it, or it's causing you any problems. Just leave it off right because it's just a
00:17:43.700 | It's just a shortcut which caches some of that intermediate steps that don't have to be recapulated each time
00:17:52.020 | Okay, and remember that when we are using pre computed activations data augmentation doesn't work right so even if you ask for a data
00:18:00.420 | augmentation if you've got pre computed equals true
00:18:02.860 | It doesn't actually do any data augmentation because it's using the cached
00:18:06.940 | non-augmented
00:18:09.220 | activations
00:18:10.540 | So in this case to keep this as simple as possible. I have no pre computed anything going on
00:18:15.140 | so I do three cycles of length one and
00:18:20.220 | Then I can then unfreeze
00:18:22.900 | So it's now going to train the whole thing
00:18:25.500 | something we haven't seen before and we'll learn about in the second half is
00:18:29.620 | called BN freeze for now all you need to know is that if you're using a
00:18:34.940 | model like a
00:18:37.140 | Bigger deeper model like resnet 50 or res next 101 on a data set
00:18:42.780 | That's very very similar to image net like these cats and dogs later sets on other words
00:18:48.140 | Like side on photos of standard objects
00:18:51.780 | You know of a similar size to image net like somewhere between 200 and 500 pixels
00:18:57.300 | You should probably add this line when you unfreeze for those of you that are more advanced what it's doing is it's it's
00:19:06.020 | Causing the batch normalization
00:19:08.460 | Moving averages to not be updated but in the second half of this course you're going to learn all about why we do that
00:19:14.340 | It's something that's not supported by any other library
00:19:17.020 | But it turns out to be super important anyway, so we do one more epoch
00:19:23.820 | training the whole network
00:19:25.820 | And then at the end we use test time augmentation
00:19:29.540 | To ensure that we get the best predictions we can and that gives us ninety nine point four five percent
00:19:37.180 | So that's that's it right so when you try a new data set they're basically the minimum set of steps
00:19:46.260 | That you would need to follow
00:19:48.260 | You'll notice this is assuming. I already know what learning rate to use so you'd use a learning rate finder for that
00:19:54.260 | It's assuming that I know the the directory layout
00:19:57.620 | and so forth
00:20:00.820 | So that's kind of a minimum set now one of the things that I wanted to make sure
00:20:05.020 | You had an understanding of how to do is how to use other libraries other than fast AI
00:20:11.780 | And so I feel like the best thing to look at is to look at Keras because Keras is a library
00:20:18.020 | Just like fast AI sits on top of Pytorch
00:20:20.820 | Keras sits on top of actually a whole variety of different back ends it fits mainly people nowadays use it with TensorFlow
00:20:28.480 | There's also an MX net version. There's also a Microsoft CNTK version
00:20:35.020 | So what I've got if you do a git pull you'll see that there's a
00:20:40.300 | something
00:20:42.220 | Called Keras lesson one where I've attempted to replicate at least parts of lesson one in Keras
00:20:49.020 | Just to give you a sense of how that works
00:20:52.580 | I'm not going to talk more about batch norm freeze now other than to say
00:21:01.880 | if you're using
00:21:04.700 | something
00:21:06.060 | Which has got a number larger than 34 at the end so like resnet 50 or res next 101 and you're
00:21:12.620 | Trading a data set that has that is very similar to image net
00:21:17.500 | So it's like normal photos of normal sizes where the thing of interest takes up most of the frame
00:21:22.780 | Then you probably should add the end freeze true after unfreeze
00:21:27.180 | If in doubt try trading it with and then try trading it without
00:21:32.700 | More advanced students will can certainly talk about it on the forums this week
00:21:36.480 | And we will be talking about the details of it in the second half of the course when we come back to our
00:21:42.740 | CNN in-depth section in the second last lesson
00:21:47.440 | So with Keras
00:21:54.300 | again, we import a bunch of stuff and
00:22:00.940 | Remember I mentioned that this idea that you've got a thing called train and a thing called valid and inside that you've got a
00:22:06.180 | Thing called dogs and the things called cats is a standard way of providing
00:22:10.420 | image
00:22:12.620 | Labeled images so Keras does that too right so it's going to tell it where the training set and the validation set are
00:22:18.780 | Size twice what batch size to use
00:22:22.820 | Now you're noticing Keras. We need much much much more
00:22:28.540 | code to do the same thing
00:22:30.660 | More importantly each part of that code has many many many more things you have to set and if you set them wrong
00:22:37.860 | everything breaks, right, so
00:22:40.300 | I'll give you a summary of what they are. So you're basically rather than creating a single
00:22:47.700 | Data object in Keras we first of all have to define something called a data
00:22:52.860 | Generator to say how to generate the data and so a data generator
00:22:57.140 | We basically have to say what kind of data augmentation
00:23:00.820 | we want to do and
00:23:03.620 | We also we actually have to say what kind of
00:23:07.340 | Normalization do we want to do so we're else with fast AI we just say
00:23:13.180 | Whatever resnet 50 requires just do that for me, please
00:23:16.780 | We actually have to kind of know a little bit about what's expected of us
00:23:20.860 | Generally speaking copy and pasting Keras code from the internet is a good way to make sure you've got the right
00:23:26.660 | The right stuff to make that work
00:23:28.660 | And again, it doesn't have a kind of a standard set of like here the best data augmentation parameters to use for photos
00:23:36.020 | So, you know, I've copied and pasted all of this from the Keras
00:23:39.780 | documentation
00:23:42.620 | So I don't know if it's I don't think it's the best set to use at all, but it's the set that they're using in their
00:23:48.500 | So having said this is how I want to generate data. So horizontally flip sometimes, you know zoom sometimes she is sometimes
00:23:55.860 | We then create a generator from that by taking that data generator and saying I want to generate
00:24:02.300 | Images by looking from a directory and we pass in the directory which is of the same
00:24:07.700 | directory structure that fast AI uses and
00:24:10.660 | You'll see there's some overlaps with kind of how fast AI works here
00:24:14.780 | You tell it what size images you want to create you tell it what batch size you want in your mini batches
00:24:20.100 | And then there's something here not to worry about too much
00:24:23.340 | But basically if you're just got two possible outcomes you would generally say binary here
00:24:28.300 | If you've got multiple possible outcomes you would say categorical. Yeah, so we've only got cats or dogs. So it's binary
00:24:34.460 | So an example of like where things get a little more complex is you have to do the same thing for the validation set
00:24:42.300 | So it's up to you to create a data generator
00:24:44.300 | That doesn't have data augmentation because obviously for the validation set unless you're using TTA that's going to stuff things up
00:24:52.740 | you also
00:24:54.380 | When you train?
00:24:56.140 | You randomly reorder the images so that they're always shown in different orders to make it more random
00:25:01.540 | but with a validation it's
00:25:04.060 | Vital that you don't do that because if you shuffle the validation set you then can't track how well you're doing
00:25:10.020 | It's in a different order for the labels. That's a
00:25:12.420 | Basically, these are the kind of steps you have to do every time with Keras
00:25:20.340 | So again, the reason I was using resnet 54 is Keras doesn't have resnet 34 unfortunately
00:25:26.120 | So I just wanted to compare like with Mike so we got to use resnet 50 here
00:25:29.680 | There isn't the same idea with Keras of saying like construct a model that is suitable for this data set for me
00:25:39.260 | So you have to do it by hand, right?
00:25:40.940 | So the way you do it is to basically say this is my base model and then you have to construct on top of that manually
00:25:48.700 | The layers that you want to add and so by the end of this course, you'll understand why it is that these
00:25:53.780 | particular three layers are the layers that we add
00:25:57.060 | So having done that in Keras you basically say okay
00:26:02.460 | this is my model and then again there isn't like a
00:26:05.980 | Concept of like automatically freezing things or an API for that
00:26:10.680 | so you just have to allow loop through the layers that you want to freeze and
00:26:15.700 | Call trainable equals false on them
00:26:18.840 | In Keras, there's a concept we don't have in fast AI or pytorch of compiling a model
00:26:25.640 | So basically once your models ready to use you have to compile it
00:26:28.720 | Passing in what kind of optimizer to use what kind of loss to look for or what metrics so again with fast AI
00:26:35.920 | You don't have to pass this in because we know what loss is the right loss to use you can always override it
00:26:42.620 | But for a particular model we give you good defaults
00:26:45.980 | Okay, so having done all that
00:26:47.980 | Rather than calling fit you call fit generator
00:26:50.980 | Passing in those two generators that you saw earlier the train generator and the validation generator
00:26:56.500 | For reasons I don't quite understand Keras expects you to also tell it how many batches there are per epoch
00:27:04.000 | So the number of batches is equal to the size of the generator
00:27:08.340 | Divided by the batch size you can tell it how many epochs
00:27:13.420 | just like in
00:27:15.420 | Fast AI you can say how many
00:27:17.420 | Processes or how many workers to use for pre-processing?
00:27:20.900 | Unlike fast AI the default in Keras is basically not to use any
00:27:27.500 | So you to get good speed you've got to make sure you include this
00:27:32.620 | And so that's basically enough to start fine-tuning the last layers
00:27:42.820 | So as you can see I got to a validation accuracy of 95%
00:27:46.140 | But as you can also see something really weird happened where after one it was like 49 and then it was 69 and then 95
00:27:53.040 | I don't know
00:27:54.900 | Why these are so low? That's not normal. I may have there may be a bug in Keras. They may be a bug in my code
00:28:01.500 | I reached out on Twitter to see if anybody could figure it out, but they couldn't I guess this is one of the challenges with using
00:28:08.700 | Something like this is one of the reasons I wanted to use fast AI for this course is it's much harder to screw things up
00:28:14.740 | So I don't know if I screwed something up or somebody else did yes, you know
00:28:18.700 | This is using the tensorflow back end yeah, yeah, and if you want to run this to try it out yourself
00:28:28.780 | You just can just go pip install
00:28:32.500 | tensorflow - GPU
00:28:36.940 | Keras
00:28:38.500 | Okay, because it's not part of the fast AI environment about default
00:28:42.720 | But that should be all you need to do to get that working
00:28:47.540 | So then
00:28:54.060 | There isn't a concept of like layer groups or differential learning rates or partial unfreezing or whatever
00:29:00.420 | So you have to decide like I had to print out all of the layers and decide manually
00:29:04.980 | How many I wanted to fine-tune so I decided to fine-tune everything from a layer 140 onwards
00:29:10.280 | So that's why I just looped through like this
00:29:12.280 | After you change that you have to recompile the model
00:29:15.540 | And then after that I then ran another step and again
00:29:19.540 | I don't know what happened here the accuracy of the training set stayed about the same but the validation set totally fell in the hole
00:29:25.380 | But I mean the main thing to note is even if we put aside the validation set
00:29:32.340 | We're getting I mean, I guess the main thing is there's a hell of a lot more code here
00:29:36.300 | Which is kind of annoying but also the performance is very different. So we're also here even on the training set
00:29:42.860 | We're getting like 97% after four epochs that took a total of about eight minutes
00:29:48.420 | you know over here we had
00:29:51.140 | 99.5% on the validation set and it ran a lot faster. So it was like
00:29:58.100 | four or five minutes
00:30:00.940 | right
00:30:04.860 | Depending on what you do particularly if you end up wanting to deploy stuff to mobile devices at the moment
00:30:12.880 | The kind of pie torch on mobile situation is very early
00:30:18.020 | So you may find yourself wanting to use tensorflow or you may work for a company that's kind of settled on tensorflow
00:30:24.340 | So if you need to convert something like redo something you've learned here in tensorflow
00:30:30.980 | You probably want to do it with Keras, but just recognize
00:30:35.160 | you know, it's going to take a bit more work to get there and
00:30:38.700 | By default it's much harder to get I mean I to get the same state-of-the-art results you get with fast AI
00:30:46.140 | You'd have to like replicate all of the state-of-the-art
00:30:49.620 | Algorithms that are in fast AI so it's hard to get the same
00:30:53.300 | Level of results, but you can see the basic ideas are similar
00:30:59.140 | Okay, and it's certainly
00:31:01.140 | It's certainly possible, you know, like there's nothing I'm doing in fast AI that like would be impossible
00:31:07.380 | But like you would have to implement stochastic gradient percent with restarts. You would have to
00:31:11.260 | Implement differential learning rates you would have to implement batch norm freezing
00:31:16.820 | Which you probably don't want to do. I know well, that's not quite true
00:31:20.940 | I think somewhat one person at least on the forum is
00:31:23.100 | Attempting to create a Keras compatible version of or tons of flow compatible version fast AI
00:31:28.380 | Which I think I hope we'll get there
00:31:30.620 | I actually spoke to Google about this a few weeks ago, and they're very interested in getting fast AI ported to tensorflow
00:31:36.420 | So maybe by the time you're looking at this on the MOOC, maybe that will exist. I certainly hope so
00:31:41.820 | We will see
00:31:44.580 | Anyway, so Keras is Keras and tensorflow are certainly not
00:31:49.900 | You know
00:31:52.940 | That difficult to handle and so I don't think you should worry if you're told you have to learn them
00:31:57.900 | After this course for some reason it'll only take you a couple of days. I'm sure
00:32:02.020 | So that's kind of most of the stuff you would need to
00:32:10.780 | Kind of complete this is kind of assignment from last week
00:32:14.460 | Which was like try to do everything you've seen already, but on the dog breeds data set and just to remind you
00:32:21.300 | The kind of last few minutes of last week's lesson I show you how to do much of that
00:32:28.940 | Including like how I actually explored the data to find out like what the classes were and how big the images were and stuff like
00:32:37.860 | That right so if you've forgotten that or didn't quite follow it all last week check out the video from last week to see
00:32:45.380 | One thing that we didn't talk about is how do you actually submit to Kaggle? So how do you actually get predictions?
00:32:51.200 | So I just wanted to show you that last piece as well
00:32:54.160 | And on the wiki thread this week. I've already put a little image of this to show you these steps
00:32:59.980 | But if you go to the Kaggle
00:33:02.980 | Website for every competition there's a section called evaluation and they tell you what to submit and so I just copied and pasted these
00:33:10.900 | Two lines from from there, and so it says we're expected to submit a file where the first line
00:33:17.060 | Contains the the word the word ID and then a comma separated list of all of the possible dog breeds
00:33:24.300 | And then every line after that will contain the ID itself
00:33:28.700 | Followed by all the probabilities of all the different dog breeds
00:33:34.860 | How do you create that?
00:33:37.700 | So I recognize that inside our data object. There's a dot classes
00:33:41.400 | Which has got in alphabetical order all of the all of the classes
00:33:47.560 | and then
00:33:50.460 | So it's got all of the different classes and then inside
00:33:54.580 | Data dot test data set test. Yes, you can also see there's all the file names
00:34:00.460 | So I just remind you
00:34:04.180 | dogs and cats sorry dogs and cats dog breeds
00:34:08.100 | Was not provided in the kind of Keras style format where the dogs and cats are in different folders
00:34:15.260 | But instead it was provided as a CSV file of labels, right? So when you get a CSV file of labels you use
00:34:22.780 | Image classifier data from CSV rather than image classifier data from paths
00:34:30.900 | There isn't an equivalent in Keras, so you'll see like on the Kaggle forums people
00:34:35.100 | Share scripts for how to convert it to a Keras style folders
00:34:39.380 | But in our case we don't have to we just go image classifier data from CSV passing in that CSV file
00:34:44.860 | And so the CSV file will you know has automatically told the data. You know what the classes are
00:34:52.100 | And then also we can see from the folder of test images what the file names of those are
00:35:00.680 | So with those two pieces of information
00:35:02.680 | We're ready to go so I always think it's a good idea to use TTA
00:35:08.040 | As you saw with that dogs and cats example just now it can really improve things particularly when your model is less good
00:35:15.240 | So I can say learn dot TTA and if you pass in
00:35:26.080 | If you pass in is test equals true
00:35:29.600 | Then it's going to give you predictions on the test set rather than the validation set okay, and now obviously we can't now get
00:35:37.480 | An accuracy or anything because by definition. We don't know the labels for the test set right
00:35:43.880 | So by default most
00:35:48.580 | Pytorch models give you back the log of the predictions
00:35:53.240 | So then we just have to go exp of that to get back our probabilities
00:35:57.720 | So in this case the test set had ten thousand three hundred and fifty seven
00:36:01.680 | Images in it, and there are 120 possible breeds all right, so we get back a matrix of of that size
00:36:08.680 | and so we now need to turn that into
00:36:11.680 | Something that looks like this and
00:36:15.400 | So the easiest way to do that is with pandas if you're not familiar with pandas
00:36:20.160 | There's lots of information online about it or check out the machine learning course intro to machine learning that we have
00:36:25.520 | Where we do lots of stuff with pandas?
00:36:27.280 | but basically we can just go PD dot data frame and pass in that matrix and
00:36:32.200 | then we can say the names of the columns are equal to data dot classes and
00:36:37.080 | Then finally we can insert a new column at position zero called ID that contains the file names
00:36:44.080 | But you'll notice that the file names contain
00:36:49.360 | Five letters at the end with a start we don't want and four letters at the end. We don't want so I just
00:36:55.240 | Subset in like so right so at that point
00:37:00.280 | I've got a data frame that looks like this
00:37:04.800 | Which is what we want so you can now
00:37:08.640 | Call data frame data. I should have used a DF not DS
00:37:14.240 | Let's fix it now
00:37:19.000 | Frame
00:37:23.240 | Okay, so you can now call data frame to CSV and
00:37:27.400 | Quite often you'll find these files actually get quite big
00:37:32.080 | so it's a good idea to say compression equals G zip and that'll zip it up on the server for you and that's going to create a
00:37:38.920 | zipped up
00:37:41.680 | CSV file on the server on wherever you're running this Jupiter notebook, so you need absent
00:37:46.920 | You now need to get that back to your computer so you can upload it
00:37:49.860 | Or you can use Kaggle CLI so you can type KG submit and do it that way I?
00:37:56.640 | Generally download it to my computer because I like I often like to just like double check it all looks okay
00:38:02.520 | So to do that there's a cool little thing called file link and if you run file link
00:38:08.800 | With a path on your server it gives you back a URL
00:38:12.320 | Which you can click on and it'll download that file from the server onto your computer
00:38:19.000 | so if I click on that now I
00:38:22.040 | Can go ahead and save it and then I can see in my downloads
00:38:29.480 | There it is here's my submission file
00:38:36.600 | If you want to open there yeah, and as you can see it's exactly what I asked for there's my
00:38:46.420 | ID in the 120 different dot breeds and
00:38:49.760 | Then here's my first row containing the file name and the 120 different probabilities
00:38:54.740 | Okay, so then you can go ahead and submit that to Kaggle through there
00:38:58.600 | Through their regular form and so this is also a good way you can see we've now got a good way of both
00:39:06.240 | Grabbing any file off the internet and getting it to our AWS instance or paper space or whatever by using
00:39:14.640 | Cool little extension in Chrome, and we've also got a way of grabbing stuff off our server easily
00:39:20.000 | those of you that are more
00:39:22.520 | Command-line oriented you can also use SCP of course, but I kind of like doing everything through the notebook
00:39:28.720 | All right
00:39:32.880 | One other question. I had during the week was like what if I want to just get a single a
00:39:38.600 | single file
00:39:41.080 | that I want to
00:39:42.360 | You know get a prediction for so for example you know maybe I want to get the first file from my validation set
00:39:49.060 | So there's its name
00:39:51.080 | So you can always look at a file just by calling image open
00:39:54.520 | That just uses the regular
00:39:57.520 | Python imaging library
00:40:02.520 | So what you can do is there's actually I'll show you the shortest version
00:40:06.320 | You can just call
00:40:08.880 | learn dot predict array
00:40:10.880 | Passing in your your image
00:40:15.600 | Okay, now the image needs to have been
00:40:19.320 | transformed
00:40:21.800 | So you've seen transform transform transforms from model before
00:40:27.120 | Normally, we just put put it all in one variable, but actually behind the scenes. It was returning two things
00:40:32.220 | It was returning training transforms and validation transforms, so I can actually split them apart
00:40:36.840 | And so here you can see I'm actually applying example my training transforms or probably more likely I want to play
00:40:44.040 | validation transforms
00:40:46.760 | That gives me back an array containing the image the transformed image
00:40:51.400 | Which I can then pass to predict array
00:40:55.920 | Everything that gets passed to or returned from our models is
00:41:00.560 | Generally assumed to be a mini batch right generally assumed to be a bunch of images
00:41:05.780 | So we'll talk more about some numpy tricks later, but basically in this case. We only have one image
00:41:12.220 | So we have to turn that into a mini batch of images so in other words. We need to create a tensor
00:41:17.520 | That basically is not just
00:41:20.960 | Rows by columns by channels, but it's number of image by rows by columns by channels and and it has one image
00:41:27.980 | So it's basically becomes a four-dimensional tensor so there's a cool little trick in numpy that if you index
00:41:34.360 | Into an array with none that basically adds additional unit access to the start
00:41:40.760 | So it turns it from an image into a mini batch of one images, and so that's why we had to do that
00:41:46.000 | So if you basically find you're trying to do things with a single image
00:41:51.360 | With any kind of pytorch or fastai thing this is just something you might you might find it says like expecting four
00:41:59.160 | Dimensions only got three it probably means that or if you get back a return
00:42:04.420 | Value from something that has like some weird first axis. That's probably why it's probably giving you like back a mini batch
00:42:12.200 | Okay, and so we'll learn a lot more about this, but it's just something to be aware of
00:42:16.040 | Okay, so that's kind of everything you need to do in practice
00:42:25.360 | So now we're going to kind of get into a little bit of theory
00:42:30.480 | What's actually going on behind the scenes with these convolutional neural networks, and you might remember in
00:42:38.040 | back in lesson one
00:42:43.960 | actually saw
00:42:45.960 | Our first little bit of theory
00:42:49.240 | Which we stole from this fantastic website so toaster dot IO EV explained visually
00:42:55.260 | And we learned that a that a convolution is something where we basically have a little matrix
00:43:01.320 | In deep learning nearly always three by three a little matrix that we basically multiply every element of that matrix
00:43:08.920 | By every element of a three by three section of an image
00:43:12.600 | Add them all together to get the result of that convolution at one point right now
00:43:19.140 | Let's see how that all gets turned together
00:43:22.960 | to create these
00:43:25.400 | These various layers that we saw in the the xyla and burgers paper and to do that again
00:43:31.720 | I'm going to steal off somebody who's much smarter than I am
00:43:34.080 | we're going to steal from a
00:43:37.520 | Guy called a tavio good a tavio good was the guy who created word lens
00:43:43.240 | Which nowadays is part of Google Translate if on Google Translate you've ever like done that thing where you you point your camera at something?
00:43:51.680 | At something with it which has any kind of foreign language on it and in real time it overlays it with the translation
00:43:57.520 | That was the potatoes company that built that
00:44:00.160 | And so it was kind enough to share this fantastic video. He created he's at Google now
00:44:08.320 | And I want to kind of step you through it because I think it explains really really well
00:44:11.940 | What's going on and then after we look at the video? We're going to see how to implement the whole a whole
00:44:17.160 | Sequence of convo an entire set of layers of convolutional neural network in Microsoft Excel
00:44:22.960 | So whether you're a visual learner or a spreadsheet learner, hopefully you'll be able to understand all this
00:44:28.520 | So we're going to start with an image
00:44:31.480 | And something that we're going to do later in the course is we're going to learn to recognize digits
00:44:35.920 | So we'll do it like end-to-end. We'll do the whole thing. So this is pretty similar
00:44:39.840 | So we're going to try and recognize in this case letters
00:44:43.200 | So here's an a which obviously it's actually a grid of numbers, right?
00:44:48.440 | And so there's the grid of numbers. And so what we do is we take our first
00:44:52.800 | Convolutional filter, so we're assuming this is always this is assuming that these are already learnt
00:44:58.760 | Right and you can see this one. It's got white down the right hand side, right and black down the left
00:45:04.440 | So it's like 0 0 0 or maybe negative 1 negative 1 negative 1 0 0 0 1 1 1 and so we're taking each
00:45:10.720 | 3 by 3 part of the image and multiplying it by that 3 by 3
00:45:15.380 | Matrix not as a matrix product that an element wise product and so you can see what happens is
00:45:21.520 | everywhere where the the white edge is
00:45:25.120 | Matching the edge of the a and the black edge isn't we're getting green
00:45:30.160 | We're getting a positive and everywhere where it's the opposite. We're getting a negative
00:45:34.280 | We're getting a red right and so that's the first filter creating the first
00:45:39.520 | The result of the first kernel right and so here's a new kernel
00:45:44.740 | This one is is got a white stripe along the top right so we literally scan it through every three by three part of the matrix
00:45:52.400 | multiplying those three bits of the a
00:45:55.280 | Nine bits of the a by the nine bits of the filter to find out whether it's red or green and how red or green it is
00:46:01.880 | Okay, and so this is assuming we had two filters one was a bottom edge
00:46:05.880 | One was a left edge and you can see here the top edge not surprisingly
00:46:09.960 | It's red here. Sorry bottom edge was red here and green here the right edge red here and green here
00:46:15.560 | And then in the next step we add a non-linearity
00:46:18.320 | Okay, the rectified linear unit which literally means throw away the negatives so here the reds all gone
00:46:26.000 | Okay, so here's layer one the input here's layer two the result of two convolutional filters
00:46:31.960 | Here's layer three which is which is throw away all of the red stuff
00:46:36.640 | And that's called a rectified linear unit and then layer four is something called a max pull
00:46:42.320 | And a layer four we replace every
00:46:45.000 | two by two
00:46:47.200 | Part of this grid and we replace it with its maximum right so it basically makes it half the size
00:46:53.560 | It's basically the same thing, but half the size and then we can go through and do exactly the same thing
00:46:58.840 | We can have some new
00:47:00.600 | Filter three by three filter that we put through each of the two results of the previous layer
00:47:08.200 | And again, we can throw away the red bits
00:47:10.520 | Right so get rid of all the negatives so we just keep the positives. That's called applying a rectified linear unit
00:47:18.800 | That gets us to our next layer of this convolutional neural network
00:47:22.840 | So you can see that by you know at this layer back here. It was kind of very interpretable
00:47:29.180 | It's like we've either got bottom edges or left edges, but then the next layer was combining
00:47:34.360 | The results of convolution so it's starting to become a lot less clear like intuitively what's happening
00:47:40.480 | But it's doing the same thing and then we do another max pull right so we replace every two by two or three by three
00:47:47.680 | Section with a single digit so here this two by two. It's all black so we replaced it with a black
00:47:53.800 | All right, and then we go and we take that and we we compare it
00:47:58.200 | To basically a kind of a template of what we would expect to see if it was an A
00:48:04.020 | It was a B. It was a C. It was D
00:48:05.860 | It was me and we see how closely it matches and we can do it in exactly the same way
00:48:11.500 | We can multiply every one of the values in this four by eight matrix with every one of the four by eight in this one
00:48:19.520 | And this one and this one and we add we just add them together to say like how often does it match?
00:48:24.720 | Versus how often does it not match and then that could be converted to give us a percentage
00:48:30.720 | Probability that this isn't a so in this case this particular template matched well with a
00:48:38.720 | So notice we're not doing any training here, right? This is how it would work if we have a pre trained model
00:48:45.040 | All right
00:48:45.920 | So when we download a pre trained image net model off the internet and visit on an image without any changing to it
00:48:51.820 | This is what's happening or if we take a model that you've trained and you're applying it to some test set or to some new image
00:48:58.840 | This is what it's doing right is it's basically taking it through. It's applying a convolution to each layer to each well multiple
00:49:07.080 | convolutional filters to each layer
00:49:09.080 | And then during the rectified linear unit so throw away the negatives and then do the max pull
00:49:16.360 | And then repeat that a bunch of times and so then we can do it with a new
00:49:21.840 | Letter a or letter B or whatever and keep going through
00:49:26.440 | That process, right?
00:49:29.480 | So as you can see that's a far nicer visualization thing and I could have created because I'm not a tevio
00:49:35.360 | So thanks to him for for sharing this with us because it's totally awesome
00:49:39.520 | He actually this is not done by hand. He actually wrote a piece of computer software to actually do these convolutions
00:49:45.740 | This is actually being actually being done dynamically. It's pretty cool
00:49:50.240 | So I'm more of a spreadsheet guy personally. I'm a simple person
00:49:55.200 | So here is the same thing now in spreadsheet form right and so you'll find this in the github repo, so you can either
00:50:04.360 | Get clone the repo to your own computer to open up the spreadsheet
00:50:08.320 | or you can just go to github.com slash fastai and
00:50:11.520 | Click on this it sits inside
00:50:14.560 | If you go to our repo
00:50:22.320 | And just go to courses as usual go to deal one as usual you'll see there's an Excel section there
00:50:28.480 | Okay, and so here they all are so you can just download them by clicking them
00:50:31.920 | Or you can clone the whole repo, and we're looking at conv example convolution example
00:50:37.280 | right, so you can see I have here an
00:50:41.600 | Input right so in this case the input is the number seven so I grabbed this from a data set called end list
00:50:49.760 | MNist which we'll be looking at in a lot of detail
00:50:52.960 | and I just took one of those digits at random and I put it into Excel and so you can see every
00:51:00.560 | Pixel is actually just a number between naught and one
00:51:03.720 | okay, very often actually it'll be a
00:51:07.480 | Bite between naught and 255
00:51:11.120 | Or sometimes it might be a float between naught and one it doesn't really matter by the time it gets to PI torch
00:51:18.160 | We're generally dealing with floats
00:51:20.280 | So we if one of the steps we often will take will be to convert it to a number between naught and one
00:51:28.320 | So you can see I've just used conditional formatting in Excel to kind of make the higher numbers more red
00:51:34.480 | So you can clearly see that this is a red that this is a seven
00:51:38.400 | But but it's just a bunch of numbers that have been imported into Excel okay, so here's our input
00:51:46.040 | So remember what Atavio did was he then applied two filters
00:51:54.600 | Right with different shapes so here. I've created a filter which is designed to detect top edges
00:52:00.860 | So this is a 3 by 3 filter
00:52:03.760 | Okay, and I've got ones along the top zeros in the middle minus ones at the bottom right so let's take a look at an example
00:52:11.720 | That's here right and so if I hit that - you can see here highlighted
00:52:18.060 | This is the 3 by 3 part of the input that this particular thing is calculating right
00:52:24.000 | so here you can see it's got 1 1 1 are all being multiplied by 1 and
00:52:29.560 | 0.1 0 0 are all being multiplied by negative 1
00:52:34.840 | Okay, so in other words all the positive bits are getting a lot of positive the negative bits are getting nearly nothing at all
00:52:41.540 | So we end up with a high number
00:52:43.720 | Okay, where else on the other side of this bit of the seven?
00:52:48.880 | Right you can see how you know this is basically zeros here or perhaps more interestingly on the top of it
00:52:57.060 | Right
00:53:01.320 | Here we've got
00:53:03.320 | High numbers at the top, but we've also got high numbers at the bottom which are negating it
00:53:07.800 | Okay, so you can see that the only place that we end up
00:53:11.340 | activating is
00:53:14.000 | Where we're actually at an edge
00:53:17.760 | So in this case this here this number three
00:53:20.320 | This is called an activation
00:53:23.200 | Okay, so when I say an activation I mean a number a number a
00:53:30.320 | Number that is calculated and it is calculated by taking
00:53:37.000 | some numbers from the input and
00:53:40.040 | applying some kind of linear operation in this case a convolutional kernel to
00:53:47.740 | Calculate an output, right?
00:53:49.740 | You'll notice that other than going
00:53:52.940 | Inputs multiplied by kernel and summing it together
00:53:58.740 | Right. So here's my sum and here's my multiply
00:54:03.060 | I then take that and I go max of zero comma that and
00:54:07.940 | So that's my rectified linear unit. So it sounds very fancy
00:54:13.220 | Rectified linear unit, but what they actually mean is open up Excel and type equals max zero comma thing. Okay
00:54:19.540 | That's all a red and you'll see people in the biz sort of say real you okay
00:54:26.020 | So really you means rectified linear unit means max zero comma thing and I'm not like simplifying it
00:54:33.700 | I really mean it like when I say like if I'm simplifying always say I'm simplifying
00:54:38.060 | But if I'm not saying I'm simplifying that's the entirety. Okay, so a rectified linear unit in its entirety is this
00:54:44.460 | And a convolution in its entirety is is this
00:54:50.060 | Okay, so a single layer of a convolutional neural network is being implemented in its entirety
00:54:58.940 | Here in Excel, okay, and so you can see what it's done is it's deleted pretty much the vertical edges
00:55:08.020 | And highlighted the horizontal edges
00:55:10.580 | so again, this is assuming that
00:55:13.580 | our network is trained and
00:55:15.900 | That at the end of training it had created a convolutional filter with these specific nine numbers in
00:55:22.020 | And so here is a second convolutional filter
00:55:26.860 | It's just a different nine numbers
00:55:29.860 | Now pie torch doesn't store them as two separate nine digit arrays
00:55:36.500 | It stores it as a tensor. Remember a tensor just means an array with
00:55:42.660 | More dimensions. Okay, you can use the word array as well
00:55:48.280 | It's the same thing but in pytorch. They always use the word tensor. So I'm going to say tensor
00:55:54.700 | Okay, so it's just a tensor with an additional axis which allows us to stack
00:56:00.180 | Each of these filters together
00:56:02.780 | right a filter and kernel
00:56:06.260 | Pretty much mean the same thing. Yeah, right it refers to one of these three by three
00:56:11.100 | Matrices or one of these three by three
00:56:14.340 | slices of a three dimensional tensor
00:56:18.380 | So if I take this one and here I've literally just copied the formulas in Excel from above
00:56:23.980 | Okay, and so you can see this one is now finding a vertical edge as we would expect. Okay, so
00:56:34.860 | We've now created
00:56:39.500 | Layer right this here is a layer and specifically we'd say it's a hidden layer
00:56:44.500 | Which is it's not an input layer and it's not an output layer. So everything else is a hidden layer. Okay, and
00:56:51.060 | this particular hidden layer has is
00:56:55.180 | A size 2 on this dimension, right because it has two
00:57:00.060 | Filters
00:57:03.220 | Right two kernels
00:57:05.220 | So what happens next
00:57:11.260 | Let's do another one
00:57:12.900 | Okay, so as we kind of go along things can
00:57:16.340 | Multiply a little bit in complexity right because my next filter is going to have to contain
00:57:24.060 | Two of these three by threes because I'm going to have to say how do I want to bring how do I want to?
00:57:30.900 | Wait these three things and at the same time, how do I want to wait the corresponding three things down here?
00:57:37.220 | But because in PyTorch
00:57:39.260 | This is going to be this whole thing here is going to be stored as a multi-dimensional tensor, right?
00:57:45.900 | So you shouldn't really think of this now as two three by three kernels, but one
00:57:51.660 | two by three by three kernel
00:57:54.980 | Okay, so to calculate this value here
00:58:00.420 | I've got the sum product of all of that plus
00:58:05.180 | the sum product of
00:58:08.260 | Scroll down
00:58:11.940 | All of that
00:58:14.420 | Okay, and
00:58:16.620 | So the top ones are being multiplied by this part of the kernel and the bottom ones are being multiplied by this part of the
00:58:22.420 | kernel and so over time
00:58:25.060 | You want to start to get very comfortable with the idea of these like higher dimensional?
00:58:31.020 | Linear combinations, right?
00:58:33.820 | Like it's it's harder to draw it on the screen like I had to put one above the other
00:58:39.340 | But conceptually just stack it in your mind like this. That's really how you want to think
00:58:44.660 | Right and actually Jeffrey Hinton in his original
00:58:47.880 | 2012 neural nets
00:58:50.860 | Coursera class has a tip which is how all computer scientists deal with like very high dimensional spaces
00:58:57.660 | Which is that they basically just visualize the two-dimensional space and then say like 12 dimensions really fast in their head lots of times
00:59:06.080 | So that's it right we can see two dimensions on the screen, and then you just got to try to trust
00:59:11.620 | That you can have more dimensions like the concepts just you know
00:59:17.220 | There's there's nothing different about them, and so you can see in Excel
00:59:20.420 | You know Excel doesn't have the ability to handle three-dimensional tenses, so I had to like say okay take this two-dimensional
00:59:26.860 | Dot product add on this two-dimensional dot product right, but if there was some kind of 3d excel
00:59:34.460 | I could have just done that in a single formula
00:59:36.940 | And then again apply max 0 comma otherwise known as rectified linear unit otherwise known as real you
00:59:45.460 | Okay, so here is my second layer, and so when people create different
00:59:51.600 | architectures right and architecture means
00:59:55.140 | Like how big is your kernel at layer one how many filters are in your kernel at layer one so here?
01:00:03.280 | I've got a 3 by 3
01:00:05.100 | Where's number one and a 3 by 3 there's number two so like this architecture?
01:00:11.180 | I've created starts off with two three by three convolutional kernels and
01:00:16.900 | then my
01:00:19.940 | Second layer has another two kernels of size two by three by three
01:00:25.900 | So there's the first one and then down here. Here's the second two by three by three kernel, okay, and so
01:00:32.960 | Remember one of these specific where any one of these numbers is an activation
01:00:39.500 | Okay, so this activation is being calculated from these three things here and other three things up there
01:00:46.460 | And we're using these this two by three by three
01:00:49.580 | kernel okay
01:00:52.020 | And so what tends to happen is people generally give names to their layers, so I say okay
01:00:57.780 | Let's call this layer here cons one and this layer here
01:01:02.740 | and this and
01:01:06.860 | This layer here con two right so that's you know
01:01:11.680 | Generally, you'll just see that like when you print out a summary of a network every layer will have some kind of name
01:01:18.840 | Okay, and so then what happens next?
01:01:22.740 | Well part of the architecture is like do you have some max pooling?
01:01:27.940 | Whereabouts is that max pooling happens or in this architecture? We're inventing we're going to next step
01:01:33.980 | Is to max pooling okay max pooling is a little hard to?
01:01:38.660 | Kind of show in Excel, but we've got it
01:01:41.980 | So max pooling if I do a two by two max pooling it's going to have the resolution both height and width
01:01:49.980 | So you can see here that I've replaced
01:01:52.660 | These four numbers
01:01:57.340 | with the maximum of those four numbers
01:02:00.740 | Right and so because I'm having the resolution it only makes sense to actually have something every two cells
01:02:05.980 | Okay, so you can see here the way. I've got kind of the same
01:02:11.500 | Looking shape as I had back here, okay, but it's now half the resolution because I've replaced every
01:02:17.860 | two by two
01:02:19.860 | With its max and you'll notice like it's not every possible two by two I skip over from here
01:02:25.620 | So this is like starting at BQ and then the next one starts at
01:02:32.380 | Right, so they're like non overlapping. That's why it's decreasing the resolution
01:02:36.540 | Okay, so anybody who's comfortable with spreadsheets
01:02:40.800 | You know you can open this and have a look and so after our max pooling
01:02:45.860 | There's a number of different things we could do next and I'm going to show you a kind of
01:02:56.620 | Classic old style approach nowadays in fact what generally happens nowadays is we do a max pool where we kind of like max across the
01:03:04.860 | entire size right
01:03:06.860 | But on older architectures and also on all the structured data stuff we do
01:03:11.400 | We actually do something called a fully connected layer, and so here's a fully connected layer
01:03:17.100 | I'm going to take every single one of these activations, and I'm going to give every single one of them a weight
01:03:24.980 | Right and so then I'm going to take over here
01:03:28.900 | here is the sum product of every one of the activations by every one of the weights for both of the
01:03:41.580 | Levels of my three-dimensional tensor right and so this is called a fully connected layer notice. It's different to a convolution
01:03:48.820 | I'm not going through a few at a time
01:03:50.860 | Right, but I'm creating a really big weight matrix right so rather than having a couple of little three by three kernels
01:03:58.380 | My weight matrix is now as big as the entire input
01:04:01.260 | And so as you can imagine
01:04:04.060 | Architectures that make heavy use of fully convolutional layers can have a lot of weights
01:04:11.860 | Which means they can have trouble with overfitting and they can also be slow and so you're going to see a lot
01:04:19.420 | An architecture called VGG because it was the first kind of successful deeper architecture
01:04:25.060 | It has up to 19 layers and VGG
01:04:27.660 | Actually contains a fully connected layer with 4,096 weights
01:04:33.020 | Connected to at a hidden layer with 4,000 sorry 4,096
01:04:38.060 | activations connected to a hidden layer with 4,096 activations, so you've got like 4,096 by
01:04:46.900 | 4,096 multiplied by remember multiplied by the number of kind of kernels that we've calculated
01:04:53.700 | so in VGG
01:04:56.540 | there's
01:04:58.940 | This I think it's like 300 million
01:05:01.260 | Weights of which something like 250 million of them are in these fully connected layers
01:05:07.740 | So we'll learn later on in the course about how we can kind of avoid using these big fully connected layers and behind the scenes
01:05:15.860 | All the stuff that you've seen us using like res net and res next none of them use very large
01:05:21.620 | Fully connected layers you know you had a question
01:05:24.580 | So you tell us more about for example if we had like three channels of the input what would be the
01:05:35.740 | The shape yeah these filters right so that's a great question
01:05:41.500 | So if we had three channels of input it would look exactly like conv1 right conv1 kind of has two channels
01:05:49.740 | Right and so you can see with conv1. We had two channels so therefore our filters
01:05:55.820 | had to have like two channels per filter and so you could like
01:06:00.460 | Imagine that this input didn't exist you know and actually this was the input right so when you have a multi-channel input
01:06:08.140 | It just means that your filters look like this and so images often full color
01:06:14.480 | They have three red green and blue sometimes. They also have an alpha channel
01:06:19.020 | So however many you have that's how many inputs you need and so something which I know
01:06:24.660 | Yannette was playing with recently was like using a full color image net model
01:06:30.540 | In medical imaging for something called bone age calculations
01:06:34.860 | Which has a single channel and so what she did was basically take the the input
01:06:40.940 | The single channel input and make three copies of it
01:06:44.820 | So you end up with basically like one two three versions of the same thing which is like
01:06:51.460 | It's kind of it's not ideal like it's kind of redundant information that we don't quite want
01:06:58.260 | But it does mean that then if you had a something that expected a three channel
01:07:04.180 | convolutional filter
01:07:05.460 | You can use it right and so at the moment. There's a Kaggle competition for iceberg detection
01:07:11.820 | using
01:07:13.820 | Some funky satellite specific data format that has two channels
01:07:17.980 | So here's how you could do that you could
01:07:21.220 | Either copy one of those two channels into the third channel
01:07:25.100 | Or I think what people on Kaggle are doing is to take the average of the two
01:07:30.420 | Again, it's not ideal, but it's a way that you can use pre-trained networks
01:07:34.740 | Yeah, I've done a lot of
01:07:38.700 | fiddling around like that you can also actually I've actually done things where I wanted to use a
01:07:44.340 | Three channel image net network on four channel data. I had a satellite data where the fourth channel was near infrared
01:07:51.460 | And so basically I added an extra
01:07:57.380 | kind of
01:07:58.780 | Level to my convolutional kernels that were all zeros and so basically like started off by ignoring the new infrared band
01:08:06.860 | And so what happens it basically and you'll see this next week is
01:08:11.380 | That rather than having these like carefully trained filters when you're actually training something from scratch
01:08:18.820 | We're actually going to start with random numbers
01:08:21.420 | That's actually what we do we actually start with random numbers
01:08:24.300 | And then we use this thing called stochastic gradient descent which we've kind of seen
01:08:28.140 | Conceptually to slightly improve those random numbers to make them less random and we basically do that again and again and again
01:08:35.460 | Okay, great. Let's take a seven-minute break, and we'll come back at 750
01:08:41.820 | All right, so what happens next so we've got as far as
01:08:55.260 | Doing a
01:08:57.100 | Fully connected layer right so we had our the results of our max Pauling layer got fed to a fully connected layer
01:09:03.420 | And you might notice those of you that remember your linear algebra the fully connected layer is actually doing a classic
01:09:10.860 | traditional matrix product
01:09:13.260 | Okay, so it's basically just going through each pair in turn multiplying them together and then adding them up to do a matrix product
01:09:25.100 | In practice if we want to calculate which one of the 10 digits we're looking at
01:09:36.900 | This single number we've calculated isn't enough
01:09:42.900 | We would actually calculate
01:09:46.100 | 10 numbers so what we would have is rather than just having
01:09:50.860 | one set of
01:09:52.860 | Fully connected weights like this and I say set because remember. There's like a whole
01:09:58.340 | 3d kind of tensor of them we would actually need
01:10:02.460 | 10 of those
01:10:05.300 | Right so you can see that these tensors start to get a little bit
01:10:08.520 | High dimensional right and so this is where my patience with doing it an Excel ran out
01:10:15.220 | But imagine that I had done this 10 times I could now have 10 different numbers all being calculated here
01:10:21.660 | Using exactly the same process right it just be 10 of these
01:10:25.220 | fully connected
01:10:28.620 | To by and by and
01:10:32.540 | Arrays basically
01:10:37.620 | So then we would have 10 numbers being spat out, so what happens next?
01:10:43.860 | So next up
01:10:45.580 | We can open up a different Excel
01:10:47.580 | worksheet
01:10:49.940 | Entropy example that XLS that's got two
01:10:52.540 | different
01:10:55.020 | Worksheets one of them is called softmax
01:10:57.340 | And what happens here? I'm sorry I've changed domains rather than predicting whether it's a number from one not to nine
01:11:05.620 | I'm going to predict whether something is a cat a dog a plane of fish or building okay, so out of our that fully connected layer
01:11:13.660 | We've got in this case. We'd have five numbers and notice at this point
01:11:18.340 | There's no value okay, and then last layer. There's no value okay, so I can have negatives
01:11:24.980 | Okay, so I want to turn these five numbers
01:11:30.140 | Each into a probability I want to turn it into a probability from not to one that it's a cat
01:11:37.380 | That's a dog. There's a plane that it's a fish that it's a building and
01:11:42.220 | I want those probabilities to have a couple of characteristics first is that each of them should be between zero and one and
01:11:47.860 | The second is that they together should add up to one right? It's definitely one of these five things
01:11:54.380 | Okay, so to do that we use a different kind of activation function
01:11:59.420 | What's an activation function an activation function is a function that is applied to activations?
01:12:07.380 | so for example max zero comma
01:12:11.740 | something is a
01:12:13.740 | function that I applied to an activation
01:12:16.940 | So an activation function always takes in
01:12:20.420 | One number and spits out one number so max of zero comma X
01:12:26.540 | Takes in a number X and spits out some different number value of X
01:12:30.900 | That's all an activation function is and if you remember back to that PowerPoint we saw and
01:12:41.420 | Lesson one
01:12:43.420 | Each of our layers
01:12:48.020 | Was just a linear
01:12:50.300 | Function and then after every layer
01:12:53.860 | We said we needed some non-linearity
01:12:56.860 | Right because if you stack a bunch of
01:12:59.980 | linear layers together
01:13:02.780 | Right then all you end up with is a linear layer
01:13:05.260 | right
01:13:07.260 | So if somebody's talking can can you not I'm slightly distracting. Thank you
01:13:11.300 | If you stack a number of linear
01:13:15.460 | Functions together you just end up with a linear function and nobody does any cool deep learning with just linear functions
01:13:22.540 | All right, but remember we also learned
01:13:24.700 | that by stacking linear functions
01:13:28.300 | With in between each one a non-linearity we could create like arbitrarily complex shapes
01:13:35.060 | and so the non-linearity that we're using after every hidden layer is a value rectified linear unit a
01:13:42.100 | non-linearity is an activation function an
01:13:46.180 | Activation function is a non-linearity in in it within deep learning. Obviously, there's lots of other non-linearities in the world, but in deep learning
01:13:55.140 | This is what we mean
01:13:57.460 | So an activation function is any function that takes some activation in that's a single number and spits out some new activation
01:14:05.220 | like max of 0 comma
01:14:07.220 | So I'm now going to tell you about a different activation function. It's slightly more complicated than
01:14:12.940 | Rally-u, but not too much. It's called softmax
01:14:16.660 | softmax only ever occurs in the final layer at the very end and the reason why is that softmax always spits out
01:14:25.140 | Numbers as an activation function that always spits out a number between 0 and 1 and it always spits out a bunch of numbers
01:14:33.300 | That add to one
01:14:34.540 | So a softmax gives us what we want, right?
01:14:37.500 | in theory
01:14:40.140 | This isn't strictly necessary right like we could ask our neural net to learn a set of
01:14:47.260 | kernels
01:14:49.220 | Which have you know, which which give probabilities that line up as closely as possible with what we want
01:14:54.980 | But in general with deep learning if you can construct your architecture so that the desired
01:15:01.300 | characteristics are as easy to express as possible
01:15:04.420 | You'll end up with better models like they'll learn more quickly with less parameters
01:15:09.460 | So in this case, we know that our probabilities should end up being between 0 and 1
01:15:14.940 | We know that they should end up adding to one
01:15:17.780 | So if we construct an activation function, which always has those features
01:15:22.140 | Then we're going to make our neural network do a better job. It's going to make it easier for it
01:15:27.820 | It doesn't have to learn to do those things because it all happened automatically
01:15:31.140 | Okay, so in order to make this work
01:15:35.580 | We first of all have to get rid of all of the negatives
01:15:39.340 | Right, like we can't have negative probabilities
01:15:42.700 | So to make things not be negative one way we could do it is just go into the power of
01:15:47.940 | Right. So here you can see my first step is to go x of
01:15:52.300 | the previous one right and I think I've mentioned this before but
01:15:57.740 | Of all the math that you just need to be super familiar with to do deep learning
01:16:02.500 | The one you really need is logarithms and x's right all of deep learning and all of machine learning
01:16:09.500 | They appear all the time, right? So
01:16:12.300 | For example
01:16:16.100 | You absolutely need to know that
01:16:20.500 | log of
01:16:23.580 | x times y
01:16:26.140 | equals log of x
01:16:28.360 | plus log of y
01:16:31.980 | Right and like not just know that that's a formula that exists but have a sense of like what does that mean?
01:16:38.180 | Why is that interesting? Oh, I can turn multiplications into additions. That could be really handy, right and therefore
01:16:46.140 | log of x over y
01:16:48.860 | equals log of x minus log of y
01:16:55.260 | Again, that's going to come in pretty handy, you know rather than dividing I can just subtract things, right?
01:17:00.220 | And also remember that if I've got log of x
01:17:04.140 | equals y
01:17:06.580 | Then that means a to the y
01:17:08.580 | Equals x in other words log
01:17:11.780 | Log and a to the the inverse of each other
01:17:18.980 | Okay again, you just you need to really really understand these things and like so if you if you haven't spent much time with logs
01:17:26.180 | and x for a while
01:17:28.020 | You try plotting them in Excel or a little notebook have a sense of what shape they are how they combine together
01:17:34.420 | Just make sure you're really comfortable with them. So
01:17:37.240 | We're using it here, right?
01:17:40.620 | We're using it here. So one of the things that we know is a to the power of something is positive
01:17:47.580 | Okay, so that's great. The other thing you'll notice about a to the power of something is because it's a power
01:17:52.860 | Numbers that are slightly bigger than other numbers like 4 is a little bit bigger than 2.8
01:17:59.260 | When you go either the power of it really accentuates that difference
01:18:03.080 | Okay, so we're going to take advantage of both of these features for the purpose of deep learning. Okay, so we take our
01:18:09.180 | The results of this fully connected layer we go a to the power of for each of them
01:18:16.420 | and then we're going to
01:18:19.060 | And then we're going to add them up
01:18:25.260 | Okay, so here is the sum of a to the power of
01:18:29.540 | So then here
01:18:32.820 | We're going to take
01:18:34.460 | a to the power of divided by the sum of a to the power of so if you take
01:18:40.140 | All of these things divided by their sum then by definition all of those things must add up to 1 and
01:18:47.420 | Furthermore since we're dividing by their sum
01:18:52.060 | They must always vary between 0 and 1 because they're always positive
01:18:57.100 | Alright, and that's it. So that's what softmax is
01:19:00.740 | Okay, so I've got this kind of
01:19:06.020 | Doing random numbers each time right and so you can see like as I look through
01:19:11.460 | My softmax generally has quite a few things that are so close to 0 that they round down to 0 and you know
01:19:19.140 | Maybe one thing that's nearly 1 right and the reason for that is what we just talked about that is with the x
01:19:25.300 | Just having one number a bit bigger than the others tends to like push it out further, right?
01:19:31.780 | So even though my inputs here are random numbers between negative 5 and 5
01:19:36.420 | Right my outputs from the softmax don't really look that random at all in the sense that
01:19:42.460 | They tend to have one big number and a bunch of small numbers
01:19:49.460 | Now that's what we want
01:19:51.460 | Right. We want to say like in terms of like is this a cat a dog a plane a fish or a building
01:19:55.860 | We really want it to say like it's it's that you know
01:19:59.260 | It's it's a dog or it's a plane not like I don't know
01:20:04.180 | Okay, so softmax has lots of these cool
01:20:07.900 | Properties right it's going to return a probability that adds up to one and it's going to tend to want to pick one thing
01:20:15.700 | particularly strongly
01:20:18.660 | Okay, so that's softmax your net. Could you pass actually bust me up?
01:20:26.420 | we how would we do something that as let's say you have an image and you want to kind of categorize as like cat and
01:20:33.460 | The dog or like as multiple things
01:20:35.460 | What what kind of function would we try to use?
01:20:38.540 | So happens we're going to do that right now
01:20:43.740 | So have to think about why we might want to do that and so one reason we might want to do that is to do
01:20:50.060 | multi-label
01:20:51.460 | classification so we're looking now at listen to image models and specifically we're going to take a look at the
01:20:57.780 | planet competition satellite imaging competition
01:21:01.260 | Now the satellite imaging competition has
01:21:05.620 | Some similarities to stuff we've seen before right so before we've seen a cat versus dog and these images are a cat or a dog
01:21:16.340 | They're not neither. They're not both right, but the satellite imaging competition
01:21:21.860 | Has data as images that look like this and in fact every single one of the images is classified by weather
01:21:29.600 | There's four kinds of weather one of which is haze and another of which is clear
01:21:34.940 | In addition to which there is a list of features that may be present including agriculture
01:21:41.860 | Which is like some some cleared area used for agriculture
01:21:45.980 | Primary which means primary rainforest and water which means a river or a creek so here is a clear day
01:21:53.700 | Satellite image showing some agriculture some primary rainforest and some water features
01:22:00.020 | And here's one which is in haze and is entirely primary rainforest
01:22:05.300 | So in this case we're going to want to be able to show
01:22:11.380 | We're going to be able to predict multiple things and so softmax wouldn't be good because softmax doesn't like
01:22:17.640 | Predicting multiple things and like I would definitely recommend
01:22:22.340 | Anthropomorphizing your activation functions right they have personalities
01:22:26.860 | Okay, and the personality of the softmax is it wants to pick a thing
01:22:31.780 | Okay, and people forget this all the time. I've seen many people even well regarded researchers in famous academic papers
01:22:41.480 | Using like softmax for multi-label classification it happens all the time, right?
01:22:47.480 | And it's kind of ridiculous because they're not
01:22:50.840 | understanding the personality of their activation function, so
01:22:56.200 | For multi-label classification where each sample can belong to one or more classes. We have to change a few things
01:23:03.980 | But here's the good news in fastai. We don't have to change anything
01:23:09.840 | Right so fastai will look at the labels in the CSV and if there is more than one label ever
01:23:17.840 | for any
01:23:20.720 | Item it will automatically switch into like multi-label mode
01:23:24.680 | So I'm going to show you how it works behind the scenes, but the good news is you don't actually have to care
01:23:30.180 | It happens anyway
01:23:32.560 | so if
01:23:34.560 | You have multi-label
01:23:39.280 | Images multi-label objects you obviously can't use the classic Keras style approach where things are in folders
01:23:47.120 | Because something can't conveniently be in multiple folders at the same time
01:23:52.380 | Right, so that's why we you basically have to use the from CSV
01:23:59.200 | Approach right so if we look at
01:24:06.720 | an example
01:24:08.720 | Actually, I'll show you I tend to take you through it right so we can say okay
01:24:14.840 | This is the CSV file containing our labels
01:24:16.980 | This looks exactly the same as it did before but rather than side on it's top down
01:24:22.400 | And top down I've mentioned before that it can do
01:24:25.820 | Vertical flips it actually does more than that there's actually eight possible symmetries for a square
01:24:31.520 | Which is it can be rotated through 90 180 270 or 0 degrees?
01:24:36.280 | And for each of those it can be flipped and if you think about it for a while you'll realize that that's a complete
01:24:42.600 | enumeration of everything that you can do
01:24:45.560 | In terms of symmetries to a square, so they're called it's called the dihedral group of eight
01:24:52.360 | So if you see in the code, there's actually a transform called dihedral. That's why it's called that
01:24:57.960 | So this transforms will basically do the full set of eight symmetric
01:25:04.520 | dihedral
01:25:06.160 | rotations and flips
01:25:08.160 | Plus everything which we can do to dogs and cats you know small 10-degree rotations little bit of zooming
01:25:14.920 | a little bit of contrast and brightness adjustment
01:25:17.680 | So these images are of size 256 by 256
01:25:21.880 | So I just created a little function here to let me quickly grab you know a
01:25:26.760 | Data loader of any size so here's a 256 by 256
01:25:31.880 | Once you've got a data object inside it
01:25:36.000 | We've already seen that there's things called valve DS test DS train DS
01:25:41.000 | They're things that you can just index into and grab a particular image so you just use square brackets zero
01:25:46.560 | You'll also see that all of those things have a DL. That's a data loader
01:25:50.920 | So DS is data set DL is data loader. These are concepts from pytorch
01:25:55.680 | So if you google pytorch data set or pytorch data loader
01:25:59.600 | You can basically see what it means, but the basic idea is a data set gives you a single image or a single
01:26:06.880 | object back a data loader gives you back a mini-batch and
01:26:10.720 | Specifically it gives you back a transformed mini-batch, so that's why when we create our
01:26:17.320 | data object we can pass in num workers and
01:26:21.560 | Transforms like how many processes do you want to use what transforms?
01:26:26.080 | Do you want and so with a data loader you can't ask for an individual image?
01:26:31.320 | You can only get back at a mini-batch and you can't get that back a particular mini-batch
01:26:36.160 | You can only get back the next mini-batch so something we risk is loop through
01:26:41.560 | Grabbing a mini-batch at a time and so in Python
01:26:45.420 | The thing that does that is called a generator right or an iterator this slightly different versions
01:26:51.600 | Of the same thing so to turn a data loader into an iterator you use the standard Python function called iter
01:26:57.360 | That's a Python function just a regular part of the Python
01:27:00.860 | Basic language that returns to an iterator and an iterator is something that takes you can pass the standard give pass it to the standard
01:27:08.920 | Python
01:27:11.080 | Function or statement next and that just says give me another batch from this iterator
01:27:19.280 | So we're basically this is one of the things I really like about pytorch is it really leverages?
01:27:24.160 | modern pythons
01:27:26.760 | Kind of stuff you know in tensorflow they invent their whole new world of ways of doing things
01:27:33.560 | And so it's kind of more
01:27:36.680 | In a sense. It's more like cross-platform, but another sense like it's not a good fit to any platform
01:27:42.880 | So it's nice if you if you know Python well
01:27:47.880 | Pytorch comes very naturally if you don't know Python well pytorch is a good reason to learn Python well a
01:27:54.800 | Pytorch near module neural network module is a standard Python bus for example
01:28:02.240 | So any work you put into learning Python better will pay off with Pytorch so here. I am using standard
01:28:08.720 | Python
01:28:10.480 | Iterators and next to grab my next mini-batch
01:28:15.040 | From the validation sets data loader, and that's going to return two things
01:28:18.720 | It's going to return the images in the mini-batch and the labels in the mini-batch so standard Python approach
01:28:24.500 | I can pull them apart like so and so here is
01:28:28.760 | one mini-batch of labels
01:28:31.520 | And so not surprisingly since I said that my batch size
01:28:42.480 | Actually, it's the batch size by default is 64 so I didn't pass in a batch size
01:28:48.240 | So just remember shift tab to see like what are the things you can pass and what are the defaults so by default?
01:28:54.920 | My batch size is 64, so I've got back something of size 64 by
01:28:59.720 | 17 so there are 17 of the possible
01:29:03.080 | classes right
01:29:05.960 | So let's take a look at the
01:29:09.560 | zeroth
01:29:11.880 | Set of labels so the zeroth images labels
01:29:14.480 | So I can zip again standard Python things it takes two lists and combines it so you get the zeroth thing from the first
01:29:22.840 | List the zeroth thing from the second list and the first thing for the first first this first thing from the second list and so
01:29:29.200 | Forth so I can zip them together and that way I can find out
01:29:32.640 | For the zeroth image in the validation set it's agriculture
01:29:37.720 | It's clear
01:29:40.040 | It's primary rainforest. It's slash and burn. It's water
01:29:44.560 | okay, so as you can see here, this is a
01:29:48.800 | multi label
01:29:51.320 | You see here's a way to do multi label classification
01:29:54.120 | So by the same token right if we go back to our single label classification
01:30:01.960 | It's a cat dog playing official building
01:30:03.960 | Behind the scenes we haven't actually looked at it, but behind the scenes
01:30:09.080 | Fastai and Pytorch are turning our labels into something called one hot encoded
01:30:16.800 | Labels and so if it was actually a dog then the actual values
01:30:21.400 | Would be like that right so these are like the actuals
01:30:26.760 | Okay, so do you remember at the very end of a tavio's video?
01:30:31.800 | He showed how like the template had to match to one of the like five a b c d or e templates
01:30:37.640 | And so what it's actually doing is it's comparing
01:30:41.440 | When I said it's basically doing a dot product. It's actually a fully connected layer at the end right that calculates an
01:30:48.520 | output activation that goes through a softmax and
01:30:53.360 | Then the softmax is compared to the one hot encoded label right so if it was a dog there would be a one here
01:31:02.800 | And then we take take the difference between the actuals and the softmax
01:31:07.520 | Activations to say and add those add up those differences to say how much error is there essentially?
01:31:13.280 | We're skipping over something called a loss function that we'll learn about next week, but essentially we're basically doing that
01:31:19.260 | Now if it's one hot encoded like if there's only one thing which have a one in it
01:31:27.620 | then actually storing it as 0 1 0 0 0 is
01:31:32.800 | terribly inefficient
01:31:34.720 | Right like we can basically say what are the index of each of these things?
01:31:38.860 | Right so we can say it's like 0 1 2 3 4 like so right and so rather than storing it as 0 1
01:31:47.400 | 0 0 0 we actually just store the index value
01:31:52.160 | Right so if you look at the the y values for the cats and dogs competition or the dog breeds competition
01:32:00.400 | You won't actually see a big lists of ones and zeros like this. You'll see a single integer
01:32:05.340 | Right, which is like. What's what class index is it right and
01:32:09.680 | internally
01:32:12.160 | Inside Pytorch it will actually turn that into a one hot encoded vector, but like you will literally never see it
01:32:19.320 | Okay, and and Pytorch has different loss functions where you basically say this thing's one
01:32:26.600 | This thing is one hot encoded or this thing is not and it uses different loss functions
01:32:31.400 | That's all hidden by the fast AI library right so like you don't have to worry about it
01:32:37.260 | But it's but the the cool thing to realize is that this approach for multi-label encoding with these ones and zeros
01:32:45.920 | Behind the scenes the exact same thing happens for single-level classification
01:32:54.760 | Does it make sense to change the pickiness of the sigmoid of the softmax function by changing the base?
01:33:01.080 | No because when you change the
01:33:05.880 | more math
01:33:09.400 | Log base a of B
01:33:14.200 | equals
01:33:17.040 | log B over
01:33:19.040 | log A
01:33:22.080 | so changing the base is just a linear scaling and
01:33:25.200 | Linear scaling is something which the neural net can learn with that very easily
01:33:31.240 | Good question
01:33:37.960 | Okay, so here is that image right here is the image with slash and burn water etc etc
01:33:46.380 | One of the things to notice here is like when I first displayed this image it was
01:33:51.560 | So washed out I really couldn't see it right but remember images
01:33:58.680 | Now you know we know images are just
01:34:01.480 | Matrices of numbers and so you can see here. I just said times 1.4
01:34:06.280 | Just to make it more visible right so like now that you're kind of it's the kind of thing
01:34:12.480 | I want you to get familiar with is the idea that this stuff you're dealing with they're just matrices of numbers
01:34:17.400 | Then you can fiddle around with them, so if you're looking at something like oh, it's a bit washed out
01:34:21.480 | You can just multiply it by something to
01:34:23.480 | Brighten it up a bit okay, so here. We can see I guess this is the slash and burn
01:34:28.760 | Here's the river. That's the water. Here's the primary rainforest. Maybe that's the agriculture and so forth okay, so
01:34:36.640 | So you know with all that background how do we actually use this?
01:34:44.840 | Exactly the same way as everything we've done before right so you know size you know and and
01:34:49.760 | The interesting thing about playing around with this planet competition is that these images are not at all like image net and I
01:34:58.600 | Would guess that the vast majority of the stuff that the vast majority of you do
01:35:03.560 | involving convolutional neural nets
01:35:06.520 | Won't actually be anything like image net you know it'll be it'll be medical imaging
01:35:13.400 | Or it'll be like classifying different kinds of steel tube or figuring out whether a world
01:35:19.520 | You know is going to break or not or or looking at satellite images, or you know whatever right so?
01:35:27.080 | It's it's good to experiment with stuff like this planet
01:35:32.640 | Competition to get a sense of kind of what you want to do and so you'll see here
01:35:37.480 | I start out by resizing my data to 64 by 64
01:35:42.880 | It starts out at 256 by 256 right now
01:35:46.320 | I wouldn't want to do this for the cats and dogs competition because the cats in dog competition
01:35:51.120 | We start with a pre trained image net network. It's it's nearly it's it's it starts off nearly perfect
01:35:57.440 | Right so if we resized everything to 64 by 64 and then retrained the whole set
01:36:03.840 | We basically destroy the weights that are already pre trained to be very good
01:36:09.360 | Remember image net most image net models are trained at either 224 by 224 or
01:36:14.400 | 299 by 299 right so if we like retrain them at 64 by 64. We're going to we're going to kill it on the other hand
01:36:22.840 | There's nothing in image net that looks anything like this
01:36:26.560 | You know there's no satellite images
01:36:29.200 | So the only useful bits of the image net network for us
01:36:35.600 | kind of layers like this one
01:36:38.800 | You know finding edges and gradients and this one you know finding kind of textures and repeating patterns
01:36:45.160 | And maybe these ones of kind of finding more complex textures, but that's probably about it right so
01:36:54.280 | so in other words
01:36:56.680 | You know starting out by training very small images
01:37:00.560 | Works pretty well when you're using stuff like satellites
01:37:04.160 | So in this case I started right back at 64 by 64
01:37:07.080 | grabbed some data
01:37:09.960 | Built my model found out what learning rate to use interestingly it turned out to be quite high
01:37:17.520 | It seems that because like it's so unlike image net I
01:37:23.960 | Needed to do quite a bit more fitting with just that last layer before it started to flatten out
01:37:30.840 | Then I unfreezed it and again. This is the difference to
01:37:34.380 | Image net like
01:37:37.760 | Data sets is my learning rate in the initial layer
01:37:41.760 | I set to divided by 9 the middle layers I set to divided by 3
01:37:45.640 | Where else for stuff like this like image net I had a multiple of 10 for each of those
01:37:51.160 | You know again the idea being that the earlier layers
01:37:55.000 | Probably are not as close to what they need to be compared to the image net
01:38:01.000 | like data sets
01:38:03.000 | So again unfreeze train for a while
01:38:06.160 | And you can kind of see here. You know there's cycle one. There's cycle two. There's cycle three
01:38:13.060 | And then I kind of increased double the size of my images
01:38:17.880 | Fit for a while
01:38:20.720 | Unfreeze fit for a while double the size of the images again fit for a while unfreeze fit for a while
01:38:26.640 | And then add TTA and so as I mentioned last time we looked at this this process ends up
01:38:31.920 | You know getting us about 30th place in this competition
01:38:35.180 | Which is really cool because people you know a lot of very very smart people
01:38:39.520 | Just a few months ago worked very very hard on this competition
01:38:43.000 | Couple of things people have asked about one is
01:38:49.160 | What is this data dot resize do
01:38:57.120 | Couple of different pieces here the first is that when we say
01:39:00.680 | Back here
01:39:04.960 | What transforms do we apply and here's our transforms we actually pass in a size right?
01:39:10.840 | So one of the things that that one of the things that data loader does is to resize the images like on demand every time
01:39:17.720 | It sees them
01:39:19.720 | This has got nothing to do with that dot resize method right so
01:39:24.900 | This is this is the thing that happens at the end like whatever's passed in before it hits out that before our data
01:39:30.580 | Lotus fits it out. It's going to resize it to this size
01:39:33.380 | If the initial input is like a thousand by a thousand
01:39:39.100 | Reading that JPEG and resizing it to 64 by 64
01:39:44.560 | Turns out to actually take more time than training the confident dots for each batch
01:39:50.940 | Right so basically all resize does is it says hey
01:39:55.820 | I'm not going to be using any images bigger than size times 1.3
01:40:00.260 | So just go through once and create new JPEGs of this size
01:40:05.900 | Right and and they're rectangular right so new JPEGs where the smallest
01:40:11.100 | Edges of this size and again. It's like you never have to do this
01:40:16.180 | There's no reason to ever use it if you don't want to it's just a speed up
01:40:20.860 | okay, but if you've got really big images coming in it saves you a lot of time and you'll often see on like Kaggle kernels or
01:40:27.580 | forum posts or whatever people will have like
01:40:30.900 | Bash scripts stuff like that to like loop through and resize images to save time you never have to do that right just you can
01:40:39.500 | Just say dot resize and it'll just
01:40:41.980 | Create you know once off it'll go through and create that if it's already there
01:40:47.180 | It'll use the resized ones for you. Okay, so it's just a it's just a
01:40:51.820 | Speed up convenience function no more
01:40:55.180 | Okay, so for those of you that are kind of past dog breeds
01:41:05.620 | Would be looking at planet
01:41:07.620 | Next you know like track like play around
01:41:13.460 | With trying to get a sense of like how can you get this as an accurate model?
01:41:17.380 | One thing to mention, and I'm not really going to go into it in detail
01:41:21.580 | It's nothing to do with deep learning particularly is that I'm using a different metric. I didn't use metrics equals accuracy
01:41:28.140 | But I said metrics equals f2
01:41:30.820 | Just remember from last week that confusion matrix that like two by two you know correct incorrect for each of dogs and cats
01:41:43.180 | There's a lot of different ways you could turn that confusion matrix into a score
01:41:49.100 | You know do you care more about false negatives, or do you care more about false positives, and how do you wait them?
01:41:54.780 | And how do you combine them together right?
01:41:56.780 | There's a base. There's basically a function called f beta
01:42:01.300 | Where the beta says how much do you wait false negatives versus false positives and so f2?
01:42:08.540 | Is f beta with beta equals 2 and it's basically as particular way of waiting false negatives and false positives
01:42:15.620 | And the reason we use it is because cattle told us that planet who were running this competition
01:42:20.300 | Wanted to use this particular
01:42:23.100 | f-theta metric
01:42:25.540 | The important thing for you to know is that you can create
01:42:30.060 | Custom metrics so in this case you can see here
01:42:32.820 | It says from planet import f2 and really I've got this here so that you can see how to do it
01:42:38.260 | Right so if you look inside
01:42:40.260 | Courses deal one
01:42:45.220 | You can see there's something called planet dot py
01:42:49.180 | Right and so if I look at planet dot py
01:42:52.820 | you'll see there's a
01:42:55.540 | function there called f2
01:42:57.980 | right and so f2
01:43:00.980 | simply calls f beta score from
01:43:04.920 | psychic
01:43:07.100 | Or sci-fi and can remember where it came from
01:43:09.220 | And does a couple little tweets that are particularly important
01:43:13.900 | But the important thing is like you can write any metric you like right as long as it takes in
01:43:21.620 | set of predictions and
01:43:24.380 | a set of targets
01:43:26.180 | They're both going to be numpy arrays one-dimensional numpy arrays, and then you return back a number
01:43:32.380 | Okay, and so as long as you create a function that takes two vectors and returns up number
01:43:37.940 | You can call it as a metric and so then when we said
01:43:42.220 | Learn metrics equals and then passed in that array which just contains a single function f2
01:43:55.980 | Then it's just going to be printed out
01:43:58.260 | After every epoch for you, okay, so in general like the the fast AI library
01:44:04.020 | Everything is customizable so kind of the idea is that everything is
01:44:09.540 | Everything is
01:44:13.940 | Kind of gives you what you might want by default, but also everything can be changed as well
01:44:21.260 | Yes, you know
01:44:24.900 | We have a little bit of confusion about the difference between
01:44:27.780 | multi label and
01:44:30.940 | Just single label. Uh-huh. Do you by any chance an example in which you compute?
01:44:35.580 | similarly to the example of the
01:44:38.180 | They just show us. Oh, I didn't get to that activation function. Yeah, so
01:44:43.700 | So I'm so sorry. I said I'd do that and then I didn't so the activation the output activation function for a single label
01:44:53.100 | Classification is softmax for all the reasons that we talked about
01:44:56.380 | but if we were trying to predict something that was like
01:45:00.380 | 00110
01:45:03.700 | Then softmax would be a terrible choice because it's very hard to come up with something where both of these are high
01:45:09.860 | In fact, it's impossible because they have to add up to one. So the closest they could be would be 0.5
01:45:14.940 | so for multi label classification
01:45:18.920 | activation function is called
01:45:22.260 | Sigmoid okay, and again the fast AI library does this automatically for you if it notices you have a multi label
01:45:30.100 | Problem and it does that by checking your data set to see if anything has more than one label applied to it
01:45:36.700 | and so sigmoid is a function which is equal to
01:45:41.740 | It's basically the same thing
01:45:44.900 | Except rather than we never add up
01:45:48.660 | All of these x's but instead we just take this x and we say it's just equal to it
01:45:54.460 | divided by
01:45:57.540 | 1 plus
01:46:02.260 | And so the nice thing about that is that now like multiple things can be high at once
01:46:12.020 | Right and so generally then if something is less than zero its sigmoid is going to be less than 0.5
01:46:20.300 | If it's greater than 0 its sigmoid is going to be greater than 0.5
01:46:24.500 | And so the important thing to know about a sigmoid function is that its shape
01:46:36.420 | Something which asymptotes the top to one and asymptotes. Oh, I drew that
01:46:42.660 | Asymptotes at the bottom
01:46:48.300 | To zero and so therefore it's a good thing to model a probability with
01:46:54.100 | Anybody who has done any?
01:46:57.380 | logistic regression
01:47:00.660 | Will be familiar with this is what we do in logistic regression
01:47:04.420 | So it kind of appears everywhere in machine learning, and you'll see that kind of a sigmoid and a softmax. They're very close
01:47:10.820 | to each other
01:47:13.500 | Conceptually, but this is what we want is our activation function for multi label
01:47:18.420 | And this is what we want a single label and again and fast AI does it all for you. There was a question over here. Yes
01:47:31.340 | have a question about
01:47:33.140 | The initial training that you do if I understand correctly you have we have frozen the
01:47:38.580 | The pre-trained model and you only did initially try to train the latest
01:47:45.700 | Layer, right? Right
01:47:48.860 | But from the other hand we said that only the initial layer
01:47:53.500 | So let's last probably the first layer is like important to us and the other two
01:47:59.340 | Are more like features that are image not related and we didn't apply in this case. Well, it's that they
01:48:04.540 | The layers are very important
01:48:07.900 | But the pre-trained weights in them aren't so it's the later layers that we really want to train the most
01:48:15.500 | so earlier layers
01:48:17.700 | Likely to be like already
01:48:19.740 | Closer to what we want
01:48:22.620 | Okay, so you start with the latest one and then you go right so if you go back to our quick dogs and cats
01:48:28.260 | right
01:48:30.140 | when we create a model from pre trained from a pre trained model it returns something where all of the convolutional layers are frozen and
01:48:38.020 | some randomly set
01:48:40.900 | Fully connected layers we add to the end
01:48:43.780 | Unfrozen and so when we go fit
01:48:47.100 | But first it just trains
01:48:49.980 | The randomly set a randomly initialized fully connected layers, right?
01:48:56.620 | And if something is like really close to image net that's often all we need
01:49:02.220 | But because the other the only layers are already good at finding edges gradients repeating patterns for
01:49:10.020 | ears and dogs heads
01:49:12.820 | So then when we unfreeze
01:49:17.180 | We set the learning rates for the early layers to be really low
01:49:22.020 | Because we don't want to change them much for us the later ones we set them to be higher
01:49:26.940 | Where else for satellite data?
01:49:29.740 | right
01:49:31.860 | This is no longer true. You know the early layers are still like
01:49:35.420 | Better than the later layers, but we still probably need to change them quite a bit
01:49:41.380 | So that's right. This learning rate is nine times smaller than the final learning rate rather than a thousand times smaller
01:49:50.980 | than the final learning rate
01:49:52.980 | Okay, so you play with with the weights of the layers with the learning rates. Yeah, normally
01:49:58.780 | Most of the stuff you see online if they talk about this at all, they'll talk about unfreezing
01:50:05.000 | different subsets of layers
01:50:07.620 | And indeed we do unfreeze our randomly generated ones
01:50:11.780 | But what I found is although the fast AI library you can type learn dot freeze to and just freeze a subset of layers
01:50:20.140 | this approach of using differential learning rates seems to be like
01:50:23.780 | More flexible to the point that I never find myself unfreezing subsets of layers
01:50:29.700 | So but but I don't understand is that I would expect you to start with that
01:50:33.540 | with a differential the different
01:50:36.620 | Learning rates rather than trying to learn the last layer. So the reason okay, so you could skip
01:50:47.180 | Training just the last layers and just go straight to differential learning rates
01:50:51.060 | But you probably don't want to the reason you probably don't want to is that there's a difference the convolutional layers all contain
01:50:58.980 | Pre trained weights, so they're like they're not random for things that are close to image net
01:51:05.260 | They're actually really good for things that are not close to image net. They're better than nothing
01:51:09.980 | All of our fully connected layers, however are totally random
01:51:16.260 | So therefore you would always want to make the fully connected weights better than random by training them a bit first
01:51:23.020 | Because otherwise if you go straight to unfreeze
01:51:26.460 | Then you're actually going to be like fiddling around of those early early can early layer weights when the later ones are still random
01:51:35.060 | That's probably not what you want. I
01:51:37.060 | Think there's another question here
01:51:39.060 | So when we unfreeze
01:51:43.420 | What are the things we're trying to change there?
01:51:48.140 | will it change the
01:51:51.300 | kernels themselves
01:51:53.460 | That that's always what SGD does. Yeah, so the only thing
01:51:58.340 | what training means is
01:52:00.980 | setting these numbers
01:52:04.380 | right and
01:52:07.300 | These numbers and
01:52:10.700 | These numbers the weights
01:52:16.460 | so the weights are the weights of the fully connected layers and
01:52:20.820 | The weights in those kernels in the convolutions. So that's what training means
01:52:26.140 | It's and we'll learn about how to do it with SGD. But training literally is setting those numbers
01:52:32.500 | these numbers on the other hand
01:52:35.940 | Activations they're calculated. They're calculated from the weights and the previous layers
01:52:42.660 | activations or imports
01:52:45.780 | I have a question. So can you lift that up higher and speak badly? So in your example of training the satellite image
01:52:52.980 | Example so you start with very small size exit support
01:52:57.340 | Yeah, so does it literally mean that you know the model takes a small area from the entire image?
01:53:03.300 | That is 64 by 64
01:53:05.420 | So how do we get that 64 by 64 depends on?
01:53:09.700 | the transforms
01:53:12.260 | by default our transform takes the smallest edge and
01:53:18.340 | Resize zooms the whole thing out
01:53:21.860 | Resamples it so the smallest edge is the size 64 and then it takes a center crop
01:53:27.260 | of that, okay, although
01:53:32.020 | When we're using data augmentation it actually takes a randomly chosen
01:53:39.460 | In the case where the image has multiple objects like in this case
01:53:43.580 | Like would it be possible like you would just lose the other things that they try to forget?
01:53:49.740 | Yeah, which is why data augmentation is important. So by and particularly their
01:53:54.620 | Test time augmentation is going to be particularly important because you would you wouldn't want to you know
01:54:00.620 | That there may be a artisanal mine out in the corner, which if you take a center crop you you don't see
01:54:07.220 | So data augmentation becomes very important. Yeah
01:54:14.820 | So when we talk about metrics that users are here see that lower or up to
01:54:18.820 | That's not really what the model tries to that's a great point. That's not the loss function
01:54:24.620 | Yeah, right. The loss function is something we'll be learning about next week
01:54:29.020 | And it uses a cross entropy or otherwise known as like negative log likelihood
01:54:34.500 | The metric is just the thing that's printed so we can see what's going on
01:54:39.900 | Just next to that
01:54:43.460 | So in the context of multi-class
01:54:45.940 | Modeling cannot training data does a training data also have to be multi-class?
01:54:50.460 | So can I train on just like images of pure cats and pure dogs and expect it at prediction time to?
01:54:56.260 | Predict if I give it a picture of both having cat analog
01:54:58.880 | I've never tried that and I've never seen an example of something that needed it. I
01:55:08.140 | Guess conceptually there's no reason it wouldn't work
01:55:12.300 | But it's kind of out there
01:55:15.740 | And you still use a sigmoid activity you would have to make sure you're using a sigmoid loss function
01:55:20.340 | So in this case fast a eyes default would not work because by default fast a I would say your training data
01:55:25.700 | Never has both a cat and a dog, so you would have to override the loss function
01:55:29.260 | When you use the differential learning rates
01:55:38.080 | Those three learning rates do they just kind of spread evenly across the layers?
01:55:43.420 | Yeah, we'll talk more about this later in the course, but I'm in the fast AI library
01:55:49.540 | There's a concept of layer groups so in something like a resnet 50
01:55:54.580 | You know there's hundreds of layers, and I figured you don't want to write down hundreds of learning rates, so I've
01:56:00.940 | basically decided for you how to split them and
01:56:04.420 | The the last one always refers just to the fully connected layers that we've randomly initialized and add it to the end
01:56:12.780 | And then these ones are split generally about halfway through
01:56:18.260 | Basically, I've tried to make it so that
01:56:20.260 | These you know these ones are kind of the ones which you hardly want to change at all
01:56:24.500 | And these are the ones you might want to change a little bit, and I don't think we're covered in the course
01:56:29.420 | But if you're interested we can talk about in the forum
01:56:31.260 | There are ways you can override this behavior to define your own layer groups if you want to
01:56:35.640 | And is there any way to visualize the model easily or like dump dump the layers of the model?
01:56:41.820 | Yeah, absolutely
01:56:43.900 | You can
01:56:45.900 | Make sure we've got one here
01:56:50.420 | So if you just type learn it doesn't tell you much at all, but what you can do is go
01:56:56.800 | learn summary and
01:56:59.980 | That spits out
01:57:03.900 | basically
01:57:05.580 | everything
01:57:07.020 | There's all the letters and so you can see in this case
01:57:09.980 | These are the names I mentioned how they all got names right so the first layer is called conv 2d - 1
01:57:18.100 | And it's going to take as input
01:57:20.100 | This is useful to actually look at it's taking 64 by 64 images. Which is what we told it
01:57:27.060 | We're going to transform things - this is three channels pie torch
01:57:30.700 | Like most things have channels at the end would say 64 by 64 by 3 pie torch moves it to the front
01:57:38.700 | So it's 3 by 64 by 64
01:57:41.300 | That's because it turns out that some of the GPU computations run faster when it's in that order
01:57:47.260 | Okay, but that happens all behind the scenes automatically so part of that transformation stuff
01:57:52.780 | That's kind of all done automatically is to do that
01:57:58.580 | Means however however big the batch size is
01:58:01.540 | In Keras they use the number they use a special number none
01:58:07.100 | In pie torch they use - 1 so this is a four-dimensional mini batch
01:58:11.860 | the number of
01:58:14.380 | Elements in the number of images in the image mini batches dynamic you can change that the number of channels is 3
01:58:20.660 | Number of images is 64 by 64. Okay, and so then you can basically see that this particular convolutional kernel
01:58:28.740 | Apparently has 64 kernels in it
01:58:32.220 | And it's also halving we haven't talked about this but convolutions can have something called a stride
01:58:37.100 | That it's like max pooling for changes the size. So it's returning a 32 by 32 by 64 kernel
01:58:44.780 | Tensor and so on and so forth
01:58:48.140 | So that's summary and we'll learn all about what that's doing in detail in the second half of the course
01:58:56.020 | one more I
01:58:59.100 | Clicked in my own data set and I tried to use the and it's a really small data set these currencies from
01:59:04.740 | images and I tried to do a
01:59:07.780 | Learning rate find and then the plot and it just it gave me some numbers which I didn't understand on the learning rate font
01:59:14.980 | Yeah, and then the plot was empty. So yeah, I mean let's let's talk about that on the forum
01:59:19.900 | but basically
01:59:21.020 | The learning rate finder is going to go through a mini batch at a time if you've got a tiny data set
01:59:26.460 | There's just not enough mini batches. So the trick is to make your mini that make your batch size really small
01:59:31.740 | Like try making it like four or eight or something
01:59:34.460 | Okay, they were great questions nothing online to add in it
01:59:41.900 | They were great questions we've got a little bit past where I hope to but let's let's quickly talk about
01:59:49.060 | Structured data so we can start thinking about it for next week
01:59:57.300 | This is really weird right to me. There's basically two types of data set we use in machine learning. There's a type of data
02:00:04.780 | like audio
02:00:07.340 | images
02:00:09.740 | natural language text
02:00:11.740 | where all of the all of the things inside an object like all of the pixels inside an image are
02:00:18.180 | All the same kind of thing. They're all pixels or they're all
02:00:23.780 | amplitudes of a waveform or
02:00:25.780 | They're all words
02:00:28.180 | I call this kind of data unstructured and then there's data sets like a
02:00:34.140 | profit-and-loss statement or
02:00:37.060 | the information about a Facebook user
02:00:39.660 | Where each column is like?
02:00:42.460 | Structurally quite different, you know one thing is representing like how many page views last month another one is their sex
02:00:49.620 | Another one is what zip code they're in and I call this structured data
02:00:53.860 | That particular terminology is not
02:00:57.180 | Unusual like lots of people use that terminology, but lots of people don't there's no
02:01:02.980 | Particularly agreed upon
02:01:05.980 | terminology so when I say structured data
02:01:09.020 | I'm referring to kind of columnar data as you might find in a database or a spreadsheet where different columns
02:01:16.700 | represent different kinds of things and each row represents an observation and
02:01:21.740 | So structured data is probably what most of you
02:01:26.180 | Analyzing most of the time
02:01:30.180 | Funnily enough you know academics in the deep learning world don't really give a shit about structured data
02:01:38.340 | Because it's pretty hard to get published in fancy conference proceed proceedings
02:01:43.060 | If you're like if you've got a better logistics model, you know, it's the thing that makes the world goes round
02:01:48.620 | It's a thing that makes everybody you know money and efficiency and make stuff work
02:01:54.020 | But it's largely ignored sadly
02:01:57.600 | So we're not going to ignore it because we're practical deep learning
02:02:02.140 | And Kaggle doesn't ignore it either because people put prize money up on Kaggle to solve real-world problems
02:02:08.940 | So there are some great Kaggle competitions we can look at there's one running right now
02:02:13.400 | Which is the grocery sales forecasting competition for Ecuador's largest chain?
02:02:19.080 | It's always a little I've got to be a little careful about how much I show you about currently running competitions because I don't want
02:02:28.660 | To you know help you cheat, but it so happens. There was a competition a year or two ago
02:02:34.620 | For one of Germany's largest grocery chains, which is almost identical. So I'm going to show you how to do that
02:02:40.640 | So that was called the Rossman stores data
02:02:48.740 | So I would suggest you know, first of all try practicing what we're learning on Rossman, right?
02:02:54.860 | but then see if you can get it working on on grocery because currently
02:03:00.340 | On the leaderboard no one seems to basically know what they're doing in the groceries competition. If you look at the leaderboard
02:03:09.540 | See here
02:03:11.220 | These ones around five to nine five three. Oh are people that are literally finding like group averages and submitting those
02:03:17.940 | I know because that the kernels that they're using so, you know the basically the people around 20th place
02:03:23.840 | I'm not actually doing any machine learning
02:03:28.500 | So yeah, let's see if we can improve things
02:03:30.500 | So you'll see there's a lesson three Rossman
02:03:35.300 | Notebook sure you get pool. Okay, in fact, you know just reminder, you know before you start working
02:03:41.220 | Get pool in your fast AI repo and from time to time
02:03:45.500 | Conda and update for you guys doing the in-person course the Conda and update
02:03:51.540 | You should do it more often because we're kind of changing things a little bit folks in the MOOC
02:03:57.100 | You know more like once a month should be fine
02:03:59.820 | So anyway, I just I just changed this a little bit so make sure you get pulled to get lesson three Rossman
02:04:06.500 | And there's a couple of new libraries here one is fast AI dot structured
02:04:12.500 | Fast AI dot structured contains stuff, which is actually not at all Pytorch specific
02:04:18.940 | And we actually use that in the machine learning course as well for doing random forests with no Pytorch at all
02:04:24.620 | I mentioned that because you can use that particular library without any of the other parts of fast AI
02:04:31.500 | So that can be handy
02:04:34.300 | And then we're also going to use fast AI dot column data
02:04:37.460 | Which is basically some stuff that allows us to do fast AI Pytorch stuff with
02:04:43.440 | columnar structured data
02:04:46.220 | For structured data we need to use pandas a lot
02:04:52.060 | Anybody who's used our data frames will be very familiar with pandas pandas is basically an attempt to kind of replicate
02:04:58.300 | data frames in Python
02:05:01.140 | You know and a bit more
02:05:04.080 | If you're not entirely familiar with pandas
02:05:09.100 | There's a great book
02:05:12.340 | Which I think I might have mentioned before
02:05:20.580 | Python for data analysis by Wes McKinney. There's a new edition that just came out a couple of weeks ago
02:05:26.180 | Obviously being by the pandas author its coverage of pandas is excellent, but it also covers
02:05:33.340 | numpy
02:05:35.460 | scipy
02:05:36.660 | plotlib
02:05:37.740 | scikit learn
02:05:39.460 | I python and jupyter really well, okay, and so I'm kind of going to assume
02:05:46.500 | That you know your way around these libraries to some extent
02:05:51.020 | Also, there was the workshop we did before this started and there's a video of that online where we kind of have a brief mention
02:05:58.100 | of all of those tools
02:06:00.340 | Structured data is generally shared as CSV files. It was no different in this competition
02:06:07.460 | As you'll see, there's a hyperlink to the Rossman data set here
02:06:11.860 | All right now if you look at the bottom of my screen you'll see this goes to files.fast.ai
02:06:17.060 | Because this doesn't require any login or anything to grab this data set. It's as simple as right clicking
02:06:22.740 | copy link address
02:06:25.540 | Head over to wherever you want it and just type
02:06:29.680 | Wget and
02:06:33.180 | The URL okay, so that's because you know, it's it's not behind a login or anything
02:06:42.060 | so you can grab the grab it from there and
02:06:46.100 | You can always read a CSV file with just pandas dot read CSV now in this particular case. There's a lot of
02:06:53.460 | Pre-processing that we do and what I've actually done here is I've
02:06:59.300 | I've actually
02:07:02.180 | Stolen the entire pipeline from the third-place winner of Rossman. Okay, so they made all their data
02:07:09.980 | They're really great. You know, they've had a github available with everything that we need and I've ported it all across and simplified it and
02:07:16.860 | Tried to make it pretty easy to understand
02:07:21.900 | Course is about deep learning not about data processing. So I'm not going to go through it
02:07:26.800 | But we will be going through it in the machine learning course in some detail because feature engineering is really important
02:07:33.820 | So if you're interested
02:07:35.820 | You know check out the machine learning course
02:07:38.980 | for that I
02:07:40.980 | will however show you
02:07:42.980 | Kind of what it looks like. So once we read the CSVs in
02:07:46.580 | You can see basically what's there so the key one is
02:07:51.500 | For a particular store
02:07:57.380 | We have the
02:08:02.900 | We have the date and we have the sales
02:08:09.620 | For that particular store. We know whether that
02:08:13.060 | Thing is on promo or not
02:08:16.100 | We know the number of customers that that particular store had
02:08:20.900 | We know whether that date was a school holiday
02:08:24.540 | We also know
02:08:34.260 | What kind of store it is so like this is pretty common right you'll often get
02:08:38.720 | Data sets where there's some column with like just some kind of code. We don't really know what the code means
02:08:44.460 | Most of the time I find it doesn't matter what it means
02:08:48.200 | Like normally you get given a data dictionary when you start on a project and obviously if you're working on internal project
02:08:54.540 | You can ask the people at your company. What does this column mean? I?
02:08:57.780 | Kind of stay away from learning too much about it. I prefer to like see what the data says
02:09:04.020 | first
02:09:06.020 | There's something about what kind of product are we selling in this particular row?
02:09:10.940 | And then there's information about like how far away is the nearest competitor how long have they been open for
02:09:21.140 | How long is the promo being on for
02:09:30.500 | Each store we can find out what state it's in for each state we can find out the name of the state
02:09:35.980 | this is in Germany and
02:09:37.980 | Interestingly they were allowed to download any data external data
02:09:42.460 | They wanted in this competition
02:09:43.740 | It's very common as long as you share it with everybody else and so some folks tried downloading data from
02:09:50.340 | Google Trends
02:09:53.180 | I'm not sure exactly what it was that they were checking the trend of but we have this information from Google Trends
02:09:59.940 | Somebody downloaded the weather for every day in Germany for every state
02:10:03.780 | And yeah, that's about it right so
02:10:12.580 | You can get a data frame summary with pandas which kind of lets you see how many
02:10:22.520 | Observations and means and standard deviations
02:10:25.180 | Again, I don't do a hell of a lot with that early on
02:10:29.260 | But it's nice to know it there
02:10:31.260 | So what we do, you know, this is called a relational data set a relational data set is one where there's quite a few tables
02:10:38.300 | We have to join together. It's very easy to do that in pandas
02:10:41.960 | There's a thing called merge so I create a little function to do that
02:10:45.020 | And so I just started joining everything together join in the weather the Google Trends
02:10:48.300 | the stores
02:10:51.500 | Yeah, that's about everything I guess
02:10:58.060 | You'll see there's one thing that I'm using from the fast AI library, which is called add date part
02:11:03.740 | We talk about this a lot in the machine learning course
02:11:06.340 | But basically this is going to take a date and pull out of it a bunch of columns day of week
02:11:11.580 | Is at the start of a quarter month of year so on and so forth and add them all in for the data set
02:11:17.820 | Okay, so this is all standard pre-processing
02:11:23.380 | As we join everything together we fiddle around with some of the dates a little bit some of them are in month and year
02:11:28.700 | Format we turn it into date format
02:11:30.700 | We spend a lot of time
02:11:32.980 | Trying to
02:11:35.860 | Take information about for example holidays and add a column for like how long until the next holiday
02:11:41.920 | How long has it been since the last holiday?
02:11:44.100 | ditto for promos
02:11:46.580 | So on and so forth. Okay, so we do all that and at the very end
02:11:51.900 | We basically save a big structured data file that contains all that stuff
02:11:57.020 | Something that those of you that use pandas may not be aware of is that there's a very cool new format called feather
02:12:03.420 | Which you can save a pandas
02:12:06.100 | Data frame into this feather format
02:12:08.380 | It's kind of pretty much takes it as it sits in RAM and dumps it to the disk
02:12:13.180 | and so it's like really really really fast the reason that you need to know this is because the
02:12:19.580 | Ecuadorian grocery competition it's on now has 350 million records
02:12:24.120 | So you will care about how long things take it took I believe about six seconds for me to save
02:12:30.820 | 350 million records to feather format, so it's pretty cool
02:12:34.380 | So at the end of all that I'd save it as feather format and for the rest of this discussion
02:12:39.740 | I'm just going to take it as given that we've got this nicely
02:12:43.700 | Processed feature-engineered file and I can just go read better. Okay, but for you to play along at home
02:12:49.780 | You will have to run those previous cells. Oh
02:12:53.020 | except the
02:12:55.660 | See these ones are commented out
02:12:57.940 | You don't have to run those because the file that you download from files.fast.ai has already done that for you, okay?
02:13:04.700 | All right
02:13:07.820 | So we basically have
02:13:09.820 | all these columns
02:13:12.780 | So it basically is going to tell us
02:13:15.460 | You know how many of this thing was sold on?
02:13:20.980 | This date at this store and so the goal of this competition is to find out
02:13:28.020 | How many things will be sold for each store for each type of thing in the future?
02:13:34.460 | Okay, and so that's basically what we're going to be trying to do
02:13:39.860 | And so here's an example of what some of the data looks like
02:13:42.580 | And so
02:13:46.420 | Next week we're going to see how to go through these steps
02:13:50.380 | But basically what we're going to learn is we're going to learn to split the columns into two types
02:13:56.500 | some columns we're going to treat as
02:13:59.340 | categorical, which is to say
02:14:01.900 | Store ID 1 and store ID 2 are not numerically related to each other the categories
02:14:09.700 | Right we're going to treat day of week like that to Monday and Tuesday day zero and day one not numerically
02:14:16.160 | Where else distance in kilometers to the nearest competitor?
02:14:22.140 | That's a number that we're going to treat numerically
02:14:25.020 | Right so in other words the categorical variables. We basically are going to one hot encode them
02:14:30.580 | You can think of it as one hot encoding them where else the continuous variables. We're going to be feeding into fully connected layers
02:14:38.700 | Just as is
02:14:42.460 | So what we'll be doing is we'll be basically
02:14:44.780 | creating a
02:14:47.820 | Validation set and you'll see like a lot of these are start to look familiar
02:14:50.900 | This is the same function we used on planet and dog breeds to create a validation set
02:14:55.180 | There's some stuff that you haven't seen before
02:14:59.060 | where we're going to
02:15:01.940 | Basically rather than saying image data dot from CSV. We're going to say columnar data
02:15:08.560 | From data frame right so you can see like the basic API concepts will be the same, but they're a little different, right?
02:15:15.680 | but just like before we're going to get a learner and
02:15:19.600 | we're going to go lr find
02:15:22.880 | to find our best learning rate and
02:15:25.120 | Then we're going to go dot fit with a metric
02:15:28.600 | with a cycle length
02:15:31.440 | Okay, so the basic sequence who's going to end up looking hopefully very familiar. Okay, so we're out of time
02:15:39.480 | so what I suggest you do this week is like
02:15:42.720 | try to
02:15:45.360 | Enter as many Kaggle image competitions as possible like like try to really get this feel for like
02:15:51.640 | cycle lengths learning rates
02:15:54.560 | plotting things
02:15:57.760 | You know that
02:16:01.360 | That post I showed you at the start of class today that kind of took you through lesson one like
02:16:07.560 | Really go through that on as many image data sets as you can to just feel
02:16:12.620 | Really comfortable with it, right?
02:16:15.360 | because you want to get to the point where next week when we start talking about structured data that this idea of like how
02:16:21.960 | Learners kind of work and data works and data loaders and data sets and looking at pictures should be really you know intuitive
02:16:30.320 | Alright, good luck. See you next week
02:16:32.320 | (audience applauding)
02:16:35.480 | (audience applauding)