back to indexLesson 2: Practical Deep Learning for Coders
00:00:00.000 |
So one of the things I wanted to talk about and it really came up when I was looking at 00:00:06.960 |
the survey responses is what is different about how we're trying to teach this course 00:00:13.240 |
and how will it impact you as participants in this course. 00:00:16.960 |
And really we're trying to teach this course in a very different way to the way most teaching 00:00:22.760 |
is done, or at least most teaching in the United States. 00:00:28.760 |
Rachel and I are both very keen fans of this guy called David Perkins who has this wonderful 00:00:34.480 |
book called Making Learning Whole, How Seven Principles of Teaching Can Transform Education. 00:00:40.520 |
We are trying to put these principles in practice in this course. 00:00:43.560 |
I'll give you a little anecdote to give you a sense of how this works. 00:00:51.200 |
If you were to learn baseball, if you were to learn baseball the way that math is taught, 00:00:57.880 |
you would first of all learn about the shape of a parabola, and then you would learn about 00:01:02.600 |
the material science design behind stitching baseballs and so forth. 00:01:07.000 |
And 20 years later after you had completed your PhD in postdoc, you would be taken to 00:01:10.720 |
your first baseball game and you would be introduced to the rules of baseball, and then 00:01:19.400 |
The way that in practice baseball is taught is we take a kid down to the baseball diamond 00:01:24.880 |
and we say these people are playing baseball. 00:01:30.880 |
You say, "Okay, stand here, I'm going to throw this, hit it." 00:01:39.880 |
So that's why we started our first class with here are 7 lines of code you can run to do 00:01:46.240 |
Not just to do deep learning, but to do image classification on any data set as long as 00:01:55.400 |
So this means you will very often be in a situation, and we've heard a lot of your questions 00:01:59.800 |
about this during the week, of gosh there's a whole lot of details I don't understand. 00:02:05.080 |
Like this fine-tuning thing, what is fine-tuning? 00:02:11.920 |
It's a thing you do in order to do effective image classification with deep learning. 00:02:18.640 |
We're going to start at the top and gradually working our way down and down and down. 00:02:22.480 |
The reason that you are going to want to learn the additional levels of detail is so that 00:02:27.880 |
when you get to the point where you want to do something that no one's done before, you'll 00:02:33.560 |
know how to go into that detail and create something that does what you want. 00:02:38.720 |
So we're going to keep going down a level and down a level and down a level and down 00:02:42.600 |
a level, but through the hierarchy of software libraries, through the hierarchy of the way 00:02:48.280 |
computers work, through the hierarchy of the algorithms and the math. 00:02:53.440 |
But only at the speed that's necessary to get to the next level of let's make a better 00:02:58.900 |
model or let's make a model that can do something we couldn't do before. 00:03:05.880 |
So it's very different to, I don't know if anybody has been reading the Yoshua Bengio 00:03:10.280 |
and Ian Goodfellow deep learning book, which is a great mathematical deep learning book, 00:03:15.280 |
but it literally starts with 5 chapters of everything you need to know about probability, 00:03:19.480 |
everything you need to know about calculus, everything you need to know about linear algebra, 00:03:22.720 |
everything you need to know about optimization and so forth. 00:03:25.520 |
And in fact, I don't know that in the whole book there's ever actually a point where it 00:03:32.200 |
says here is how you do deep learning, even if you read the whole thing. 00:03:36.360 |
I've read 2/3 of it before, it's a really good math book. 00:03:42.560 |
And anybody who's interested in understanding the math of deep learning I would strongly 00:03:45.840 |
recommend but it's kind of the opposite of how we're teaching this course. 00:03:50.340 |
So if you often find yourself thinking, "I don't really know what's going on," that's 00:03:56.800 |
But I also want you to always be thinking about, "Well how can I figure out a bit more 00:04:06.400 |
So generally speaking, the assignments during the week are trying to give you enough room 00:04:12.160 |
to find a way to dig into what you've learned and learn a little bit more. 00:04:17.120 |
Make sure you can do what you've seen and also that you can learn a little bit more 00:04:22.400 |
So you are all coders, and therefore you are all expected to look at that first notebook 00:04:27.920 |
and look at what are the inputs to every one of those cells? 00:04:30.800 |
What are the outputs from every one of those cells? 00:04:33.280 |
How is it that the output of this cell can be used as the input of that cell? 00:04:39.140 |
This is why we did not tell you how do you use Kaggle CLI? 00:04:43.420 |
How do you prepare a submission in the correct format? 00:04:46.880 |
Because we wanted you to see if you can figure it out and also to leverage the community 00:04:53.400 |
that we have to ask questions when you're stuck. 00:04:59.520 |
Being stuck and failing is terrific because it means you have found some limit of your 00:05:07.320 |
You can then think really hard, read lots of documentation, and ask the rest of the community 00:05:14.480 |
until you are no longer stuck, at which point you now know something that you didn't know 00:05:22.440 |
Asking for help is a key part of this, and so there is a whole wiki page called How to 00:05:27.960 |
It's really important, and so far I would say about half the times I've seen people ask 00:05:33.040 |
for help, there is not enough information for your colleagues to actually help you effectively. 00:05:38.640 |
So when people point you at this page, it's not because they're trying to be a pain, it's 00:05:42.880 |
because they're saying, "I want to help you, but you haven't given me enough information." 00:05:47.200 |
So in particular, what have you tried so far? 00:05:56.980 |
And tell us everything you can about your computer and your software. 00:06:13.120 |
Show us screenshots, error messages, show us your code. 00:06:16.360 |
So the better you get at asking for help, the more enjoyable experience you're going 00:06:23.560 |
to have because continually you'll find your problems will be solved very quickly and you 00:06:30.480 |
There was a terrific recommendation from the head of Google Brain, Vincent van Hooke, on 00:06:39.760 |
a Reddit AMA a few weeks ago where he said he tells everybody in his team, "If you're 00:06:49.540 |
You have to work at it yourself for half an hour. 00:06:51.600 |
If you're still stuck, you have to ask for help from somebody else." 00:06:55.040 |
The idea being that you are always making sure that you try everything you can, but 00:07:00.240 |
you're also never wasting your time when somebody else can help you. 00:07:05.600 |
So maybe you can think about this half an hour rule yourself. 00:07:10.400 |
I wanted to highlight a great example of a really successful how to ask for help. 00:07:20.600 |
What's your background before being here at this class? 00:07:26.000 |
You could introduce yourself real quick, please. 00:07:33.000 |
Hey, I actually graduated from U.S.S. two years ago with the Master of U.S.S. later in elements. 00:07:41.640 |
So that's why it was taught us as a team back to this class. 00:07:52.560 |
Well, hopefully you've heard some of these fantastic approaches to asking for help. 00:07:53.560 |
You can see here that he explained what he's going to do, what happened last time, what 00:08:00.960 |
We've got a screenshot showing what he typed and what came back. 00:08:03.960 |
He showed us what resources he's currently used, what these resources say, and so forth. 00:08:18.880 |
I'm so happy when I saw this question because it's just so clear. 00:08:19.880 |
I was like, this is easy to answer because it's a well-asked question. 00:08:31.240 |
So as you might have noticed, the wiki is rapidly filling out with lots of great information. 00:08:38.800 |
You'll see on the left-hand side there is a recent changes section. 00:08:42.800 |
You can see every day, lots of people have been contributing to lots of things, so it's 00:08:54.520 |
If you are trying to diagnose something which is not covered and you solve it, please add 00:09:06.320 |
One of the things I love seeing today was Tom, where's Tom? 00:09:17.800 |
Actually I think he was remote, I think he joined his remote yesterday. 00:09:20.240 |
So he was asking a question about how fine-tuning works, and we talked a bit about the answers, 00:09:28.740 |
and then he went ahead and created a very small little wiki page. 00:09:32.800 |
There's not much information there, but there's more than there used to be. 00:09:37.800 |
And you can even see in the places where he wasn't quite sure, he put some question marks. 00:09:42.480 |
So now somebody else can go back, edit his wiki page, and Tom's going to come back tomorrow 00:09:47.080 |
and say "Oh, now I've got even more questions answered." 00:09:50.720 |
So this is the kind of approach where you're going to learn a lot. 00:09:55.760 |
We've already spoken to Melissa, so this is good. 00:09:59.520 |
This is another great example of something which I think is very helpful, which is Melissa, 00:10:04.120 |
who we heard from earlier, went ahead and told us all, "Here are my understanding of 00:10:08.760 |
the 17 steps necessary to complete the things that we were asked to do this week." 00:10:13.560 |
So this is great not only for Melissa to make sure she understands it correctly, but then 00:10:18.780 |
everybody else can say "Oh, that's a really handy resource that we can draw on as well." 00:10:27.760 |
There are 718 messages in Slack in a single channel. 00:10:32.280 |
That's way too much for you to expect to use this as a learning resource, so this is kind 00:10:39.080 |
of my suggestion as to where you might want to be careful of how you use Slack. 00:10:47.320 |
So I wanted to spend maybe quite a lot of time, as you can see, talking about the resources 00:10:53.480 |
I feel like if we get that sorted out now, then we're all going to speed along a lot 00:10:59.440 |
Thanks for your patience as we talk about some non-deep learning stuff. 00:11:03.760 |
We expect the vast majority of learning to have an outside of class, and in fact if we 00:11:14.240 |
go back and finish off our survey, I know that one of the questions asked about that. 00:11:25.800 |
How much time are you prepared to commit most weeks to this class? 00:11:30.400 |
And the majority are 8-15, some are 15-30, and a small number are less than 8. 00:11:38.040 |
Now if you're in the less than 8 group, I understand that's not something you can probably 00:11:43.120 |
If you had more time, you'd put in more time. 00:11:46.280 |
So if you're in the less than 8 group, I guess just think about how you want to prioritize 00:11:52.400 |
what you're getting out of this course, and be aware it's not really designed that you're 00:11:56.040 |
going to be able to do everything in less than 8 hours a week. 00:12:01.360 |
So maybe make more use of the forums and the wiki and kind of focus your assignments during 00:12:08.600 |
the week on the stuff that you're most interested in. 00:12:11.120 |
And don't worry too much if you don't feel like you're getting everything, because you 00:12:16.800 |
For those of you in the 15-30 group, I really hope that you'll find that you're getting 00:12:21.920 |
a huge amount of that time that you're putting in. 00:12:27.560 |
Something I'm really glad I asked, because I found this very helpful, was how much was 00:12:32.720 |
And for half of you, the answer is most of it. 00:12:37.020 |
And for well over half of you, most of it or nearly all of it from Lesson 1 is new. 00:12:43.120 |
So if you're one of the many people I've spoken to during the week who are saying "holy shit, 00:12:48.400 |
that was a fire hose of information, I feel kind of overwhelmed, but kind of excited. 00:12:58.600 |
Remember during the week, there are about 100 of you going through this same journey. 00:13:03.840 |
So if you want to catch up with some people during the week and have a coffee to talk 00:13:08.120 |
more about the class, or join a study group here at USF, or if you're from the South Bay, 00:13:13.400 |
find some people from the South Bay, I would strongly suggest doing that. 00:13:18.200 |
So for example, if you're in Menlo Park, you could create a Menlo Park Slack channel and 00:13:24.020 |
put out a message saying "Hey, anybody else in Menlo Park available on Wednesday night, 00:13:28.680 |
I'd love to get together and maybe do some pair programming." 00:13:35.440 |
For some of you, not very much of it was new. 00:13:39.240 |
And so for those of you, I do want to make sure that you feel comfortable pushing ahead, 00:13:49.740 |
Basically in the last lesson, what we learned was a pretty standard data science computing 00:13:56.960 |
So AWS, Jupyter Notebook, bit of NumPy, Bash, this is all stuff that regardless of what 00:14:07.840 |
kind of data science you do, you're going to be seeing a lot more of if you stick in 00:14:13.840 |
They're all very, very useful things, and those of you who have maybe spent some time 00:14:18.960 |
in this field, you'll have seen most of it before. 00:14:36.920 |
So last week we were really looking at the basic foundations, computing foundations necessary 00:14:46.160 |
for data science more generally, and for deep learning more particularly. 00:14:52.640 |
This week we're going to do something very similar, but we're going to be looking at 00:14:58.120 |
So in particular, we're going to go back and say "Hey, what did we actually do last week? 00:15:08.240 |
For those of you who don't have much algorithmic background around machine learning, this is 00:15:13.200 |
going to be the same fire hose of information as last week was for those of you who don't 00:15:18.080 |
have so much software and Bash and AWS background. 00:15:23.000 |
So again, if there's a lot of information, don't worry, this is being recorded. 00:15:31.520 |
And so the key thing is to come away with an understanding of what are the pieces being 00:15:39.400 |
What are they kind of doing, even if you don't understand the details? 00:15:42.960 |
So if at any point you're thinking "Okay, Jeremy's talking about activation functions, 00:15:48.000 |
I have no idea what he just said about what an activation function is, or why I should 00:15:52.400 |
care, please go on to the in-class Slack channel and probably @Rachel, @Rachel, I don't know 00:16:01.840 |
what Jeremy's talking about at all, and then Rachel's got a microphone and she can let 00:16:05.400 |
me know, or else put up your hand and I will give you the microphone and you can ask. 00:16:10.560 |
So I do want to make sure you guys feel very comfortable asking questions. 00:16:14.640 |
I have done this class now once before because I did it for the Skype students last night. 00:16:20.640 |
So I've heard a few of the questions already, so hopefully I can cover some things that 00:16:28.680 |
Before we look at these kind of digging into what's going on, the first thing we're going 00:16:34.120 |
to do is see how do we do the basic homework assignment from last week. 00:16:39.960 |
So the basic homework assignment from last week was "Can you enter the Kaggle Dogs and 00:16:47.920 |
So how many of you managed to submit something to that competition and get some kind of result? 00:16:57.520 |
So for those of you who haven't yet, keep trying during this week and use all of those 00:17:02.280 |
resources I showed you to help you because now quite a few of your colleagues have done 00:17:06.300 |
it successfully and therefore we can all help you. 00:17:21.960 |
So the basic idea here is we had to download the data to a directory. 00:17:37.360 |
So to do that, I just typed "kg download" after using the "kg config" command. 00:17:45.640 |
Kg is part of the Kaggle CLI thing, and Kaggle CLI can be installed by typing "p install 00:17:58.160 |
This works fine without any changes if you're using our AWS instances and setup scripts. 00:18:06.880 |
In fact it works fine if you're using Anaconda pretty much anywhere. 00:18:11.680 |
If you're not doing either of those two things, you may have found this step more challenging. 00:18:17.220 |
But once it's installed, it's as simple as saying "kg config" with your username, password 00:18:24.420 |
When you put in the competition name, you can find that out by just going to the Kaggle 00:18:30.040 |
website and you'll see that when you go to the competition in the URL, it has here a 00:18:39.120 |
Just copy and paste that, that's the competition then. 00:18:43.880 |
Kaggle CLI is a script that somebody created in their spare time and didn't spend a lot 00:18:51.040 |
There's no error handling, there's no checking, there's nothing. 00:18:53.720 |
So for example, if you haven't gone to Kaggle and accepted the competition rules, then attempting 00:18:59.260 |
to run Kg download will not give you an error. 00:19:02.680 |
It will create a zip file that actually contains the contents of the Kaggle webpage saying 00:19:09.080 |
So those of you that tried to unzip that and that said it's not a zip file, if you go ahead 00:19:13.440 |
and cat that, you'll see it's not a zip file, it's an HTML file. 00:19:18.360 |
This is pretty common with recent-ish data science tools and particularly with cutting 00:19:26.840 |
A lot of it's pretty new, it's pretty rough, and you really have to expect to do a lot 00:19:33.680 |
It's very different to using Excel or Photoshop. 00:19:37.440 |
When I said Kg download, I created a test.zip and a train.zip, so I went ahead and I unzipped 00:19:43.320 |
both of those things, that created a test and a train, and they contained a whole bunch 00:19:54.760 |
So the next thing I did to make my life easier was I made a list of what I believed I had 00:20:06.280 |
I thought I need to create a validation set, I need to create a sample, I need to move 00:20:11.000 |
my cats into a cats directory and dogs into a docs directory, I then need to run the fine 00:20:17.800 |
So I just went ahead then and created markdown headings for each of those things and started 00:20:27.360 |
A very handy thing in Jupyter, Jupyter Notebook, is that you can create a cell that starts 00:20:32.760 |
with a % sign and that allows you to type what they call magic commands. 00:20:36.960 |
There are lots of magic commands that do all kinds of useful things, but they do include 00:20:45.040 |
Another cool thing you can do is you can use an explanation mark and then type any bash 00:20:52.000 |
So the nice thing about doing this stuff in the notebook rather than in bash is you've 00:20:59.000 |
So if you need to go back and do it again, you can. 00:21:01.320 |
If you make a mistake, you can go back and figure it out. 00:21:03.840 |
So this kind of reproducible research, very highly recommended. 00:21:08.800 |
So I try to do everything in a single notebook so I can go back and fix the problems that 00:21:14.600 |
So here you can see I've gone into the directory, I've created my validation set, I then used 00:21:21.520 |
three lines of Python to go ahead and grab all of the JPEG file names, create a random 00:21:31.240 |
permutation of them, and so then the first 2000 of that random permutation are 2000 random 00:21:37.160 |
files, and then I moved them into my validation directory, that gave them my valid. 00:21:42.600 |
I did exactly the same thing for my sample, but rather than moving them, I copied them. 00:21:50.640 |
And then I did that for both my sample training and my sample validation, and that was enough 00:22:00.800 |
The next thing I had to do was to move all my cats into a cats directory and dogs into 00:22:04.680 |
a dogs directory, which was as complex as typing move cat.star cats and dogs.star dogs. 00:22:15.760 |
And so the cool thing is, now that I've done that, I can then just copy and paste the seven 00:22:25.960 |
So these lines of code are totally unchanged. 00:22:29.480 |
I added one more line of code which was save weights. 00:22:33.520 |
Once you've trained something, it's a great idea to save the weights so you don't have 00:22:36.840 |
to train it again, you can always go back later and say load weights. 00:22:42.800 |
So I now had a model which predicted cats and dogs through my Redux competition. 00:22:52.740 |
So Kaggle tells us exactly what they expect, and the way they do that is by showing us 00:23:02.880 |
And basically the sample shows us that they expect an ID column and a label column. 00:23:10.820 |
The ID is the file number, so if you have a look at the test set, you'll see everyone's 00:23:25.280 |
So it's expecting to get the number of the file along with your probability. 00:23:38.280 |
So you have to figure out how to take your model and create something of that form. 00:23:46.280 |
This is clearly something that you're going to be doing a lot. 00:23:48.720 |
So once I figured out how to do it, I actually created a method to do it in one step. 00:23:53.080 |
So I'm going to go and show you the method that I wrote. 00:24:12.000 |
So I just added this utils module that I kind of chucked everything in. 00:24:15.480 |
Actually that's not true, I'll put it in my VGG module because I added it to the VGG class. 00:24:23.040 |
So there's a few ways you could possibly do this. 00:24:25.440 |
Basically you know that you've got a way of grabbing a mini-batch of data at a time, or 00:24:31.760 |
So one thing you could do would be to grab your mini-batch size 64, you could grab your 00:24:36.360 |
64 predictions and just keep appending them 64 at a time to an array until eventually 00:24:43.240 |
you have your 12,500 test images all with a prediction in an array. 00:24:50.280 |
That is actually a perfectly valid way to do it. 00:24:52.240 |
How many people solved it using that kind of approach? 00:24:55.560 |
Not many of you, that's interesting, but it works perfectly well. 00:25:01.160 |
Those of you who didn't, I guess either asked on the forum or read the documentation and 00:25:05.840 |
discovered that there's a very handy thing in Keras called Predict Generator. 00:25:12.600 |
And what Predict Generator does is it lets you send it in a bunch of batches, so something 00:25:18.640 |
that we created with get_batches, and it will run the predictions on every one of those 00:25:22.840 |
batches and return them all in a single array. 00:25:29.640 |
If you read the Keras documentation, which you should do very often, you will find out 00:25:35.040 |
that Predict Generator generally will give you the labels. 00:25:41.120 |
So not the probabilities, but the labels, so cat1, dog0, something like that. 00:25:46.000 |
In this case, for this competition, they told us they want probabilities, not labels. 00:25:53.320 |
So instead of calling the get_batches, which we wrote, here is the get_batches that we 00:25:59.680 |
wrote, you can see all it's doing is calling something else, which is flow from directory. 00:26:07.320 |
To get Predict Generator to give you probabilities instead of classes, you have to pass in an 00:26:17.920 |
extra argument, which is plus mode equals, and rather than categorical, you have to say 00:26:24.320 |
So in my case, when I went ahead and actually modified get_batches to take an extra argument, 00:26:29.400 |
which was plus mode, and then in my test method I created, I then added plus mode equals none. 00:26:37.200 |
So then I could call model.PredictGenerator, passing in my batches, and that is going to 00:26:51.880 |
So once I do, I basically say vgg.test, this is the thing I created, pass in my test directory, 00:26:57.880 |
pass in my batch size, that returns two things, it returns the predictions, and it returns 00:27:03.960 |
I can then use batches.filenames to grab the filenames, because I need the filenames in 00:27:13.000 |
And so that looks like this, let's take a look at them, so there's a few predictions, 00:27:32.240 |
Now one thing interesting is that at least for the first five, the probabilities are 00:27:37.000 |
all 1's and 0's, rather than 0.6, 0.8, and so forth. 00:27:40.600 |
We're going to talk about why that is in just a moment. 00:27:45.480 |
It's not doing anything wrong, it really thinks that the answer. 00:27:50.000 |
So all we need to do is grab, because Kaggle wants something which is is_dog, we just need 00:27:55.600 |
to grab the second column of this, and the numbers from this, place them together as 00:28:02.840 |
So here is grabbing the first column from the predictions, and I call it is_dog. 00:28:11.680 |
Here is grabbing from the 8th character until the dot in filenames, turning that into an 00:28:20.760 |
NumPy has something called stack, which lets you put two columns next to each other, and 00:28:29.280 |
And then NumPy lets you save that as a CSV file using save text. 00:28:35.400 |
You can now either SSH to your AWS instance and use KgSubmit, or my preferred technique 00:28:42.440 |
is to use a handy little IPython thing called FileLink. 00:28:46.880 |
If you type FileLink and then pass in a file that is on your server, it gives you a little 00:28:52.240 |
URL like this, which I can click on, and it downloads it to my computer. 00:28:58.440 |
And so now on my computer I can go to Kaggle and I can just submit it in the usual way. 00:29:03.400 |
I prefer that because it lets me find out exactly if there's any error messages or anything 00:29:07.480 |
going wrong on Kaggle, I can see what's happening. 00:29:11.160 |
So as you can see, rerunning what we learned last time to submit something to Kaggle really 00:29:19.280 |
just requires a little bit of coding to just create the submission file, a little bit of 00:29:25.160 |
bash scripting to move things into the right place, and then rerunning the 7 lines of code, 00:29:29.560 |
the actual deep learning itself is incredibly straightforward. 00:29:37.040 |
When I submitted my 1s and 0s to Kaggle, I was put in -- let's have a look at the leaderboard. 00:29:51.320 |
The first thing I did was I accidentally put in "iscat" rather than "isdog", and that made 00:30:02.240 |
Then when I was putting in 1s and 0s, I was in 110th place, which is still not that great. 00:30:08.520 |
Now the funny thing was I was pretty confident that my model was doing well because the validation 00:30:13.280 |
set for my model told me that my accuracy was 97.5%. 00:30:23.520 |
I'm pretty confident that people on Kaggle are not all of them doing better than that. 00:30:31.280 |
So that's a good time to figure out what does this number mean? 00:30:40.080 |
It says here that it is a log loss, so if we go to Evaluation, we can find out what 00:30:53.480 |
Log loss is known in Keras as binary entropy or categorical entropy, and you will actually 00:31:00.860 |
find it very familiar because every single time we've been creating a model, we have 00:31:06.680 |
been using -- let's go and find out when we compile it. 00:31:21.760 |
When we compile a model, we've always been using categorical cross-entropy. 00:31:25.560 |
So it's probably a good time for us to find out what the hell this means. 00:31:29.480 |
So the short answer is it is this mathematical function. 00:31:36.240 |
But let's dig into this a little bit more and find out what's going on. 00:31:40.480 |
I would strongly recommend that when you want to understand how something works, you whip 00:31:46.920 |
Spreadsheets are like my favorite tool for doing small-scale data analysis. 00:31:53.000 |
They are perhaps the least well-utilized tools among professional data scientists, which 00:32:00.880 |
Because back when I was in consulting, everybody used them for everything, and they were the 00:32:06.720 |
So what I've done here is I've gone ahead and created a little column of his cats and 00:32:13.440 |
So this is the correct answer, and I've created a little column of some possible predictions. 00:32:18.240 |
And then I've just gone in and I've typed in the formula from that cattle page. 00:32:24.320 |
Basically it's the truth label times log of the prediction minus 1 minus the truth label 00:32:38.200 |
Now if you think about it, the truth label is always 1 or 0. 00:32:42.020 |
So this is actually probably more easily understood using an if function. 00:32:49.000 |
Rather than multiplying by 1 and 0, let's just use the if function. 00:32:51.840 |
Because if it's a cat, then take log of the prediction, otherwise take log of 1 minus 00:33:02.280 |
If it's a cat and your prediction is really high, then we're taking the log of that and 00:33:09.560 |
If it's not a cat and then our prediction is really low, then we want to take the log 00:33:16.620 |
And so you can get a sense of it by looking here, here's like a non-cat, which we thought 00:33:23.320 |
is a non-cat, and therefore we end up with log of 1 minus that, which is a low number. 00:33:31.640 |
Here's a cat, which we're pretty confident isn't a cat, so here is log of that. 00:33:37.280 |
Notice this is all being a negative sign at the front just to make it so that smaller 00:33:43.480 |
So this is log loss, or binary, or categorical cross-entropy. 00:33:51.000 |
And this is where we find out what's going on. 00:33:53.200 |
Because I'm now going to go and try and say, well, what did I submit? 00:33:58.120 |
And I've submitted predictions that were all 1s and 0s. 00:34:14.320 |
So actually, Kaggle has been pretty nice not to return just an error. 00:34:19.640 |
And I actually know why this happens because I wrote this functionality on Kaggle. 00:34:23.960 |
Kaggle modifies it by a tiny 0.0001, just to make sure it doesn't die. 00:34:30.400 |
So if you say 1, it actually treats it as 0.9999, if you say 0 it treats it as 0.0001. 00:34:36.720 |
So our incredibly overconfident model is getting massively penalized for that overconfidence. 00:34:44.220 |
So what would be better to do would be instead of sending across 1s and 0s, why not send 00:34:48.960 |
across actual probabilities you think are reasonable? 00:34:53.320 |
So in my case, what I did was I added a line which was, I said numpy.clip, my first column 00:35:20.320 |
of my predictions, and clip it to 0.05 and 0.95. 00:35:23.600 |
So anything less than 0.05 becomes 0.05 and anything greater than 0.95 becomes 0.95. 00:35:31.080 |
And that moved me from 110th place to 40th place. 00:35:37.400 |
So the goal of this week was really try and get in the top half of this competition. 00:35:42.200 |
And that's all you had to do, was run a single epoch, and then realize that with this evaluation 00:35:46.880 |
function, you need to be submitting things that aren't 1s and 0s. 00:35:51.360 |
Let's take that one offline and talk about it in the forum because I actually need to 00:36:13.320 |
So probably I should have used, and I'll be interested in trying this tomorrow and maybe 00:36:23.680 |
in a resubmission, I probably should have done 0.025 and 0.975 because I actually know 00:36:29.760 |
that my accuracy on the validation set was 0.975. 00:36:34.600 |
So that's probably the probability that I should have used. 00:36:38.320 |
I would need to think about it more though to think like, because it's like a nonlinear 00:36:42.520 |
loss function, is it better to underestimate how confident you are or overestimate how 00:36:49.120 |
So I would need to think about it a little bit. 00:36:51.720 |
In the end, I said it's about 97.5, I have a feeling that being overconfident might be 00:36:57.080 |
a bad thing because of the shape of the function, so I'll just be a little bit on the tame side. 00:37:05.040 |
I then later on tried 0.02 and 0.98, and I did actually get a slightly better answer. 00:37:12.680 |
I actually got a little bit better than that. 00:37:15.520 |
I think in the end this afternoon I ran a couple more epochs just to see what would 00:37:27.480 |
So I'll show you how you can get to 24th position, and it's incredibly simple. 00:37:33.120 |
You take these two lines here, fit and save weights, and copy and paste them a bunch of 00:37:41.720 |
You can see I saved the weights under a different file name each time just so that I can always 00:37:46.480 |
go back and use a model that I created earlier. 00:37:51.280 |
Something we'll talk about more in the class later is this idea that halfway through after 00:37:55.080 |
two epochs I changed my learning rate from 0.1 to 0.01 just because I happen to know 00:38:03.560 |
I haven't actually tried it without doing that. 00:38:05.760 |
I suspect it might be just as good or even better, but that was just something I tried. 00:38:10.580 |
So interestingly, by the time I run four epochs, my accuracy is 98.3%. 00:38:18.160 |
That would have been second place in the original Cats and Dogs competition. 00:38:22.480 |
So you can see it doesn't take much to get really good results. 00:38:28.280 |
And each one of these took, as you can see, 10 minutes to run on my AWS P2 instance. 00:38:51.120 |
The original Cats and Dogs used a different evaluation function, which was just accuracy. 00:38:56.480 |
So they changed it for the Redux one to use block loss, which makes it a bit more interesting. 00:39:13.240 |
The reason I didn't just say nb_epoch=4 is that I really wanted to save the result after 00:39:20.440 |
each epoch under a different weights file name just in case at some point it overfit. 00:39:25.280 |
I could always go back and use one that I got in the middle. 00:39:37.200 |
We're going to learn a lot about that in the next couple of weeks. 00:39:41.780 |
In this case, we have added a single linear layer to the end. 00:39:49.220 |
And so we actually are not training very many parameters. 00:39:52.280 |
So my guess would be that in this case, we could probably run as many epochs as we like 00:39:56.360 |
and it would probably keep getting better and better until it eventually levels off. 00:40:05.200 |
So I wanted to talk about what are these probabilities. 00:40:10.240 |
One way to do that, and also to talk about how can you make this model better, is any 00:40:15.680 |
time I build a model and I think about how to make it better, my first step is to draw 00:40:23.320 |
Let's take that one offline onto the forum because we don't need to cover it today. 00:40:39.920 |
Now when I say draw pictures, I mean everything from printing out the first five lines of 00:40:45.580 |
your array to see what it looks like to drawing complex plots. 00:40:50.680 |
For a computer vision, you can draw lots of pictures because we're classifying pictures. 00:40:55.680 |
I've given you some tips here about what I think are super useful things to visualize. 00:41:01.180 |
So when I wanted to find out how come my Kaggle submission is 110th place, I ran my kind of 00:41:10.760 |
The standard five steps are let's look at a few examples of images we got right, let's 00:41:15.960 |
look at a few examples of images we got wrong. 00:41:18.940 |
Let's look at some of the cats that we felt were the most cat-like, some of the dogs that 00:41:26.400 |
Some of the cats that we were the most wrong about, some of the dogs we were the most wrong 00:41:30.200 |
about, and then finally some of the cats and dogs that our model is the most unsure about. 00:41:37.960 |
This little bit of code I suggest you keep around somewhere because this is a super useful 00:41:42.320 |
thing to do anytime you do image recognition. 00:41:45.400 |
So the first thing I did was I loaded my weights back up just to make sure that they were there 00:41:49.360 |
and I took them from my very first epoch, and I used that vgg.test method that I just 00:41:56.460 |
This time I passed in the validation set, not the test set because the validation set 00:42:03.400 |
So then from the batches I could get the correct labels and I could get the file names. 00:42:08.760 |
I then grabbed the probabilities and the class predictions, and that then allowed me to do 00:42:16.800 |
So here's number 1, a few correct labels at random. 00:42:20.120 |
So numpy.where, the prediction is equal to the label. 00:42:25.720 |
Let's then get a random permutation and grab the first 4 and plot them by index. 00:42:32.080 |
So here are 4 examples of things that we got right. 00:42:37.320 |
And not surprisingly, this cat looks like a cat and this dog looks like a dog. 00:42:46.600 |
You can kind of see here's a very black underexposed thing on a bright background. 00:42:53.120 |
Here is something that is on a totally unusual angle. 00:42:56.720 |
And here is something that's so curled up you can't see its face. 00:43:01.920 |
So this gives me a sense of like, okay, the things that's getting wrong, it's reasonable 00:43:08.240 |
If you looked at this and they were really obvious, cats and dogs, you would think there's 00:43:13.120 |
But in this case, no, the things that it's finding hard are genuinely hard. 00:43:19.560 |
Here are some cats that we felt very sure were cats. 00:43:22.400 |
Here are some dogs we felt very sure were dogs. 00:43:26.520 |
So these weights, this one here results ft1.h5, this ft stands for fine-tune, and you can 00:43:42.120 |
see here I saved my weights after I did my fine-tuning. 00:43:47.680 |
So these I think are the most interesting, which is here are the images we were very 00:44:13.960 |
confident were cats, but they're actually dogs. 00:44:17.280 |
Here's one that is only 50x60 pixels, that's very difficult. 00:44:23.240 |
Here's one that's almost totally in front of a person and is also standing upright. 00:44:31.320 |
This one is very white and is totally from the front, that's quite difficult. 00:44:36.280 |
And this one I'm guessing the color of the floor and the color of the fur are nearly 00:44:41.880 |
So again, this makes sense, these do look genuinely difficult. 00:44:46.120 |
So if we want to do really well in this competition, we might start to think about should we start 00:44:50.600 |
building some models of very very small images because we now know that sometimes cable gives 00:44:55.680 |
us 50x50 images, which are going to be very difficult for us to deal with. 00:45:01.080 |
Here are some pictures that we were very confident are dogs, but they're actually cats. 00:45:06.160 |
Again, not being able to see the face seems like a common problem. 00:45:12.000 |
And then finally, here are some examples that we were most uncertain about. 00:45:17.020 |
Now notice that the most uncertain are still not very uncertain, like they're still nearly 00:45:24.440 |
Well, we will learn in a moment about exactly what is going on from a mathematical point 00:45:29.120 |
of view when we calculate these things, but the short answer is the probabilities that 00:45:33.480 |
come out of a deep learning network are not probabilities in any statistical sense of 00:45:40.680 |
So this is not actually saying that there is one chance that I had of 100,000, that 00:45:48.040 |
It's only a probability from the mathematical point of view, and in math the probability 00:45:52.680 |
means it's between 0 and 1, and all of the possibilities add up to 1. 00:45:57.920 |
It's not a probability in the sense that this is actually something that tells you how often 00:46:02.000 |
this is going to be right versus this is going to be wrong. 00:46:06.800 |
When we talk about these probabilities that come out of neural network training, you can't 00:46:16.580 |
We will learn about how to create better probabilities down the track. 00:46:22.760 |
Every time you do another epoch, your network is going to get more and more confident. 00:46:29.440 |
This is why when I loaded the weights, I loaded the weights from the very first epoch. 00:46:34.600 |
If I had loaded the weights from the last epoch, they all would have been 1 and 0. 00:46:44.380 |
So hopefully you can all go back and get great results on the Kaggle competition. 00:46:49.240 |
Even though I'm going to share all this, you will learn a lot more by trying to do it yourself, 00:46:56.400 |
and only referring to this when and if you're stuck. 00:46:59.640 |
And if you do get stuck, rather than copying and pasting my code, find out what I used 00:47:04.400 |
and then go to the Keras documentation and read about it and then try and write that 00:47:10.720 |
So the more you can do that, the more you'll think, "Okay, I can do this. 00:47:17.040 |
Just some suggestions, it's entirely up to you. 00:47:26.960 |
So now that we know how to do this, I wanted to show you one other thing, which is the 00:47:31.760 |
last part of the homework was redo this on a different dataset. 00:47:38.240 |
And so I decided to grab the State Farm Distracted Driver Competition. 00:47:44.440 |
The Kaggle State Farm Distracted Driver Competition has pictures of people in 10 different types 00:47:50.600 |
of distracted driving, ranging from drinking coffee to changing the radio station. 00:47:58.400 |
I wanted to show you how I entered this competition. 00:48:02.420 |
It took me a quarter of an hour to enter the competition, and all I did was I duplicated 00:48:09.880 |
my Cats and Dogs Redux notebook, and then I started basically rerunning everything. 00:48:18.880 |
But in this case, it was even easier because when you download the State Farm Competition 00:48:24.200 |
data, they had already put it into directories, one for each type of distracted driving. 00:48:31.080 |
So I was delighted to discover, let's go to it, so if I type "tree-d", that shows you 00:48:49.360 |
my directory structure, you can see in "train", it already had 10 directories, it actually 00:48:54.920 |
didn't have valid, so in "train", it already had the 10 directories. 00:49:01.020 |
So I only had to create the validation and sample set. 00:49:05.240 |
If all I wanted to do was enter the competition, I wouldn't even have had to have done that. 00:49:09.560 |
So I won't go through, but it's basically exactly the same code as I had before to create 00:49:16.880 |
I deleted all of the bits which moved things into separate subfolders, I then used exactly 00:49:22.440 |
the same 7 lines of code as before, and that was basically done. 00:49:27.800 |
I'm not getting good accuracy yet, I don't know why, so I'm going to have to figure out 00:49:33.920 |
But as you can see, this general approach works for any kind of image classification. 00:49:42.360 |
There's nothing specific about cats and dogs, so you now have a very general tool in your 00:49:50.400 |
And all of the stuff I showed you about visualizing the errors and stuff, you can use all that 00:49:55.040 |
So maybe when you're done, you could try this as well. 00:49:57.080 |
Yes, you know, can I grab one of these please? 00:50:26.720 |
So the question is, would this work for CT scans and cancer? 00:50:32.240 |
And I can tell you that the answer is yes, because I've done it. 00:50:34.960 |
So my previous company I created was something called Enlidic, which was the first deep learning 00:50:43.520 |
And the first thing I did with four of my staff was we downloaded the National Lung 00:50:48.240 |
Screening Trial data, which is a thousand examples of people with cancer, it's a CT scan of their 00:50:53.680 |
lungs and 5,000 examples of people without cancer, CT scans of their lungs. 00:50:59.720 |
We took ImageNet, we fine-tuned ImageNet, but in this case instead of cats and dogs, 00:51:06.120 |
we had malignant tumor versus non-malignant tumor. 00:51:09.480 |
We then took the result of that and saw how accurate it was, and we discovered that it 00:51:13.400 |
was more accurate than a panel of four of the world's best radiologists. 00:51:17.840 |
And that ended up getting covered on TV on CNN. 00:51:22.000 |
So making major breakthroughs in domains is not necessarily technically that challenging. 00:51:32.000 |
The technical challenges in this case were really about dealing with the fact that CT 00:51:37.100 |
scans are pretty big, so we had to just think about some resource issues. 00:51:41.200 |
Also they're black and white, so we had to think about how do we change our ImageNet 00:51:44.760 |
pre-training to black and white, and stuff like that. 00:51:47.560 |
But the basic example was really not much more of a different code to what you see here. 00:52:03.000 |
The State Farm data is 4GB, and I only downloaded it like half an hour before class started. 00:52:11.760 |
So I only ran a small fraction of an epoch just to make sure that it works. 00:52:16.440 |
I'm running a whole epoch, probably would have taken overnight. 00:52:24.840 |
So let's go back to lesson 1, and there was a little bit at the end that we didn't look 00:52:39.200 |
Actually before we do, now's a good time for a break. 00:52:42.040 |
So let's have a 12 minute break, let's come back at 8pm, and one thing that you may consider 00:52:50.280 |
doing during those 12 minutes if you haven't done it already is to fill out the survey. 00:52:54.680 |
I will place the survey URL back onto the in class page. 00:53:21.120 |
You need to, because as I've mentioned a couple of times in our emails, the last two thirds 00:53:26.160 |
of it was actually a surprise lesson 0 of this class, and it's where I teach about what 00:53:50.160 |
The first 20 minutes or so is more of a general background, but the rest is a discussion of 00:53:56.680 |
For now, I'll try not to assume too much that you know what they are, the rest of it hopefully 00:54:06.240 |
But I want to talk about fine-tuning, and I want to talk about why we do fine-tuning. 00:54:24.040 |
Why do we start with an image network and then fine-tune it rather than just train our 00:54:34.240 |
And the reason why is that an image network has learned a hell of a lot of stuff about 00:54:43.720 |
A guy called Matt Zeiler wrote this fantastic paper a few years ago in which he showed us 00:54:53.360 |
And in fact, the year after he wrote this paper, he went on to win ImageNet. 00:54:57.780 |
So this is a powerful example of why spending time thinking about visualizations is so helpful. 00:55:04.140 |
By spending time thinking about visualizing networks, he then realized what was wrong 00:55:07.960 |
with the networks at the time, made them better and won the next year's ImageNet. 00:55:14.120 |
We're not going to talk about that, we're going to talk about some of these pictures 00:55:19.040 |
Here are 9 examples of what the very first layer of an ImageNet convolutional neural 00:55:27.600 |
network looks like, what the filters look like. 00:55:33.360 |
And you can see here that, for example, here is a filter that learns to find a diagonal 00:55:52.440 |
So you can see it's saying look for something where there's no pixels and then there's bright 00:55:58.340 |
pixels and then there's no pixels, so that's finding a diagonal line. 00:56:02.200 |
Here's something that finds a diagonal line in the up direction. 00:56:04.880 |
Here's something that finds a gradient horizontal from orange to blue. 00:56:12.940 |
As I said, these are just 9 of these filters in layer 1 of this ImageNet trained network. 00:56:24.240 |
So what happens, those of you who have watched the video I just mentioned will be aware of 00:56:29.480 |
this, is that each of these filters gets placed pixel by pixel or group of pixels by group 00:56:36.400 |
of pixels over a photo, over an image, to find which parts of an image it matches. 00:56:45.640 |
And over here it shows 9 examples of little bits of actual ImageNet images which match 00:56:57.360 |
So here are, as you can see, they all are little diagonal lines. 00:57:02.000 |
So here are 9 examples which match the next filter, the diagonal lines in the opposite 00:57:09.580 |
The filters in the very first layer of a deep learning network are very easy to visualize. 00:57:15.360 |
This has happened for a long time, and we've always really known for a long time that this 00:57:21.180 |
We also know, incidentally, that the human vision system is very similar. 00:57:26.560 |
The human vision system has filters that look much the same. 00:57:36.840 |
To really answer the question of what are we talking about here, I would say watch the 00:57:42.840 |
But the short answer is this is a 7x7 pixel patch which is slid over the image, one group 00:57:51.680 |
of 7 pixels at a time, to find which 7x7 patches look like that. 00:57:57.100 |
And here is one example of a 7x7 patch that looks like that. 00:58:04.320 |
So for example, this gradient, here are some examples of 7x7 patches that look like that. 00:58:13.980 |
So we know the human vision system actually looks for very similar kinds of things. 00:58:21.240 |
These kinds of things that they look for are called Gabor filters. 00:58:25.080 |
If you want to Google for Gabor filters, you can see some examples. 00:58:34.360 |
It's a little bit harder to visualize what the second layer of a neural net looks like, 00:58:41.960 |
In his paper, he shows us a number of examples of the second layer of his ImageNet trained 00:58:50.800 |
Suppose we can't directly visualize them, instead we have to show examples of what the 00:58:57.600 |
So here is an example of a filter which clearly tends to pick up corners. 00:59:04.880 |
So in other words, it's taking the straight lines from the previous layer and combining 00:59:13.720 |
There's another one which is learning to find circles, and another one which is learning 00:59:20.800 |
So you can see here are 9 examples from actual pictures on ImageNet, which actually did get 00:59:33.280 |
And here are some that got heavily activated by this circle filter. 00:59:39.040 |
The third layer then can take these filters and combine them, and remember this is just 00:59:44.120 |
16 out of 100 which are actually in the ImageNet architecture. 00:59:51.120 |
So in layer 3, we can combine all of those to create even more sophisticated filters. 00:59:56.200 |
In layer 3, there's a filter which can find repeating geometrical patterns. 01:00:02.520 |
Here's a filter, let's go look at the examples. 01:00:07.920 |
That's interesting, it's finding pieces of text. 01:00:12.320 |
And here's something which is finding edges of natural things like fur and plants. 01:00:20.520 |
Layer 4 is finding certain kinds of dog face. 01:00:28.440 |
Layer 5 is finding the eyeballs of birds and reptiles and so forth. 01:00:43.240 |
What we do when we fine-tune is we say let's keep all of these learnt filters and use them 01:00:53.400 |
and then just learn how to combine the most complex subtle nuanced filters to find cats 01:01:03.640 |
versus dogs rather than combine them to learn a thousand categories of ImageNet. 01:01:14.680 |
So when I asked Yannette's earlier question about does this work for CT scans and lung 01:01:24.880 |
These kinds of filters that find dog faces are not very helpful for looking at a CT scan 01:01:31.120 |
and looking for cancer, but these earlier ones that can recognize repeating images or 01:01:41.000 |
So really regardless of what computer vision work you're doing, starting with some kind 01:01:47.560 |
of pre-trained network is almost certainly a good idea because at some level that pre-trained 01:01:51.720 |
network has learnt to find some kinds of features that are going to be useful to you. 01:01:56.720 |
And so if you start from scratch you have to learn them from scratch. 01:02:01.240 |
In cats versus dogs we only had 25,000 pictures. 01:02:04.640 |
And so from 25,000 pictures to learn this whole hierarchy of geometric and semantic 01:02:13.760 |
So let's not learn it, let's use one that's already been learned on ImageNet which is 01:02:20.540 |
So that's the short answer to the question "Why do fine-tuning?" 01:02:25.520 |
To the longer answer really requires answering the question "What exactly is fine-tuning?" 01:02:31.620 |
And to answer the question "What exactly is fine-tuning?" we have to answer the question 01:02:39.160 |
So a neural network, we'll learn more about this shortly, but the short answer is if you're 01:03:10.840 |
Generally speaking, if you're doing something with natural images, the second to last layer 01:03:16.400 |
is very likely to be the best, but I just tend to try a few. 01:03:20.960 |
And we're going to see today or next week some ways that we can actually experiment 01:03:29.920 |
So as per usual, in order to learn about something we will use Excel. 01:03:40.200 |
Rather than having a picture with lots of pixels, I just have three inputs, a single 01:03:46.360 |
row with three inputs which are x1, x2 and x3, and the numbers are 2, 3 and 1. 01:03:52.560 |
And rather than trying to pick out whether it's a dog or a cat, we're going to assume 01:03:58.520 |
So here's like a single row that we're feeding into a deep neural network. 01:04:07.480 |
A deep neural network basically is a bunch of matrix products. 01:04:14.280 |
So what I've done here is I've created a bunch of random numbers. 01:04:19.480 |
They are normally distributed random numbers, and this is the standard deviation that I'm 01:04:24.920 |
using for my normal distribution, and I'm using 0 as the mean. 01:04:33.240 |
What if I then take my input vector and matrix multiply them by my random weights? 01:04:48.720 |
So for example, 24.03 = 2 x 11.07 + 3 x -2.81 + 1 x 10.31 and so forth. 01:05:03.640 |
Any of you who are either not familiar with or are a little shaky on your matrix vector 01:05:11.560 |
products, tomorrow please go to the Khan Academy website and look for Linear Algebra and watch 01:05:22.400 |
They are very, very, very simple, but you also need to understand them very, very, very intuitively, 01:05:29.400 |
comfortably, just like you understand plus and times in regular algebra. 01:05:35.440 |
I really want you to get to that level of comfort with linear algebra because this is 01:05:39.800 |
the basic operation we're doing again and again. 01:06:08.840 |
So if that is a single layer, how do we turn that into multi-layers? 01:06:13.440 |
Well, not surprisingly, we create another bunch of weights. 01:06:18.400 |
And now we take those weights, the new bunch of weights, times the previous activations 01:06:26.080 |
with our matrix multiply, and we get a new set of activations. 01:06:32.120 |
Let's create another bunch of weights and multiply them by our previous set of activations. 01:06:41.640 |
Note that the number of columns in your weight matrix is, you can make it as big or as small 01:06:47.360 |
as you like, as long as the last one has the same number of columns as your output. 01:06:56.400 |
So our final weight matrix had to have 2 columns so that our final activations has 2 things. 01:07:04.880 |
So with our random numbers, our activations are not very close to what we hope they would 01:07:13.880 |
So the basic idea here is that we now have to use some kind of optimization algorithm 01:07:19.240 |
to repeatably make the weights a little bit better and a little bit better, and we will 01:07:24.840 |
But for now, hopefully you're all familiar with the idea that there is such a thing as 01:07:31.320 |
An optimization algorithm is something that takes some kind of output to some kind of 01:07:35.760 |
mathematical function and finds the inputs to that function that makes the outputs as 01:07:41.760 |
And in this case, the thing we would want to make as low as possible would be something 01:07:45.920 |
like the sum of squared errors between the activations and the outputs. 01:07:54.520 |
I want to point out something here, which is that when we stuck in these random numbers, 01:07:58.680 |
the activations that came out, not only are they wrong, they're not even in the same general 01:08:11.040 |
The reason it's a bad problem is because they're so much bigger than the scale that we were 01:08:17.240 |
As we change these weights just a little bit, it's going to change the activations by a 01:08:25.960 |
In general, you want your neural network to start off even with random weights, to start 01:08:33.960 |
off with activations which are all of similar scale to each other, and the output activations 01:08:43.480 |
For a very long time, nobody really knew how to do this. 01:08:47.600 |
And so for a very long time, people could not really train deep neural networks. 01:08:53.120 |
It turns out that it is incredibly easy to do. 01:08:56.720 |
And there is a whole body of work talking about neural network initializations. 01:09:03.120 |
It turns out that a really simple and really effective neural network initialization is 01:09:07.960 |
called Xavier initialization, named after its founder, Xavier Glauro. 01:09:19.160 |
Like many things in deep learning, you will find this complex-looking thing like Xavier 01:09:25.680 |
weight initialization scheme, and when you look into it, you will find it is something 01:09:32.160 |
This is about as complex as deep learning gets. 01:09:35.480 |
So I am now going to go ahead and implement Xavier deep learning weight initialization 01:09:42.080 |
So I'm going to go up here and type =2 divided by 3in + 4out, and put that in brackets because 01:09:56.360 |
we're complex and sophisticated mathematicians, and press enter. 01:10:01.960 |
So now my first set of weights have that as its standard deviation. 01:10:05.780 |
My second set of weights I actually have pointing at the same place, because they also have 01:10:12.480 |
And then my third I need to have =2 divided by 3in + 2out. 01:10:22.280 |
So I have now implemented it in Excel, and you can see that my activations are indeed 01:10:31.640 |
So generally speaking, you would normalize your inputs and outputs to be mean 0 and standard 01:10:38.720 |
And if you use these, we want them to be of the same kind of scale. 01:10:54.480 |
Obviously they're not going to be in 5 and 6 because we haven't done any optimization 01:10:57.440 |
yet, but we don't want them to be like 100,000. 01:11:07.120 |
Eventually we want them to be close to 5 and 6. 01:11:15.680 |
And so if we start off with them really high or really low, then optimization is going 01:11:24.920 |
And so for decades when people tried to train deep learning neural networks, the training 01:11:29.760 |
that took forever or was so incredibly unresilient, it was useless, and this one thing, better 01:11:42.380 |
We're talking maybe 3 years ago that this was invented, so this is not like we're going 01:11:52.880 |
Now the good news is that Keras and pretty much any decent neural network library will 01:12:02.480 |
Until very recently they pretty much all used this. 01:12:05.680 |
There are some even more recent slightly better approaches, but they'll give you a set of weights 01:12:11.680 |
where your outputs will generally have a reasonable scale. 01:12:14.760 |
So what's not arbitrary is that you are given your input dimensionality. 01:12:37.360 |
So in our case, for example, it would be 224x224 pixels, in this case I'm saying it's 3 things. 01:12:46.220 |
So for example in our case, for cats and dogs it's 2, for this I'm saying it's 2. 01:12:54.520 |
The thing in the middle about how many columns does each of your weight matrices have is 01:13:02.100 |
The more columns you add, the more complex your model, and we're going to learn a lot 01:13:07.780 |
As Rachel said, this is all about your choice of architecture. 01:13:10.460 |
So in my first one here I had 4 columns, and therefore I had 4 outputs. 01:13:15.600 |
In my next one I had 3 columns, and therefore I had 3 outputs. 01:13:19.080 |
In my final one I had 2 columns, and therefore I had 2 outputs, and that is the number of 01:13:26.960 |
So this thing of like how many columns do you have in your weight matrix is where you 01:13:30.360 |
get to decide how complex your model is, so we're going to see that. 01:13:46.760 |
Alright, so we're going to learn how to create a linear model. 01:14:13.560 |
Let's first of all learn how to create a linear model from scratch, and this is something 01:14:20.320 |
which we did in that original USF Data Institute launch video, but I'll just remind you. 01:14:29.320 |
Without using Keras at all, I can define a line as being ax + b, I can then create some 01:14:37.680 |
So let's say I'm going to assume a is 3 and b is 8, create some random x's, and my y will 01:14:45.880 |
So here are some x's and some y's that I've created, not surprisingly, this kind of plot 01:14:54.480 |
The job of somebody creating a linear model is to say I don't know what a and b is, how 01:15:01.560 |
So let's forget that we know that they're 3 and 8, and say let's guess that they're 01:15:11.400 |
And to make our guess better, we need a loss function. 01:15:14.480 |
So the loss function is something which is a mathematical function that will be high 01:15:18.800 |
if your guess is bad, and is low if it's good. 01:15:22.600 |
The loss function I'm using here is sum of squared errors, which is just my actual minus 01:15:32.320 |
So if I define my loss function like that, and then I say my guesses are -1 and 1, I can 01:15:42.760 |
So my average loss with my random guesses is not very good. 01:15:47.240 |
In order to create an optimizer, I need something that can make my weights a little bit better. 01:15:53.640 |
If I have something that can make my weights a little bit better, I can just call it again 01:16:01.360 |
If you know the derivative of your loss function with respect to your weights, then all you 01:16:08.120 |
need to do is update your weights by the opposite of that. 01:16:12.240 |
So remember, the derivative is the thing that says, as your weight changes, your output 01:16:24.080 |
In this case, we have y = ax + b, and then we have our loss function is actual minus 01:16:49.520 |
So we're now going to create a function called update, which is going to take our a guess 01:16:53.640 |
and our b guess and make them a little bit better. 01:16:56.920 |
And to make them a little bit better, we calculate the derivative of our loss function with respect 01:17:02.320 |
to b, and the derivative of our loss function with respect to a. 01:17:08.880 |
We go to Wolfram Alpha and we enter in d along with our formula, and the thing we want to 01:17:15.280 |
get the derivative of, and it tells us the answer. 01:17:18.440 |
So that's all I did, I went to Wolfram Alpha, found the correct derivative, pasted them 01:17:26.040 |
And so what this means is that this formula here tells me as I increase b by 1, my sum 01:17:33.800 |
of squared errors will change by this amount. 01:17:38.360 |
And this says as I change a by 1, my sum of squared errors will change by this amount. 01:17:43.960 |
So if I know that my loss function gets higher by 3 if I increase a by 1, then clearly I need 01:17:57.400 |
to make a a little bit smaller, because if I make it a little bit smaller, my loss function 01:18:04.680 |
So that's why our final step is to say take our guess and subtract from it our derivative 01:18:15.900 |
LR stands for learning rate, and as you can see I'm setting it to 0.01. 01:18:21.920 |
How much is a little bit is something which people spend a lot of time thinking about 01:18:26.760 |
and studying, and we will spend time talking about, but you can always trial and error 01:18:34.480 |
When you use Keras, you will always need to tell it what learning rate you want to use, 01:18:38.440 |
and that's something that you want the highest number you can get away with. 01:18:45.240 |
But the important thing to realize here is that if we update our guess, minus equals 01:18:50.480 |
our derivative times a little bit, our guess is going to be a little bit better because 01:18:57.400 |
we know that going in the opposite direction makes the loss function a little bit lower. 01:19:03.020 |
So let's run those two things, where we've now got a function called update, which every 01:19:07.600 |
time we run it makes our predictions a little bit better. 01:19:11.160 |
So finally now, I'm basically doing a little animation here that says every time you calculate 01:19:17.360 |
an animation, call my animate function, which 10 times will call my update function. 01:19:24.240 |
So let's see what happens when I animate that. 01:19:32.640 |
So it starts with a really bad line, which is my -11, and it gets better and better. 01:19:40.920 |
So this is how stochastic gradient descent works. 01:19:46.400 |
Stochastic gradient descent is the most important algorithm in deep learning. 01:19:51.000 |
Stochastic gradient descent is the thing that starts with random weights like this and ends 01:20:03.600 |
So as you can see, stochastic gradient descent is incredibly simple and yet incredibly powerful 01:20:13.480 |
because it can take any function and find the set of parameters that does exactly what 01:20:21.200 |
And when that function is a deep learning neural network that becomes particularly powerful. 01:20:41.680 |
It has nothing to do with neural nets except - so just to remind ourselves about the setup 01:20:47.120 |
for this, we started out by saying this spreadsheet is showing us a deep neural network with a 01:20:56.960 |
Can we come up with a way to replace the random parameters with parameters that actually give 01:21:05.240 |
So we need to come up with a way to do mathematical optimization. 01:21:09.620 |
So rather than showing how to do that with a deep neural network, let's see how to do 01:21:17.560 |
So we started out by saying let's have a line Ax + b where A is 3 and B is 8, and pretend 01:21:30.520 |
Make a wild guess as to what A and B might be, come up with an update function that every 01:21:36.160 |
time we call it makes A and B a little bit better, and then call that update function 01:21:41.220 |
lots of times and confirm that eventually our line fits our data. 01:21:48.460 |
Conceptually take that exact same idea and apply it to these weight matrices. 01:21:53.840 |
Question is, is there a problem here that as we run this update function, might we get 01:22:14.240 |
to a point where, let's say the function looks like this. 01:22:36.160 |
So currently we're trying to optimize sum of squared errors and the sum of squared errors 01:22:42.180 |
looks like this, which is fine, but let's say the more complex function that kind of 01:23:01.300 |
So if we started here and kind of gradually tried to make it better and better and better, 01:23:06.380 |
we might get to a point where the derivative is zero and we then can't get any better. 01:23:17.260 |
So the question was suggesting a particular approach to avoiding that. 01:23:20.780 |
Here's the good news, in deep learning you don't have local minimum. 01:23:28.460 |
Well the reason is that in an actual deep learning neural network, you don't have one 01:23:32.900 |
or two parameters, you have hundreds of millions of parameters. 01:23:37.360 |
So rather than looking like this, or even like a 3D version where it's like something 01:23:42.860 |
like this, it's a 600 million dimensional space. 01:23:48.540 |
And so for something to be a local minimum, it means that the stochastic gradient descent 01:23:53.660 |
has wandered around and got to a point where in every one of those 600 million directions, 01:24:01.380 |
The probability of that happening is 2 to the power of 600 million. 01:24:06.140 |
So for actual deep learning in practice, there's always enough parameters that it's basically 01:24:12.420 |
unheard of to get to a point where there's no direction you can go to get better. 01:24:17.580 |
So the answer is no, for deep learning, stochastic gradient descent is just as simple as this. 01:24:30.300 |
We will learn some tweaks to allow us to make it faster, but this basic approach works just 01:24:39.700 |
[The question is, "If you had known the derivative of sum of squared errors, would you have been 01:24:45.540 |
able to define the same function in a different way?"] 01:24:53.740 |
And so for a long time, this was a royal goddamn pain in the ass. 01:24:58.060 |
Anybody who wanted to create stochastic gradient descent for their neural network had to go 01:25:02.380 |
through and calculate all of their derivatives. 01:25:05.260 |
And if you've got 600 million parameters, that's a lot of trips to Wolfram Alpha. 01:25:11.300 |
So nowadays, we don't have to worry about that because all of the modern neural network 01:25:19.780 |
In other words, it's like they have their own little copy of Wolfram Alpha inside them 01:25:25.700 |
So you don't ever be in a situation where you don't know the derivatives. 01:25:29.420 |
You just tell it your architecture and it will automatically calculate the derivatives. 01:25:36.340 |
Let's take this linear example and see what it looks like in Keras. 01:25:49.700 |
So let's start by creating some random numbers, but this time let's make it a bit more complex. 01:25:54.660 |
We're going to have a random matrix with two columns. 01:25:57.940 |
And so to calculate our y value, we'll do a little matrix multiply here with our x with 01:26:03.380 |
a vector of 2, 3 and then we'll add in a constant of 1. 01:26:15.540 |
So here's our x's, the first 5 out of 30 of them, and here's the first few y's. 01:26:22.120 |
So here 3.2 equals 0.56 times 2 plus 0.37 times 3 plus 1. 01:26:30.780 |
Hopefully this looks very familiar because it's exactly what we did in Excel in the very 01:26:39.940 |
And the answer is Keras calls a linear model dense. 01:26:45.020 |
It's also known in other libraries as fully connected. 01:26:49.020 |
So when we go dense with an input of two columns and an output of one column, we have to find 01:26:57.660 |
a linear model that can go from this two column array to this one column output. 01:27:08.180 |
The second thing we have in Keras is we have some way to build multiple layer networks, 01:27:13.780 |
and Keras calls this sequential. Sequential takes an array that contains all of the layers 01:27:22.340 |
So for example in Excel here, I would have had 1, 2, 3 layers. 01:27:30.700 |
So to create a linear model in Keras, you say sequential, fasten an array with a single 01:27:42.660 |
We tell it that there are two inputs and one output. 01:27:47.780 |
And then we tell it, and this will automatically initialize the weights in a sensible way. 01:27:54.220 |
It will automatically calculate the derivatives. 01:27:56.700 |
So all we have to tell it is how do we want to optimize the weights, and we will say please 01:28:00.880 |
use stochastic gradient descent with a learning rate of 0.1. 01:28:05.780 |
And we're attempting to minimize our loss of a mean squared error. 01:28:11.260 |
So if I do that, that does everything except the very last solving step that we saw in 01:28:26.820 |
And as you can see, when we fit, before we start, we can say evaluate to basically find 01:28:33.260 |
out our loss function with random weights, which is pretty crappy. 01:28:37.500 |
And then we run 5 epochs, and the loss function gets better and better and better using the 01:28:42.980 |
stochastic gradient descent update rule we just learned. 01:28:46.020 |
And so at the end, we can evaluate and it's better. 01:28:51.740 |
They should be equal to 231, they're actually 1.8, 2.7, 1.2. 01:29:04.460 |
Loss function keeps getting better, we evaluate it now, it's better and the weights are now 01:29:12.840 |
So we now know everything that Keras is doing behind the scenes. 01:29:18.500 |
I'm not hand-waving over details here, that is it. 01:29:27.260 |
If we now say that Keras don't just create a single layer, but create multiple layers 01:29:32.540 |
by passing it multiple layers to this sequential, we can start to build and optimize deep neural 01:29:40.220 |
But before we do that, we can actually use this to create a pretty decent entry to our 01:29:52.660 |
So forget all the fine-tuning stuff, because I haven't told you how fine-tuning works yet. 01:29:57.780 |
How do we take the output of an ImageNet network and as simply as possible create an entry 01:30:05.900 |
So the basic problem here is that our current ImageNet network returns a thousand probabilities 01:30:15.940 |
So it returns not just cat vs dog, but animals, domestic animals, and then ideally it would 01:30:39.260 |
best be cat and dog here, but it's not, it keeps going, Egyptian cats, Persian cats, 01:30:44.900 |
So one thing we could do is we could write code to take this hierarchy and roll it up 01:30:54.460 |
So I've got a couple of ideas here for how we could do that. 01:30:58.780 |
For instance, we could find the largest probability that's either a cat or a dog with a thousand, 01:31:05.500 |
Or we could average all of the cat categories, all of the dog categories, and use that. 01:31:10.020 |
But the downsides here are that would require manual coding for something that should be 01:31:14.820 |
learning from data, and more importantly it's ignoring information. 01:31:19.620 |
So let's say out of those thousand categories, the category for a bone was very high. 01:31:26.220 |
It's more likely a dog is with a bone than a cat is with a bone, so therefore it ought 01:31:30.780 |
to actually take advantage, it should learn to recognize environments that cats are in 01:31:35.420 |
vs environments that dogs are in, or even recognize things that look like cats from 01:31:41.820 |
So what we could do is learn a linear model that takes the output of the ImageNet model, 01:31:48.700 |
the thousand predictions, and that uses that as the input, and uses the dog cat label as 01:31:55.460 |
the target, and that linear model would solve our problem. 01:31:59.820 |
We have everything we need to know to create this model now. 01:32:10.500 |
Let's again import our VGG model, and we're going to try and do three things. 01:32:18.220 |
For every image we'll get the true labels, is it cat or is it dog. 01:32:23.140 |
We're going to get the 1000 ImageNet category predictions, so that will be 1000 floats for 01:32:28.540 |
every image, and then we're going to use the output of 2 as the input to our linear model, 01:32:34.180 |
and we're going to use the output 1 as the target for our linear model, and create this 01:32:41.860 |
So as per usual, we start by creating our validation batches and our batches, just like 01:32:52.940 |
Because one of the steps here is get the 1000 ImageNet category predictions to every image, 01:33:02.540 |
Once we've done it once, let's save the result. 01:33:05.360 |
So I want to show you how you can save NumPy arrays. 01:33:08.460 |
Unfortunately, most of the stuff you'll find online about saving NumPy arrays takes a very, 01:33:13.740 |
very, very long time to run, and it takes a shitload of space. 01:33:18.140 |
There's a really cool library called bcols that almost nobody knows about that can save 01:33:22.940 |
NumPy arrays very, very quickly and in very little space. 01:33:27.540 |
So I've created these two little things here called save array and load array, which you 01:33:34.100 |
They're actually in the utils.py, so you can use them in the future. 01:33:37.580 |
And once you've grabbed the predictions, you can use these to just save the predictions 01:33:46.800 |
and load them back later, rather than recalculating them each time. 01:33:53.940 |
Before we even worry about calculating the predictions, we just need to load up the images. 01:33:59.940 |
When we load the images, there's a few things we have to do. 01:34:01.900 |
We have to decode the jpeg images, and we have to convert them into 224x224 pixel images 01:34:15.500 |
So I've created this little function called getData, which basically grabs all of the 01:34:22.340 |
validation images and all of the training images and sticks them in a NumPy array. 01:34:33.260 |
If you put question mark before something, it shows you the source code. 01:34:39.500 |
So if you want to know what is getData doing, go question mark, question mark, getData, 01:34:45.020 |
It's just concatenating all of the different batches together. 01:34:52.340 |
Any time you're using one of my little convenience functions, I strongly suggest you look at 01:34:56.060 |
the source code and make sure you see what it's doing. 01:35:02.740 |
So I can grab the data for the validation data, I can grab it for the training data, 01:35:07.120 |
and then I just saved it so that in the future, I can. 01:35:26.660 |
So now rather than having to watch and wait for that to pre-process, I'll just go load 01:35:30.700 |
array and that goes ahead and loads it off disk. 01:35:35.020 |
It still takes a few seconds, but this will be way faster than having to calculate it 01:35:43.340 |
So what that does is it creates a NumPy array with my 23,000 images, each of which has three 01:35:56.220 |
If you remember from lesson 1, the labels that Keras expects are in a very particular 01:36:07.140 |
Let's look at the format to see what they look like. 01:36:18.480 |
The format of the labels is each one has two things. 01:36:23.780 |
It has the probability that it's a cat and the probability that it's a dog, and they're 01:36:30.540 |
So here is 0, 1 is a dog, 1, 0 is a cat, 1, 0 is a cat, 0, 1 is a dog. 01:36:36.740 |
This approach where you have a vector where every element of it is a zero except for a 01:36:41.380 |
single one, for the class that you want, is called one-hot encoding. 01:36:47.020 |
And this is used for nearly all deep learning. 01:36:51.740 |
So that's why I created a little function called one-hot that makes it very easy for 01:37:00.900 |
So for example, if your data was just like 0, 1, 2, 1, 0, one-hot encoding that would 01:37:12.380 |
So that would be the kind of raw form, and that is the one-hot encoded form. 01:37:26.660 |
The reason that we use one-hot encoding a lot is that if you take this and you do a matrix 01:37:32.820 |
multiply by a bunch of weights, W_1, W_2, W_3, you can calculate a matrix multiply, you see 01:37:46.820 |
So this is what lets you do deep learning really easily with categorical variables. 01:37:56.520 |
So the next thing I want to do is I want to grab my labels and I want to one-hot encode 01:38:19.700 |
So you can see here that the first few classes look like so, but the first few labels are 01:38:35.780 |
So we're now at a point where we can finally do step number 1, get the 1000 image net category 01:38:51.980 |
We can just say model.predict and pass in our data. 01:38:59.480 |
So model.predict with train data is going to give us the 1000 predictions from image 01:39:04.460 |
net for our train data, and this will give it for our validation data. 01:39:08.980 |
And again, running this takes a few minutes, so I save it, and then instead of waiting 01:39:14.060 |
for you to wait, I will load it, and so you can see that we now have the 23,000 images 01:39:21.660 |
are now no longer 23,000 by 3 by 244 by 244, it's now 23,000 by 1,000, so for every image 01:39:31.860 |
So let's look at one of them, train_features 0. 01:39:38.180 |
Not surprisingly, if we look at just one of these, nearly all of them are 0. 01:39:43.180 |
So for these 1000 categories, only one of these numbers should be big, it can't be lots 01:39:49.940 |
of different things, it's not a cat and a dog and a jet airplane. 01:39:53.860 |
So not surprisingly, nearly all of these things are very close to 0, and hopefully just one 01:40:05.060 |
So now that we've got our 1000 features for each of our training images and for each of 01:40:11.180 |
our validation images, we can go ahead and create our linear model. 01:40:18.060 |
The input is 1000 columns, it's every one of those image net predictions. 01:40:23.460 |
The output is 2 columns, it's a dog or it's a cat. 01:40:28.740 |
We will optimize it with, I'm actually not going to use SGD, I'm going to use a slightly 01:40:35.060 |
better thing called rmsprop which I will teach you about next week, it's a very minor tweak 01:40:42.460 |
So I suggest in practice you use rmsprop, not SGD, but it's almost the same thing. 01:40:49.840 |
And now that we know how to fit the model, once it's defined, we can just go model.fit, 01:40:56.900 |
and it runs basically instantly because all it has to do is, let's have a look at our 01:41:08.760 |
We have just one layer with just 2000 weights, so running 3 epochs took 0 seconds. 01:41:17.560 |
And we've got an accuracy of 0.9734, let's run another 3 epochs, 0.9770, even better. 01:41:30.120 |
So you can see this is like the simplest possible model. 01:41:33.940 |
I haven't done any fine-tuning, all I've done is I've just taken the image net predictions 01:41:40.020 |
for every image and built a linear model that maps from those predictions to cat or dog. 01:41:47.380 |
A lot of the amateur deep learning papers that you see, like I showed you a couple last 01:41:52.980 |
week, one was like classifying leaves by whether they're sick, one was like classifying skin 01:42:03.420 |
Often this is all people do, they take a pre-entrain model, they grab the outputs and they stick 01:42:13.140 |
And as you can see, it actually works pretty well. 01:42:17.860 |
So I just wanted to point out here that in getting this 0.9770 result, we have not used 01:42:30.060 |
All we've done is we have more code than it looks like, just because we've done some saving 01:42:38.440 |
We grabbed our batches, just to grab the data. 01:42:50.620 |
We took the numpy array and ran bottle.predict on them. 01:42:59.320 |
We grabbed our labels and we one-hot encoded them. 01:43:06.700 |
And then finally we took the one-hot encoded labels and the thousand probabilities and 01:43:12.580 |
we fed them to a linear model with 1000 inputs and 2 outputs. 01:43:23.140 |
And then we trained it and we ended up with a validation accuracy of 0.977. 01:43:29.460 |
So what we're really doing here is we're digging right deep into the details. 01:43:36.860 |
We know exactly how the layers are being calculated, and we know exactly what Keras is doing behind 01:43:44.620 |
So we started way up high with something that was totally obscure as to what was going on. 01:43:49.180 |
We were just using it like you might use Excel, and we've gone all the way down to see exactly 01:43:53.180 |
what's going on, and we've got the pretty good result. 01:44:02.100 |
The last thing we're going to do is take this and turn it into a fine-tuning model to get 01:44:10.620 |
In order to understand fine-tuning, we're going to have to understand one more piece 01:44:16.740 |
And this is activation functions, this is our last major piece. 01:44:24.300 |
In this view of a deep learning model, we went matrix-multiply, matrix-multiply, matrix-multiply. 01:44:36.580 |
Who wants to tell me how can you simplify a matrix-multiply on top of a matrix-multiply 01:44:47.340 |
A linear model and a linear model and a linear model is itself a linear model. 01:44:53.100 |
So in fact, this whole thing could be turned into a single matrix-multiply because it's 01:44:58.860 |
just doing linear on top of linear on top of linear. 01:45:02.080 |
So this clearly cannot be what deep learning is really doing because deep learning is doing 01:45:12.380 |
What deep learning is actually doing is at every one of these points where it says activations, 01:45:18.020 |
with deep learning we do one more thing which is we put each of these activations through 01:45:27.580 |
There are various things that people use, sometimes people use fan, sometimes people 01:45:32.380 |
use sigmoid, but most commonly nowadays people use max(0,x) which is called ReLU, or rectified 01:45:43.940 |
When you see rectified linear activation function, people actually mean max(0,x). 01:45:52.780 |
So if we took this excel spreadsheet and added equals max(x), equals max(0,x) and we made 01:46:24.740 |
So if we replace the activation with this and did that at each layer, we now have a 01:46:36.580 |
Interestingly it turns out that this kind of neural network is capable of approximating 01:46:43.020 |
any given function, of arbitrarily complexity. 01:46:48.260 |
In the lesson you'll see that there is a link to a fantastic tutorial by Michael Nielsen 01:47:00.460 |
And what he does is he shows you how with exactly this kind of approach where you put 01:47:05.060 |
functions on top of functions, you can actually drag them up and down to see how you can change 01:47:15.660 |
And he gradually builds up so that once you have a function of a function of a function 01:47:20.540 |
of this type, he shows you how you can gradually create arbitrarily complex shapes. 01:47:27.820 |
So using this incredibly simple approach where you have a matrix multiplication followed 01:47:34.180 |
by a rectified linear, which is max(0,x) and stick that on top of each other, on top of 01:47:40.100 |
each other, that's actually what's going on in a deep learning neural net. 01:47:45.860 |
And so you will see that in all of the deep neural networks we have created so far, we 01:47:52.940 |
have always had this extra parameter activation equals something. 01:47:58.620 |
And generally you'll see activation equals value. 01:48:03.620 |
It's saying after you do the matrix product, do a max(0,x). 01:48:10.500 |
So what we need to do is we need to take our final layer, which has both a matrix multiplication 01:48:18.220 |
and an activation function, and what we're going to do is we're going to remove it. 01:48:25.140 |
So I'll show you why, if we look at our model, our VGG model, let's take a look at it. 01:48:44.540 |
And let's see what does the end of it look like. 01:48:55.540 |
The very last layer is a dense layer, the very last layer is a linear layer. 01:49:03.220 |
It seems weird therefore that in that previous section where we added an extra dense layer, 01:49:09.500 |
why would we add a dense layer on top of a dense layer given that this dense layer has 01:49:14.820 |
been tuned to find the 1000 image net categories? 01:49:19.580 |
Why would we want to take that and add on top of it something that's tuned to find cats 01:49:24.140 |
How about we remove this and instead use the previous dense layer with its 4096 activations 01:49:39.020 |
So to do that, it's as simple as saying model.pop, that will remove the very last layer, and 01:49:47.220 |
then we can go model.add and add in our new linear layer with two outputs, cat and dog. 01:49:58.040 |
So when we said VGG.findTune earlier, it was actually, we can have a look, VGG, VGG.findTune. 01:50:14.460 |
Here is the source code, model.pop, model.add, a dense layer with the correct number of classes, 01:50:24.980 |
and the input equal to the parts interesting, that's actually incorrect, I think it's being 01:50:33.900 |
So to get this little part, I will fix that later. 01:50:37.060 |
So it's basically doing a model.pop and then model.add dense. 01:50:44.740 |
So once we've done that, we will now have a new model which is designed to calculate 01:50:51.780 |
cats versus dogs rather than designed to calculate image net categories and then calculate cats 01:50:59.160 |
And so when we use that approach, everything else is exactly the same. 01:51:05.300 |
We then compile it, giving it an optimizer, and then we can call model.fit. 01:51:15.460 |
Anything where we want to use batches, by the way, we have to use in Keras something_generator. 01:51:21.020 |
This is fit_generator because we're passing in batches. 01:51:24.740 |
And if we run it for 2 epochs, you can see we get 97.35. 01:51:31.580 |
If we run it for a little bit longer, eventually we will get something quite a bit better than 01:51:35.940 |
our previous linear model on top of image net approach. 01:51:40.020 |
In fact we know we can, we got 98.3 when we looked at this fine-tuning earlier. 01:51:46.420 |
So that's the only difference between fine-tuning and adding an additional linear layer. 01:51:59.740 |
Of course once I calculate it, I would then go ahead and save the weights and then we 01:52:05.660 |
And so from here on in, you'll often find that after I create my fine-tuned model, I 01:52:09.860 |
will often go model.load_weights_fine_tune_1.h5 because this is now something that we can use 01:52:16.620 |
as a pretty good starting point for all of our future dogs and cats models. 01:52:23.640 |
I think that's about everything that I wanted to show you for now. 01:52:28.580 |
Anybody who is interested in going further during the week, there is one more section 01:52:32.140 |
here in this lesson which is showing you how you can train more than just the last layer, 01:52:39.420 |
So during this week, the assignment is really very similar to last week's assignment, but 01:52:46.620 |
Now that you actually know what's going on with fine-tuning and linear layers, there's 01:52:53.900 |
One is, for those of you who haven't yet entered the cats and dogs competition, get your entry 01:52:59.700 |
And then have a think about everything you know about the evaluation function, the categorical 01:53:05.700 |
cross-entropy loss function, fine-tuning, and see if you can find ways to make your model 01:53:11.940 |
better and see how high up the leaderboard you can get using this information. 01:53:16.740 |
Maybe you can push yourself a little further, read some of the other forum threads on Kaggle 01:53:21.300 |
and on our forums and see if you can get the best result you can. 01:53:28.180 |
If you want to really push yourself then, see if you can do the same thing by writing 01:53:31.820 |
all of the code yourself, so don't use our fine-tune at all. 01:53:36.100 |
Don't use our notebooks at all, see if you can build it from scratch just to really make 01:53:45.020 |
And then of course, if you want to go further, see if you can enter not just the dogs and 01:53:50.540 |
cats competition, but see if you can enter one of the other competitions that we talk 01:53:54.060 |
about on our website such as Galaxy Zoo or the Plankton competition or the State Farm 01:54:07.220 |
Well thanks everybody, I look forward to talking to you all during the week and hopefully see