back to index

Lesson 2: Practical Deep Learning for Coders


Whisper Transcript | Transcript Only Page

00:00:00.000 | So one of the things I wanted to talk about and it really came up when I was looking at
00:00:06.960 | the survey responses is what is different about how we're trying to teach this course
00:00:13.240 | and how will it impact you as participants in this course.
00:00:16.960 | And really we're trying to teach this course in a very different way to the way most teaching
00:00:22.760 | is done, or at least most teaching in the United States.
00:00:28.760 | Rachel and I are both very keen fans of this guy called David Perkins who has this wonderful
00:00:34.480 | book called Making Learning Whole, How Seven Principles of Teaching Can Transform Education.
00:00:40.520 | We are trying to put these principles in practice in this course.
00:00:43.560 | I'll give you a little anecdote to give you a sense of how this works.
00:00:48.680 | It's an anecdote from the book.
00:00:51.200 | If you were to learn baseball, if you were to learn baseball the way that math is taught,
00:00:57.880 | you would first of all learn about the shape of a parabola, and then you would learn about
00:01:02.600 | the material science design behind stitching baseballs and so forth.
00:01:07.000 | And 20 years later after you had completed your PhD in postdoc, you would be taken to
00:01:10.720 | your first baseball game and you would be introduced to the rules of baseball, and then
00:01:15.080 | 10 years later you might get to here.
00:01:19.400 | The way that in practice baseball is taught is we take a kid down to the baseball diamond
00:01:24.880 | and we say these people are playing baseball.
00:01:27.560 | Would you like to play it?
00:01:29.440 | And they say, "Yeah, sure I would."
00:01:30.880 | You say, "Okay, stand here, I'm going to throw this, hit it."
00:01:35.520 | Okay, great.
00:01:36.880 | Now run.
00:01:37.880 | Good.
00:01:38.880 | You're playing baseball.
00:01:39.880 | So that's why we started our first class with here are 7 lines of code you can run to do
00:01:45.240 | deep learning.
00:01:46.240 | Not just to do deep learning, but to do image classification on any data set as long as
00:01:51.640 | you structure it in the right way.
00:01:55.400 | So this means you will very often be in a situation, and we've heard a lot of your questions
00:01:59.800 | about this during the week, of gosh there's a whole lot of details I don't understand.
00:02:05.080 | Like this fine-tuning thing, what is fine-tuning?
00:02:09.080 | And the answer is we haven't told you yet.
00:02:11.920 | It's a thing you do in order to do effective image classification with deep learning.
00:02:18.640 | We're going to start at the top and gradually working our way down and down and down.
00:02:22.480 | The reason that you are going to want to learn the additional levels of detail is so that
00:02:27.880 | when you get to the point where you want to do something that no one's done before, you'll
00:02:33.560 | know how to go into that detail and create something that does what you want.
00:02:38.720 | So we're going to keep going down a level and down a level and down a level and down
00:02:42.600 | a level, but through the hierarchy of software libraries, through the hierarchy of the way
00:02:48.280 | computers work, through the hierarchy of the algorithms and the math.
00:02:53.440 | But only at the speed that's necessary to get to the next level of let's make a better
00:02:58.900 | model or let's make a model that can do something we couldn't do before.
00:03:02.960 | Those will always be our goals.
00:03:05.880 | So it's very different to, I don't know if anybody has been reading the Yoshua Bengio
00:03:10.280 | and Ian Goodfellow deep learning book, which is a great mathematical deep learning book,
00:03:15.280 | but it literally starts with 5 chapters of everything you need to know about probability,
00:03:19.480 | everything you need to know about calculus, everything you need to know about linear algebra,
00:03:22.720 | everything you need to know about optimization and so forth.
00:03:25.520 | And in fact, I don't know that in the whole book there's ever actually a point where it
00:03:32.200 | says here is how you do deep learning, even if you read the whole thing.
00:03:36.360 | I've read 2/3 of it before, it's a really good math book.
00:03:42.560 | And anybody who's interested in understanding the math of deep learning I would strongly
00:03:45.840 | recommend but it's kind of the opposite of how we're teaching this course.
00:03:50.340 | So if you often find yourself thinking, "I don't really know what's going on," that's
00:03:55.800 | fine.
00:03:56.800 | But I also want you to always be thinking about, "Well how can I figure out a bit more
00:04:02.000 | about what's going on?"
00:04:03.760 | So we're trying to let you experiment.
00:04:06.400 | So generally speaking, the assignments during the week are trying to give you enough room
00:04:12.160 | to find a way to dig into what you've learned and learn a little bit more.
00:04:17.120 | Make sure you can do what you've seen and also that you can learn a little bit more
00:04:20.680 | about it.
00:04:22.400 | So you are all coders, and therefore you are all expected to look at that first notebook
00:04:27.920 | and look at what are the inputs to every one of those cells?
00:04:30.800 | What are the outputs from every one of those cells?
00:04:33.280 | How is it that the output of this cell can be used as the input of that cell?
00:04:36.480 | Why is this transformation going on?
00:04:39.140 | This is why we did not tell you how do you use Kaggle CLI?
00:04:43.420 | How do you prepare a submission in the correct format?
00:04:46.880 | Because we wanted you to see if you can figure it out and also to leverage the community
00:04:53.400 | that we have to ask questions when you're stuck.
00:04:59.520 | Being stuck and failing is terrific because it means you have found some limit of your
00:05:04.560 | knowledge or your current expertise.
00:05:07.320 | You can then think really hard, read lots of documentation, and ask the rest of the community
00:05:14.480 | until you are no longer stuck, at which point you now know something that you didn't know
00:05:19.440 | before.
00:05:20.440 | So that's the goal.
00:05:22.440 | Asking for help is a key part of this, and so there is a whole wiki page called How to
00:05:26.080 | Ask for Help.
00:05:27.960 | It's really important, and so far I would say about half the times I've seen people ask
00:05:33.040 | for help, there is not enough information for your colleagues to actually help you effectively.
00:05:38.640 | So when people point you at this page, it's not because they're trying to be a pain, it's
00:05:42.880 | because they're saying, "I want to help you, but you haven't given me enough information."
00:05:47.200 | So in particular, what have you tried so far?
00:05:50.440 | What did you expect to happen?
00:05:51.800 | What actually happened?
00:05:53.280 | What do you think might be going wrong?
00:05:54.720 | What if you tried to test this out?
00:05:56.980 | And tell us everything you can about your computer and your software.
00:06:00.360 | Yes, Rachel?
00:06:07.360 | Where you've looked so far?
00:06:13.120 | Show us screenshots, error messages, show us your code.
00:06:16.360 | So the better you get at asking for help, the more enjoyable experience you're going
00:06:23.560 | to have because continually you'll find your problems will be solved very quickly and you
00:06:28.360 | can move on.
00:06:30.480 | There was a terrific recommendation from the head of Google Brain, Vincent van Hooke, on
00:06:39.760 | a Reddit AMA a few weeks ago where he said he tells everybody in his team, "If you're
00:06:45.000 | stuck, work at it yourself for half an hour.
00:06:49.540 | You have to work at it yourself for half an hour.
00:06:51.600 | If you're still stuck, you have to ask for help from somebody else."
00:06:55.040 | The idea being that you are always making sure that you try everything you can, but
00:07:00.240 | you're also never wasting your time when somebody else can help you.
00:07:03.760 | I think that's a really good suggestion.
00:07:05.600 | So maybe you can think about this half an hour rule yourself.
00:07:10.400 | I wanted to highlight a great example of a really successful how to ask for help.
00:07:15.400 | Who asked this particular question?
00:07:17.400 | This is really well done.
00:07:18.560 | So that was really nice.
00:07:20.600 | What's your background before being here at this class?
00:07:26.000 | You could introduce yourself real quick, please.
00:07:33.000 | Hey, I actually graduated from U.S.S. two years ago with the Master of U.S.S. later in elements.
00:07:41.640 | So that's why it was taught us as a team back to this class.
00:07:52.560 | Well, hopefully you've heard some of these fantastic approaches to asking for help.
00:07:53.560 | You can see here that he explained what he's going to do, what happened last time, what
00:07:59.960 | error message you got.
00:08:00.960 | We've got a screenshot showing what he typed and what came back.
00:08:03.960 | He showed us what resources he's currently used, what these resources say, and so forth.
00:08:09.560 | Do you get your own question answered?
00:08:13.880 | Okay, great.
00:08:14.880 | Thanks very much.
00:08:15.880 | Sorry.
00:08:16.880 | Good for you.
00:08:17.880 | Thank you for coming in.
00:08:18.880 | I'm so happy when I saw this question because it's just so clear.
00:08:19.880 | I was like, this is easy to answer because it's a well-asked question.
00:08:31.240 | So as you might have noticed, the wiki is rapidly filling out with lots of great information.
00:08:35.440 | So please start exploring it.
00:08:38.800 | You'll see on the left-hand side there is a recent changes section.
00:08:42.800 | You can see every day, lots of people have been contributing to lots of things, so it's
00:08:47.560 | continually improving.
00:08:52.000 | There's some great diagnostic sections.
00:08:54.520 | If you are trying to diagnose something which is not covered and you solve it, please add
00:08:59.640 | your solution to these diagnostic sections.
00:09:06.320 | One of the things I love seeing today was Tom, where's Tom?
00:09:16.800 | Maybe his remote.
00:09:17.800 | Actually I think he was remote, I think he joined his remote yesterday.
00:09:20.240 | So he was asking a question about how fine-tuning works, and we talked a bit about the answers,
00:09:28.740 | and then he went ahead and created a very small little wiki page.
00:09:32.800 | There's not much information there, but there's more than there used to be.
00:09:35.720 | And this is exactly what we want.
00:09:37.800 | And you can even see in the places where he wasn't quite sure, he put some question marks.
00:09:42.480 | So now somebody else can go back, edit his wiki page, and Tom's going to come back tomorrow
00:09:47.080 | and say "Oh, now I've got even more questions answered."
00:09:50.720 | So this is the kind of approach where you're going to learn a lot.
00:09:55.760 | We've already spoken to Melissa, so this is good.
00:09:59.520 | This is another great example of something which I think is very helpful, which is Melissa,
00:10:04.120 | who we heard from earlier, went ahead and told us all, "Here are my understanding of
00:10:08.760 | the 17 steps necessary to complete the things that we were asked to do this week."
00:10:13.560 | So this is great not only for Melissa to make sure she understands it correctly, but then
00:10:18.780 | everybody else can say "Oh, that's a really handy resource that we can draw on as well."
00:10:27.760 | There are 718 messages in Slack in a single channel.
00:10:32.280 | That's way too much for you to expect to use this as a learning resource, so this is kind
00:10:39.080 | of my suggestion as to where you might want to be careful of how you use Slack.
00:10:47.320 | So I wanted to spend maybe quite a lot of time, as you can see, talking about the resources
00:10:52.480 | that are available.
00:10:53.480 | I feel like if we get that sorted out now, then we're all going to speed along a lot
00:10:58.440 | more quickly.
00:10:59.440 | Thanks for your patience as we talk about some non-deep learning stuff.
00:11:03.760 | We expect the vast majority of learning to have an outside of class, and in fact if we
00:11:14.240 | go back and finish off our survey, I know that one of the questions asked about that.
00:11:25.800 | How much time are you prepared to commit most weeks to this class?
00:11:30.400 | And the majority are 8-15, some are 15-30, and a small number are less than 8.
00:11:38.040 | Now if you're in the less than 8 group, I understand that's not something you can probably
00:11:42.120 | change.
00:11:43.120 | If you had more time, you'd put in more time.
00:11:46.280 | So if you're in the less than 8 group, I guess just think about how you want to prioritize
00:11:52.400 | what you're getting out of this course, and be aware it's not really designed that you're
00:11:56.040 | going to be able to do everything in less than 8 hours a week.
00:12:01.360 | So maybe make more use of the forums and the wiki and kind of focus your assignments during
00:12:08.600 | the week on the stuff that you're most interested in.
00:12:11.120 | And don't worry too much if you don't feel like you're getting everything, because you
00:12:15.800 | have less time available.
00:12:16.800 | For those of you in the 15-30 group, I really hope that you'll find that you're getting
00:12:21.920 | a huge amount of that time that you're putting in.
00:12:27.560 | Something I'm really glad I asked, because I found this very helpful, was how much was
00:12:31.560 | new to you?
00:12:32.720 | And for half of you, the answer is most of it.
00:12:37.020 | And for well over half of you, most of it or nearly all of it from Lesson 1 is new.
00:12:43.120 | So if you're one of the many people I've spoken to during the week who are saying "holy shit,
00:12:48.400 | that was a fire hose of information, I feel kind of overwhelmed, but kind of excited.
00:12:55.640 | You are amongst friends."
00:12:58.600 | Remember during the week, there are about 100 of you going through this same journey.
00:13:03.840 | So if you want to catch up with some people during the week and have a coffee to talk
00:13:08.120 | more about the class, or join a study group here at USF, or if you're from the South Bay,
00:13:13.400 | find some people from the South Bay, I would strongly suggest doing that.
00:13:18.200 | So for example, if you're in Menlo Park, you could create a Menlo Park Slack channel and
00:13:24.020 | put out a message saying "Hey, anybody else in Menlo Park available on Wednesday night,
00:13:28.680 | I'd love to get together and maybe do some pair programming."
00:13:35.440 | For some of you, not very much of it was new.
00:13:39.240 | And so for those of you, I do want to make sure that you feel comfortable pushing ahead,
00:13:46.280 | trying out your own projects and so forth.
00:13:49.740 | Basically in the last lesson, what we learned was a pretty standard data science computing
00:13:55.960 | stack.
00:13:56.960 | So AWS, Jupyter Notebook, bit of NumPy, Bash, this is all stuff that regardless of what
00:14:07.840 | kind of data science you do, you're going to be seeing a lot more of if you stick in
00:14:12.840 | this area.
00:14:13.840 | They're all very, very useful things, and those of you who have maybe spent some time
00:14:18.960 | in this field, you'll have seen most of it before.
00:14:22.520 | So that's to be expected.
00:14:28.320 | So hopefully that is some useful background.
00:14:36.920 | So last week we were really looking at the basic foundations, computing foundations necessary
00:14:46.160 | for data science more generally, and for deep learning more particularly.
00:14:52.640 | This week we're going to do something very similar, but we're going to be looking at
00:14:55.400 | the key algorithmic pieces.
00:14:58.120 | So in particular, we're going to go back and say "Hey, what did we actually do last week?
00:15:04.400 | And why did that work?
00:15:06.200 | And how did that work?"
00:15:08.240 | For those of you who don't have much algorithmic background around machine learning, this is
00:15:13.200 | going to be the same fire hose of information as last week was for those of you who don't
00:15:18.080 | have so much software and Bash and AWS background.
00:15:23.000 | So again, if there's a lot of information, don't worry, this is being recorded.
00:15:28.600 | There are all the resources during the week.
00:15:31.520 | And so the key thing is to come away with an understanding of what are the pieces being
00:15:35.840 | discussed.
00:15:37.240 | Why are those pieces important?
00:15:39.400 | What are they kind of doing, even if you don't understand the details?
00:15:42.960 | So if at any point you're thinking "Okay, Jeremy's talking about activation functions,
00:15:48.000 | I have no idea what he just said about what an activation function is, or why I should
00:15:52.400 | care, please go on to the in-class Slack channel and probably @Rachel, @Rachel, I don't know
00:16:01.840 | what Jeremy's talking about at all, and then Rachel's got a microphone and she can let
00:16:05.400 | me know, or else put up your hand and I will give you the microphone and you can ask.
00:16:10.560 | So I do want to make sure you guys feel very comfortable asking questions.
00:16:14.640 | I have done this class now once before because I did it for the Skype students last night.
00:16:20.640 | So I've heard a few of the questions already, so hopefully I can cover some things that
00:16:25.040 | are likely to come up.
00:16:28.680 | Before we look at these kind of digging into what's going on, the first thing we're going
00:16:34.120 | to do is see how do we do the basic homework assignment from last week.
00:16:39.960 | So the basic homework assignment from last week was "Can you enter the Kaggle Dogs and
00:16:45.320 | Cats Redux Competition?"
00:16:47.920 | So how many of you managed to submit something to that competition and get some kind of result?
00:16:53.720 | Okay, that's not bad, so maybe a third.
00:16:57.520 | So for those of you who haven't yet, keep trying during this week and use all of those
00:17:02.280 | resources I showed you to help you because now quite a few of your colleagues have done
00:17:06.300 | it successfully and therefore we can all help you.
00:17:09.920 | And I will show you how I did it.
00:17:15.560 | Here is Redux.
00:17:21.960 | So the basic idea here is we had to download the data to a directory.
00:17:37.360 | So to do that, I just typed "kg download" after using the "kg config" command.
00:17:45.640 | Kg is part of the Kaggle CLI thing, and Kaggle CLI can be installed by typing "p install
00:17:55.840 | Kaggle CLI".
00:17:58.160 | This works fine without any changes if you're using our AWS instances and setup scripts.
00:18:06.880 | In fact it works fine if you're using Anaconda pretty much anywhere.
00:18:11.680 | If you're not doing either of those two things, you may have found this step more challenging.
00:18:17.220 | But once it's installed, it's as simple as saying "kg config" with your username, password
00:18:21.560 | and competition name.
00:18:24.420 | When you put in the competition name, you can find that out by just going to the Kaggle
00:18:30.040 | website and you'll see that when you go to the competition in the URL, it has here a
00:18:38.120 | name.
00:18:39.120 | Just copy and paste that, that's the competition then.
00:18:43.880 | Kaggle CLI is a script that somebody created in their spare time and didn't spend a lot
00:18:50.040 | of time on it.
00:18:51.040 | There's no error handling, there's no checking, there's nothing.
00:18:53.720 | So for example, if you haven't gone to Kaggle and accepted the competition rules, then attempting
00:18:59.260 | to run Kg download will not give you an error.
00:19:02.680 | It will create a zip file that actually contains the contents of the Kaggle webpage saying
00:19:06.840 | please accept the competition rules.
00:19:09.080 | So those of you that tried to unzip that and that said it's not a zip file, if you go ahead
00:19:13.440 | and cat that, you'll see it's not a zip file, it's an HTML file.
00:19:18.360 | This is pretty common with recent-ish data science tools and particularly with cutting
00:19:25.320 | HTML learning stuff.
00:19:26.840 | A lot of it's pretty new, it's pretty rough, and you really have to expect to do a lot
00:19:31.400 | of debugging.
00:19:33.680 | It's very different to using Excel or Photoshop.
00:19:37.440 | When I said Kg download, I created a test.zip and a train.zip, so I went ahead and I unzipped
00:19:43.320 | both of those things, that created a test and a train, and they contained a whole bunch
00:19:49.400 | of files called cat.one.jpg and so forth.
00:19:54.760 | So the next thing I did to make my life easier was I made a list of what I believed I had
00:20:01.840 | to do.
00:20:04.200 | I find life much easier with a to-do list.
00:20:06.280 | I thought I need to create a validation set, I need to create a sample, I need to move
00:20:11.000 | my cats into a cats directory and dogs into a docs directory, I then need to run the fine
00:20:15.200 | tune and train, I then need to submit.
00:20:17.800 | So I just went ahead then and created markdown headings for each of those things and started
00:20:22.600 | filling them out.
00:20:25.440 | Create validation set and sample.
00:20:27.360 | A very handy thing in Jupyter, Jupyter Notebook, is that you can create a cell that starts
00:20:32.760 | with a % sign and that allows you to type what they call magic commands.
00:20:36.960 | There are lots of magic commands that do all kinds of useful things, but they do include
00:20:40.980 | things like cd and makedir and so forth.
00:20:45.040 | Another cool thing you can do is you can use an explanation mark and then type any bash
00:20:50.000 | command.
00:20:52.000 | So the nice thing about doing this stuff in the notebook rather than in bash is you've
00:20:57.000 | got a record of everything you did.
00:20:59.000 | So if you need to go back and do it again, you can.
00:21:01.320 | If you make a mistake, you can go back and figure it out.
00:21:03.840 | So this kind of reproducible research, very highly recommended.
00:21:08.800 | So I try to do everything in a single notebook so I can go back and fix the problems that
00:21:13.480 | I always make.
00:21:14.600 | So here you can see I've gone into the directory, I've created my validation set, I then used
00:21:21.520 | three lines of Python to go ahead and grab all of the JPEG file names, create a random
00:21:31.240 | permutation of them, and so then the first 2000 of that random permutation are 2000 random
00:21:37.160 | files, and then I moved them into my validation directory, that gave them my valid.
00:21:42.600 | I did exactly the same thing for my sample, but rather than moving them, I copied them.
00:21:50.640 | And then I did that for both my sample training and my sample validation, and that was enough
00:21:57.640 | to create my validation set and sample.
00:22:00.800 | The next thing I had to do was to move all my cats into a cats directory and dogs into
00:22:04.680 | a dogs directory, which was as complex as typing move cat.star cats and dogs.star dogs.
00:22:15.760 | And so the cool thing is, now that I've done that, I can then just copy and paste the seven
00:22:21.980 | lines of code from our previous lesson.
00:22:25.960 | So these lines of code are totally unchanged.
00:22:29.480 | I added one more line of code which was save weights.
00:22:33.520 | Once you've trained something, it's a great idea to save the weights so you don't have
00:22:36.840 | to train it again, you can always go back later and say load weights.
00:22:42.800 | So I now had a model which predicted cats and dogs through my Redux competition.
00:22:49.040 | My final step was to submit it to Kaggle.
00:22:52.740 | So Kaggle tells us exactly what they expect, and the way they do that is by showing us
00:22:58.600 | a sample of the submission file.
00:23:02.880 | And basically the sample shows us that they expect an ID column and a label column.
00:23:10.820 | The ID is the file number, so if you have a look at the test set, you'll see everyone's
00:23:24.280 | got a number.
00:23:25.280 | So it's expecting to get the number of the file along with your probability.
00:23:38.280 | So you have to figure out how to take your model and create something of that form.
00:23:46.280 | This is clearly something that you're going to be doing a lot.
00:23:48.720 | So once I figured out how to do it, I actually created a method to do it in one step.
00:23:53.080 | So I'm going to go and show you the method that I wrote.
00:24:12.000 | So I just added this utils module that I kind of chucked everything in.
00:24:15.480 | Actually that's not true, I'll put it in my VGG module because I added it to the VGG class.
00:24:23.040 | So there's a few ways you could possibly do this.
00:24:25.440 | Basically you know that you've got a way of grabbing a mini-batch of data at a time, or
00:24:29.520 | a mini-batch of predictions at a time.
00:24:31.760 | So one thing you could do would be to grab your mini-batch size 64, you could grab your
00:24:36.360 | 64 predictions and just keep appending them 64 at a time to an array until eventually
00:24:43.240 | you have your 12,500 test images all with a prediction in an array.
00:24:50.280 | That is actually a perfectly valid way to do it.
00:24:52.240 | How many people solved it using that kind of approach?
00:24:55.560 | Not many of you, that's interesting, but it works perfectly well.
00:25:01.160 | Those of you who didn't, I guess either asked on the forum or read the documentation and
00:25:05.840 | discovered that there's a very handy thing in Keras called Predict Generator.
00:25:12.600 | And what Predict Generator does is it lets you send it in a bunch of batches, so something
00:25:18.640 | that we created with get_batches, and it will run the predictions on every one of those
00:25:22.840 | batches and return them all in a single array.
00:25:26.400 | So that's what we wanted to do.
00:25:29.640 | If you read the Keras documentation, which you should do very often, you will find out
00:25:35.040 | that Predict Generator generally will give you the labels.
00:25:41.120 | So not the probabilities, but the labels, so cat1, dog0, something like that.
00:25:46.000 | In this case, for this competition, they told us they want probabilities, not labels.
00:25:53.320 | So instead of calling the get_batches, which we wrote, here is the get_batches that we
00:25:59.680 | wrote, you can see all it's doing is calling something else, which is flow from directory.
00:26:07.320 | To get Predict Generator to give you probabilities instead of classes, you have to pass in an
00:26:17.920 | extra argument, which is plus mode equals, and rather than categorical, you have to say
00:26:22.600 | none.
00:26:24.320 | So in my case, when I went ahead and actually modified get_batches to take an extra argument,
00:26:29.400 | which was plus mode, and then in my test method I created, I then added plus mode equals none.
00:26:37.200 | So then I could call model.PredictGenerator, passing in my batches, and that is going to
00:26:46.440 | give me everything I need.
00:26:48.920 | So I will show you what that looks like.
00:26:51.880 | So once I do, I basically say vgg.test, this is the thing I created, pass in my test directory,
00:26:57.880 | pass in my batch size, that returns two things, it returns the predictions, and it returns
00:27:02.400 | the batches.
00:27:03.960 | I can then use batches.filenames to grab the filenames, because I need the filenames in
00:27:09.540 | order to grab the IDs.
00:27:13.000 | And so that looks like this, let's take a look at them, so there's a few predictions,
00:27:26.040 | and let's look at a few filenames.
00:27:32.240 | Now one thing interesting is that at least for the first five, the probabilities are
00:27:37.000 | all 1's and 0's, rather than 0.6, 0.8, and so forth.
00:27:40.600 | We're going to talk about why that is in just a moment.
00:27:43.480 | For now, it is what it is.
00:27:45.480 | It's not doing anything wrong, it really thinks that the answer.
00:27:50.000 | So all we need to do is grab, because Kaggle wants something which is is_dog, we just need
00:27:55.600 | to grab the second column of this, and the numbers from this, place them together as
00:28:00.080 | columns, and send them across.
00:28:02.840 | So here is grabbing the first column from the predictions, and I call it is_dog.
00:28:11.680 | Here is grabbing from the 8th character until the dot in filenames, turning that into an
00:28:18.000 | integer, get my IDs.
00:28:20.760 | NumPy has something called stack, which lets you put two columns next to each other, and
00:28:25.040 | so here is my IDs and my probabilities.
00:28:29.280 | And then NumPy lets you save that as a CSV file using save text.
00:28:35.400 | You can now either SSH to your AWS instance and use KgSubmit, or my preferred technique
00:28:42.440 | is to use a handy little IPython thing called FileLink.
00:28:46.880 | If you type FileLink and then pass in a file that is on your server, it gives you a little
00:28:52.240 | URL like this, which I can click on, and it downloads it to my computer.
00:28:58.440 | And so now on my computer I can go to Kaggle and I can just submit it in the usual way.
00:29:03.400 | I prefer that because it lets me find out exactly if there's any error messages or anything
00:29:07.480 | going wrong on Kaggle, I can see what's happening.
00:29:11.160 | So as you can see, rerunning what we learned last time to submit something to Kaggle really
00:29:19.280 | just requires a little bit of coding to just create the submission file, a little bit of
00:29:25.160 | bash scripting to move things into the right place, and then rerunning the 7 lines of code,
00:29:29.560 | the actual deep learning itself is incredibly straightforward.
00:29:34.040 | Now here's where it gets interesting.
00:29:37.040 | When I submitted my 1s and 0s to Kaggle, I was put in -- let's have a look at the leaderboard.
00:29:51.320 | The first thing I did was I accidentally put in "iscat" rather than "isdog", and that made
00:29:57.240 | me last place.
00:29:59.080 | So I had 38 was my loss.
00:30:02.240 | Then when I was putting in 1s and 0s, I was in 110th place, which is still not that great.
00:30:08.520 | Now the funny thing was I was pretty confident that my model was doing well because the validation
00:30:13.280 | set for my model told me that my accuracy was 97.5%.
00:30:23.520 | I'm pretty confident that people on Kaggle are not all of them doing better than that.
00:30:29.040 | So I thought something weird is going on.
00:30:31.280 | So that's a good time to figure out what does this number mean?
00:30:34.880 | What is 12?
00:30:35.880 | What is 17?
00:30:38.320 | So let's go and find out.
00:30:40.080 | It says here that it is a log loss, so if we go to Evaluation, we can find out what
00:30:48.160 | log loss is.
00:30:49.860 | And here is the definition of log loss.
00:30:53.480 | Log loss is known in Keras as binary entropy or categorical entropy, and you will actually
00:31:00.860 | find it very familiar because every single time we've been creating a model, we have
00:31:06.680 | been using -- let's go and find out when we compile it.
00:31:21.760 | When we compile a model, we've always been using categorical cross-entropy.
00:31:25.560 | So it's probably a good time for us to find out what the hell this means.
00:31:29.480 | So the short answer is it is this mathematical function.
00:31:36.240 | But let's dig into this a little bit more and find out what's going on.
00:31:40.480 | I would strongly recommend that when you want to understand how something works, you whip
00:31:45.040 | out a spreadsheet.
00:31:46.920 | Spreadsheets are like my favorite tool for doing small-scale data analysis.
00:31:53.000 | They are perhaps the least well-utilized tools among professional data scientists, which
00:31:59.080 | I find really surprising.
00:32:00.880 | Because back when I was in consulting, everybody used them for everything, and they were the
00:32:04.080 | most overused tools.
00:32:06.720 | So what I've done here is I've gone ahead and created a little column of his cats and
00:32:12.440 | his dogs.
00:32:13.440 | So this is the correct answer, and I've created a little column of some possible predictions.
00:32:18.240 | And then I've just gone in and I've typed in the formula from that cattle page.
00:32:23.320 | And so here it is.
00:32:24.320 | Basically it's the truth label times log of the prediction minus 1 minus the truth label
00:32:33.720 | times log of 1 minus the prediction.
00:32:38.200 | Now if you think about it, the truth label is always 1 or 0.
00:32:42.020 | So this is actually probably more easily understood using an if function.
00:32:48.000 | It's exactly the same thing.
00:32:49.000 | Rather than multiplying by 1 and 0, let's just use the if function.
00:32:51.840 | Because if it's a cat, then take log of the prediction, otherwise take log of 1 minus
00:32:58.480 | the prediction.
00:32:59.960 | Now this is hopefully pretty intuitive.
00:33:02.280 | If it's a cat and your prediction is really high, then we're taking the log of that and
00:33:08.040 | getting a small number.
00:33:09.560 | If it's not a cat and then our prediction is really low, then we want to take the log
00:33:14.940 | of 1 minus that.
00:33:16.620 | And so you can get a sense of it by looking here, here's like a non-cat, which we thought
00:33:23.320 | is a non-cat, and therefore we end up with log of 1 minus that, which is a low number.
00:33:31.640 | Here's a cat, which we're pretty confident isn't a cat, so here is log of that.
00:33:37.280 | Notice this is all being a negative sign at the front just to make it so that smaller
00:33:41.360 | numbers are better.
00:33:43.480 | So this is log loss, or binary, or categorical cross-entropy.
00:33:51.000 | And this is where we find out what's going on.
00:33:53.200 | Because I'm now going to go and try and say, well, what did I submit?
00:33:58.120 | And I've submitted predictions that were all 1s and 0s.
00:34:01.320 | So what if I submit 1s and 0s?
00:34:04.080 | Ouch.
00:34:05.080 | Okay, why is that happening?
00:34:08.760 | Because we're taking logs of 1s and 0s.
00:34:13.320 | That's no good.
00:34:14.320 | So actually, Kaggle has been pretty nice not to return just an error.
00:34:19.640 | And I actually know why this happens because I wrote this functionality on Kaggle.
00:34:23.960 | Kaggle modifies it by a tiny 0.0001, just to make sure it doesn't die.
00:34:30.400 | So if you say 1, it actually treats it as 0.9999, if you say 0 it treats it as 0.0001.
00:34:36.720 | So our incredibly overconfident model is getting massively penalized for that overconfidence.
00:34:44.220 | So what would be better to do would be instead of sending across 1s and 0s, why not send
00:34:48.960 | across actual probabilities you think are reasonable?
00:34:53.320 | So in my case, what I did was I added a line which was, I said numpy.clip, my first column
00:35:20.320 | of my predictions, and clip it to 0.05 and 0.95.
00:35:23.600 | So anything less than 0.05 becomes 0.05 and anything greater than 0.95 becomes 0.95.
00:35:29.320 | And then I tried submitting that.
00:35:31.080 | And that moved me from 110th place to 40th place.
00:35:34.680 | And suddenly, I was in the top half.
00:35:37.400 | So the goal of this week was really try and get in the top half of this competition.
00:35:42.200 | And that's all you had to do, was run a single epoch, and then realize that with this evaluation
00:35:46.880 | function, you need to be submitting things that aren't 1s and 0s.
00:35:51.360 | Let's take that one offline and talk about it in the forum because I actually need to
00:36:12.320 | think about that properly.
00:36:13.320 | So probably I should have used, and I'll be interested in trying this tomorrow and maybe
00:36:23.680 | in a resubmission, I probably should have done 0.025 and 0.975 because I actually know
00:36:29.760 | that my accuracy on the validation set was 0.975.
00:36:34.600 | So that's probably the probability that I should have used.
00:36:38.320 | I would need to think about it more though to think like, because it's like a nonlinear
00:36:42.520 | loss function, is it better to underestimate how confident you are or overestimate how
00:36:47.800 | confident you are?
00:36:49.120 | So I would need to think about it a little bit.
00:36:51.720 | In the end, I said it's about 97.5, I have a feeling that being overconfident might be
00:36:57.080 | a bad thing because of the shape of the function, so I'll just be a little bit on the tame side.
00:37:05.040 | I then later on tried 0.02 and 0.98, and I did actually get a slightly better answer.
00:37:12.680 | I actually got a little bit better than that.
00:37:15.520 | I think in the end this afternoon I ran a couple more epochs just to see what would
00:37:21.080 | happen, and that got me to 24th.
00:37:27.480 | So I'll show you how you can get to 24th position, and it's incredibly simple.
00:37:33.120 | You take these two lines here, fit and save weights, and copy and paste them a bunch of
00:37:39.880 | times.
00:37:41.720 | You can see I saved the weights under a different file name each time just so that I can always
00:37:46.480 | go back and use a model that I created earlier.
00:37:51.280 | Something we'll talk about more in the class later is this idea that halfway through after
00:37:55.080 | two epochs I changed my learning rate from 0.1 to 0.01 just because I happen to know
00:38:02.240 | this is often a good idea.
00:38:03.560 | I haven't actually tried it without doing that.
00:38:05.760 | I suspect it might be just as good or even better, but that was just something I tried.
00:38:10.580 | So interestingly, by the time I run four epochs, my accuracy is 98.3%.
00:38:18.160 | That would have been second place in the original Cats and Dogs competition.
00:38:22.480 | So you can see it doesn't take much to get really good results.
00:38:28.280 | And each one of these took, as you can see, 10 minutes to run on my AWS P2 instance.
00:38:51.120 | The original Cats and Dogs used a different evaluation function, which was just accuracy.
00:38:56.480 | So they changed it for the Redux one to use block loss, which makes it a bit more interesting.
00:39:13.240 | The reason I didn't just say nb_epoch=4 is that I really wanted to save the result after
00:39:20.440 | each epoch under a different weights file name just in case at some point it overfit.
00:39:25.280 | I could always go back and use one that I got in the middle.
00:39:37.200 | We're going to learn a lot about that in the next couple of weeks.
00:39:41.780 | In this case, we have added a single linear layer to the end.
00:39:47.080 | We're about to learn a lot about this.
00:39:49.220 | And so we actually are not training very many parameters.
00:39:52.280 | So my guess would be that in this case, we could probably run as many epochs as we like
00:39:56.360 | and it would probably keep getting better and better until it eventually levels off.
00:40:00.240 | That would be my guess.
00:40:05.200 | So I wanted to talk about what are these probabilities.
00:40:10.240 | One way to do that, and also to talk about how can you make this model better, is any
00:40:15.680 | time I build a model and I think about how to make it better, my first step is to draw
00:40:21.320 | a picture.
00:40:23.320 | Let's take that one offline onto the forum because we don't need to cover it today.
00:40:38.240 | Data scientists don't draw enough pictures.
00:40:39.920 | Now when I say draw pictures, I mean everything from printing out the first five lines of
00:40:45.580 | your array to see what it looks like to drawing complex plots.
00:40:50.680 | For a computer vision, you can draw lots of pictures because we're classifying pictures.
00:40:55.680 | I've given you some tips here about what I think are super useful things to visualize.
00:41:01.180 | So when I wanted to find out how come my Kaggle submission is 110th place, I ran my kind of
00:41:08.500 | standard five steps.
00:41:10.760 | The standard five steps are let's look at a few examples of images we got right, let's
00:41:15.960 | look at a few examples of images we got wrong.
00:41:18.940 | Let's look at some of the cats that we felt were the most cat-like, some of the dogs that
00:41:23.300 | we felt were the most dog-like, vice versa.
00:41:26.400 | Some of the cats that we were the most wrong about, some of the dogs we were the most wrong
00:41:30.200 | about, and then finally some of the cats and dogs that our model is the most unsure about.
00:41:37.960 | This little bit of code I suggest you keep around somewhere because this is a super useful
00:41:42.320 | thing to do anytime you do image recognition.
00:41:45.400 | So the first thing I did was I loaded my weights back up just to make sure that they were there
00:41:49.360 | and I took them from my very first epoch, and I used that vgg.test method that I just
00:41:55.040 | showed you that I created.
00:41:56.460 | This time I passed in the validation set, not the test set because the validation set
00:42:00.040 | I know the correct answer.
00:42:03.400 | So then from the batches I could get the correct labels and I could get the file names.
00:42:08.760 | I then grabbed the probabilities and the class predictions, and that then allowed me to do
00:42:14.280 | the 5 things I just mentioned.
00:42:16.800 | So here's number 1, a few correct labels at random.
00:42:20.120 | So numpy.where, the prediction is equal to the label.
00:42:25.720 | Let's then get a random permutation and grab the first 4 and plot them by index.
00:42:32.080 | So here are 4 examples of things that we got right.
00:42:37.320 | And not surprisingly, this cat looks like a cat and this dog looks like a dog.
00:42:42.520 | Here are 4 things we got wrong.
00:42:45.600 | And so that's interesting.
00:42:46.600 | You can kind of see here's a very black underexposed thing on a bright background.
00:42:53.120 | Here is something that is on a totally unusual angle.
00:42:56.720 | And here is something that's so curled up you can't see its face.
00:43:00.240 | And this one you can't see its face either.
00:43:01.920 | So this gives me a sense of like, okay, the things that's getting wrong, it's reasonable
00:43:06.600 | to get those things wrong.
00:43:08.240 | If you looked at this and they were really obvious, cats and dogs, you would think there's
00:43:12.120 | something wrong with your model.
00:43:13.120 | But in this case, no, the things that it's finding hard are genuinely hard.
00:43:19.560 | Here are some cats that we felt very sure were cats.
00:43:22.400 | Here are some dogs we felt very sure were dogs.
00:43:26.520 | So these weights, this one here results ft1.h5, this ft stands for fine-tune, and you can
00:43:42.120 | see here I saved my weights after I did my fine-tuning.
00:43:46.040 | So these are the cats and dogs.
00:43:47.680 | So these I think are the most interesting, which is here are the images we were very
00:44:13.960 | confident were cats, but they're actually dogs.
00:44:17.280 | Here's one that is only 50x60 pixels, that's very difficult.
00:44:23.240 | Here's one that's almost totally in front of a person and is also standing upright.
00:44:28.360 | That's difficult because it's unusual.
00:44:31.320 | This one is very white and is totally from the front, that's quite difficult.
00:44:36.280 | And this one I'm guessing the color of the floor and the color of the fur are nearly
00:44:40.040 | identical.
00:44:41.880 | So again, this makes sense, these do look genuinely difficult.
00:44:46.120 | So if we want to do really well in this competition, we might start to think about should we start
00:44:50.600 | building some models of very very small images because we now know that sometimes cable gives
00:44:55.680 | us 50x50 images, which are going to be very difficult for us to deal with.
00:45:01.080 | Here are some pictures that we were very confident are dogs, but they're actually cats.
00:45:06.160 | Again, not being able to see the face seems like a common problem.
00:45:12.000 | And then finally, here are some examples that we were most uncertain about.
00:45:17.020 | Now notice that the most uncertain are still not very uncertain, like they're still nearly
00:45:21.440 | one or nearly zero.
00:45:23.240 | So why is that?
00:45:24.440 | Well, we will learn in a moment about exactly what is going on from a mathematical point
00:45:29.120 | of view when we calculate these things, but the short answer is the probabilities that
00:45:33.480 | come out of a deep learning network are not probabilities in any statistical sense of
00:45:39.600 | the term.
00:45:40.680 | So this is not actually saying that there is one chance that I had of 100,000, that
00:45:47.040 | this is a dog.
00:45:48.040 | It's only a probability from the mathematical point of view, and in math the probability
00:45:52.680 | means it's between 0 and 1, and all of the possibilities add up to 1.
00:45:57.920 | It's not a probability in the sense that this is actually something that tells you how often
00:46:02.000 | this is going to be right versus this is going to be wrong.
00:46:04.880 | So for now, just be aware of that.
00:46:06.800 | When we talk about these probabilities that come out of neural network training, you can't
00:46:12.240 | interpret them in any kind of intuitive way.
00:46:16.580 | We will learn about how to create better probabilities down the track.
00:46:22.760 | Every time you do another epoch, your network is going to get more and more confident.
00:46:29.440 | This is why when I loaded the weights, I loaded the weights from the very first epoch.
00:46:34.600 | If I had loaded the weights from the last epoch, they all would have been 1 and 0.
00:46:38.840 | So this is just something to be aware of.
00:46:44.380 | So hopefully you can all go back and get great results on the Kaggle competition.
00:46:49.240 | Even though I'm going to share all this, you will learn a lot more by trying to do it yourself,
00:46:56.400 | and only referring to this when and if you're stuck.
00:46:59.640 | And if you do get stuck, rather than copying and pasting my code, find out what I used
00:47:04.400 | and then go to the Keras documentation and read about it and then try and write that
00:47:08.400 | line of code without looking at mine.
00:47:10.720 | So the more you can do that, the more you'll think, "Okay, I can do this.
00:47:14.400 | I understand how to do this myself."
00:47:17.040 | Just some suggestions, it's entirely up to you.
00:47:24.160 | So let's move on.
00:47:26.960 | So now that we know how to do this, I wanted to show you one other thing, which is the
00:47:31.760 | last part of the homework was redo this on a different dataset.
00:47:38.240 | And so I decided to grab the State Farm Distracted Driver Competition.
00:47:44.440 | The Kaggle State Farm Distracted Driver Competition has pictures of people in 10 different types
00:47:50.600 | of distracted driving, ranging from drinking coffee to changing the radio station.
00:47:58.400 | I wanted to show you how I entered this competition.
00:48:02.420 | It took me a quarter of an hour to enter the competition, and all I did was I duplicated
00:48:09.880 | my Cats and Dogs Redux notebook, and then I started basically rerunning everything.
00:48:18.880 | But in this case, it was even easier because when you download the State Farm Competition
00:48:24.200 | data, they had already put it into directories, one for each type of distracted driving.
00:48:31.080 | So I was delighted to discover, let's go to it, so if I type "tree-d", that shows you
00:48:49.360 | my directory structure, you can see in "train", it already had 10 directories, it actually
00:48:54.920 | didn't have valid, so in "train", it already had the 10 directories.
00:48:58.680 | So I could skip that whole section.
00:49:01.020 | So I only had to create the validation and sample set.
00:49:05.240 | If all I wanted to do was enter the competition, I wouldn't even have had to have done that.
00:49:09.560 | So I won't go through, but it's basically exactly the same code as I had before to create
00:49:14.480 | my validation set and sample.
00:49:16.880 | I deleted all of the bits which moved things into separate subfolders, I then used exactly
00:49:22.440 | the same 7 lines of code as before, and that was basically done.
00:49:27.800 | I'm not getting good accuracy yet, I don't know why, so I'm going to have to figure out
00:49:32.920 | what's going on with this.
00:49:33.920 | But as you can see, this general approach works for any kind of image classification.
00:49:42.360 | There's nothing specific about cats and dogs, so you now have a very general tool in your
00:49:47.880 | toolbox.
00:49:50.400 | And all of the stuff I showed you about visualizing the errors and stuff, you can use all that
00:49:54.040 | as well.
00:49:55.040 | So maybe when you're done, you could try this as well.
00:49:57.080 | Yes, you know, can I grab one of these please?
00:50:26.720 | So the question is, would this work for CT scans and cancer?
00:50:32.240 | And I can tell you that the answer is yes, because I've done it.
00:50:34.960 | So my previous company I created was something called Enlidic, which was the first deep learning
00:50:41.300 | for medical diagnostics company.
00:50:43.520 | And the first thing I did with four of my staff was we downloaded the National Lung
00:50:48.240 | Screening Trial data, which is a thousand examples of people with cancer, it's a CT scan of their
00:50:53.680 | lungs and 5,000 examples of people without cancer, CT scans of their lungs.
00:50:58.400 | We did the same thing.
00:50:59.720 | We took ImageNet, we fine-tuned ImageNet, but in this case instead of cats and dogs,
00:51:06.120 | we had malignant tumor versus non-malignant tumor.
00:51:09.480 | We then took the result of that and saw how accurate it was, and we discovered that it
00:51:13.400 | was more accurate than a panel of four of the world's best radiologists.
00:51:17.840 | And that ended up getting covered on TV on CNN.
00:51:22.000 | So making major breakthroughs in domains is not necessarily technically that challenging.
00:51:32.000 | The technical challenges in this case were really about dealing with the fact that CT
00:51:37.100 | scans are pretty big, so we had to just think about some resource issues.
00:51:41.200 | Also they're black and white, so we had to think about how do we change our ImageNet
00:51:44.760 | pre-training to black and white, and stuff like that.
00:51:47.560 | But the basic example was really not much more of a different code to what you see here.
00:52:03.000 | The State Farm data is 4GB, and I only downloaded it like half an hour before class started.
00:52:11.760 | So I only ran a small fraction of an epoch just to make sure that it works.
00:52:16.440 | I'm running a whole epoch, probably would have taken overnight.
00:52:24.840 | So let's go back to lesson 1, and there was a little bit at the end that we didn't look
00:52:39.200 | Actually before we do, now's a good time for a break.
00:52:42.040 | So let's have a 12 minute break, let's come back at 8pm, and one thing that you may consider
00:52:50.280 | doing during those 12 minutes if you haven't done it already is to fill out the survey.
00:52:54.680 | I will place the survey URL back onto the in class page.
00:53:04.520 | See you in 12 minutes.
00:53:08.520 | Okay thanks everybody.
00:53:11.880 | How many of you have watched this video?
00:53:16.080 | Okay, some of you haven't.
00:53:21.120 | You need to, because as I've mentioned a couple of times in our emails, the last two thirds
00:53:26.160 | of it was actually a surprise lesson 0 of this class, and it's where I teach about what
00:53:32.400 | convolutions are.
00:53:34.400 | So if you haven't watched it, please do.
00:53:50.160 | The first 20 minutes or so is more of a general background, but the rest is a discussion of
00:53:54.640 | exactly what convolutions are.
00:53:56.680 | For now, I'll try not to assume too much that you know what they are, the rest of it hopefully
00:54:02.680 | will be stand-alone anyway.
00:54:06.240 | But I want to talk about fine-tuning, and I want to talk about why we do fine-tuning.
00:54:24.040 | Why do we start with an image network and then fine-tune it rather than just train our
00:54:32.400 | own network?
00:54:34.240 | And the reason why is that an image network has learned a hell of a lot of stuff about
00:54:40.600 | what the world looks like.
00:54:43.720 | A guy called Matt Zeiler wrote this fantastic paper a few years ago in which he showed us
00:54:51.120 | what these networks learn.
00:54:53.360 | And in fact, the year after he wrote this paper, he went on to win ImageNet.
00:54:57.780 | So this is a powerful example of why spending time thinking about visualizations is so helpful.
00:55:04.140 | By spending time thinking about visualizing networks, he then realized what was wrong
00:55:07.960 | with the networks at the time, made them better and won the next year's ImageNet.
00:55:14.120 | We're not going to talk about that, we're going to talk about some of these pictures
00:55:16.320 | here, Drew.
00:55:19.040 | Here are 9 examples of what the very first layer of an ImageNet convolutional neural
00:55:27.600 | network looks like, what the filters look like.
00:55:33.360 | And you can see here that, for example, here is a filter that learns to find a diagonal
00:55:43.360 | edge or a diagonal line.
00:55:52.440 | So you can see it's saying look for something where there's no pixels and then there's bright
00:55:58.340 | pixels and then there's no pixels, so that's finding a diagonal line.
00:56:02.200 | Here's something that finds a diagonal line in the up direction.
00:56:04.880 | Here's something that finds a gradient horizontal from orange to blue.
00:56:10.600 | Here's one diagonal from orange to blue.
00:56:12.940 | As I said, these are just 9 of these filters in layer 1 of this ImageNet trained network.
00:56:24.240 | So what happens, those of you who have watched the video I just mentioned will be aware of
00:56:29.480 | this, is that each of these filters gets placed pixel by pixel or group of pixels by group
00:56:36.400 | of pixels over a photo, over an image, to find which parts of an image it matches.
00:56:42.780 | So which parts have a diagonal line.
00:56:45.640 | And over here it shows 9 examples of little bits of actual ImageNet images which match
00:56:53.700 | this first filter.
00:56:57.360 | So here are, as you can see, they all are little diagonal lines.
00:57:02.000 | So here are 9 examples which match the next filter, the diagonal lines in the opposite
00:57:07.200 | direction and so forth.
00:57:09.580 | The filters in the very first layer of a deep learning network are very easy to visualize.
00:57:15.360 | This has happened for a long time, and we've always really known for a long time that this
00:57:19.700 | is what they look like.
00:57:21.180 | We also know, incidentally, that the human vision system is very similar.
00:57:26.560 | The human vision system has filters that look much the same.
00:57:36.840 | To really answer the question of what are we talking about here, I would say watch the
00:57:41.280 | video.
00:57:42.840 | But the short answer is this is a 7x7 pixel patch which is slid over the image, one group
00:57:51.680 | of 7 pixels at a time, to find which 7x7 patches look like that.
00:57:57.100 | And here is one example of a 7x7 patch that looks like that.
00:58:04.320 | So for example, this gradient, here are some examples of 7x7 patches that look like that.
00:58:13.980 | So we know the human vision system actually looks for very similar kinds of things.
00:58:21.240 | These kinds of things that they look for are called Gabor filters.
00:58:25.080 | If you want to Google for Gabor filters, you can see some examples.
00:58:34.360 | It's a little bit harder to visualize what the second layer of a neural net looks like,
00:58:39.120 | but Zyla figured out a way to do it.
00:58:41.960 | In his paper, he shows us a number of examples of the second layer of his ImageNet trained
00:58:49.040 | neural network.
00:58:50.800 | Suppose we can't directly visualize them, instead we have to show examples of what the
00:58:55.640 | filter can look like.
00:58:57.600 | So here is an example of a filter which clearly tends to pick up corners.
00:59:04.880 | So in other words, it's taking the straight lines from the previous layer and combining
00:59:11.400 | them to find corners.
00:59:13.720 | There's another one which is learning to find circles, and another one which is learning
00:59:18.760 | to find curves.
00:59:20.800 | So you can see here are 9 examples from actual pictures on ImageNet, which actually did get
00:59:30.360 | heavily activated by this corner filter.
00:59:33.280 | And here are some that got heavily activated by this circle filter.
00:59:39.040 | The third layer then can take these filters and combine them, and remember this is just
00:59:44.120 | 16 out of 100 which are actually in the ImageNet architecture.
00:59:51.120 | So in layer 3, we can combine all of those to create even more sophisticated filters.
00:59:56.200 | In layer 3, there's a filter which can find repeating geometrical patterns.
01:00:02.520 | Here's a filter, let's go look at the examples.
01:00:07.920 | That's interesting, it's finding pieces of text.
01:00:12.320 | And here's something which is finding edges of natural things like fur and plants.
01:00:20.520 | Layer 4 is finding certain kinds of dog face.
01:00:28.440 | Layer 5 is finding the eyeballs of birds and reptiles and so forth.
01:00:34.240 | So there are 16 layers in our VGG network.
01:00:43.240 | What we do when we fine-tune is we say let's keep all of these learnt filters and use them
01:00:53.400 | and then just learn how to combine the most complex subtle nuanced filters to find cats
01:01:03.640 | versus dogs rather than combine them to learn a thousand categories of ImageNet.
01:01:10.400 | This is why we do fine-tuning.
01:01:14.680 | So when I asked Yannette's earlier question about does this work for CT scans and lung
01:01:20.360 | cancer, and the answer was yes.
01:01:24.880 | These kinds of filters that find dog faces are not very helpful for looking at a CT scan
01:01:31.120 | and looking for cancer, but these earlier ones that can recognize repeating images or
01:01:36.720 | corners or curves certainly are.
01:01:41.000 | So really regardless of what computer vision work you're doing, starting with some kind
01:01:47.560 | of pre-trained network is almost certainly a good idea because at some level that pre-trained
01:01:51.720 | network has learnt to find some kinds of features that are going to be useful to you.
01:01:56.720 | And so if you start from scratch you have to learn them from scratch.
01:02:01.240 | In cats versus dogs we only had 25,000 pictures.
01:02:04.640 | And so from 25,000 pictures to learn this whole hierarchy of geometric and semantic
01:02:10.960 | structures would have been very difficult.
01:02:13.760 | So let's not learn it, let's use one that's already been learned on ImageNet which is
01:02:17.960 | one and a half million pictures.
01:02:20.540 | So that's the short answer to the question "Why do fine-tuning?"
01:02:25.520 | To the longer answer really requires answering the question "What exactly is fine-tuning?"
01:02:31.620 | And to answer the question "What exactly is fine-tuning?" we have to answer the question
01:02:35.600 | "What exactly is a neural network?"
01:02:39.160 | So a neural network, we'll learn more about this shortly, but the short answer is if you're
01:03:07.680 | not sure, try all of it.
01:03:10.840 | Generally speaking, if you're doing something with natural images, the second to last layer
01:03:16.400 | is very likely to be the best, but I just tend to try a few.
01:03:20.960 | And we're going to see today or next week some ways that we can actually experiment
01:03:26.240 | with that question.
01:03:29.920 | So as per usual, in order to learn about something we will use Excel.
01:03:34.920 | And here is a deep neural network in Excel.
01:03:40.200 | Rather than having a picture with lots of pixels, I just have three inputs, a single
01:03:46.360 | row with three inputs which are x1, x2 and x3, and the numbers are 2, 3 and 1.
01:03:52.560 | And rather than trying to pick out whether it's a dog or a cat, we're going to assume
01:03:56.080 | there are two outputs, 5 and 6.
01:03:58.520 | So here's like a single row that we're feeding into a deep neural network.
01:04:06.160 | So what is a deep neural network?
01:04:07.480 | A deep neural network basically is a bunch of matrix products.
01:04:14.280 | So what I've done here is I've created a bunch of random numbers.
01:04:19.480 | They are normally distributed random numbers, and this is the standard deviation that I'm
01:04:24.920 | using for my normal distribution, and I'm using 0 as the mean.
01:04:28.920 | So here's a bunch of random numbers.
01:04:33.240 | What if I then take my input vector and matrix multiply them by my random weights?
01:04:42.560 | And here it is.
01:04:43.560 | So here's matrix multiply, that by that.
01:04:47.280 | And here's the answer I get.
01:04:48.720 | So for example, 24.03 = 2 x 11.07 + 3 x -2.81 + 1 x 10.31 and so forth.
01:05:03.640 | Any of you who are either not familiar with or are a little shaky on your matrix vector
01:05:11.560 | products, tomorrow please go to the Khan Academy website and look for Linear Algebra and watch
01:05:19.920 | the videos about matrix vector products.
01:05:22.400 | They are very, very, very simple, but you also need to understand them very, very, very intuitively,
01:05:29.400 | comfortably, just like you understand plus and times in regular algebra.
01:05:35.440 | I really want you to get to that level of comfort with linear algebra because this is
01:05:39.800 | the basic operation we're doing again and again.
01:06:08.840 | So if that is a single layer, how do we turn that into multi-layers?
01:06:13.440 | Well, not surprisingly, we create another bunch of weights.
01:06:18.400 | And now we take those weights, the new bunch of weights, times the previous activations
01:06:26.080 | with our matrix multiply, and we get a new set of activations.
01:06:30.560 | And then we do it again.
01:06:32.120 | Let's create another bunch of weights and multiply them by our previous set of activations.
01:06:41.640 | Note that the number of columns in your weight matrix is, you can make it as big or as small
01:06:47.360 | as you like, as long as the last one has the same number of columns as your output.
01:06:53.080 | So we have 2 outputs, 5 and 6.
01:06:56.400 | So our final weight matrix had to have 2 columns so that our final activations has 2 things.
01:07:04.880 | So with our random numbers, our activations are not very close to what we hope they would
01:07:11.040 | be, not surprisingly.
01:07:13.880 | So the basic idea here is that we now have to use some kind of optimization algorithm
01:07:19.240 | to repeatably make the weights a little bit better and a little bit better, and we will
01:07:23.400 | see how to do that in a moment.
01:07:24.840 | But for now, hopefully you're all familiar with the idea that there is such a thing as
01:07:29.200 | an optimization algorithm.
01:07:31.320 | An optimization algorithm is something that takes some kind of output to some kind of
01:07:35.760 | mathematical function and finds the inputs to that function that makes the outputs as
01:07:40.760 | low as possible.
01:07:41.760 | And in this case, the thing we would want to make as low as possible would be something
01:07:45.920 | like the sum of squared errors between the activations and the outputs.
01:07:54.520 | I want to point out something here, which is that when we stuck in these random numbers,
01:07:58.680 | the activations that came out, not only are they wrong, they're not even in the same general
01:08:03.520 | scale as the activations that we wanted.
01:08:08.700 | So that's a bad problem.
01:08:11.040 | The reason it's a bad problem is because they're so much bigger than the scale that we were
01:08:15.920 | looking for.
01:08:17.240 | As we change these weights just a little bit, it's going to change the activations by a
01:08:23.200 | And this makes it very hard to train.
01:08:25.960 | In general, you want your neural network to start off even with random weights, to start
01:08:33.960 | off with activations which are all of similar scale to each other, and the output activations
01:08:39.320 | to be of similar scale to the output.
01:08:43.480 | For a very long time, nobody really knew how to do this.
01:08:47.600 | And so for a very long time, people could not really train deep neural networks.
01:08:53.120 | It turns out that it is incredibly easy to do.
01:08:56.720 | And there is a whole body of work talking about neural network initializations.
01:09:03.120 | It turns out that a really simple and really effective neural network initialization is
01:09:07.960 | called Xavier initialization, named after its founder, Xavier Glauro.
01:09:14.160 | And it is 2 divided by n+l.
01:09:19.160 | Like many things in deep learning, you will find this complex-looking thing like Xavier
01:09:25.680 | weight initialization scheme, and when you look into it, you will find it is something
01:09:31.160 | about this easy.
01:09:32.160 | This is about as complex as deep learning gets.
01:09:35.480 | So I am now going to go ahead and implement Xavier deep learning weight initialization
01:09:40.000 | schemes in Excel.
01:09:42.080 | So I'm going to go up here and type =2 divided by 3in + 4out, and put that in brackets because
01:09:56.360 | we're complex and sophisticated mathematicians, and press enter.
01:10:00.960 | There we go.
01:10:01.960 | So now my first set of weights have that as its standard deviation.
01:10:05.780 | My second set of weights I actually have pointing at the same place, because they also have
01:10:09.320 | 4in and 3out.
01:10:12.480 | And then my third I need to have =2 divided by 3in + 2out.
01:10:21.280 | Done!
01:10:22.280 | So I have now implemented it in Excel, and you can see that my activations are indeed
01:10:28.640 | of the right general scale.
01:10:31.640 | So generally speaking, you would normalize your inputs and outputs to be mean 0 and standard
01:10:37.320 | deviation 1.
01:10:38.720 | And if you use these, we want them to be of the same kind of scale.
01:10:54.480 | Obviously they're not going to be in 5 and 6 because we haven't done any optimization
01:10:57.440 | yet, but we don't want them to be like 100,000.
01:10:59.920 | We want them to be somewhere around 5 and 6.
01:11:07.120 | Eventually we want them to be close to 5 and 6.
01:11:15.680 | And so if we start off with them really high or really low, then optimization is going
01:11:22.160 | to be really finicky and really hard to do.
01:11:24.920 | And so for decades when people tried to train deep learning neural networks, the training
01:11:29.760 | that took forever or was so incredibly unresilient, it was useless, and this one thing, better
01:11:38.640 | weight initialization, was a huge step.
01:11:42.380 | We're talking maybe 3 years ago that this was invented, so this is not like we're going
01:11:47.600 | back a long time, this is relatively recent.
01:11:52.880 | Now the good news is that Keras and pretty much any decent neural network library will
01:11:59.480 | handle your weight initialization for you.
01:12:02.480 | Until very recently they pretty much all used this.
01:12:05.680 | There are some even more recent slightly better approaches, but they'll give you a set of weights
01:12:11.680 | where your outputs will generally have a reasonable scale.
01:12:14.760 | So what's not arbitrary is that you are given your input dimensionality.
01:12:37.360 | So in our case, for example, it would be 224x224 pixels, in this case I'm saying it's 3 things.
01:12:43.860 | You are given your output dimensionality.
01:12:46.220 | So for example in our case, for cats and dogs it's 2, for this I'm saying it's 2.
01:12:54.520 | The thing in the middle about how many columns does each of your weight matrices have is
01:12:59.800 | entirely up to you.
01:13:02.100 | The more columns you add, the more complex your model, and we're going to learn a lot
01:13:06.780 | about that.
01:13:07.780 | As Rachel said, this is all about your choice of architecture.
01:13:10.460 | So in my first one here I had 4 columns, and therefore I had 4 outputs.
01:13:15.600 | In my next one I had 3 columns, and therefore I had 3 outputs.
01:13:19.080 | In my final one I had 2 columns, and therefore I had 2 outputs, and that is the number of
01:13:24.400 | outputs that I wanted.
01:13:26.960 | So this thing of like how many columns do you have in your weight matrix is where you
01:13:30.360 | get to decide how complex your model is, so we're going to see that.
01:13:36.400 | So let's go ahead and create a linear model.
01:13:46.760 | Alright, so we're going to learn how to create a linear model.
01:14:13.560 | Let's first of all learn how to create a linear model from scratch, and this is something
01:14:20.320 | which we did in that original USF Data Institute launch video, but I'll just remind you.
01:14:29.320 | Without using Keras at all, I can define a line as being ax + b, I can then create some
01:14:36.680 | synthetic data.
01:14:37.680 | So let's say I'm going to assume a is 3 and b is 8, create some random x's, and my y will
01:14:43.840 | then be my ax + b.
01:14:45.880 | So here are some x's and some y's that I've created, not surprisingly, this kind of plot
01:14:51.360 | looks like so.
01:14:54.480 | The job of somebody creating a linear model is to say I don't know what a and b is, how
01:15:00.160 | can we calculate them?
01:15:01.560 | So let's forget that we know that they're 3 and 8, and say let's guess that they're
01:15:05.760 | -1 and 1, how can we make our guess better?
01:15:11.400 | And to make our guess better, we need a loss function.
01:15:14.480 | So the loss function is something which is a mathematical function that will be high
01:15:18.800 | if your guess is bad, and is low if it's good.
01:15:22.600 | The loss function I'm using here is sum of squared errors, which is just my actual minus
01:15:27.000 | my prediction squared, and add it up.
01:15:32.320 | So if I define my loss function like that, and then I say my guesses are -1 and 1, I can
01:15:39.040 | then calculate my average loss and it's 9.
01:15:42.760 | So my average loss with my random guesses is not very good.
01:15:47.240 | In order to create an optimizer, I need something that can make my weights a little bit better.
01:15:53.640 | If I have something that can make my weights a little bit better, I can just call it again
01:15:56.680 | and again and again.
01:15:59.720 | That's actually very easy to do.
01:16:01.360 | If you know the derivative of your loss function with respect to your weights, then all you
01:16:08.120 | need to do is update your weights by the opposite of that.
01:16:12.240 | So remember, the derivative is the thing that says, as your weight changes, your output
01:16:20.200 | changes by this amount.
01:16:22.920 | That's what the derivative is.
01:16:24.080 | In this case, we have y = ax + b, and then we have our loss function is actual minus
01:16:46.000 | predicted squared, then add it up.
01:16:49.520 | So we're now going to create a function called update, which is going to take our a guess
01:16:53.640 | and our b guess and make them a little bit better.
01:16:56.920 | And to make them a little bit better, we calculate the derivative of our loss function with respect
01:17:02.320 | to b, and the derivative of our loss function with respect to a.
01:17:06.720 | How do we calculate those?
01:17:08.880 | We go to Wolfram Alpha and we enter in d along with our formula, and the thing we want to
01:17:15.280 | get the derivative of, and it tells us the answer.
01:17:18.440 | So that's all I did, I went to Wolfram Alpha, found the correct derivative, pasted them
01:17:24.160 | in here.
01:17:26.040 | And so what this means is that this formula here tells me as I increase b by 1, my sum
01:17:33.800 | of squared errors will change by this amount.
01:17:38.360 | And this says as I change a by 1, my sum of squared errors will change by this amount.
01:17:43.960 | So if I know that my loss function gets higher by 3 if I increase a by 1, then clearly I need
01:17:57.400 | to make a a little bit smaller, because if I make it a little bit smaller, my loss function
01:18:02.440 | will go down.
01:18:04.680 | So that's why our final step is to say take our guess and subtract from it our derivative
01:18:14.320 | times a little bit.
01:18:15.900 | LR stands for learning rate, and as you can see I'm setting it to 0.01.
01:18:21.920 | How much is a little bit is something which people spend a lot of time thinking about
01:18:26.760 | and studying, and we will spend time talking about, but you can always trial and error
01:18:32.280 | to find a good learning rate.
01:18:34.480 | When you use Keras, you will always need to tell it what learning rate you want to use,
01:18:38.440 | and that's something that you want the highest number you can get away with.
01:18:42.280 | We'll see more of this next week.
01:18:45.240 | But the important thing to realize here is that if we update our guess, minus equals
01:18:50.480 | our derivative times a little bit, our guess is going to be a little bit better because
01:18:57.400 | we know that going in the opposite direction makes the loss function a little bit lower.
01:19:03.020 | So let's run those two things, where we've now got a function called update, which every
01:19:07.600 | time we run it makes our predictions a little bit better.
01:19:11.160 | So finally now, I'm basically doing a little animation here that says every time you calculate
01:19:17.360 | an animation, call my animate function, which 10 times will call my update function.
01:19:24.240 | So let's see what happens when I animate that.
01:19:31.520 | There it is.
01:19:32.640 | So it starts with a really bad line, which is my -11, and it gets better and better.
01:19:40.920 | So this is how stochastic gradient descent works.
01:19:46.400 | Stochastic gradient descent is the most important algorithm in deep learning.
01:19:51.000 | Stochastic gradient descent is the thing that starts with random weights like this and ends
01:19:57.240 | with weights that do what you want to do.
01:20:03.600 | So as you can see, stochastic gradient descent is incredibly simple and yet incredibly powerful
01:20:13.480 | because it can take any function and find the set of parameters that does exactly what
01:20:18.720 | we want to do with that function.
01:20:21.200 | And when that function is a deep learning neural network that becomes particularly powerful.
01:20:41.680 | It has nothing to do with neural nets except - so just to remind ourselves about the setup
01:20:47.120 | for this, we started out by saying this spreadsheet is showing us a deep neural network with a
01:20:54.000 | bunch of random parameters.
01:20:56.960 | Can we come up with a way to replace the random parameters with parameters that actually give
01:21:03.160 | us the right answer?
01:21:05.240 | So we need to come up with a way to do mathematical optimization.
01:21:09.620 | So rather than showing how to do that with a deep neural network, let's see how to do
01:21:15.200 | it with a line.
01:21:17.560 | So we started out by saying let's have a line Ax + b where A is 3 and B is 8, and pretend
01:21:26.080 | we didn't know that A was 3 and B is 8.
01:21:30.520 | Make a wild guess as to what A and B might be, come up with an update function that every
01:21:36.160 | time we call it makes A and B a little bit better, and then call that update function
01:21:41.220 | lots of times and confirm that eventually our line fits our data.
01:21:48.460 | Conceptually take that exact same idea and apply it to these weight matrices.
01:21:53.840 | Question is, is there a problem here that as we run this update function, might we get
01:22:14.240 | to a point where, let's say the function looks like this.
01:22:36.160 | So currently we're trying to optimize sum of squared errors and the sum of squared errors
01:22:42.180 | looks like this, which is fine, but let's say the more complex function that kind of
01:22:56.860 | look like this.
01:23:01.300 | So if we started here and kind of gradually tried to make it better and better and better,
01:23:06.380 | we might get to a point where the derivative is zero and we then can't get any better.
01:23:11.020 | This would be called a local minimum.
01:23:17.260 | So the question was suggesting a particular approach to avoiding that.
01:23:20.780 | Here's the good news, in deep learning you don't have local minimum.
01:23:27.460 | Why not?
01:23:28.460 | Well the reason is that in an actual deep learning neural network, you don't have one
01:23:32.900 | or two parameters, you have hundreds of millions of parameters.
01:23:37.360 | So rather than looking like this, or even like a 3D version where it's like something
01:23:42.860 | like this, it's a 600 million dimensional space.
01:23:48.540 | And so for something to be a local minimum, it means that the stochastic gradient descent
01:23:53.660 | has wandered around and got to a point where in every one of those 600 million directions,
01:23:59.580 | it can't do any better.
01:24:01.380 | The probability of that happening is 2 to the power of 600 million.
01:24:06.140 | So for actual deep learning in practice, there's always enough parameters that it's basically
01:24:12.420 | unheard of to get to a point where there's no direction you can go to get better.
01:24:17.580 | So the answer is no, for deep learning, stochastic gradient descent is just as simple as this.
01:24:30.300 | We will learn some tweaks to allow us to make it faster, but this basic approach works just
01:24:37.700 | fine.
01:24:39.700 | [The question is, "If you had known the derivative of sum of squared errors, would you have been
01:24:45.540 | able to define the same function in a different way?"]
01:24:48.940 | That's a great question.
01:24:50.300 | So what if you don't know the derivative?
01:24:53.740 | And so for a long time, this was a royal goddamn pain in the ass.
01:24:58.060 | Anybody who wanted to create stochastic gradient descent for their neural network had to go
01:25:02.380 | through and calculate all of their derivatives.
01:25:05.260 | And if you've got 600 million parameters, that's a lot of trips to Wolfram Alpha.
01:25:11.300 | So nowadays, we don't have to worry about that because all of the modern neural network
01:25:16.820 | libraries do symbolic differentiation.
01:25:19.780 | In other words, it's like they have their own little copy of Wolfram Alpha inside them
01:25:23.260 | and they calculate the derivatives for you.
01:25:25.700 | So you don't ever be in a situation where you don't know the derivatives.
01:25:29.420 | You just tell it your architecture and it will automatically calculate the derivatives.
01:25:34.820 | So let's take a look.
01:25:36.340 | Let's take this linear example and see what it looks like in Keras.
01:25:47.420 | In Keras, we can do exactly the same thing.
01:25:49.700 | So let's start by creating some random numbers, but this time let's make it a bit more complex.
01:25:54.660 | We're going to have a random matrix with two columns.
01:25:57.940 | And so to calculate our y value, we'll do a little matrix multiply here with our x with
01:26:03.380 | a vector of 2, 3 and then we'll add in a constant of 1.
01:26:15.540 | So here's our x's, the first 5 out of 30 of them, and here's the first few y's.
01:26:22.120 | So here 3.2 equals 0.56 times 2 plus 0.37 times 3 plus 1.
01:26:30.780 | Hopefully this looks very familiar because it's exactly what we did in Excel in the very
01:26:34.980 | first level.
01:26:37.140 | How do we create a linear model in Keras?
01:26:39.940 | And the answer is Keras calls a linear model dense.
01:26:45.020 | It's also known in other libraries as fully connected.
01:26:49.020 | So when we go dense with an input of two columns and an output of one column, we have to find
01:26:57.660 | a linear model that can go from this two column array to this one column output.
01:27:08.180 | The second thing we have in Keras is we have some way to build multiple layer networks,
01:27:13.780 | and Keras calls this sequential. Sequential takes an array that contains all of the layers
01:27:20.140 | that you have in your neural network.
01:27:22.340 | So for example in Excel here, I would have had 1, 2, 3 layers.
01:27:27.860 | In a linear model, we just have one layer.
01:27:30.700 | So to create a linear model in Keras, you say sequential, fasten an array with a single
01:27:37.460 | layer and that is a dense layer.
01:27:39.540 | A dense layer is just a simple linear layer.
01:27:42.660 | We tell it that there are two inputs and one output.
01:27:47.780 | And then we tell it, and this will automatically initialize the weights in a sensible way.
01:27:54.220 | It will automatically calculate the derivatives.
01:27:56.700 | So all we have to tell it is how do we want to optimize the weights, and we will say please
01:28:00.880 | use stochastic gradient descent with a learning rate of 0.1.
01:28:05.780 | And we're attempting to minimize our loss of a mean squared error.
01:28:11.260 | So if I do that, that does everything except the very last solving step that we saw in
01:28:18.340 | the previous notebook.
01:28:19.840 | To do the solving, we just type fit.
01:28:26.820 | And as you can see, when we fit, before we start, we can say evaluate to basically find
01:28:33.260 | out our loss function with random weights, which is pretty crappy.
01:28:37.500 | And then we run 5 epochs, and the loss function gets better and better and better using the
01:28:42.980 | stochastic gradient descent update rule we just learned.
01:28:46.020 | And so at the end, we can evaluate and it's better.
01:28:50.060 | And then let's take a look at the weights.
01:28:51.740 | They should be equal to 231, they're actually 1.8, 2.7, 1.2.
01:28:58.380 | That's not bad.
01:28:59.500 | So why don't we run another 5 epochs.
01:29:04.460 | Loss function keeps getting better, we evaluate it now, it's better and the weights are now
01:29:10.200 | closer again to 2, 3, 1.
01:29:12.840 | So we now know everything that Keras is doing behind the scenes.
01:29:17.500 | Exactly.
01:29:18.500 | I'm not hand-waving over details here, that is it.
01:29:22.580 | So we now know what it's doing.
01:29:27.260 | If we now say that Keras don't just create a single layer, but create multiple layers
01:29:32.540 | by passing it multiple layers to this sequential, we can start to build and optimize deep neural
01:29:38.500 | networks.
01:29:40.220 | But before we do that, we can actually use this to create a pretty decent entry to our
01:29:48.540 | cats and dogs competition.
01:29:52.660 | So forget all the fine-tuning stuff, because I haven't told you how fine-tuning works yet.
01:29:57.780 | How do we take the output of an ImageNet network and as simply as possible create an entry
01:30:02.980 | to our cats and dogs competition?
01:30:05.900 | So the basic problem here is that our current ImageNet network returns a thousand probabilities
01:30:14.300 | in a lot of detail.
01:30:15.940 | So it returns not just cat vs dog, but animals, domestic animals, and then ideally it would
01:30:39.260 | best be cat and dog here, but it's not, it keeps going, Egyptian cats, Persian cats,
01:30:43.900 | and so forth.
01:30:44.900 | So one thing we could do is we could write code to take this hierarchy and roll it up
01:30:51.860 | into cats vs dogs.
01:30:54.460 | So I've got a couple of ideas here for how we could do that.
01:30:58.780 | For instance, we could find the largest probability that's either a cat or a dog with a thousand,
01:31:04.500 | and use that.
01:31:05.500 | Or we could average all of the cat categories, all of the dog categories, and use that.
01:31:10.020 | But the downsides here are that would require manual coding for something that should be
01:31:14.820 | learning from data, and more importantly it's ignoring information.
01:31:19.620 | So let's say out of those thousand categories, the category for a bone was very high.
01:31:26.220 | It's more likely a dog is with a bone than a cat is with a bone, so therefore it ought
01:31:30.780 | to actually take advantage, it should learn to recognize environments that cats are in
01:31:35.420 | vs environments that dogs are in, or even recognize things that look like cats from
01:31:39.900 | things that look like dogs.
01:31:41.820 | So what we could do is learn a linear model that takes the output of the ImageNet model,
01:31:48.700 | the thousand predictions, and that uses that as the input, and uses the dog cat label as
01:31:55.460 | the target, and that linear model would solve our problem.
01:31:59.820 | We have everything we need to know to create this model now.
01:32:04.700 | So let me show you how that works.
01:32:10.500 | Let's again import our VGG model, and we're going to try and do three things.
01:32:18.220 | For every image we'll get the true labels, is it cat or is it dog.
01:32:23.140 | We're going to get the 1000 ImageNet category predictions, so that will be 1000 floats for
01:32:28.540 | every image, and then we're going to use the output of 2 as the input to our linear model,
01:32:34.180 | and we're going to use the output 1 as the target for our linear model, and create this
01:32:38.700 | linear model and build some predictions.
01:32:41.860 | So as per usual, we start by creating our validation batches and our batches, just like
01:32:49.140 | before.
01:32:50.140 | And I'll show you a trick.
01:32:52.940 | Because one of the steps here is get the 1000 ImageNet category predictions to every image,
01:32:57.800 | that takes a few minutes.
01:33:00.260 | There's no need to do that again and again.
01:33:02.540 | Once we've done it once, let's save the result.
01:33:05.360 | So I want to show you how you can save NumPy arrays.
01:33:08.460 | Unfortunately, most of the stuff you'll find online about saving NumPy arrays takes a very,
01:33:13.740 | very, very long time to run, and it takes a shitload of space.
01:33:18.140 | There's a really cool library called bcols that almost nobody knows about that can save
01:33:22.940 | NumPy arrays very, very quickly and in very little space.
01:33:27.540 | So I've created these two little things here called save array and load array, which you
01:33:31.860 | should definitely add to your toolbox.
01:33:34.100 | They're actually in the utils.py, so you can use them in the future.
01:33:37.580 | And once you've grabbed the predictions, you can use these to just save the predictions
01:33:46.800 | and load them back later, rather than recalculating them each time.
01:33:50.900 | I'll show you something else we've got.
01:33:53.940 | Before we even worry about calculating the predictions, we just need to load up the images.
01:33:59.940 | When we load the images, there's a few things we have to do.
01:34:01.900 | We have to decode the jpeg images, and we have to convert them into 224x224 pixel images
01:34:07.900 | because that's what VGG expects.
01:34:10.060 | That's kind of slow, too.
01:34:12.220 | So let's also save the result of that.
01:34:15.500 | So I've created this little function called getData, which basically grabs all of the
01:34:22.340 | validation images and all of the training images and sticks them in a NumPy array.
01:34:28.940 | Here's a cool trick.
01:34:33.260 | If you put question mark before something, it shows you the source code.
01:34:39.500 | So if you want to know what is getData doing, go question mark, question mark, getData,
01:34:43.820 | and you can see exactly what it's doing.
01:34:45.020 | It's just concatenating all of the different batches together.
01:34:52.340 | Any time you're using one of my little convenience functions, I strongly suggest you look at
01:34:56.060 | the source code and make sure you see what it's doing.
01:34:58.900 | Because they're all super, super small.
01:35:02.740 | So I can grab the data for the validation data, I can grab it for the training data,
01:35:07.120 | and then I just saved it so that in the future, I can.
01:35:26.660 | So now rather than having to watch and wait for that to pre-process, I'll just go load
01:35:30.700 | array and that goes ahead and loads it off disk.
01:35:35.020 | It still takes a few seconds, but this will be way faster than having to calculate it
01:35:42.340 | directly.
01:35:43.340 | So what that does is it creates a NumPy array with my 23,000 images, each of which has three
01:35:49.380 | colors and is 224x224 in size.
01:35:56.220 | If you remember from lesson 1, the labels that Keras expects are in a very particular
01:36:06.140 | format.
01:36:07.140 | Let's look at the format to see what they look like.
01:36:18.480 | The format of the labels is each one has two things.
01:36:23.780 | It has the probability that it's a cat and the probability that it's a dog, and they're
01:36:28.940 | always just 0's and 1's.
01:36:30.540 | So here is 0, 1 is a dog, 1, 0 is a cat, 1, 0 is a cat, 0, 1 is a dog.
01:36:36.740 | This approach where you have a vector where every element of it is a zero except for a
01:36:41.380 | single one, for the class that you want, is called one-hot encoding.
01:36:47.020 | And this is used for nearly all deep learning.
01:36:51.740 | So that's why I created a little function called one-hot that makes it very easy for
01:36:57.980 | you to one-hot encode your data.
01:37:00.900 | So for example, if your data was just like 0, 1, 2, 1, 0, one-hot encoding that would
01:37:10.380 | look like this.
01:37:12.380 | So that would be the kind of raw form, and that is the one-hot encoded form.
01:37:26.660 | The reason that we use one-hot encoding a lot is that if you take this and you do a matrix
01:37:32.820 | multiply by a bunch of weights, W_1, W_2, W_3, you can calculate a matrix multiply, you see
01:37:43.980 | these two compatible.
01:37:46.820 | So this is what lets you do deep learning really easily with categorical variables.
01:37:56.520 | So the next thing I want to do is I want to grab my labels and I want to one-hot encode
01:38:01.260 | them by using this one-hot function.
01:38:06.060 | And so you can take a look at that.
01:38:19.700 | So you can see here that the first few classes look like so, but the first few labels are
01:38:31.820 | one-hot encoded like so.
01:38:35.780 | So we're now at a point where we can finally do step number 1, get the 1000 image net category
01:38:47.020 | predictions for every image.
01:38:49.260 | So Keras makes that really easy for us.
01:38:51.980 | We can just say model.predict and pass in our data.
01:38:59.480 | So model.predict with train data is going to give us the 1000 predictions from image
01:39:04.460 | net for our train data, and this will give it for our validation data.
01:39:08.980 | And again, running this takes a few minutes, so I save it, and then instead of waiting
01:39:14.060 | for you to wait, I will load it, and so you can see that we now have the 23,000 images
01:39:21.660 | are now no longer 23,000 by 3 by 244 by 244, it's now 23,000 by 1,000, so for every image
01:39:29.900 | we have the 1,000 probabilities.
01:39:31.860 | So let's look at one of them, train_features 0.
01:39:38.180 | Not surprisingly, if we look at just one of these, nearly all of them are 0.
01:39:43.180 | So for these 1000 categories, only one of these numbers should be big, it can't be lots
01:39:49.940 | of different things, it's not a cat and a dog and a jet airplane.
01:39:53.860 | So not surprisingly, nearly all of these things are very close to 0, and hopefully just one
01:39:59.040 | of them is very close to 1.
01:40:02.060 | So that's exactly what we'd expect.
01:40:05.060 | So now that we've got our 1000 features for each of our training images and for each of
01:40:11.180 | our validation images, we can go ahead and create our linear model.
01:40:15.500 | So here it is, here's our linear model.
01:40:18.060 | The input is 1000 columns, it's every one of those image net predictions.
01:40:23.460 | The output is 2 columns, it's a dog or it's a cat.
01:40:28.740 | We will optimize it with, I'm actually not going to use SGD, I'm going to use a slightly
01:40:35.060 | better thing called rmsprop which I will teach you about next week, it's a very minor tweak
01:40:39.780 | on SGD that tends to be a lot faster.
01:40:42.460 | So I suggest in practice you use rmsprop, not SGD, but it's almost the same thing.
01:40:49.840 | And now that we know how to fit the model, once it's defined, we can just go model.fit,
01:40:56.900 | and it runs basically instantly because all it has to do is, let's have a look at our
01:41:03.180 | model, lm.summary.
01:41:08.760 | We have just one layer with just 2000 weights, so running 3 epochs took 0 seconds.
01:41:17.560 | And we've got an accuracy of 0.9734, let's run another 3 epochs, 0.9770, even better.
01:41:30.120 | So you can see this is like the simplest possible model.
01:41:33.940 | I haven't done any fine-tuning, all I've done is I've just taken the image net predictions
01:41:40.020 | for every image and built a linear model that maps from those predictions to cat or dog.
01:41:47.380 | A lot of the amateur deep learning papers that you see, like I showed you a couple last
01:41:52.980 | week, one was like classifying leaves by whether they're sick, one was like classifying skin
01:41:59.300 | lesions by type of lesion.
01:42:03.420 | Often this is all people do, they take a pre-entrain model, they grab the outputs and they stick
01:42:10.540 | it into a linear model and then they use it.
01:42:13.140 | And as you can see, it actually works pretty well.
01:42:17.860 | So I just wanted to point out here that in getting this 0.9770 result, we have not used
01:42:24.940 | any magic libraries at all.
01:42:30.060 | All we've done is we have more code than it looks like, just because we've done some saving
01:42:36.940 | and stuff as we go.
01:42:38.440 | We grabbed our batches, just to grab the data.
01:42:42.700 | We turned the images into a numpy array.
01:42:50.620 | We took the numpy array and ran bottle.predict on them.
01:42:59.320 | We grabbed our labels and we one-hot encoded them.
01:43:06.700 | And then finally we took the one-hot encoded labels and the thousand probabilities and
01:43:12.580 | we fed them to a linear model with 1000 inputs and 2 outputs.
01:43:23.140 | And then we trained it and we ended up with a validation accuracy of 0.977.
01:43:29.460 | So what we're really doing here is we're digging right deep into the details.
01:43:34.540 | We know exactly how SGD works.
01:43:36.860 | We know exactly how the layers are being calculated, and we know exactly what Keras is doing behind
01:43:43.620 | the scenes.
01:43:44.620 | So we started way up high with something that was totally obscure as to what was going on.
01:43:49.180 | We were just using it like you might use Excel, and we've gone all the way down to see exactly
01:43:53.180 | what's going on, and we've got the pretty good result.
01:44:02.100 | The last thing we're going to do is take this and turn it into a fine-tuning model to get
01:44:07.220 | a slightly better result.
01:44:08.560 | And so what is fine-tuning?
01:44:10.620 | In order to understand fine-tuning, we're going to have to understand one more piece
01:44:14.120 | of a deep learning model.
01:44:16.740 | And this is activation functions, this is our last major piece.
01:44:22.220 | I want to point something out to you.
01:44:24.300 | In this view of a deep learning model, we went matrix-multiply, matrix-multiply, matrix-multiply.
01:44:36.580 | Who wants to tell me how can you simplify a matrix-multiply on top of a matrix-multiply
01:44:42.140 | on top of a matrix-multiply?
01:44:44.420 | What's that actually doing?
01:44:47.340 | A linear model and a linear model and a linear model is itself a linear model.
01:44:53.100 | So in fact, this whole thing could be turned into a single matrix-multiply because it's
01:44:58.860 | just doing linear on top of linear on top of linear.
01:45:02.080 | So this clearly cannot be what deep learning is really doing because deep learning is doing
01:45:07.580 | something a lot more than a linear model.
01:45:10.080 | So what is deep learning actually doing?
01:45:12.380 | What deep learning is actually doing is at every one of these points where it says activations,
01:45:18.020 | with deep learning we do one more thing which is we put each of these activations through
01:45:23.420 | a non-linearity of some sort.
01:45:27.580 | There are various things that people use, sometimes people use fan, sometimes people
01:45:32.380 | use sigmoid, but most commonly nowadays people use max(0,x) which is called ReLU, or rectified
01:45:42.940 | linear.
01:45:43.940 | When you see rectified linear activation function, people actually mean max(0,x).
01:45:52.780 | So if we took this excel spreadsheet and added equals max(x), equals max(0,x) and we made
01:46:19.940 | this count.
01:46:24.740 | So if we replace the activation with this and did that at each layer, we now have a
01:46:30.520 | genuine modern deep learning neural network.
01:46:36.580 | Interestingly it turns out that this kind of neural network is capable of approximating
01:46:43.020 | any given function, of arbitrarily complexity.
01:46:48.260 | In the lesson you'll see that there is a link to a fantastic tutorial by Michael Nielsen
01:46:54.060 | on this topic, which is here.
01:47:00.460 | And what he does is he shows you how with exactly this kind of approach where you put
01:47:05.060 | functions on top of functions, you can actually drag them up and down to see how you can change
01:47:13.180 | the parameters and see what they do.
01:47:15.660 | And he gradually builds up so that once you have a function of a function of a function
01:47:20.540 | of this type, he shows you how you can gradually create arbitrarily complex shapes.
01:47:27.820 | So using this incredibly simple approach where you have a matrix multiplication followed
01:47:34.180 | by a rectified linear, which is max(0,x) and stick that on top of each other, on top of
01:47:40.100 | each other, that's actually what's going on in a deep learning neural net.
01:47:45.860 | And so you will see that in all of the deep neural networks we have created so far, we
01:47:52.940 | have always had this extra parameter activation equals something.
01:47:58.620 | And generally you'll see activation equals value.
01:48:02.620 | And that's what it's doing.
01:48:03.620 | It's saying after you do the matrix product, do a max(0,x).
01:48:10.500 | So what we need to do is we need to take our final layer, which has both a matrix multiplication
01:48:18.220 | and an activation function, and what we're going to do is we're going to remove it.
01:48:25.140 | So I'll show you why, if we look at our model, our VGG model, let's take a look at it.
01:48:44.540 | And let's see what does the end of it look like.
01:48:55.540 | The very last layer is a dense layer, the very last layer is a linear layer.
01:49:03.220 | It seems weird therefore that in that previous section where we added an extra dense layer,
01:49:09.500 | why would we add a dense layer on top of a dense layer given that this dense layer has
01:49:14.820 | been tuned to find the 1000 image net categories?
01:49:19.580 | Why would we want to take that and add on top of it something that's tuned to find cats
01:49:23.020 | and dogs?
01:49:24.140 | How about we remove this and instead use the previous dense layer with its 4096 activations
01:49:34.540 | and use that to find our cats and dogs?
01:49:39.020 | So to do that, it's as simple as saying model.pop, that will remove the very last layer, and
01:49:47.220 | then we can go model.add and add in our new linear layer with two outputs, cat and dog.
01:49:58.040 | So when we said VGG.findTune earlier, it was actually, we can have a look, VGG, VGG.findTune.
01:50:14.460 | Here is the source code, model.pop, model.add, a dense layer with the correct number of classes,
01:50:24.980 | and the input equal to the parts interesting, that's actually incorrect, I think it's being
01:50:32.900 | ignored.
01:50:33.900 | So to get this little part, I will fix that later.
01:50:37.060 | So it's basically doing a model.pop and then model.add dense.
01:50:44.740 | So once we've done that, we will now have a new model which is designed to calculate
01:50:51.780 | cats versus dogs rather than designed to calculate image net categories and then calculate cats
01:50:57.980 | versus dogs.
01:50:59.160 | And so when we use that approach, everything else is exactly the same.
01:51:05.300 | We then compile it, giving it an optimizer, and then we can call model.fit.
01:51:15.460 | Anything where we want to use batches, by the way, we have to use in Keras something_generator.
01:51:21.020 | This is fit_generator because we're passing in batches.
01:51:24.740 | And if we run it for 2 epochs, you can see we get 97.35.
01:51:31.580 | If we run it for a little bit longer, eventually we will get something quite a bit better than
01:51:35.940 | our previous linear model on top of image net approach.
01:51:40.020 | In fact we know we can, we got 98.3 when we looked at this fine-tuning earlier.
01:51:46.420 | So that's the only difference between fine-tuning and adding an additional linear layer.
01:51:52.940 | We just do a pop first before we add it.
01:51:59.740 | Of course once I calculate it, I would then go ahead and save the weights and then we
01:52:04.020 | can use that again in the future.
01:52:05.660 | And so from here on in, you'll often find that after I create my fine-tuned model, I
01:52:09.860 | will often go model.load_weights_fine_tune_1.h5 because this is now something that we can use
01:52:16.620 | as a pretty good starting point for all of our future dogs and cats models.
01:52:23.640 | I think that's about everything that I wanted to show you for now.
01:52:28.580 | Anybody who is interested in going further during the week, there is one more section
01:52:32.140 | here in this lesson which is showing you how you can train more than just the last layer,
01:52:36.580 | but we'll look at that next week as well.
01:52:39.420 | So during this week, the assignment is really very similar to last week's assignment, but
01:52:45.180 | it's just to take it further.
01:52:46.620 | Now that you actually know what's going on with fine-tuning and linear layers, there's
01:52:52.220 | a couple of things you could do.
01:52:53.900 | One is, for those of you who haven't yet entered the cats and dogs competition, get your entry
01:52:59.700 | And then have a think about everything you know about the evaluation function, the categorical
01:53:05.700 | cross-entropy loss function, fine-tuning, and see if you can find ways to make your model
01:53:11.940 | better and see how high up the leaderboard you can get using this information.
01:53:16.740 | Maybe you can push yourself a little further, read some of the other forum threads on Kaggle
01:53:21.300 | and on our forums and see if you can get the best result you can.
01:53:28.180 | If you want to really push yourself then, see if you can do the same thing by writing
01:53:31.820 | all of the code yourself, so don't use our fine-tune at all.
01:53:36.100 | Don't use our notebooks at all, see if you can build it from scratch just to really make
01:53:42.940 | sure you understand how it works.
01:53:45.020 | And then of course, if you want to go further, see if you can enter not just the dogs and
01:53:50.540 | cats competition, but see if you can enter one of the other competitions that we talk
01:53:54.060 | about on our website such as Galaxy Zoo or the Plankton competition or the State Farm
01:54:01.100 | Driver Distraction competition or so forth.
01:54:06.220 | Great!
01:54:07.220 | Well thanks everybody, I look forward to talking to you all during the week and hopefully see
01:54:11.340 | you here next Monday.
01:54:13.140 | Thanks very much.
01:54:14.140 | [Applause]
01:54:14.140 | [APPLAUSE]