Lesson 3: Practical Deep Learning for Coders

00:00:00.000 | Let's start actually on the Wiki on the Lesson 3 section of the Wiki because Rachel added

00:00:13.200 | something which I think is super helpful to the Wiki this week, which is in this section

00:00:18.180 | about the assignments, you'll see where it talks about going through the notebooks as

00:00:22.640 | a section called "How to Use the Provided Notebooks" and I think the feedback I get

00:00:28.320 | is each time I talk about the kind of teaching approach in this class, people get a lot out

00:00:33.680 | of it.

00:00:34.680 | So I thought I wanted to keep talking a little bit about that.

00:00:39.240 | As we've discussed before, in the two hours that we spend together each week, that's not

00:00:43.960 | nearly enough time for me to teach you Deep Learning.

00:00:46.800 | I can show you what kinds of things you need to learn about and I can show you where to

00:00:51.080 | look and try to give you a sense of some of the key topics.

00:00:55.000 | But then the idea is that you're going to learn about deep learning during the week

00:00:58.160 | by doing a whole lot of experimenting.

00:01:00.600 | And one of the places that you can do that experimenting is with the help of the notebooks

00:01:05.720 | that we provide.

00:01:08.420 | Having said that, if you do that by loading up a notebook and hitting Shift + Enter a

00:01:12.480 | bunch of times to go through each cell until you get an error message and then you go,

00:01:16.680 | "Oh shit, I got an error message," you're not going to learn anything about deep learning.

00:01:23.700 | I was almost tempted to not put the notebooks online until a week after each class because

00:01:30.960 | it's just so much better when you can build it yourself.

00:01:35.520 | But the notebooks are very useful if you use them really rigorously and thoughtfully which

00:01:40.120 | is as Rachel described here, read through it and then put it aside, minimize it or close

00:01:47.400 | it or whatever, and now try and replicate what you just read from scratch.

00:01:53.920 | And anytime you get stuck, you can go back and open back up the notebook, find the solution

00:01:59.840 | to your problem, but don't copy and paste it.

00:02:03.580 | Put the notebook aside again, go and read the documentation about what it turns out

00:02:08.440 | the solution was, try and understand why is this a solution, and type in that solution

00:02:13.800 | yourself from scratch.

00:02:16.360 | And so if you can do that, it means you really understand now the solution to this thing

00:02:20.800 | you're previously stuck on and you've now learned something you didn't know before.

00:02:25.240 | You might still be stuck, and that's fine.

00:02:27.680 | So if you're still stuck, you can refer back to the notebook again, still don't copy and

00:02:32.040 | paste the code, but whilst having both open on the screen at the same time, type in the

00:02:37.560 | code.

00:02:38.560 | Now that might seem pretty weird.

00:02:39.560 | Why would you type in code you can copy and paste, but just the very kinesthetic process

00:02:43.800 | of typing it in forces you to think about where are the parentheses, where are the dots, and

00:02:48.680 | what's going on.

00:02:49.920 | And then once you've done that, you can try changing the inputs to that function and see

00:02:55.040 | what happens and see how it affects the outputs and really experiment.

00:02:59.320 | So it's through this process of trying to come up and think about what step do I take next.

00:03:06.140 | That means that you're thinking about the concepts you've learned.

00:03:09.000 | And then how do you do that step means that you're having to recall how the actual libraries

00:03:14.080 | are working.

00:03:15.560 | And then most importantly, through experimenting with the inputs and outputs, you get this really

00:03:20.000 | intuitive understanding of what's going on.

00:03:22.400 | So one of the questions I was thrilled to see over the weekend, which is exactly the

00:03:32.080 | kind of thing I think is super helpful.

00:03:37.600 | How do I pronounce your name?

00:03:39.600 | I'm trying to understand Correlate.

00:03:47.060 | So I sent it two vectors with two things each and was happy with Resolve.

00:03:53.080 | And then I sent it two vectors with three things in, and I don't get it.

00:03:57.560 | And so that's great.

00:03:58.560 | This is like taking it down to make sure I really understand this.

00:04:02.720 | And so I typed something in and the output was not what I expected, what's going on.

00:04:08.440 | And so then I tried it by creating a little spreadsheet and showed here are the three

00:04:12.920 | numbers and here's how it was calculated.

00:04:15.760 | And then it's like, "Okay, I kind of get that, not fully," and then I finally described it.

00:04:21.560 | Did that make sense in the end?

00:04:23.240 | So you now understand correlation and convolution.

00:04:25.960 | You know you do because you put it in there, you figure out what the answer ought to be,

00:04:30.600 | and eventually the answer is what you thought.

00:04:32.520 | So this is exactly the kind of experimentation I find a lot of people try to jump straight

00:04:43.320 | to full-scale image recognition before they've got to the kind of 1+1 stage.

00:04:51.000 | And so you'll see I do a lot of stuff in Excel, and this is why.

00:04:55.480 | In Excel or with simple little things in Python, I think that's where you get the most experimental

00:05:00.560 | benefit.

00:05:02.640 | So that's what we're talking about when we talk about experiments.

00:05:10.280 | I want to show you something pretty interesting.

00:05:12.960 | And remember last week we looked at this paper from Matt Zyla where we saw what the different

00:05:23.040 | layers of a convolutional neural network look like.

00:05:33.880 | One of the steps in the how to use the provided notebooks is if you don't know why a step

00:05:57.280 | is being done or how it works or what you observe, please ask.

00:06:06.440 | Any time you're stuck for half an hour, please ask.

00:06:12.000 | So far, I believe that there has been a 100% success rate in answering questions on the

00:06:18.840 | forums.

00:06:19.840 | So when people ask, they get an answer.

00:06:23.120 | So part of the homework this week in the assignments is ask a question on the forum.

00:06:31.080 | Question about setting up AWS, don't be embarrassed if you still have questions there.

00:06:42.240 | No, absolutely.

00:06:43.240 | I know a lot of people are still working through cats and dogs, or cats and dogs, or redux.

00:06:50.020 | And that makes perfect sense.

00:06:51.360 | The people here have different backgrounds.

00:06:53.180 | There are plenty of people here who have never used Python before.

00:06:57.400 | Python was not a prerequisite.

00:06:59.260 | The goal is that for those of you that don't know Python, that we give you the resources

00:07:04.180 | to learn it and learn it well enough to be effective in doing deep learning in it.

00:07:09.920 | But that does mean that you guys are going to have to ask more questions.

00:07:15.500 | There are no dumb questions.

00:07:17.680 | So if you see somebody asking on the forum about how do I analyze functional brain MRIs

00:07:24.840 | with 3D convolutional neural networks, that's fine, that's where they are at.

00:07:29.120 | That's okay if you then ask, What does this Python function do?

00:07:33.520 | Or vice versa.

00:07:34.520 | If you see somebody ask, What does this Python function do?

00:07:36.960 | And you want to talk about 3D brain MRIs, do that too.

00:07:40.560 | The nice thing about the forum is that as you can see, it really is buzzing now.

00:07:48.040 | The nice thing is that the different threads allow people to dig into the stuff that interests

00:07:53.360 | them.

00:07:54.360 | And I'll tell you from personal experience, the thing that I learn the most from is answering

00:08:00.920 | the simplest questions.

00:08:03.060 | So actually answering that question about a 1D convolution, I found very interesting.

00:08:08.080 | I actually didn't realize that the reflect parameter was the default parameter and I

00:08:12.560 | didn't quite understand how it worked, so answering that question I found very interesting.

00:08:17.600 | And even sometimes if you know the answer, figuring out how to express it teaches you

00:08:22.480 | a lot.

00:08:23.280 | So asking questions of any level is always helpful to you and to the rest of the community.

00:08:32.160 | So please, if everybody only does one part of the assignments this week, do that one.

00:08:38.840 | Which is to ask a question.

00:08:40.280 | And here are some ideas about questions you could ask if you're not sure.

00:08:46.240 | Thank you Rachel.

00:08:49.360 | So I was saying last week we kind of looked later in the class at this amazing visualization

00:08:56.360 | of what goes on in a convolutional neural network.

00:09:00.440 | I want to show you something even cooler, which is the same thing in video.

00:09:07.800 | This is by an amazing guy called Jason Nusinski, his supervisor, Todd Lipson, and some other

00:09:13.200 | guys.

00:09:15.080 | And it's doing the same thing but in video.

00:09:18.760 | And so I'm going to show you what's going on here.

00:09:23.800 | And you can download this.

00:09:24.800 | It's called the Deep Visualization Toolbox.

00:09:26.400 | So if you go to Google and search for the Deep Visualization Toolbox, you can do this.

00:09:32.180 | You can grab pictures, you can click on any one of the layers of a convolutional neural

00:09:37.640 | network, and it will visualize every one of the outputs of the filters in that convolutional

00:09:46.920 | layer.

00:09:47.920 | So you can see here with this dog, it looks like there's a filter here which is kind of

00:09:51.480 | finding edges.

00:09:52.480 | And you can even give it a video stream.

00:09:55.000 | So if you give it a video stream of your own webcam, you can see the video stream popping

00:09:59.460 | up here.

00:10:02.920 | So this is a great tool.

00:10:04.200 | And looking at this tool now, I hope it will give us a better intuition about what's going

00:10:07.440 | on in a convolutional neural network.

00:10:09.300 | Look at this one here he selected.

00:10:10.920 | There's clearly an edge detector.

00:10:12.680 | As he slides a piece of paper over it, you get this very strong edge.

00:10:18.280 | And clearly it's specifically a horizontal edge detector.

00:10:21.360 | And here is actually a visualization of the pixels of the filter itself.

00:10:24.920 | And it's exactly what you would expect.

00:10:26.480 | Remember from our initial lesson 0, an edge detector has black on one side and white on

00:10:30.520 | the other.

00:10:32.200 | So you can scroll through all the different layers of this neural network.

00:10:38.360 | And different layers do different things.

00:10:40.280 | And the deeper the layer, the larger the area it covers, and therefore the smaller the actual

00:10:47.280 | filter is, and the more complex the objects that it can recognize.

00:10:52.160 | So here's an interesting example of a layer 5 thing which it looks like it's a face detector.

00:10:58.440 | So you can see that as he moves his face around, this is moving around as well.

00:11:03.780 | So one of the cool things you can do with this is you can say show me all the images

00:11:07.640 | from ImageNet that match this filter as much as possible, and you can see that it's showing

00:11:12.760 | us faces.

00:11:13.760 | This is a really cool way to understand what your neural network is doing, or what ImageNet

00:11:22.480 | is doing.

00:11:23.480 | You can see other guys come along and here we are.

00:11:26.040 | And so here you can see the actual result in real time of the filter deconvolution,

00:11:31.280 | and here's the actual recognition that it's doing.

00:11:34.200 | So clearly it's a face detector which also detects cat faces.

00:11:39.520 | So the interesting thing about these types of neural net filters is that they're often

00:11:44.300 | pretty subtle as to how they work.

00:11:46.720 | They're not looking for just some fixed set of pixels, but they really understand concepts.

00:11:52.480 | So here's a really interesting example.

00:11:55.400 | Here's one of the filters in the 5th layer which seems to be like an armpit detector.

00:12:01.340 | So why would you have an armpit detector?

00:12:03.680 | Well interestingly, what he shows here is that actually it's not an armpit detector.

00:12:07.600 | Because look what happens.

00:12:09.240 | If he smooths out his fabric, this disappears.

00:12:12.920 | So what this actually is, is a texture detector.

00:12:16.200 | It's something that detects some kind of regular texture.

00:12:23.520 | Here's an interesting example of one which clearly is a text detector.

00:12:27.600 | Now interestingly, ImageNet did not have a category called text, one of the thousand

00:12:32.880 | categories is not text, but one of the thousand categories is bookshelf.

00:12:37.800 | And so you can't find a bookshelf if you don't know how to find a book, and you can't find

00:12:41.840 | a book if you don't know how to recognize its spine, and the way to recognize its spine

00:12:45.560 | is by finding text.

00:12:47.480 | So this is the cool thing about these neural networks is that you don't have to tell them

00:12:52.320 | what to find.

00:12:53.860 | They decide what they want to find in order to solve your problem.

00:12:59.520 | So I wanted to start at this end of "Oh my God, deep learning is really cool" and then

00:13:06.480 | jump back to the other end of "Oh my God, deep learning is really simple."

00:13:11.960 | So everything we just saw works because of the things that we've learned about so far,

00:13:18.080 | and I've got a section here called CNN Review in lesson 3.

00:13:21.920 | And Rachel and I have started to add some of our favorite readings about each of these

00:13:27.280 | pieces, but everything you just saw in that video consists of the following pieces.

00:13:33.280 | Matrix products, convolutions just like we saw in Excel and Python, activations such

00:13:42.260 | as ReLuse and Softmax, Stochastic Gradient Descent which is based on backpropagation

00:13:50.120 | - we'll learn more about that today - and that's basically it.

00:13:54.700 | One of the, I think, challenging things is even if you feel comfortable with each of

00:13:59.080 | these 1, 2, 3, 4, 5 pieces that are convolutional neural networks, is really understanding how

00:14:06.960 | will those pieces fit together to actually do deep learning.

00:14:10.920 | So we've got two really good resources here on putting it all together.

00:14:15.040 | So I'm going to go through each of these six things today as revision, but what I suggest

00:14:22.400 | you do if there's any piece where you feel like I'm not quite confident, I really know

00:14:27.480 | what a convolution is or I really know what an activation function is, see if this information

00:14:32.440 | is helpful and maybe ask a question on the forum.

00:14:38.860 | So let's go through each of these.

00:14:40.840 | I think a particularly good place to start maybe is with convolutions.

00:14:47.120 | And a good reason to start with convolutions is because we haven't really looked at them

00:14:50.980 | since Lesson 0.

00:14:53.480 | And that was quite a while ago.

00:14:54.800 | So let's remind ourselves about Lesson 0.

00:15:00.720 | So in Lesson 0, we learned about what a convolution is and we learned about what a convolution

00:15:05.500 | is by actually running a convolution against an image.

00:15:10.440 | So we used the MNIST dataset.

00:15:12.760 | The MNIST dataset, remember, consists of 55,000 28x28 grayscale images of handwritten digits.

00:15:24.200 | So each one of these has some known label, and so here's five examples with a known label.

00:15:33.620 | So in order to understand what a convolution is, we tried creating a simple little 3x3 matrix.

00:15:41.380 | And so the 3x3 matrix we started with had negative 1s at the top, 1s in the middle, and

00:15:47.080 | 0s at the bottom.

00:15:48.560 | So we could kind of visualize that.

00:15:52.240 | So what would happen if we took this 3x3 matrix and we slid it over every 3x3 part of this

00:16:01.420 | image and we multiplied negative 1 by the first pixel, negative 1 by the second pixel,

00:16:07.560 | and then move to the next row and multiply by 1, 1, 1, 0, 0, 0, and add them all together.

00:16:16.140 | And so we could do that for every 3x3 area.

00:16:21.120 | That's what a convolution does.

00:16:23.920 | So you might remember from Lesson 0, we looked at a little area to actually see what this

00:16:29.840 | looks like.

00:16:31.120 | So we could zoom in, so here's a little small little bit of the 7.

00:16:43.440 | And so one thing I think is helpful is just to look at what is that little bit.

00:16:51.280 | Let's make it a bit smaller so it fits on our screen.

00:16:57.280 | So you can see that an image just is a bunch of numbers.

00:17:03.240 | And the blacks are zeros and the things in between bigger and bigger numbers until eventually

00:17:08.440 | the whites are very close to 1.

00:17:11.100 | So what would happen if we took this little 3x3 area?

00:17:13.960 | 0, 0, 0, 0, 0.35, 0.5, 0.9, 0.9, 0.9, and we multiplied each of those 9 things by each

00:17:25.400 | of these 9 things.

00:17:27.720 | So clearly anywhere where the first row is zeros and the second row is ones, this is

00:17:38.000 | going to be very high when we multiply it all together and add the 9 things up.

00:17:42.720 | And so given that white means high, you can see then that when we do this convolution,

00:17:53.680 | we end up with something where the top edges become bright because we went -1, -1, -1 times

00:18:02.840 | 1, 1, 1 times 0, 0, 0 and added them all together.

00:18:07.040 | So one of the things we looked at in lesson 0 and we have a link to here is this cool

00:18:11.840 | little image kernel explained visually site where you can actually create any 3x3 matrix

00:18:21.600 | yourself and go through any 3x3 part of this picture and see the actual arithmetic and see

00:18:32.400 | the result.

00:18:34.480 | So if you're not comfortable with convolutions, this would be a great place to go next.

00:18:45.520 | That's an excellent question.

00:18:47.160 | How did you decide on the values of the top matrix?

00:18:52.240 | So in order to demonstrate an edge filter, I picked values based on some well-known edge

00:19:01.260 | filter matrices.

00:19:02.260 | So you can see here's a bunch of different matrices that this guy has.

00:19:06.040 | So for example, top_sobel, I could select, and you can see that does a top_edge filter.

00:19:12.180 | Or I could say emboss, and you can see it creates this embossing sense.

00:19:19.760 | Here's a better example because it's nice and big here.

00:19:25.520 | So these types of filters have been created over many decades, and there's lots and lots

00:19:30.880 | of filters designed to do interesting things.

00:19:34.800 | So I just picked a simple filter which I knew from experience and from common sense would

00:19:51.040 | create a top edge filter.

00:19:53.320 | And so by the same kind of idea, if I rotate that by 90 degrees, that's going to create

00:20:02.200 | a left-hand edge filter.

00:20:05.040 | So if I create the four different types of filter here, and I could also create four

00:20:09.980 | different diagonal filters like these, that would allow me to create top edge, left edge,

00:20:16.840 | bottom edge, right edge, and then each diagonal edge filter here.

00:20:21.560 | So I created these filters just by hand through a combination of common sense and having read

00:20:29.440 | about filters because people spend time designing filters.

00:20:35.040 | The more interesting question then really is what would be the optimal way to design

00:20:41.200 | filters?

00:20:42.460 | Because it's definitely not the case that these eight filters are the best way of figuring

00:20:48.800 | out what's a 7 and what's an 8 and what's a 1.

00:20:53.160 | So this is what deep learning does.

00:20:56.040 | What deep learning does is it says let's start with random filters.

00:21:01.320 | So let's not design them, but we'll start with totally random numbers for each of our

00:21:05.520 | filters.

00:21:06.520 | So we might start with eight random filters, each of 3x3.

00:21:12.200 | And we then use stochastic gradient descent to find out what are the optimal values of

00:21:19.320 | each of those sets of 9 numbers.

00:21:23.300 | And that's what happens in order to create that cool video we just saw, and that cool

00:21:28.640 | paper that we saw.

00:21:29.800 | That's how those different kinds of edge detectors and gradient detectors and so forth were created.

00:21:34.880 | When you use stochastic gradient descent to optimize these kinds of values when they start

00:21:39.980 | out random, it figures out that the best way to recognize images is by creating these kinds

00:21:46.600 | of different detectors, different filters.

00:21:52.600 | Where it gets interesting is when you start building convolutions on top of convolutions.

00:21:59.640 | So we saw last week that we can create a bunch of inputs, so if I don't, please remind me.

00:22:23.480 | So we saw last week how if you've got three inputs, you can create a bunch of weight matrices,

00:22:30.780 | so we can create one weight matrix.

00:22:35.960 | So if we've got three inputs, we saw last week how you could create a random matrix

00:22:41.280 | and then do a matrix multiply of the inputs times a random matrix.

00:22:49.180 | We could then put it through an activation function such as max(0,x) and we could then

00:22:58.600 | take that and multiply it by another weight matrix to create another output.

00:23:05.260 | And then we could put that through max(0,x) and we can keep doing that to create arbitrarily

00:23:12.240 | complex functions. And we looked at this really great neural networks and deep learning chapter

00:23:22.600 | where we saw visually how that kind of bunch of matrix products followed by activation

00:23:28.960 | functions can approximate any given function.

00:23:36.280 | So where it gets interesting then is instead of just having a bunch of weight matrices and

00:23:42.880 | matrix products, what if sometimes we had convolutions and activations? Because a convolution

00:23:51.120 | is just a subset of a matrix product, so if you think about it, a matrix product says

00:23:56.200 | here's 10 activations and then a weight matrix going down to 10 activations. The weight matrix

00:24:03.120 | goes from every single element of the first layer to every single element of the next layer.

00:24:08.280 | So if this goes from 10 to 10, there are 100 weights. Whereas a convolution is just creating

00:24:14.400 | a subset of those weights.

00:24:17.400 | So I'll let you think about this during the week because it's a really interesting insight

00:24:20.480 | to think about that a convolution is identical to a fully connected layer, but it's just

00:24:27.360 | a subset of the weights. And so therefore everything we learned about stacking linear

00:24:35.680 | and nonlinear layers together applies also to convolutions.

00:24:40.960 | But we also know that convolutions are particularly well-suited to identifying interesting features

00:24:47.240 | of images. So by using convolutions, it allows us to more conveniently and quickly find powerful

00:24:57.920 | deep learning networks.

00:25:00.440 | So the spreadsheet will be available for download tomorrow. We're trying to get to the point

00:25:18.200 | that we can actually get the derivatives to work in the spreadsheet and we're still slightly

00:25:21.280 | stuck with some of the details, but we'll make something available tomorrow. Are the

00:25:28.240 | filters the layers? Yes they are. So this is something where spending a lot of time looking

00:25:37.880 | at simple little convolution examples is really helpful.

00:25:44.240 | Because for a fully connected layer, it's pretty easy. You can see if I have 3 inputs,

00:25:49.040 | then my matrix product will have to have 3 rows, otherwise they won't match. And then

00:25:54.840 | I can create as many columns as I like. And the number of columns I create tells me how

00:25:59.500 | many activations I create because that's what matrix products do.

00:26:04.640 | So it's very easy to see how with what Keras calls dense layers, I can decide how big I

00:26:11.040 | want each activation layer to be. If you think about it, you can do exactly the same thing

00:26:18.560 | with convolutions. You can decide how many sets of 3x3 matrices you want to create at

00:26:27.320 | random, and each one will generate a different output when applied to the image.

00:26:34.160 | So the way that VGG works, for example, so the VGG network, which we learned about in

00:26:48.840 | Lesson 1, contains a bunch of layers. It contains a bunch of convolutional layers, followed

00:27:01.200 | by a flatten. And all flatten does is just a Keras thing that says don't think of the

00:27:07.280 | layers anymore as being x by y by channel matrices, think of them as being a single

00:27:15.480 | vector. So it just concatenates all the dimensions together, and then it contains a bunch of

00:27:20.720 | fully connected blocks.

00:27:22.600 | And so each of the convolutional blocks is -- you can kind of ignore the zero padding,

00:27:29.800 | that just adds zeros around the outside so that your convolutions end up with the same

00:27:34.520 | number of outputs as inputs. It contains a 2D convolution, followed by, and we'll review

00:27:40.720 | this in a moment, a max pooling layer.

00:27:44.840 | You can see that it starts off with 2 convolutional layers with 64 filters, and then 2 convolutional

00:27:52.360 | layers with 128 filters, and then 3 convolutional layers with 256 filters. And so you can see

00:27:56.960 | what it's doing is it's gradually creating more and more filters in each layer.

00:28:08.040 | These definitions of block are specific to VGG, so I just created -- this is just me

00:28:12.960 | refactoring the model so there wasn't lots and lots of lines of code. So I just didn't

00:28:17.080 | want to retype lots of code, so I kind of found that these lines of code were being

00:28:21.480 | repeated so I turned it into a function.

00:28:24.660 | So why would we be having the number of filters being increasing? Well, the best way to understand

00:28:31.560 | a model is to use the summary command. So let's go back to lesson 1.

00:28:51.000 | So let's go right back to our first thing we learned, which was the 7 lines of code

00:28:56.560 | that you can run in order to create and train a network. I won't wait for it to actually

00:29:07.520 | finish training, but what I do want to do now is go vgg.model.summary. So anytime you're

00:29:18.560 | creating models, it's a really good idea to use the summary command to look inside them

00:29:21.960 | and it tells you all about it.

00:29:25.680 | So here we can see that the input to our model has 3 channels, red, green and blue, and they

00:29:32.840 | are 224x224 images. After I do my first 2D convolution, I now have 64 channels of 224x224.

00:29:47.320 | So I've replaced my 3 channels with 64, just like here I've got 8 different filters, here

00:29:58.160 | I've got 64 different filters because that's what I asked for.

00:30:07.360 | So again we have a second convolution set with 224x224 of 64, and then we do max pooling.

00:30:17.800 | So max pooling, remember from lesson 0, was this thing where we simplified things. So

00:30:25.440 | we started out with these 28x28 images and we said let's take each 7x7 block and replace

00:30:33.620 | that entire 7x7 block with a single pixel which contains the maximum pixel value. So

00:30:40.880 | here is this 7x7 block which is basically all gray, so we end up with a very low number

00:30:48.060 | here. And so instead of being 28x28, it becomes 4x4 because we are replacing every 7x7 block

00:30:58.760 | with a single pixel. That's all max pooling does. So the reason we have max pooling is

00:31:06.080 | it allows us to gradually simplify our image so that we get larger and larger areas and

00:31:13.720 | smaller and smaller images.

00:31:16.440 | So if we look at VGG, after our max pooling layer, we now longer have 224x224, we now

00:31:22.880 | have 112x112. Later on we do another max pooling, we end up with 56x56. Later on we do another

00:31:34.520 | max pooling and we end up with 28x28. So each time we do a max pooling we're reducing the

00:31:44.880 | resolution of our image. As we're reducing the resolution, we need to high cut the number

00:31:56.240 | of filters otherwise we're losing information. So that's really why each time we have a max

00:32:04.000 | pooling, we then double the number of filters because that means that every layer we're

00:32:09.200 | keeping the same amount of information content.

00:32:12.320 | So it starts out with a very, very important insight, which is a very important insight

00:32:41.760 | which is a convolution is position invariant. So in other words, this thing we created which

00:32:47.960 | is a top edge detector, we can apply that to any part of the image and get top edges from

00:33:04.580 | every part of the image.

00:33:06.280 | And earlier on when we looked at that Jason Nusinski video, it showed that there was a

00:33:10.600 | face detector which could find a face in any part of the image. So this is fundamental to

00:33:16.680 | how a convolution works. A convolution is a position invariant. It finds a pattern regardless

00:33:22.680 | of where abouts an image is.

00:33:25.280 | Now that is a very powerful idea because when we want to say find a face, we want to be

00:33:32.120 | able to find eyes. And we want to be able to find eyes regardless of whether the face

00:33:37.080 | is in the top left or the bottom right.

00:33:40.600 | So position invariance is important, but also we need to be able to identify position to

00:33:47.000 | some extent because if there's four eyes in the picture, or if there's an eye in the top

00:33:52.680 | corner and the bottom corner, then something weird is going on, or if the eyes and the

00:33:57.080 | nose aren't in the right positions.

00:33:59.800 | So how does a convolutional neural network both have this location invariant filter but

00:34:07.960 | also handle location? And the trick is that every one of the 3x3 filters cares deeply

00:34:16.640 | about where each of these 3x3 things is.

00:34:20.280 | And so as we go down through the layers of our model from 224 to 112 to 56 to 28 to 14

00:34:31.080 | to 7, at each one of these stages (think about this stage which goes from 14x14 to 7x7),

00:34:40.320 | these filters are now looking at large parts of the image. So it's now at a point where

00:34:44.280 | it can actually say there needs to be an eye here and an eye here and a nose here.

00:34:52.160 | So this is one of the cool things about convolutional neural networks. They can find features everywhere

00:34:57.640 | but they can also build things which care about how features relate to each other positionally.

00:35:02.920 | So you get to do both.

00:35:29.220 | So do we need zero padding? Zero padding is literally something that sticks zeros around

00:35:35.120 | the outside of an image. If you think about what a convolution does, it's taking a 3x3

00:35:41.640 | and moving it over an image. If you do that, when you get to the edge, what do you do?

00:35:48.960 | Because at the very edge, you can't move your 3x3 any further. Which means if you only do

00:35:55.140 | what's called a valid convolution, which means you always make sure your 3x3 filter fits

00:35:59.960 | entirely within your image, you end up losing 2 pixels from the sides and 2 pixels from

00:36:06.720 | the top each time.

00:36:09.320 | There's actually nothing wrong with that, but it's a little inelegant. It's kind of

00:36:15.120 | nice to be able to half the size each time and be able to see exactly what's going on.

00:36:19.520 | So people tend to often like doing what's called same convolutions. So if you add a black border

00:36:25.620 | around the outside, then the result of your convolution is exactly the same size as your

00:36:30.920 | input. That is literally the only reason to do it.

00:36:34.740 | In fact, this is a rather inelegant way of going zero padding and then convolution. In

00:36:39.520 | fact, there's a parameter to nearly every library's convolution function where you can

00:36:44.200 | say "I want valid" or "full" or "half" which basically means do you add no black pixels,

00:36:52.560 | one black pixels or two black pixels, assuming it's 3x3.

00:36:57.240 | So I don't quite know why this one does it this way. It's really doing two functions

00:37:01.660 | where one would have done, but it does the job. So there's no right answer to that question.

00:37:12.840 | All neural networks work fine for cartoons. The question was do they work for cartoons.

00:37:19.080 | However, fine-tuning, which has been fundamental to everything we've learned so far, it's going

00:37:27.240 | to be difficult to fine-tune from an ImageNet model to a cartoon. Because an ImageNet model

00:37:33.800 | was built on all those pictures of corn we looked at and all those pictures of dogs we

00:37:37.560 | looked at. So an ImageNet model has learned to find the kinds of features that are in

00:37:43.480 | photos of objects out there in the world. And those are very different kinds of photos to

00:37:48.640 | what you see in a cartoon.

00:37:51.400 | So if you want to be able to build a cartoon neural network, you'll need to either find

00:37:58.100 | somebody else who has already trained a neural network on cartoons and fine-tune that, or

00:38:03.600 | you're going to have to create a really big corpus of cartoons and create your own ImageNet

00:38:09.400 | equivalent.

00:38:24.400 | So why doesn't an ImageNet network translate to cartoons given that an eye is a circle?

00:38:32.200 | Because the nuance level of a CNN is very high. It doesn't think of an eye as being just

00:38:40.160 | a circle. It knows that an eye very specifically has particular gradients and particular shapes

00:38:45.800 | and particular ways that the light reflects off it and so forth. So when it sees a round

00:38:51.880 | blob there, it has no ability to abstract that out and say I guess they mean an eye. One

00:39:01.440 | of the big shortcomings of CNNs is that they can only learn to recognize things that you

00:39:06.920 | specifically give them to recognize.

00:39:10.280 | If you feed a neural net with a wide range of photos and drawings, maybe it would learn

00:39:18.640 | about that kind of abstraction. To my knowledge, that's never been done. It would be a very

00:39:23.680 | interesting question. It must be possible. I'm just not sure how many examples you would

00:39:28.880 | need and what kind of architecture you would need.

00:39:31.960 | In this particular example, I used correlate, not convolution. One of the things we briefly

00:39:49.720 | mentioned in lesson 1 is that convolve and correlate are exactly the same thing, except

00:39:59.680 | convolve is equal to correlate of an image with a filter that has been rotated by 90

00:40:06.560 | degrees. So you can see convolve images with rotated 90 degrees filter looks exactly the

00:40:14.200 | same and numpy.all_close is true. So convolve and correlate are identical except that correlate

00:40:24.760 | is more intuitive. In each one it goes rows and then columns, where else with convolve

00:40:33.160 | one goes along rows and the other one goes down columns. So I tend to prefer to think

00:40:38.280 | about correlate because it's just more intuitive.

00:40:42.680 | Convolve originally came really from physics, I think, and it's also a basic math operation.

00:40:49.720 | There are various reasons that people sometimes find it more intuitive to think about convolution

00:40:54.600 | but in terms of everything that they can do in a neural net, it doesn't matter which one

00:41:00.360 | you're using. In fact, many libraries let you set a parameter to true or false to decide

00:41:06.120 | whether or not internally it uses convolution or correlation. And of course the results

00:41:10.320 | are going to be identical.

00:41:12.080 | So let's go back to our CNN review. Our network architecture is a bunch of matrix products

00:41:27.720 | or in more generally linear layers, and remember a convolution is just a subset of a matrix

00:41:34.040 | product so it's also a linear layer, a bunch of matrix products or convolutions stacked

00:41:39.120 | with alternating nonlinear activation functions. And specifically we looked at the activation

00:41:46.360 | function which was the rectified linear unit, which is just max of 0, x. So that's an incredibly

00:41:55.160 | simple activation function, but it's by far the most common, it works really well, for

00:42:01.240 | the internal parts of a neural network.

00:42:04.760 | I want to introduce one more activation function today, and you can read more about it in Lesson

00:42:10.960 | 2. Let's go down here where it says About Activation Functions. And you can see I've

00:42:26.680 | got all the details of these activation functions here. I want to talk about one core.

00:42:34.720 | It's called the Softmax function, and Softmax is defined as follows, e^xi divided by sum

00:42:42.880 | of e^xi.

00:42:46.460 | What is this all about? Softmax is used not for the middle layers of a deep learning network,

00:42:52.920 | but for the last layer. The last layer of a neural network, if you think about what it's

00:42:57.360 | trying to do for classification, it's trying to match to a one-hot encoded output. Remember

00:43:03.760 | a one-hot encoded output is a vector with all zeros and just a 1 in one spot. The spot

00:43:10.320 | is like we had for cats and dogs two spots, the first one was a 1 if it was a cat, the

00:43:16.400 | second one was a 1 if it was a dog.

00:43:19.440 | So in general, if we're doing classification, we want our output to have one high number

00:43:26.600 | and all the other ones be low. That's going to be easier to create this one-hot encoded

00:43:31.680 | output. Furthermore, we would like to be able to interpret these as probabilities, which

00:43:36.440 | means all of the outputs have to add to 1.

00:43:39.000 | So we've got these two requirements here. Our final layer's activations should add to

00:43:42.480 | one, and one of them should be higher than all the rest. This particular function does

00:43:50.120 | exactly that, and we will look at that by looking at a spreadsheet.

00:43:55.980 | So here is an example of what an output layer might contain. Here is e^of each of those

00:44:05.720 | things to the left. Here is the sum of e^of those things. And then here is the thing to

00:44:15.080 | the left divided by the sum of them, in other words, softmax. And you can see that we start

00:44:23.680 | with a bunch of numbers that are all of a similar kind of scale. And we end up with

00:44:28.160 | a bunch of numbers that sum to 1, and one of them is much higher than the others.

00:44:34.280 | So in general, when we design neural networks, we want to come up with architectures, by

00:44:43.480 | which I mean convolutions, fully connected layers, activation functions, we want to come

00:44:51.480 | up with architectures where replicating the outcome we want is as convenient as possible.

00:45:00.380 | So in this case, our activation function for the last layer makes it pretty convenient,

00:45:05.720 | pretty easy to come up with something that looks a lot like a 1-watt encoded output.

00:45:11.280 | So the easier it is for our neural net to create the thing we want, the faster it's

00:45:16.260 | going to get there, and the more likely it is to get there in a way that's quite accurate.

00:45:21.200 | So we've learned that any big enough, deep enough neural network, because of the Universal

00:45:29.480 | Approximation Theorem, can approximate any function at all. And we know that Stochastic

00:45:35.600 | Gradient Descent can find the parameters for any of these, which kind of leaves you thinking

00:45:39.920 | why do we need 7 weeks of neural network training? Any architecture ought to work. And indeed

00:45:47.520 | that's true. If you have long enough, any architecture will work. Any architecture can

00:45:53.040 | translate Hungarian to English, any architecture can recognize cats versus dogs, any architecture

00:45:58.640 | can analyze Hillary Clinton's emails, as long as it's big enough. However, some of them

00:46:04.480 | do it much faster than others. They train much faster than others. A bad architecture

00:46:11.560 | could take so long to train that it doesn't train in the amount of years you have left

00:46:16.040 | in your lifetime. And that's why we care about things like convolutional neural networks

00:46:21.400 | instead of just fully connected layers all the way through. That's why we care about

00:46:26.340 | having a softmax at the last layer rather than just a linear last layer. So we try to

00:46:30.760 | make it as convenient as possible for our network to create the thing that we want to

00:46:34.800 | create. Yes, Rachel? [audience question]

00:46:57.800 | So the first one was? Softmax, just like the other one, is about how Keras internally

00:47:09.800 | handles these matrices of data. Any more information about that one?

00:47:15.800 | Honestly, I don't do theoretical justifications, I do intuitive justifications. There is a

00:47:35.760 | great book for theoretical justifications and it's available for free. If you just google

00:47:40.840 | for Deep Learning Book, or indeed go to deeplearningbook.org, it actually does have a fantastic theoretical

00:47:48.000 | justification of why we use softmax. The short version basically is as follows. Softmax contains

00:47:56.400 | an eta in it, our log-loss layer contains a log in it, the two nicely mesh up against

00:48:05.800 | each other and in fact the derivative of the two together is just a - b. So that's kind

00:48:12.560 | of the short version, but I will refer you to the Deep Learning Book for more information

00:48:16.480 | about the theoretical justification.

00:48:18.680 | The intuitive justification is that because we have an eta here, it makes a big number

00:48:24.320 | really really big, and therefore once we take one divided by the sum of the others, we end

00:48:29.960 | up with one number that tends to be bigger than all the rest, and that is very close

00:48:33.840 | to the one-hot encoded output that we're trying to match.

00:48:39.960 | Could a network learn identical filters? A network absolutely could learn identical filters,

00:48:49.840 | but it won't. The reason it won't is because it's not optimal to. Stochastic gradient descent

00:48:56.200 | is an optimization procedure. It will come up with, if you train it for long enough,

00:49:01.640 | with an appropriate learning rate, the optimal set of filters. Having the same filter twice

00:49:07.080 | is never optimal, that's redundant. So as long as you start off with random weights, then

00:49:15.880 | it can learn to find the optimal set of filters, which will not include duplicate filters.

00:49:24.600 | These are all fantastic questions.

00:49:34.200 | In this review, we've done our different layers, and then these different layers get optimized

00:49:43.360 | with SGD. Last week we learned about SGD by using this extremely simple example where

00:49:52.960 | we said let's define a function which is a line, ax + b. Let's create some data that

00:50:01.960 | matches a line, x's and y's. Let's define a loss function, which is the sum of squared

00:50:10.480 | errors. We now no longer know what a and b are, so let's start with some guess. Obviously

00:50:18.960 | the loss is pretty high, and let's now try and come up with a procedure where each step

00:50:26.760 | makes the loss a little bit better by making a and b a little bit better. The way we did

00:50:32.520 | that was very simple. We calculated the derivative of the loss with respect to each of a and

00:50:39.840 | b, and that means that the derivative of the loss with respect to b is, if I increase b

00:50:46.480 | by a bit, how does the loss change? And the derivative of the loss with respect to a means

00:50:52.440 | as I change a a bit, how does the loss change? If I know those two things, then I know that

00:50:58.240 | I should subtract the derivative times some learning rate, which is 0.01, and as long

00:51:06.440 | as our learning rate is low enough, we know that this is going to make our a guess a little

00:51:11.600 | bit better. And we do the same for our b guess, it gets a little bit better.

00:51:17.280 | And so we learned that that is the entirety of SGD. We run that again and again and again,

00:51:23.160 | and indeed we set up something that would run it again and again and again in an animation

00:51:27.600 | loop and we saw that indeed it does optimize our line.

00:51:33.920 | The tricky thing for me with deep learning is jumping from this kind of easy to visualize

00:51:40.480 | intuition. If I run this little derivative on these two things a bunch of times, it optimizes

00:51:47.960 | this line, I can then create a set of layers with hundreds of millions of parameters that

00:52:00.480 | in theory can match any possible function and it's going to do exactly the same thing.

00:52:07.000 | So this is where our intuition breaks down, which is that this incredibly simple thing

00:52:11.560 | called SGD is capable of creating these incredibly sophisticated deep learning models. We really

00:52:19.680 | have to just respect our understanding of the basics of what's going on. We know it's

00:52:24.800 | going to work, and we can see that it does work. But even when you've trained dozens

00:52:31.520 | of deep learning models, it's still surprising that it does work. It's always a bit shocking

00:52:36.800 | when you start without any ability to analyze some problem. You start with some random weights,

00:52:43.440 | you start with a general architecture, you throw some data in with SGD, and you end up

00:52:47.240 | with something that works. Hopefully now it makes sense, you can see why that happens.

00:52:53.760 | But it takes doing it a few times to really intuitively understand, okay, it really does

00:53:00.280 | work.

00:53:01.280 | So one question about Softmax, could you use it for multi-class, multi-label classification

00:53:08.920 | for the multiple correct answers?

00:53:10.840 | And you use Softmax for multi-class classification, and the answer is absolutely yes. In fact,

00:53:15.400 | the example I showed here was such an example. So imagine that these outputs were for cat,

00:53:23.320 | dog, plane, fish, and building. So these might be what these 5 things represent. So this

00:53:34.040 | is exactly showing a Softmax for a multi-class output.

00:53:41.160 | You just have to make sure that your neural net has as many outputs as you want. And to

00:53:47.520 | do that, you just need to make sure that the last weight layer in your neural net has as

00:53:54.740 | many columns as you want. The number of columns in your final weight matrix tells you how

00:53:59.880 | many outputs.

00:54:00.880 | [IENCE]

00:54:01.880 | Okay, that is not multi-class classification. So if you want to create something that is

00:54:11.200 | going to find more than one thing, then no. Softmax would not be the best way to do that.

00:54:17.400 | I'm not sure if we're going to cover that in this set of classes. If we don't, we'll

00:54:23.720 | be doing it next year.

00:54:24.720 | [AUDIENCE]

00:54:25.720 | Let's go back to the question about 3x3 filters, and more generally, how do we pick an architecture?

00:54:49.200 | So the question of the VGG authors used 3x3 filters. The 2012 ImageNet winners used a

00:55:00.440 | combination of 7x7 and 11x11 filters. What has happened over the last few years since

00:55:09.000 | then if people have realized that 3x3 filters are just better? The original insight for

00:55:17.560 | this was actually that Matt Zeiler visualization paper I showed you. It's real worth reading

00:55:23.200 | that paper because he really shows that by looking at lots of pictures of all the stuff

00:55:27.600 | going on inside of CNN, it clearly works better when you have smaller filters and more layers.

00:55:34.360 | I'm not going to go into the theoretical justification as to why, for the sake of applying CNNs,

00:55:39.760 | all you need to know is that there's really no reason to use anything but 3x3 filters.

00:55:45.880 | So that's a nice simple rule of thumb which always works, 3x3 filters.

00:55:53.000 | How many layers of 3x3 filters? This is where there is not any standard agreed-upon technique.

00:56:05.100 | Weeding lots of papers, looking at lots of Kaggle winners, you will over time get a sense

00:56:09.600 | of for a problem of this level of complexity, you need this many filters. There have been

00:56:16.640 | various people that have tried to simplify this, but we're really still at a point where

00:56:22.960 | the answer is try a few different architectures and see what works. The same applies to this

00:56:27.840 | question of how many filters per layer.

00:56:31.600 | So in general, this idea of having 3x3 filters with max pooling and doubling the number of

00:56:37.800 | filters each time you do max pooling is a pretty good rule of thumb. How many do you

00:56:43.080 | start with? You've kind of got to experiment. Actually, we're going to see today an example

00:56:50.400 | of how that works.

00:56:52.440 | If you had a much larger image, would you still want 3x3 filters?

00:56:59.960 | If you had a much larger image, what would you do? For example, on Kaggle, there's a diabetic

00:57:04.800 | retinopathy competition that has some pictures of eyeballs that are quite a high resolution.

00:57:09.320 | I think they're a couple of thousand by a couple of thousand. The question of how to

00:57:14.160 | deal with large images is as yet unsolved in the literature. So if you actually look

00:57:20.520 | at the winners of that Kaggle competition, all of the winners resampled that image down

00:57:26.000 | to 512x512. So I find that quite depressing. It's clearly not the right approach. I'm pretty

00:57:34.360 | sure I know what the right approach is. I'm pretty sure the right approach is to do what

00:57:38.200 | the eye does.

00:57:39.200 | The eye does something called foveation, which means that when I look directly at something,

00:57:44.200 | the thing in the middle is very high-res and very clear, and the stuff on the outside is

00:57:48.440 | not.

00:57:52.320 | I think a lot of people are generally in agreement with the idea that if we could come up with

00:57:56.240 | an architecture which has this concept of foveation, and then secondly, we need something,

00:58:02.640 | and there are some good techniques to this already called attentional models. An attentional

00:58:06.600 | model is something that says, "Okay, the thing I'm looking for is not in the middle of my

00:58:10.440 | view, but my low-res peripheral vision thinks it might be over there. Let's focus my attention

00:58:18.800 | over there."

00:58:19.800 | And we're going to start looking at recurrent neural networks next week, and we can use

00:58:25.200 | recurrent neural networks to build attentional models that allow us to search through a big

00:58:29.720 | image to find areas of interest. That is a very active area of research, but as yet is

00:58:36.760 | not really finalized. By the time this turns into a MOOC and a video, I wouldn't be surprised

00:58:44.360 | if that has been much better solved. It's moving very quickly.

00:58:48.000 | The Matt Zyler paper showed larger filters because he was showing what AlexNet, the 2012

00:59:01.360 | winner, looked like. Later on in the paper, he said based on what it looks like, here

00:59:07.080 | are some suggestions about how to build better models.

00:59:12.640 | So let us now finalize our review by looking at fine-tuning. So we learned how to do fine-tuning

00:59:28.840 | using the little VGG class that I built, which is one line of code, vgg.fine-tuned. We also

00:59:37.520 | learned how to take 1000 predictions of all the 1000 ImageNet categories and turn them

00:59:45.680 | into two predictions, which is just a cat or a dog, by building a simple linear model

00:59:53.080 | that took as input the 1000 ImageNet category predictions as input, and took the true cat

01:00:14.240 | and dog labels as output, and we just created a linear model of that. So here is that linear

01:00:30.880 | model. It's got 1000 inputs and 2 outputs. So we trained that linear model, it took less

01:00:41.240 | than a second to train, and we got 97.7% accuracy.

01:00:46.780 | So this was actually pretty effective. So why was it pretty effective to take 1000 predictions

01:00:55.360 | of is it a cat, is it a fish, is it a bird, is it a poodle, is it a pug, is it a plane,

01:01:01.800 | and turn it into a cat or is it a dog. The reason that worked so well is because the

01:01:07.600 | original architecture, the ImageNet architecture, was already trained to do something very similar

01:01:15.960 | to what we wanted our model to do. We wanted our model to separate cats from dogs, and

01:01:21.080 | the ImageNet model already separated lots of different cats from different dogs from

01:01:26.040 | lots of other things as well. So the thing we were trying to do was really just a subset

01:01:30.840 | of what ImageNet already does.

01:01:33.920 | So that was why starting with 1000 predictions and building the simple linear model worked

01:01:39.440 | so well. This week, you're going to be looking at the State Farm competition. And in the

01:01:45.440 | State Farm competition, you're going to be looking at pictures like this one, and this

01:01:54.840 | one, and this one. And your job will not be to decide whether or not it's a person or

01:02:03.560 | a dog or a cat. Your job will be to decide is this person driving in a distracted way

01:02:08.880 | or not. That is not something that the original ImageNet categories included. And therefore

01:02:16.360 | this same technique is not going to work this week.

01:02:21.280 | So what do you do if you need to go further? What do you do if you need to predict something

01:02:27.520 | which is very different to what the original model did? The answer is to throw away some

01:02:34.280 | of the later layers in the model and retrain them from scratch. And that's called fine-tuning.

01:02:42.320 | And so that is pretty simple to do. So if we just want to fine-tune the last layer,

01:02:50.000 | we can just go model.pop, that removes the last layer. We can then say make all of the

01:02:56.040 | other layers non-trainable, so that means it won't update those weights, and then add

01:03:01.760 | a new fully connected layer, dense layer to the end with just our dog and cat, our two

01:03:07.480 | activations, and then go ahead and fit that model.

01:03:14.320 | So that is the simplest kind of fine-tuning. Remove the last layer, but previously that

01:03:19.680 | last layer was going to try and predict 1000 possible categories, and replace it with a

01:03:24.360 | new last layer which we train. In this case, I've only run it for a two-week box, so I'm

01:03:31.120 | not getting a great result. But if we ran it for a few more, we would get a bit better

01:03:34.260 | than the 97.7 we had last time. When we look at State Farm, it's going to be critical to

01:03:40.040 | do something like this.

01:03:43.480 | So how many layers would you remove? Because you don't just have to remove one. In fact,

01:03:48.320 | if you go back through your lesson 2 notebook, you'll see after this, I've got a section

01:03:55.560 | called Retraining More Layers. In it, we see that we can take any model and we can say

01:04:12.120 | okay, let's grab all the layers up to the nth layer. So in this case, we set all the

01:04:18.840 | layers up to, sorry, after the first fully connected layer and set them all to trainable.

01:04:27.280 | And then what would happen if we tried running that model?

01:04:31.120 | So with Keras, we can tell Keras which layers we want to freeze and leave them at their

01:04:38.280 | ImageNet-decided weights, and which layers do we want to retrain based on the things

01:04:45.520 | that we're interested in.

01:04:46.840 | And so in general, the more different your problem is to the original ImageNet 1000 categories,

01:04:53.320 | the more layers you're going to have to retrain.

01:04:56.560 | So how do you decide how far to go back in the layers? Two ways. Way number 1, intuition.

01:05:19.160 | So have a look at something like those mat-zylar visualizations to get a sense of at what semantic

01:05:24.840 | level each of those layers is operating at. And go back to the point where you feel like

01:05:31.720 | that level of meaning is going to be relevant to your model.

01:05:37.920 | Method number 2, experiment. It doesn't take that long to train another model starting

01:05:43.880 | at a different point. I generally do a bit of both. When I know dogs and cats are subsets

01:05:52.920 | of the ImageNet categories, I'm not going to bother generally training more than one replacement

01:06:00.080 | layer.

01:06:01.080 | For State Farm, I really had no idea. I was pretty sure I wouldn't have to retrain any

01:06:07.280 | of the convolutional layers because the convolutional layers are all about spatial relationships.

01:06:14.160 | And therefore a convolutional layer is all about recognizing how things in space relate

01:06:17.960 | to each other. I was pretty confident that figuring out whether somebody is looking at

01:06:23.440 | a mobile phone or playing with their radio is not going to use different spatial features.

01:06:29.740 | So for State Farm, I've really only looked at retraining the dense layers.

01:06:35.280 | And in VGG, there are actually only three dense layers. There are actually only three

01:06:44.900 | dense layers, the two intermediate layers and the output layer, so I just trained all three.

01:06:53.840 | Generally speaking, the answer to this is try a few things and see what works the best.

01:07:04.720 | When we retrain the layers, we do not set the weights randomly. We start the weights

01:07:09.560 | at their optimal ImageNet levels. That means that if you retrain more layers than you really

01:07:18.880 | need to, it's not a big problem because the weights are already at the right point.

01:07:24.640 | If you randomized the weights of the layers that you're retraining, that would actually

01:07:31.160 | kill the earlier layers as well if you made them trainable. There's no point really setting

01:07:38.840 | them to random most of the time. We'll be learning a bit more about that after the break.

01:07:45.360 | So far, we have not reset the weights. When we say layer.trainable = true, we're just

01:07:52.880 | telling Keras that when you say fit, I want you to actually use SGD to update the weights

01:07:58.720 | in that layer.

01:08:10.480 | When we come back, we're going to be talking about how to go beyond these basic five pieces

01:08:17.960 | to create models which are more accurate. Specifically, we're going to look at avoiding underfitting

01:08:25.360 | and avoiding overfitting.

01:08:28.480 | Next week, we're going to be doing half a class on review of convolutional neural networks

01:08:35.080 | and half a class of an introduction to recurrent neural networks which we'll be using for language.

01:08:40.680 | So hopefully by the end of this class, you'll be feeling ready to really dig deep into CNNs

01:08:48.760 | during the week. This is really the right time this week to make sure that you're asking

01:08:53.120 | questions you have about CNNs because next week we'll be wrapping up this topic.

01:08:59.560 | Let's come back at 5 past 8.

01:09:07.000 | So we have a lot to cover in our next 55 minutes. I think this approach of doing the new material

01:09:14.440 | quickly and then you can review it in the lesson notebook on the video by experimenting

01:09:19.920 | during the week and then reviewing the next week is fine. I think that's a good approach.

01:09:24.760 | But I just want to make you aware that the new material of the next 55 minutes will move

01:09:29.560 | pretty quickly.

01:09:32.200 | So don't worry too much if not everything sinks in straight away. If you have any questions,

01:09:37.640 | of course, please do ask. But also, recognize that it's really going to sink in as you study

01:09:45.400 | it and play with it during the week, and then next week we're going to review all of this.

01:09:49.520 | So if it's still not making sense, and of course you've asked your questions on the forum,

01:09:53.320 | it's still not making sense, we'll be reviewing it next week.

01:09:56.640 | So if you don't retrain a layer, does that mean the layer remembers what gets saved?

01:10:21.480 | So yes, if you don't retrain a layer, then when you save the weights, it's going to contain

01:10:25.720 | the weights that it originally had. That's a really important question.

01:10:31.720 | Why would we want to start out by overfitting? We're going to talk about that next.

01:10:38.320 | The last conflayer in VGG is a 7x7 output. There are 49 boxes and each one has 512 different

01:10:45.440 | things. That's kind of right, but it's not that it recognizes 512 different things. When

01:10:51.160 | you have a convolution on a convolution on a convolution on a convolution on a convolution,

01:10:55.280 | you have a very rich function with hundreds of thousands of parameters. So it's not that

01:11:02.120 | it's recognizing 512 things, it's that there are 512 rich complex functions. And so those

01:11:11.200 | rich complex functions can recognize rich complex concepts.

01:11:18.100 | So for example, we saw in the video that even in layer 6 there's a face detector which can

01:11:25.560 | recognize cat faces as well as human faces. So the later on we get in these neural networks,

01:11:33.520 | the harder it is to even say what it is that's being found because they get more and more

01:11:39.080 | sophisticated and complex. So what those 512 things do in the last layer of VGG, I'm not

01:11:49.080 | sure that anybody's really got to a point that they could tell you that.

01:11:53.480 | I'm going to move on. The next section is all about making our model better. So at this

01:12:22.120 | point, we have a model with an accuracy of 97.7%. So how do we make it better?

01:12:31.760 | Now because we have started with an existing model, a VGG model, there are two reasons

01:12:46.160 | that you could be less good than you want to be. Either you're underfitting or you're

01:12:51.960 | overfitting. Underfitting means that, for example, you're using a linear model to try

01:12:59.060 | to do image recognition. You're using a model that is not complex and powerful enough for

01:13:05.640 | the thing you're doing or it doesn't have enough parameters for the thing you're doing.

01:13:10.240 | That's what underfitting is. Overfitting means that you're using a model with too many parameters

01:13:17.760 | that you've trained for too long without using any of the techniques or without correctly

01:13:22.440 | using the techniques you're about to learn about, such that you've ended up learning what

01:13:26.800 | your specific training pictures look like rather than what the general patterns in them

01:13:33.840 | look like. You will recognize overfitting if your training set has a much higher accuracy

01:13:43.560 | than your test set or your validation set. So that means you've learned how to recognize

01:13:49.520 | the contents of your training set too well. And so then when you look at your validation

01:13:55.200 | set you get a less good result. So that's overfitting.

01:14:00.120 | I'm not going to go into detail on this because any of you who have done any machine learning

01:14:03.560 | have seen this before, so any of you who haven't, please look up overfitting on the internet,

01:14:10.920 | learn about it, ask questions about it. It is perhaps the most important single concept

01:14:15.760 | in machine learning.

01:14:17.960 | So it's not that we're not covering it because it's not interesting, it's just that we're

01:14:20.800 | not covering it because I know a lot of you are already familiar with it. Underfitting

01:14:26.840 | we can see in the same way, but it's the opposite. If our training error is much lower than our

01:14:36.640 | validation error, then we're underfitting.

01:14:41.240 | So I'm going to look at this now because in fact you might have noticed that in all of

01:14:45.240 | our models so far, our training error has been lower than our validation error, which

01:14:52.800 | means we are underfitting.

01:14:56.360 | So how is this possible? And the answer to how this is possible is because the VGG network

01:15:03.240 | includes something called dropout, and specifically dropout with a p of 0.5. What does dropout

01:15:10.880 | mean with a p of 0.5? It means that at this layer, which happens at the end of every fully

01:15:16.700 | connected block, it deletes 0.5, so 50% of, the activations at random. It sets them to

01:15:27.640 | 0. That's what a dropout layer does. It sets to 0.5, half of the activations at random.

01:15:36.680 | Why would it do that? Because when you randomly throw away bits of the network, it means that

01:15:43.120 | the network can't learn to overfit. It can't learn to build a network that just learns

01:15:48.880 | about your images, because as soon as it does, you throw away half of it and suddenly it's

01:15:53.640 | not working anymore. So dropout is a fairly recent development, I think it's about three

01:15:58.960 | years old, and it's perhaps the most important development of the last few years. Because

01:16:06.040 | it's the thing that now means we can train big complex models for long periods of time

01:16:11.920 | without overfitting. Incredibly important.

01:16:15.920 | But in this case, it seems that we are using too much dropout. So the VGG network, which

01:16:24.200 | used a dropout of 0.5, they decided they needed that much in order to avoid overfitting ImageNet.

01:16:30.840 | But it seems for our cats and dogs, it's underfitting. So what do we do? The answer is, let's try

01:16:39.360 | removing dropout. So how do we remove dropout? And this is where it gets fun.

01:16:46.960 | We can start with our VGG fine-tuned model. And I've actually created a little function

01:16:52.400 | called VGG fine-tuned, which creates a VGG fine-tuned model with two outputs. It looks

01:17:02.060 | exactly like you would expect it to look. It creates a VGG model, it fine-tunes it, it

01:17:14.320 | returns it. What does fine-tune do? It does exactly what we've learnt. It pops off the

01:17:24.480 | last layer, sets all the rest of the layers to non-trainable, and adds a new dense layer.

01:17:30.600 | So I just create a little thing that does all that. Every time I start writing the same

01:17:35.680 | code more than once, I stick it into a function and use it again in the future. It's good practice.

01:17:42.360 | I then load the weights that I just saved in my last model, so I don't have to retrain

01:17:48.520 | it. So saving and loading weights is a really helpful way of avoiding not refitting things.

01:17:54.180 | So already I now have a model that fits cats and dogs with 97.7% accuracy and underfits.

01:18:05.460 | We can grab all of the layers of the model and we can then enumerate through them and

01:18:11.920 | find the last one which is a convolution. So let's remind ourselves, model.summary. So

01:18:25.360 | that's going to enumerate through all the layers and find the last one that is a convolution.

01:18:33.480 | So at this point, we now have the index of the last convolutional layer. It turns out

01:18:41.600 | to be 30. So we can now grab that last convolutional layer.

01:18:46.960 | And so what we want to try doing is removing dropout from all the rest of the layers. So

01:18:53.160 | after the convolutional layer are the dense layers. So after the convolutional layers,

01:19:05.400 | the last convolutional layer, after that we have the dense layers.

01:19:10.560 | So this is a really important concept in the Keras library of playing around with layers.

01:19:25.560 | And so spend some time looking at this code and really look at the inputs and the outputs

01:19:30.680 | and get a sense of it. So you can see here, here are all the layers up to the last convolutional

01:19:38.240 | layer. Here are all of the layers from the last convolutional layer. So all the fully

01:19:44.120 | connected layers and all the convolutional layers. I can create a whole new model that

01:19:48.880 | contains just the convolutional layers.

01:19:53.360 | Why would I do that? Because if I'm going to remove dropout, then clearly I'm going

01:20:00.180 | to want to fine-tune all of the layers that involve dropout. That is, all of the dense

01:20:05.560 | layers. I don't need to fine-tune any convolutional layers because none of the convolutional layers

01:20:11.180 | have dropout. I'm going to save myself some time. I'm going to pre-calculate the output

01:20:24.120 | of the last convolutional layer.

01:20:26.980 | So you see this model I've built here, this model that contains all the convolutional

01:20:32.800 | layers. If I pre-calculate the output of that, then that's the input to the dense layers

01:20:40.780 | that I want to train.

01:20:43.040 | So you can see what I do here is I say conv_model.predict with my validation batches, conv_model.predict

01:20:52.720 | with my batches, and that now gives me the output of the convolutional layer for my training

01:20:59.800 | and the output of it for my validation. And because that's something I don't want to have

01:21:04.480 | to do it again and again, I save it.

01:21:08.640 | So here I'm just going to go load_array and that's going to load from the disk the output

01:21:17.640 | of that. And so I'm going to say train_features.shape, and this is always the first thing that you

01:21:22.520 | want to do when you've built something, is look at its shape. And indeed, it's what we

01:21:27.400 | would expect. It is 23,000 images, each one is 14x14, because I didn't include the final

01:21:34.720 | Max Pauling layer, with 512 filters.

01:21:39.760 | And so indeed, if we go model.summary, we should find that the last convolutional layer, here

01:21:50.580 | it is, 512 filters, 14x14 dimension. So we have basically built a model that is just

01:22:00.600 | a subset of VGG containing all of these earlier layers. We've run it through our test set

01:22:07.480 | and our validation set, and we've got the outputs. So that's the stuff that we want

01:22:12.720 | to fix, and so we don't want to recalculate that every time.

01:22:17.920 | So now we create a new model which is exactly the same as the dense part of VGG, but we

01:22:24.840 | replace the dropout P with 0. So here's something pretty interesting, and I'm going to let you

01:22:32.160 | guys think about this during the week. How do you take the previous weights from VGG

01:22:41.520 | and put them into this model where dropout is 0? So if you think about it, before we

01:22:46.480 | had dropout of 0.5, so half the activations were being deleted at random. So since half

01:22:52.440 | the activations are being deleted at random, now that I've removed dropout, I effectively

01:22:57.400 | have twice as many weights being active. Since I have twice as many weights being active,

01:23:02.840 | I need to take my imageNet weights and divide them by 2. So by taking my imageNet weights

01:23:10.360 | and copying them across, so I take my previous weights and copy them across to my new model,

01:23:15.360 | each time divide them by 2, that means that this new model is going to be exactly as accurate

01:23:20.960 | as my old model before I start training, but it has no dropout.

01:23:26.520 | Is it wasteful to have in the cats and dogs model filters that are being learnt to find

01:23:45.720 | things like bookshelves? Potentially it is, but it's okay to be wasteful. The only place

01:23:54.040 | that it's a problem is if we are overfitting. And if we're overfitting, then we can easily

01:23:59.320 | fix that by adding more dropout.

01:24:03.440 | So let's try this. We now have a model which takes the output of the convolutional layers

01:24:10.000 | as input, gives us our cats vs. dogs as output, and has no dropout. So now we can just go

01:24:18.200 | ahead and fit it. So notice that the input to this is my 512 x 14 x 14 inputs. My outputs

01:24:28.720 | are my cats and dogs as usual, and train it for a few epochs.

01:24:33.760 | And here's something really interesting. Dense layers take very little time to compute. A

01:24:41.800 | convolutional layer takes a long time to compute. Think about it, you're computing 512 x 3 x

01:24:48.160 | 3 x 512 filters. For each of 14 x 14 spots, that is a lot of computation.

01:24:58.760 | So in a deep learning network, your convolutional layers is where all of your computation is

01:25:04.120 | being taken up. So look, when I train just my dense layers, it's only taking 17 seconds.

01:25:09.440 | Super fast. On the other hand, the dense layers is where all of your memory is taken up. Because

01:25:15.160 | between this 4096 layer and this 4096 layer, there are 4000 x 4000 = 16 million weights.

01:25:24.160 | And between the previous layer, which was 512 x 7 x 7 after Max Pauling, that's 25,088.

01:25:31.880 | There are 25,088 x 4096 weights. So this is a really important rule of thumb. Your dense

01:25:39.600 | layers is where your memory is taken up. Your convolutional layers is where your computation

01:25:44.120 | time is taking up.

01:25:46.200 | So it took me a minute or so to run 8 epochs. That's pretty fast. And holy shit, look at

01:25:52.560 | that! 98.5%. So you can see now, I am overfitting. But even though I'm overfitting, I am doing

01:26:04.640 | pretty damn well.

01:26:06.680 | So overfitting is only bad if you're doing it so much that your accuracy is bad. So in

01:26:14.920 | this case, it looks like actually this amount of overfitting is pretty good. So for cats

01:26:20.480 | and dogs, this is about as good as I've gotten. And in fact, if I'd stopped it a little earlier,

01:26:26.960 | you can see it was really good. In fact, the winner was 98.8, and here I've got 98.75.

01:26:34.680 | And there are some tricks I'll show you later that always give you an extra 50% accuracy.

01:26:39.520 | So this would definitely have won cats and dogs if we had used this model.

01:26:43.280 | Question - Can you perform dropout on a convolutional layer?

01:26:48.040 | You can absolutely perform dropout on a convolutional layer. And indeed, nowadays people normally

01:26:52.920 | do. I don't quite remember the VGG days. I guess that was 2 years ago. Maybe people in

01:26:58.240 | those days didn't.

01:26:59.440 | Nowadays, the general approach would be you would have dropout of 0.1 before your first

01:27:04.760 | layer, dropout of 0.2 before this one, 0.3, 0.4, and then finally dropout of 0.5 before

01:27:10.040 | your fully connected layers. It's kind of the standard.

01:27:13.800 | If you then find that you're underfitting or overfitting, you can modify all of those

01:27:18.080 | probabilities by the same amount. If you dropout in an early layer, you're losing that information

01:27:27.840 | for all of the future layers, so you don't want to drop out too much in the early layers.

01:27:32.520 | You can feel better dropping out more in the later layers.

01:27:37.240 | This is how you manually tune with your overfitting or underfitting. Another way to do it would

01:27:57.400 | be to modify the architecture to have less or more filters. But that's actually pretty

01:28:03.480 | difficult to do. So it's the point that we didn't need dropout anyway. Perhaps it was.

01:28:10.760 | But VGG comes with dropout. So when you're fine-tuning, you start with what you start

01:28:16.040 | with.

01:28:19.920 | We are overfitting here, so my hypothesis is that we maybe should try a little less

01:28:24.400 | dropout. But before we do, I'm going to show you some better tricks.

01:28:30.280 | The first trick I'm going to show you is a trick that lets you avoid overfitting without

01:28:35.240 | deleting information. Dropout deletes information, so we don't want to do it unless we have to.

01:28:40.600 | So instead of dropout, here is a list. You guys should refer to this every time you're

01:28:47.000 | building a model that is overfitting.

01:28:49.760 | 5 steps. Step 1, add more data. This is a Kaggle competition, so we can't do that.

01:28:57.720 | Step 2, use data augmentation, which we're about to learn.

01:29:02.040 | Step 3, use more generalizable architectures. We're going to learn that after this.

01:29:07.480 | Step 4, add regularization. That generally means dropout. There's another type of regularization

01:29:13.560 | which is where you basically add up all of your weights, the value of all of your weights,

01:29:22.360 | and then multiply it by some small number, and you add that to the loss function. Basically

01:29:26.960 | you say having higher weights is bad. That's called either L2 regularization, if you take

01:29:34.320 | the square of your weights and add them up, or L1 regularization if you take the absolute

01:29:38.300 | value of your weights and add them up.

01:29:40.600 | Tera supports that as well. Also popular. I don't think anybody has a great sense of

01:29:49.760 | when do you use L1 and L2 regularization and when do you use dropout. I use dropout pretty

01:29:56.040 | much all the time, and I don't particularly see why you would need both, but I just wanted

01:30:00.460 | to let you know that that other type of regularization exists.

01:30:05.520 | And then lastly, if you really have to reduce architecture complexity, so remove some filters.

01:30:10.560 | But that's pretty hard to do if you're fine-tuning, because how do you know which filters to remove?

01:30:16.520 | So really, the first four. Now that we have dropout, the first four are what we do in

01:30:22.440 | practice.

01:30:23.440 | Like in Random Forests, where we randomly select subsets of variables at each point,

01:30:41.160 | that's kind of what dropout is doing. Dropout is randomly throwing away half the activations,

01:30:46.120 | so dropout and random forests both effectively create large ensembles. It's actually a fantastic

01:30:57.320 | kind of analogy between random forests.

01:30:59.560 | So just like when we went from decision trees to random forests, it was this huge step which

01:31:05.640 | was basically create lots of decision trees with some random differences. Dropout is effectively

01:31:10.440 | creating lots of, automatically, lots of neural networks with different subsets of features

01:31:15.920 | that have been randomly selected.

01:31:19.840 | Data augmentation is very simple. Data augmentation is something which takes a cat and turns it

01:31:27.160 | into lots of cats. That's it. Actually, it does it for dogs as well. You can rotate,

01:31:38.160 | you can flip, you can move up and down, left and right, zoom in and out. And in Keras,

01:31:44.920 | you do it by, rather than, what we've always said before was image data generator, open

01:31:49.920 | parenthesis, closed parenthesis.

01:31:52.360 | Now we say all these other things. Flip it horizontally at random, zoom in a bit at random,

01:31:58.600 | share at random, rotate at random, move it left and right at random, and move it up and

01:32:02.480 | down at random. So once you've done that, then when you create your batches, rather than

01:32:12.640 | doing it the way we did it before, you simply add that to your batches. So we said, Ok, this

01:32:22.160 | is our data generator, and so when we create our batches, use that data generator, the augmenting

01:32:28.360 | data generator.

01:32:30.780 | Very important to notice, the validation set does not include that. Because the validation

01:32:36.120 | set is the validation set. That's the thing we want to check against, so we shouldn't

01:32:39.400 | be fiddling with that at all. The validation set has no data augmentation and no shuffling.

01:32:44.400 | It's constant and fixed. The training set, on the other hand, we want to move it around

01:32:49.460 | as much as we can. So shuffle its order and add all these different types of augmentation.

01:32:55.680 | How much augmentation to use? This is one of the things that Rachel and I would love

01:32:59.880 | to automate. For now, two methods, use your intuition. The best way to use your intuition

01:33:06.540 | is to take one of your images, add some augmentation, and check whether they still look like cats.

01:33:15.020 | So if it's so warped that you're like, "Ok, nobody takes a photo of a cat like that,"

01:33:19.860 | you've done it wrong. So this is kind of like a small amount of data augmentation.

01:33:25.560 | Method 2, experiment. Try a range of different augmentations and see which one gives you

01:33:29.600 | the best results. If we add some augmentation, everything else is exactly the same, except

01:33:38.520 | we can't pre-compute anything anymore. So earlier on, we pre-computed the output of

01:33:43.520 | the last convolutional layer. We can't do that now, because every time this cat approaches

01:33:49.040 | our neural network, it's a little bit different. It's rotated a bit, it's flipped, it's moved

01:33:53.920 | around or it's zoomed in and out. So unfortunately, when we use data augmentation, we can't pre-compute

01:34:00.440 | anything and so things take longer. Everything else is the same though. So we grab our fully

01:34:05.800 | connected model, we add it to the end of our convolutional model, and this is the one with

01:34:10.760 | our dropout, compile it, fit it, and now rather than taking 9 seconds per epoch, it takes

01:34:18.600 | 273 seconds per epoch because it has to calculate through all the convolutional layers because

01:34:25.000 | of the data augmentation.

01:34:29.680 | So in terms of results here, we have not managed to get back up to that 98.7 accuracy. I probably

01:34:40.280 | have, I've run a few more. So if I keep running them, again, I start overfitting. So it's

01:34:51.160 | a little hard to tell because my validation accuracy is moving around quite a lot because

01:34:55.920 | my validation sets a little bit on the small side. It's a little bit hard to tell whether

01:35:00.520 | this data augmentation is helping or hindering. I suspect what we're finding here is that

01:35:07.840 | maybe we're doing too much data augmentation, so if I went back and reduced my different

01:35:17.440 | ranges by say half, I might get a better result than this. But really, this is something to

01:35:24.720 | experiment with and I had better things to do than experiment with this. But you get

01:35:27.960 | the idea.

01:35:30.640 | Data augmentation is something you should always do. There's never a reason not to use

01:35:37.340 | data augmentation. The question is just what kind and how much. So for example, what kind?

01:35:43.120 | Should you flip x, y? So clearly, for dogs and cats, no. You pretty much never see a

01:35:50.180 | picture of an upside down dog. So would you do vertical flipping in this particular problem?

01:35:57.520 | No you wouldn't. Would you do rotations? Yeah, you very often see cats and dogs that are

01:36:03.040 | kind of on their hind legs or the photos taken a little bit uneven or whatever. You certainly

01:36:06.820 | would have zooming because sometimes you're close to the dog, sometimes further away.

01:36:10.260 | So use your intuition to think about what kind of augmentation.

01:36:14.480 | Yes?

01:36:15.480 | What about data augmentation? Data augmentation for color? That's an excellent point. So something

01:36:22.680 | I didn't add to this, but I probably should have, is that there is a channel augmentation

01:36:30.000 | parameter for the data generator in Keras. And that will slightly change the colors.

01:36:35.380 | That's a great idea for natural images like these because you have different white balance,

01:36:40.420 | you have different lighting and so forth. And indeed I think that would be a great idea.

01:36:45.440 | So I hope during the week people will take this notebook and somebody will tell me what

01:36:51.460 | is the best result they've got. And hopefully I bet that that data augmentation will include

01:36:58.460 | some fiddling around with the colors.

01:37:01.500 | Question on the same light screen. If you change all the images to more of long images? Would

01:37:15.380 | that be equal to black and white?

01:37:16.380 | We're changing it to black and white. No it wouldn't, because the Kaggle competition test

01:37:19.980 | set is in color. So if you're throwing away color, you're throwing away information. And

01:37:25.260 | figuring out whether something is -- but the Kaggle competition is saying is this a cat

01:37:32.260 | or is this a dog? And part of seeing whether something is a cat or a dog is looking at

01:37:35.860 | what color it is. So if you're throwing away the color, you're making that harder. So yeah,

01:37:39.780 | you could run it on the test set and get answers, but they're going to be less accurate because

01:37:44.380 | you've thrown away information.

01:37:46.380 | Question on the same light screen. How is it working since you've removed the flattened

01:37:57.100 | layer between the comp block and the dense layers?

01:37:58.100 | Okay, so what happened to the flattened layer? And the answer is that it was there. Where

01:38:02.180 | was it? Oh gosh. I forgot to add it back to this one. So I actually changed my mind about

01:38:11.900 | whether to include the flattened layer and where to put it and where to put max pooling.

01:38:15.620 | It will come back later. So this is a slightly old version. Thank you for picking it up.

01:38:19.780 | Could you do a form of dropout on the raw images by randomly blanking out pieces of

01:38:26.260 | the images?

01:38:27.260 | Yeah, so can you do dropout on the raw images? The simple answer is yes, you could. There's

01:38:32.660 | no reason I can't put a dropout layer right here. And that's going to drop out raw pixels.

01:38:38.340 | It turns out that's not a good idea. Throwing away input information is very different to

01:38:44.540 | throwing away modeled information. Throwing away modeled information is letting you effectively

01:38:49.780 | avoid overfitting the model. But you don't want to avoid overfitting the data. So you

01:38:55.220 | probably don't want to do that.

01:38:58.900 | Question on the same light screen.

01:39:08.140 | To clarify, the augmentation is at random. I just showed you 8 examples of the augmentation.

01:39:14.300 | So what the augmentation does is it says at random, rotate by up to 20 degrees, move by

01:39:20.740 | up to 10% in each direction, sheer by up to 5%, zoom by up to 10%, and flip at random

01:39:26.140 | half the time. So then I just said, OK, here are 8 cats. But what happens is every single

01:39:31.660 | time an image goes into the batch, it gets randomized. So effectively, it's an infinite

01:39:38.420 | number of augmented images.

01:39:40.940 | That doesn't have anything to do with data augmentation, so maybe we'll discuss that

01:39:54.140 | on a forum.

01:39:56.900 | The final concept to learn about today is batch normalization. Batch normalization,

01:40:03.140 | like data augmentation, is something you should always do. Why didn't VGG do it? Because it

01:40:09.020 | didn't exist then. Batch norm is about a year old, maybe 18 months. Here's the basic idea.

01:40:16.620 | When anybody who's done any machine learning probably knows that one of the first things

01:40:23.900 | you want to do is take your input data, subtract its mean, and divide by its standard deviation.

01:40:31.380 | Why is that? Imagine that we had 40, minus 30, and 1. You can see that the outputs are

01:40:44.980 | all over the place. The intermediate values, some are really big, some are really small.

01:40:51.120 | So if we change a weight which impacted x_1, it's going to change the loss function by

01:40:57.780 | a lot, whereas if we change a weight which impacts x_3, it'll change the loss function

01:41:02.940 | by very little.

01:41:04.260 | So the different weights have very different gradients, very different amounts that are

01:41:10.620 | going to affect the outcome. Furthermore, as you go further down through the model,

01:41:16.660 | that's going to multiply. Particularly when we're using something like softmax, which

01:41:19.980 | has an 'e' to the power of in it, you end up with these crazy big numbers.

01:41:24.700 | So when you have inputs that are of very different scales, it makes the whole model very fragile,

01:41:32.300 | which means it is harder to learn the best set of weights and you have to use smaller

01:41:36.820 | learning weights. This is not just true of deep learning, it's true of pretty much every

01:41:43.020 | kind of machine learning model, which is why everybody who's been through the MSAM program

01:41:47.220 | here hopefully you guys all learn to normalize your inputs.

01:41:51.180 | So if you haven't done any machine learning before, no problem, just take my word for

01:41:55.540 | it, you always want to normalize your inputs. It's so common that pretty much all of the

01:42:02.900 | deep learning libraries will normalize your inputs for you with a single parameter. And

01:42:08.660 | indeed we're doing it in hours because images, like pixel values only range from 0 to 255,

01:42:19.260 | you don't generally worry about dividing by the standard deviation with images, but you

01:42:24.120 | do generally worry about subtracting the mean. So you'll see that the first thing that our

01:42:30.700 | model does is this thing called pre-process, which subtracts the mean. And the mean was

01:42:37.420 | something which basically you can look it up on the internet and find out what the mean

01:42:41.260 | of the ImageNet data is. So these three fixed values.

01:42:46.300 | Now what's that got to do with batch norm? Well, imagine that somewhere along the line

01:42:52.140 | in our training, we ended up with one really big weight. Then suddenly one of our layers

01:42:59.700 | is going to have one really big number. And now we're going to have exactly the same problem

01:43:03.920 | as we had before, which is the whole model becomes very un-resilient, becomes very fragile,

01:43:10.460 | becomes very hard to train, going to be all over the place. Some numbers could even get

01:43:17.100 | slightly out of control.

01:43:23.860 | So what do we do? Really what we want to do is to normalize not just our inputs but our

01:43:34.420 | activations as well. So you may think, OK, no problem, let's just subtract the mean and

01:43:40.020 | divide by the standard deviation for each of our activation layers. Unfortunately that

01:43:45.300 | doesn't work. SGD is very bloody-minded. If it wants to increase one of the weights higher

01:43:51.420 | and you try to undo it by subtracting the mean and dividing by the standard deviation,

01:43:55.380 | the next iteration is going to try to make it higher again. So if SGD decides that it

01:44:00.860 | wants to make your weights of very different scales, it will do so. So just normalizing

01:44:07.020 | the activation layers doesn't work.

01:44:10.220 | So batch norm is a really neat trick for avoiding that problem. Before I tell you the trick,

01:44:16.580 | I will just tell you why you want to use it. Because A) it's about 10 times faster than

01:44:22.800 | not using it, particularly because it often lets you use a 10 times higher learning rate,

01:44:28.420 | and B) because it reduces overfitting without removing any information from the model. So

01:44:33.620 | these are the two things you want, less overfitting and faster models.

01:44:39.860 | I'm not going to go into detail on how it works. You can read about this during the week if

01:44:42.980 | you're interested. But a brief outline. First step, it normalizes the intermediate layers

01:44:48.980 | just the same way as input layers can be normalized. The thing I just told you wouldn't work, well

01:44:54.020 | it does it, but it does something else critical, which is it adds two more trainable parameters.

01:45:00.340 | One trainable parameter multiplies by all the activations, and the other one is added

01:45:04.580 | to all the activations. So effectively that is able to undo that normalization. Both of

01:45:12.060 | those two things are then incorporated into the calculation of the gradient.

01:45:16.800 | So the model now knows that it can rescale all of the weights if it wants to without

01:45:24.880 | moving one of the weights way off into the distance. And so it turns out that this does

01:45:30.720 | actually effectively control the weights in a really effective way.

01:45:34.580 | So that's what batch normalization is. The good news is, for you to use it, you just

01:45:38.820 | type batch normalization. In fact, you can put it after dense layers, you can put it

01:45:47.860 | after convolutional layers, you should put it after all of your layers.

01:45:52.300 | Here's the bad news, VGG didn't train originally with batch normalization, and adding batch

01:45:59.300 | normalization changes all of the weights. I think that there is a way to calculate a new

01:46:05.820 | set of weights with batch normalization, I haven't gone through that process yet.

01:46:12.020 | So what I did today was I actually grabbed the entirety of ImageNet and I trained this

01:46:20.900 | model on all of ImageNet. And that then gave me a model which was basically VGG plus batch

01:46:28.940 | normalization. And so that is the model here that I'm loading. So this is the ImageNet,

01:46:36.380 | whatever it is, large visual recognition competition 2012 dataset. And so I trained this set of

01:46:42.140 | weights on the entirety of ImageNet so that I created basically a VGG plus batch norm.

01:46:47.540 | And so then I fine-tuned the VGG plus batch norm model by popping off the end and adding

01:46:56.960 | a new dense layer. And then I trained it, and these only took 6 seconds because I pre-calculated

01:47:06.440 | the inputs to this. Then I added data augmentation and I started training that. And then I ran

01:47:20.420 | out of time because it was class.

01:47:22.380 | So I think this was on the right track. I think if I had another hour or so, you guys

01:47:29.100 | can play with this during the week. Because this is now like all the pieces together.

01:47:34.780 | It's batch norm and data augmentation and as much dropout as you want. So you'll see

01:47:43.300 | what I've got here is I have dropout layers with an arbitrary amount of dropout. And so

01:47:50.820 | in this, the way I set it up, you can go ahead and say create batch norm layers with whatever

01:47:56.140 | amount of dropout you want. And then later on you can say I want you to change the weights

01:48:02.020 | to use this new amount of dropout.

01:48:04.180 | So this is kind of like the ultimate ImageNet fine-tuning experience. And I haven't seen

01:48:11.820 | anybody create this before, so this is a useful tool that didn't exist until today. And hopefully

01:48:19.340 | during the week, we'll keep improving it.

01:48:22.300 | Interestingly, I found that when I went back to even 0.5 dropout, it was still massively

01:48:28.300 | overfitting. So it seems that batch normalization allows the model to be so much better at finding

01:48:35.420 | the optimum that I actually needed more dropout rather than less.

01:48:39.900 | So anyway, as I said, this is all something I was doing today. So I haven't quite finalized

01:48:45.500 | that. What I will show you though is something I did finalize, which I did on Sunday, which

01:48:51.180 | is going through end-to-end an entire model-building process on MNIST. And so I want to show you

01:48:58.780 | this entire process and then you guys can play with it.

01:49:03.780 | MNIST is a great way to really experiment with and revise everything we know about CNNs

01:49:10.460 | because it's very fast to train, because there are only 28x28 images, and there's also extensive

01:49:15.300 | benchmarks on what are the best approaches to MNIST.

01:49:19.820 | So it's very, very easy to get started with MNIST because Keras actually contains a copy

01:49:25.580 | of MNIST. So we can just go from Keras.datasets, import MNIST, MNIST.loadData, and we're done.

01:49:34.460 | Now MNIST are grayscale images, and everything in Keras in terms of the convolutional stuff

01:49:41.020 | expects there to be a number of channels. So we have to use expand-dims to add this empty

01:49:49.260 | dimension. So this is 60,000 images with one color, which are 28x28. So if you try to use

01:49:59.500 | grayscale images and get weird errors, I'm pretty sure this is what you've forgotten

01:50:04.780 | to do, just to add this kind of empty dimension, which is you actually have to tell it there

01:50:10.140 | is one channel. Because otherwise it doesn't know how many channels are there. So there

01:50:13.860 | is one channel.

01:50:15.780 | The other thing I had to do was take the y-values, the labels, and one-hot encode them. Because

01:50:23.140 | otherwise they were like this, they were actual numbers, 50419. And we need to one-hot encode

01:50:30.380 | them so that they're 50419. Remember, this is the thing that that softmax function is

01:50:40.180 | trying to approximate. That's how the linear algebra works. So there are the two things

01:50:44.820 | I had to do to preprocess this. Add the empty dimension and do my one-hot encoding. Then

01:50:51.220 | I normalize the input by subtracting the mean and dividing by the standard deviation. And

01:50:57.300 | then I tried to build a linear model.

01:51:01.340 | So I can't fine-tune from ImageNet now because ImageNet is 224x224 and this is 28x28. ImageNet

01:51:09.580 | is full color and this is grayscale. So we're going to start from scratch. So all of these

01:51:14.460 | are going to start from random.

01:51:16.620 | So a linear model needs to normalize the input and needs to flatten it because I'm not going

01:51:22.620 | to treat it as an image, I'm going to treat it as a single vector. And then I create my

01:51:27.540 | one dense layer with 10 outputs, compile it, grab my batches, and train my linear model.

01:51:37.500 | And so you can see, generally speaking, the best way to train a model is to start by doing

01:51:45.440 | one epoch with a pretty low learning rate. So the default learning rate is 0.001, which

01:51:53.100 | is actually a pretty good default. So you'll find nearly all of the time I just accept

01:51:56.740 | the default learning rate and I do a single epoch. And that's enough to get it started.

01:52:02.500 | Once you've got it started, you can set the learning rate really high. So 0.1 is about

01:52:06.780 | as high as you ever want to go, and do another epoch. And that's going to move super fast.

01:52:12.420 | And then gradually, you reduce the learning rate by order of magnitude at a time. So I

01:52:18.780 | go to 0.01, do a few epochs, and basically keep going like that until you start overfitting.

01:52:25.820 | So I got down to the point where I had a 92.7% accuracy on the training, 92.4% on the test,

01:52:32.820 | and I was like, okay, that's about as far as I can go. So that's a linear model. Not

01:52:37.500 | very interesting. So the next thing to do is to grab one extra dense layer in the middle,

01:52:42.660 | so one hidden layer. This is what in the 80s and 90s people thought of as a neural network,

01:52:48.500 | one hidden layer fully connected. And so that still takes 5 seconds to train. Again, we

01:52:55.460 | do the same thing, one epoch with a low learning rate, then pop up the learning rate for as

01:52:59.740 | long as we can, gradually decrease it, and we get 94% accuracy.

01:53:07.500 | So you wouldn't expect a fully connected network to do that well. So let's create a CNN. So

01:53:12.900 | this was actually the first architecture I tried. And basically I thought, okay, we know

01:53:17.180 | VGG works pretty well, so how about I create an architecture that looks like VGG, but it's

01:53:23.180 | much simpler because this is just 28x28. So I thought, okay, well VGG generally has a couple

01:53:28.380 | of convolutional layers of 3x3, and then a max pooling layer, and then a couple more with

01:53:33.460 | twice as many filters. So I just tried that. So this is kind of like my inspired by VGG

01:53:41.100 | model.

01:53:42.100 | And I thought, okay, so after 2 lots of max pooling, it'll go from 28x28 by 14x14 to 7x7.

01:53:50.660 | Okay, that's probably enough. So then I added my 2 dense layers again. So I didn't use any

01:53:57.140 | science here, it's just kind of some intuition. And it actually worked pretty well. After

01:54:03.900 | my learning rate of 0.1, I had an accuracy of 98.9%, validation accuracy of 99%. And then

01:54:12.300 | after a few layers of 0.01, I had an accuracy of 99.75%. But look, my validation accuracy

01:54:19.900 | is only 99.2%. So look, I'm overfitting.

01:54:23.280 | So this is the trick. Start by overfitting. Once you know you're overfitting, you know

01:54:28.500 | that you have a model that is complex enough to handle your data. So at this point, I was

01:54:33.740 | like, okay, this is a good architecture. It's capable of overfitting. So let's now try to

01:54:38.160 | use the same architecture and reduce overfitting, but reduce the complexity of the model no

01:54:43.860 | more than necessary.

01:54:45.260 | So step 1 of my 5-step list was data augmentation. So I added a bit of data augmentation, and

01:54:52.420 | then I used exactly the same model as I had before. And trained it for a while. And I found

01:54:58.780 | this time I could actually train it for even longer, as you can see. And I started to get

01:55:03.680 | some pretty good results here, 99.3, 99.34. But by the end, you can see I'm massively

01:55:09.220 | overfitting again. 99.6 training versus 91.1 test.

01:55:15.260 | So data augmentation alone is not enough. And I said to you guys, we'll always use batch

01:55:20.820 | norm anyway. So then I add batch norm. I use batch norm on every layer. Notice that when

01:55:29.420 | you use batch norm on convolution layers, you have to add axis=1. I am not going to

01:55:36.220 | tell you why. I want you guys to read the documentation about batch norm and try and

01:55:41.420 | figure out why you need this. And then we'll have a discussion about it on the forum because

01:55:45.780 | it's a really interesting analysis if you really want to understand batch norm and understand

01:55:51.260 | why you need this here.

01:55:54.060 | If you don't care about the details, that's fine. Just know type axis=1 anytime you have

01:55:59.100 | batch norm. And so this is like a pretty good quality modern network. You can see I've got

01:56:05.860 | convolution layers, they're 3x3, and then I have batch norm, and then I have max pooling,

01:56:10.100 | and then at the end I have some dense layers. This is actually a pretty decent looking model.

01:56:16.300 | Not surprisingly, it does pretty well. So I train it for a while at 0.1, I train it for

01:56:20.780 | a while at 0.01, I train it for a while at 0.001, and you can see I get up to 99.5%. That's

01:56:30.380 | not bad. But by the end, I'm starting to overfit.

01:56:35.380 | So add a little bit of dropout. And remember what I said to you guys, nowadays the rule

01:56:41.980 | for dropout is to gradually increase it. I only had time yesterday to just try adding

01:56:49.300 | one layer of dropout right at the end, but as it happened, that seemed to be enough.

01:56:53.800 | So when I just added one layer of dropout to the previous model, trained it for a while

01:56:57.820 | at 0.1, 0.01, 0.001, and it's like, oh great, my accuracy and my validation accuracy are

01:57:08.100 | pretty similar, and my validation accuracy is around 99.5 to 99.6 towards the end here.

01:57:14.900 | So I thought, okay, that sounds pretty good.

01:57:18.900 | So at 99.5 or 99.6% accuracy on handwriting recognition is pretty good, but there's one

01:57:26.700 | more trick you can do which makes every model better, and it's called Ensembling. Ensembling

01:57:32.980 | refers to building multiple versions of your model and combining them together.

01:57:37.780 | So what I did was I took all of the code from that last section and put it into a single

01:57:43.980 | function. So this is exactly the same model I had before, and this is my exact steps that

01:57:49.780 | I talked to train it, my learning rate of 0.1, 0.01, 0.001. So at the end of this, it returns

01:57:56.340 | a trained model.

01:57:58.420 | And so then I said, okay, 6 times fit a model and return a list of the results. So models

01:58:07.320 | at the end of this contain 6 trained models using my preferred network.

01:58:15.380 | So then what I could do was to say, go through every one of those 6 models and predict the

01:58:24.920 | output for everything in my test set. So now I have 10,000 test images by 10 outputs by

01:58:34.500 | 6 models. And so now I can take the average across the 6 models. And so now I'm basically

01:58:42.020 | saying here are 6 models, they've all been trained in the same way but from different

01:58:46.460 | random starting points. And so the idea is that they will be having errors in different

01:58:51.100 | places.

01:58:52.100 | So let's take the average of them, and I get an accuracy of 99.7%. How good is that? It's

01:59:01.300 | very good. It's so good that if we go to the academic list of the best MNIST results of

01:59:08.000 | all time, and many of these were specifically designed for handwriting recognition, it comes

01:59:14.140 | here.

01:59:16.740 | So one afternoon's work gets us in the list of the best results ever found on this dataset.

01:59:25.140 | So as you can see, it's not rocket science, it's all stuff you've learned before, you've

01:59:30.700 | learned now, and it's a process which is fairly repeatable, can get you right up to the state

01:59:38.580 | of the art.

01:59:39.580 | So it was easier to do it on MNIST because I only had to wait a few seconds for each

01:59:45.060 | of my trainings to finish. To get to this point on State Farm, it's going to be harder

01:59:51.460 | because you're going to have to think about how do you do it in the time you have available

01:59:54.860 | and how do you do it in the context of fine-tuning and stuff like that. But hopefully you can

02:00:00.140 | see that you have all of the tools now at your disposal to create literally a state

02:00:05.460 | of the art model.

02:00:08.160 | So I'm going to make all of these notebooks available. You can play with them. You can

02:00:12.780 | try to get a better result from dogs and cats. As you can see, it's kind of like an incomplete

02:00:19.100 | thing that I've done here. I haven't found the best data augmentation, I haven't found

02:00:22.340 | the best dropout, I haven't trained it as long as I probably need to. So there's some

02:00:27.160 | work for you to do.

02:00:29.260 | So here are your assignments for this week. This is all review now. I suggest you go back

02:00:36.580 | and actually read. There's quite a bit of prose in every one of these notebooks. Hopefully

02:00:40.820 | now you can go back and read that prose, and some of that prose at first was a bit mysterious,

02:00:46.340 | now it's going to make sense. Oh, okay, I see what it's saying. And if you read something

02:00:50.500 | and it doesn't make sense, ask on the forum. Or if you read something and you want to check,

02:00:55.900 | oh, is this kind of another way of saying this other thing? Ask on the forum.

02:01:00.340 | So these are all notebooks that we've looked at already and you should definitely review.

02:01:05.020 | Ask us something on the forum. Make sure that you can replicate the steps shown in the lesson

02:01:09.620 | notebooks we've seen so far using the technique in how to use the provided notebooks we looked

02:01:14.620 | at the start of class. If you haven't yet got into the top 50% of dogs vs cats, hopefully

02:01:20.260 | you've now got the tools to do so.

02:01:22.940 | If you get stuck at any point, ask on the forum. And then this is your big challenge. Can you

02:01:28.100 | get into the top 50% of State Farm? Now this is tough. The first step to doing well in

02:01:33.900 | a Kaggle competition is to create a validation set that gives you accurate answers. So create

02:01:39.940 | a validation set, and then make sure that the validation set accuracy is the same as

02:01:46.940 | you get when you submit to Kaggle. If you don't, you don't have a good enough validation

02:01:51.340 | set yet. Creating a validation set for State Farm is really your first challenge. It requires

02:01:57.100 | thinking long and hard about the evaluation section on that page and what that means.

02:02:01.820 | And then it's thinking about which layers of the pre-trained network should I be retraining.

02:02:09.460 | I actually have read through the top 20 results from the competition close 3 months ago. I

02:02:16.420 | actually think all of the top 20 result methods are pretty hacky. They're pretty ugly. I feel

02:02:24.860 | like there's a better way to do this that's kind of in our grasp. So I'm hoping that somebody

02:02:31.060 | is going to come up with a top 20 result for State Farm that is elegant. We'll see how

02:02:39.140 | we go. If not this year, maybe next year. Honestly, nobody in Kaggle quite came up with

02:02:44.940 | a really good way of tackling this. They've got some really good results, but with some

02:02:49.580 | really convoluted methods.

02:02:53.100 | And then as you go through a review, please, any of these techniques that you're not clear

02:02:58.060 | about, these 5 pieces, please go and have a look at this additional information and see

02:03:03.340 | if that helps.

02:03:05.260 | Alright, that was a pretty quick run-through. I hope everything goes well and I will see

02:03:11.180 | you next week.