back to indexLesson 3: Practical Deep Learning for Coders
00:00:00.000 |
Let's start actually on the Wiki on the Lesson 3 section of the Wiki because Rachel added 00:00:13.200 |
something which I think is super helpful to the Wiki this week, which is in this section 00:00:18.180 |
about the assignments, you'll see where it talks about going through the notebooks as 00:00:22.640 |
a section called "How to Use the Provided Notebooks" and I think the feedback I get 00:00:28.320 |
is each time I talk about the kind of teaching approach in this class, people get a lot out 00:00:34.680 |
So I thought I wanted to keep talking a little bit about that. 00:00:39.240 |
As we've discussed before, in the two hours that we spend together each week, that's not 00:00:43.960 |
nearly enough time for me to teach you Deep Learning. 00:00:46.800 |
I can show you what kinds of things you need to learn about and I can show you where to 00:00:51.080 |
look and try to give you a sense of some of the key topics. 00:00:55.000 |
But then the idea is that you're going to learn about deep learning during the week 00:01:00.600 |
And one of the places that you can do that experimenting is with the help of the notebooks 00:01:08.420 |
Having said that, if you do that by loading up a notebook and hitting Shift + Enter a 00:01:12.480 |
bunch of times to go through each cell until you get an error message and then you go, 00:01:16.680 |
"Oh shit, I got an error message," you're not going to learn anything about deep learning. 00:01:23.700 |
I was almost tempted to not put the notebooks online until a week after each class because 00:01:30.960 |
it's just so much better when you can build it yourself. 00:01:35.520 |
But the notebooks are very useful if you use them really rigorously and thoughtfully which 00:01:40.120 |
is as Rachel described here, read through it and then put it aside, minimize it or close 00:01:47.400 |
it or whatever, and now try and replicate what you just read from scratch. 00:01:53.920 |
And anytime you get stuck, you can go back and open back up the notebook, find the solution 00:01:59.840 |
to your problem, but don't copy and paste it. 00:02:03.580 |
Put the notebook aside again, go and read the documentation about what it turns out 00:02:08.440 |
the solution was, try and understand why is this a solution, and type in that solution 00:02:16.360 |
And so if you can do that, it means you really understand now the solution to this thing 00:02:20.800 |
you're previously stuck on and you've now learned something you didn't know before. 00:02:27.680 |
So if you're still stuck, you can refer back to the notebook again, still don't copy and 00:02:32.040 |
paste the code, but whilst having both open on the screen at the same time, type in the 00:02:39.560 |
Why would you type in code you can copy and paste, but just the very kinesthetic process 00:02:43.800 |
of typing it in forces you to think about where are the parentheses, where are the dots, and 00:02:49.920 |
And then once you've done that, you can try changing the inputs to that function and see 00:02:55.040 |
what happens and see how it affects the outputs and really experiment. 00:02:59.320 |
So it's through this process of trying to come up and think about what step do I take next. 00:03:06.140 |
That means that you're thinking about the concepts you've learned. 00:03:09.000 |
And then how do you do that step means that you're having to recall how the actual libraries 00:03:15.560 |
And then most importantly, through experimenting with the inputs and outputs, you get this really 00:03:22.400 |
So one of the questions I was thrilled to see over the weekend, which is exactly the 00:03:47.060 |
So I sent it two vectors with two things each and was happy with Resolve. 00:03:53.080 |
And then I sent it two vectors with three things in, and I don't get it. 00:03:58.560 |
This is like taking it down to make sure I really understand this. 00:04:02.720 |
And so I typed something in and the output was not what I expected, what's going on. 00:04:08.440 |
And so then I tried it by creating a little spreadsheet and showed here are the three 00:04:15.760 |
And then it's like, "Okay, I kind of get that, not fully," and then I finally described it. 00:04:23.240 |
So you now understand correlation and convolution. 00:04:25.960 |
You know you do because you put it in there, you figure out what the answer ought to be, 00:04:30.600 |
and eventually the answer is what you thought. 00:04:32.520 |
So this is exactly the kind of experimentation I find a lot of people try to jump straight 00:04:43.320 |
to full-scale image recognition before they've got to the kind of 1+1 stage. 00:04:51.000 |
And so you'll see I do a lot of stuff in Excel, and this is why. 00:04:55.480 |
In Excel or with simple little things in Python, I think that's where you get the most experimental 00:05:02.640 |
So that's what we're talking about when we talk about experiments. 00:05:10.280 |
I want to show you something pretty interesting. 00:05:12.960 |
And remember last week we looked at this paper from Matt Zyla where we saw what the different 00:05:23.040 |
layers of a convolutional neural network look like. 00:05:33.880 |
One of the steps in the how to use the provided notebooks is if you don't know why a step 00:05:57.280 |
is being done or how it works or what you observe, please ask. 00:06:06.440 |
Any time you're stuck for half an hour, please ask. 00:06:12.000 |
So far, I believe that there has been a 100% success rate in answering questions on the 00:06:23.120 |
So part of the homework this week in the assignments is ask a question on the forum. 00:06:31.080 |
Question about setting up AWS, don't be embarrassed if you still have questions there. 00:06:43.240 |
I know a lot of people are still working through cats and dogs, or cats and dogs, or redux. 00:06:53.180 |
There are plenty of people here who have never used Python before. 00:06:59.260 |
The goal is that for those of you that don't know Python, that we give you the resources 00:07:04.180 |
to learn it and learn it well enough to be effective in doing deep learning in it. 00:07:09.920 |
But that does mean that you guys are going to have to ask more questions. 00:07:17.680 |
So if you see somebody asking on the forum about how do I analyze functional brain MRIs 00:07:24.840 |
with 3D convolutional neural networks, that's fine, that's where they are at. 00:07:29.120 |
That's okay if you then ask, What does this Python function do? 00:07:34.520 |
If you see somebody ask, What does this Python function do? 00:07:36.960 |
And you want to talk about 3D brain MRIs, do that too. 00:07:40.560 |
The nice thing about the forum is that as you can see, it really is buzzing now. 00:07:48.040 |
The nice thing is that the different threads allow people to dig into the stuff that interests 00:07:54.360 |
And I'll tell you from personal experience, the thing that I learn the most from is answering 00:08:03.060 |
So actually answering that question about a 1D convolution, I found very interesting. 00:08:08.080 |
I actually didn't realize that the reflect parameter was the default parameter and I 00:08:12.560 |
didn't quite understand how it worked, so answering that question I found very interesting. 00:08:17.600 |
And even sometimes if you know the answer, figuring out how to express it teaches you 00:08:23.280 |
So asking questions of any level is always helpful to you and to the rest of the community. 00:08:32.160 |
So please, if everybody only does one part of the assignments this week, do that one. 00:08:40.280 |
And here are some ideas about questions you could ask if you're not sure. 00:08:49.360 |
So I was saying last week we kind of looked later in the class at this amazing visualization 00:08:56.360 |
of what goes on in a convolutional neural network. 00:09:00.440 |
I want to show you something even cooler, which is the same thing in video. 00:09:07.800 |
This is by an amazing guy called Jason Nusinski, his supervisor, Todd Lipson, and some other 00:09:18.760 |
And so I'm going to show you what's going on here. 00:09:26.400 |
So if you go to Google and search for the Deep Visualization Toolbox, you can do this. 00:09:32.180 |
You can grab pictures, you can click on any one of the layers of a convolutional neural 00:09:37.640 |
network, and it will visualize every one of the outputs of the filters in that convolutional 00:09:47.920 |
So you can see here with this dog, it looks like there's a filter here which is kind of 00:09:55.000 |
So if you give it a video stream of your own webcam, you can see the video stream popping 00:10:04.200 |
And looking at this tool now, I hope it will give us a better intuition about what's going 00:10:12.680 |
As he slides a piece of paper over it, you get this very strong edge. 00:10:18.280 |
And clearly it's specifically a horizontal edge detector. 00:10:21.360 |
And here is actually a visualization of the pixels of the filter itself. 00:10:26.480 |
Remember from our initial lesson 0, an edge detector has black on one side and white on 00:10:32.200 |
So you can scroll through all the different layers of this neural network. 00:10:40.280 |
And the deeper the layer, the larger the area it covers, and therefore the smaller the actual 00:10:47.280 |
filter is, and the more complex the objects that it can recognize. 00:10:52.160 |
So here's an interesting example of a layer 5 thing which it looks like it's a face detector. 00:10:58.440 |
So you can see that as he moves his face around, this is moving around as well. 00:11:03.780 |
So one of the cool things you can do with this is you can say show me all the images 00:11:07.640 |
from ImageNet that match this filter as much as possible, and you can see that it's showing 00:11:13.760 |
This is a really cool way to understand what your neural network is doing, or what ImageNet 00:11:23.480 |
You can see other guys come along and here we are. 00:11:26.040 |
And so here you can see the actual result in real time of the filter deconvolution, 00:11:31.280 |
and here's the actual recognition that it's doing. 00:11:34.200 |
So clearly it's a face detector which also detects cat faces. 00:11:39.520 |
So the interesting thing about these types of neural net filters is that they're often 00:11:46.720 |
They're not looking for just some fixed set of pixels, but they really understand concepts. 00:11:55.400 |
Here's one of the filters in the 5th layer which seems to be like an armpit detector. 00:12:03.680 |
Well interestingly, what he shows here is that actually it's not an armpit detector. 00:12:09.240 |
If he smooths out his fabric, this disappears. 00:12:12.920 |
So what this actually is, is a texture detector. 00:12:16.200 |
It's something that detects some kind of regular texture. 00:12:23.520 |
Here's an interesting example of one which clearly is a text detector. 00:12:27.600 |
Now interestingly, ImageNet did not have a category called text, one of the thousand 00:12:32.880 |
categories is not text, but one of the thousand categories is bookshelf. 00:12:37.800 |
And so you can't find a bookshelf if you don't know how to find a book, and you can't find 00:12:41.840 |
a book if you don't know how to recognize its spine, and the way to recognize its spine 00:12:47.480 |
So this is the cool thing about these neural networks is that you don't have to tell them 00:12:53.860 |
They decide what they want to find in order to solve your problem. 00:12:59.520 |
So I wanted to start at this end of "Oh my God, deep learning is really cool" and then 00:13:06.480 |
jump back to the other end of "Oh my God, deep learning is really simple." 00:13:11.960 |
So everything we just saw works because of the things that we've learned about so far, 00:13:18.080 |
and I've got a section here called CNN Review in lesson 3. 00:13:21.920 |
And Rachel and I have started to add some of our favorite readings about each of these 00:13:27.280 |
pieces, but everything you just saw in that video consists of the following pieces. 00:13:33.280 |
Matrix products, convolutions just like we saw in Excel and Python, activations such 00:13:42.260 |
as ReLuse and Softmax, Stochastic Gradient Descent which is based on backpropagation 00:13:50.120 |
- we'll learn more about that today - and that's basically it. 00:13:54.700 |
One of the, I think, challenging things is even if you feel comfortable with each of 00:13:59.080 |
these 1, 2, 3, 4, 5 pieces that are convolutional neural networks, is really understanding how 00:14:06.960 |
will those pieces fit together to actually do deep learning. 00:14:10.920 |
So we've got two really good resources here on putting it all together. 00:14:15.040 |
So I'm going to go through each of these six things today as revision, but what I suggest 00:14:22.400 |
you do if there's any piece where you feel like I'm not quite confident, I really know 00:14:27.480 |
what a convolution is or I really know what an activation function is, see if this information 00:14:32.440 |
is helpful and maybe ask a question on the forum. 00:14:40.840 |
I think a particularly good place to start maybe is with convolutions. 00:14:47.120 |
And a good reason to start with convolutions is because we haven't really looked at them 00:15:00.720 |
So in Lesson 0, we learned about what a convolution is and we learned about what a convolution 00:15:05.500 |
is by actually running a convolution against an image. 00:15:12.760 |
The MNIST dataset, remember, consists of 55,000 28x28 grayscale images of handwritten digits. 00:15:24.200 |
So each one of these has some known label, and so here's five examples with a known label. 00:15:33.620 |
So in order to understand what a convolution is, we tried creating a simple little 3x3 matrix. 00:15:41.380 |
And so the 3x3 matrix we started with had negative 1s at the top, 1s in the middle, and 00:15:52.240 |
So what would happen if we took this 3x3 matrix and we slid it over every 3x3 part of this 00:16:01.420 |
image and we multiplied negative 1 by the first pixel, negative 1 by the second pixel, 00:16:07.560 |
and then move to the next row and multiply by 1, 1, 1, 0, 0, 0, and add them all together. 00:16:23.920 |
So you might remember from Lesson 0, we looked at a little area to actually see what this 00:16:31.120 |
So we could zoom in, so here's a little small little bit of the 7. 00:16:43.440 |
And so one thing I think is helpful is just to look at what is that little bit. 00:16:51.280 |
Let's make it a bit smaller so it fits on our screen. 00:16:57.280 |
So you can see that an image just is a bunch of numbers. 00:17:03.240 |
And the blacks are zeros and the things in between bigger and bigger numbers until eventually 00:17:11.100 |
So what would happen if we took this little 3x3 area? 00:17:13.960 |
0, 0, 0, 0, 0.35, 0.5, 0.9, 0.9, 0.9, and we multiplied each of those 9 things by each 00:17:27.720 |
So clearly anywhere where the first row is zeros and the second row is ones, this is 00:17:38.000 |
going to be very high when we multiply it all together and add the 9 things up. 00:17:42.720 |
And so given that white means high, you can see then that when we do this convolution, 00:17:53.680 |
we end up with something where the top edges become bright because we went -1, -1, -1 times 00:18:02.840 |
1, 1, 1 times 0, 0, 0 and added them all together. 00:18:07.040 |
So one of the things we looked at in lesson 0 and we have a link to here is this cool 00:18:11.840 |
little image kernel explained visually site where you can actually create any 3x3 matrix 00:18:21.600 |
yourself and go through any 3x3 part of this picture and see the actual arithmetic and see 00:18:34.480 |
So if you're not comfortable with convolutions, this would be a great place to go next. 00:18:47.160 |
How did you decide on the values of the top matrix? 00:18:52.240 |
So in order to demonstrate an edge filter, I picked values based on some well-known edge 00:19:02.260 |
So you can see here's a bunch of different matrices that this guy has. 00:19:06.040 |
So for example, top_sobel, I could select, and you can see that does a top_edge filter. 00:19:12.180 |
Or I could say emboss, and you can see it creates this embossing sense. 00:19:19.760 |
Here's a better example because it's nice and big here. 00:19:25.520 |
So these types of filters have been created over many decades, and there's lots and lots 00:19:30.880 |
of filters designed to do interesting things. 00:19:34.800 |
So I just picked a simple filter which I knew from experience and from common sense would 00:19:53.320 |
And so by the same kind of idea, if I rotate that by 90 degrees, that's going to create 00:20:05.040 |
So if I create the four different types of filter here, and I could also create four 00:20:09.980 |
different diagonal filters like these, that would allow me to create top edge, left edge, 00:20:16.840 |
bottom edge, right edge, and then each diagonal edge filter here. 00:20:21.560 |
So I created these filters just by hand through a combination of common sense and having read 00:20:29.440 |
about filters because people spend time designing filters. 00:20:35.040 |
The more interesting question then really is what would be the optimal way to design 00:20:42.460 |
Because it's definitely not the case that these eight filters are the best way of figuring 00:20:48.800 |
out what's a 7 and what's an 8 and what's a 1. 00:20:56.040 |
What deep learning does is it says let's start with random filters. 00:21:01.320 |
So let's not design them, but we'll start with totally random numbers for each of our 00:21:06.520 |
So we might start with eight random filters, each of 3x3. 00:21:12.200 |
And we then use stochastic gradient descent to find out what are the optimal values of 00:21:23.300 |
And that's what happens in order to create that cool video we just saw, and that cool 00:21:29.800 |
That's how those different kinds of edge detectors and gradient detectors and so forth were created. 00:21:34.880 |
When you use stochastic gradient descent to optimize these kinds of values when they start 00:21:39.980 |
out random, it figures out that the best way to recognize images is by creating these kinds 00:21:52.600 |
Where it gets interesting is when you start building convolutions on top of convolutions. 00:21:59.640 |
So we saw last week that we can create a bunch of inputs, so if I don't, please remind me. 00:22:23.480 |
So we saw last week how if you've got three inputs, you can create a bunch of weight matrices, 00:22:35.960 |
So if we've got three inputs, we saw last week how you could create a random matrix 00:22:41.280 |
and then do a matrix multiply of the inputs times a random matrix. 00:22:49.180 |
We could then put it through an activation function such as max(0,x) and we could then 00:22:58.600 |
take that and multiply it by another weight matrix to create another output. 00:23:05.260 |
And then we could put that through max(0,x) and we can keep doing that to create arbitrarily 00:23:12.240 |
complex functions. And we looked at this really great neural networks and deep learning chapter 00:23:22.600 |
where we saw visually how that kind of bunch of matrix products followed by activation 00:23:28.960 |
functions can approximate any given function. 00:23:36.280 |
So where it gets interesting then is instead of just having a bunch of weight matrices and 00:23:42.880 |
matrix products, what if sometimes we had convolutions and activations? Because a convolution 00:23:51.120 |
is just a subset of a matrix product, so if you think about it, a matrix product says 00:23:56.200 |
here's 10 activations and then a weight matrix going down to 10 activations. The weight matrix 00:24:03.120 |
goes from every single element of the first layer to every single element of the next layer. 00:24:08.280 |
So if this goes from 10 to 10, there are 100 weights. Whereas a convolution is just creating 00:24:17.400 |
So I'll let you think about this during the week because it's a really interesting insight 00:24:20.480 |
to think about that a convolution is identical to a fully connected layer, but it's just 00:24:27.360 |
a subset of the weights. And so therefore everything we learned about stacking linear 00:24:35.680 |
and nonlinear layers together applies also to convolutions. 00:24:40.960 |
But we also know that convolutions are particularly well-suited to identifying interesting features 00:24:47.240 |
of images. So by using convolutions, it allows us to more conveniently and quickly find powerful 00:25:00.440 |
So the spreadsheet will be available for download tomorrow. We're trying to get to the point 00:25:18.200 |
that we can actually get the derivatives to work in the spreadsheet and we're still slightly 00:25:21.280 |
stuck with some of the details, but we'll make something available tomorrow. Are the 00:25:28.240 |
filters the layers? Yes they are. So this is something where spending a lot of time looking 00:25:37.880 |
at simple little convolution examples is really helpful. 00:25:44.240 |
Because for a fully connected layer, it's pretty easy. You can see if I have 3 inputs, 00:25:49.040 |
then my matrix product will have to have 3 rows, otherwise they won't match. And then 00:25:54.840 |
I can create as many columns as I like. And the number of columns I create tells me how 00:25:59.500 |
many activations I create because that's what matrix products do. 00:26:04.640 |
So it's very easy to see how with what Keras calls dense layers, I can decide how big I 00:26:11.040 |
want each activation layer to be. If you think about it, you can do exactly the same thing 00:26:18.560 |
with convolutions. You can decide how many sets of 3x3 matrices you want to create at 00:26:27.320 |
random, and each one will generate a different output when applied to the image. 00:26:34.160 |
So the way that VGG works, for example, so the VGG network, which we learned about in 00:26:48.840 |
Lesson 1, contains a bunch of layers. It contains a bunch of convolutional layers, followed 00:27:01.200 |
by a flatten. And all flatten does is just a Keras thing that says don't think of the 00:27:07.280 |
layers anymore as being x by y by channel matrices, think of them as being a single 00:27:15.480 |
vector. So it just concatenates all the dimensions together, and then it contains a bunch of 00:27:22.600 |
And so each of the convolutional blocks is -- you can kind of ignore the zero padding, 00:27:29.800 |
that just adds zeros around the outside so that your convolutions end up with the same 00:27:34.520 |
number of outputs as inputs. It contains a 2D convolution, followed by, and we'll review 00:27:44.840 |
You can see that it starts off with 2 convolutional layers with 64 filters, and then 2 convolutional 00:27:52.360 |
layers with 128 filters, and then 3 convolutional layers with 256 filters. And so you can see 00:27:56.960 |
what it's doing is it's gradually creating more and more filters in each layer. 00:28:08.040 |
These definitions of block are specific to VGG, so I just created -- this is just me 00:28:12.960 |
refactoring the model so there wasn't lots and lots of lines of code. So I just didn't 00:28:17.080 |
want to retype lots of code, so I kind of found that these lines of code were being 00:28:24.660 |
So why would we be having the number of filters being increasing? Well, the best way to understand 00:28:31.560 |
a model is to use the summary command. So let's go back to lesson 1. 00:28:51.000 |
So let's go right back to our first thing we learned, which was the 7 lines of code 00:28:56.560 |
that you can run in order to create and train a network. I won't wait for it to actually 00:29:07.520 |
finish training, but what I do want to do now is go vgg.model.summary. So anytime you're 00:29:18.560 |
creating models, it's a really good idea to use the summary command to look inside them 00:29:25.680 |
So here we can see that the input to our model has 3 channels, red, green and blue, and they 00:29:32.840 |
are 224x224 images. After I do my first 2D convolution, I now have 64 channels of 224x224. 00:29:47.320 |
So I've replaced my 3 channels with 64, just like here I've got 8 different filters, here 00:29:58.160 |
I've got 64 different filters because that's what I asked for. 00:30:07.360 |
So again we have a second convolution set with 224x224 of 64, and then we do max pooling. 00:30:17.800 |
So max pooling, remember from lesson 0, was this thing where we simplified things. So 00:30:25.440 |
we started out with these 28x28 images and we said let's take each 7x7 block and replace 00:30:33.620 |
that entire 7x7 block with a single pixel which contains the maximum pixel value. So 00:30:40.880 |
here is this 7x7 block which is basically all gray, so we end up with a very low number 00:30:48.060 |
here. And so instead of being 28x28, it becomes 4x4 because we are replacing every 7x7 block 00:30:58.760 |
with a single pixel. That's all max pooling does. So the reason we have max pooling is 00:31:06.080 |
it allows us to gradually simplify our image so that we get larger and larger areas and 00:31:16.440 |
So if we look at VGG, after our max pooling layer, we now longer have 224x224, we now 00:31:22.880 |
have 112x112. Later on we do another max pooling, we end up with 56x56. Later on we do another 00:31:34.520 |
max pooling and we end up with 28x28. So each time we do a max pooling we're reducing the 00:31:44.880 |
resolution of our image. As we're reducing the resolution, we need to high cut the number 00:31:56.240 |
of filters otherwise we're losing information. So that's really why each time we have a max 00:32:04.000 |
pooling, we then double the number of filters because that means that every layer we're 00:32:09.200 |
keeping the same amount of information content. 00:32:12.320 |
So it starts out with a very, very important insight, which is a very important insight 00:32:41.760 |
which is a convolution is position invariant. So in other words, this thing we created which 00:32:47.960 |
is a top edge detector, we can apply that to any part of the image and get top edges from 00:33:06.280 |
And earlier on when we looked at that Jason Nusinski video, it showed that there was a 00:33:10.600 |
face detector which could find a face in any part of the image. So this is fundamental to 00:33:16.680 |
how a convolution works. A convolution is a position invariant. It finds a pattern regardless 00:33:25.280 |
Now that is a very powerful idea because when we want to say find a face, we want to be 00:33:32.120 |
able to find eyes. And we want to be able to find eyes regardless of whether the face 00:33:40.600 |
So position invariance is important, but also we need to be able to identify position to 00:33:47.000 |
some extent because if there's four eyes in the picture, or if there's an eye in the top 00:33:52.680 |
corner and the bottom corner, then something weird is going on, or if the eyes and the 00:33:59.800 |
So how does a convolutional neural network both have this location invariant filter but 00:34:07.960 |
also handle location? And the trick is that every one of the 3x3 filters cares deeply 00:34:20.280 |
And so as we go down through the layers of our model from 224 to 112 to 56 to 28 to 14 00:34:31.080 |
to 7, at each one of these stages (think about this stage which goes from 14x14 to 7x7), 00:34:40.320 |
these filters are now looking at large parts of the image. So it's now at a point where 00:34:44.280 |
it can actually say there needs to be an eye here and an eye here and a nose here. 00:34:52.160 |
So this is one of the cool things about convolutional neural networks. They can find features everywhere 00:34:57.640 |
but they can also build things which care about how features relate to each other positionally. 00:35:29.220 |
So do we need zero padding? Zero padding is literally something that sticks zeros around 00:35:35.120 |
the outside of an image. If you think about what a convolution does, it's taking a 3x3 00:35:41.640 |
and moving it over an image. If you do that, when you get to the edge, what do you do? 00:35:48.960 |
Because at the very edge, you can't move your 3x3 any further. Which means if you only do 00:35:55.140 |
what's called a valid convolution, which means you always make sure your 3x3 filter fits 00:35:59.960 |
entirely within your image, you end up losing 2 pixels from the sides and 2 pixels from 00:36:09.320 |
There's actually nothing wrong with that, but it's a little inelegant. It's kind of 00:36:15.120 |
nice to be able to half the size each time and be able to see exactly what's going on. 00:36:19.520 |
So people tend to often like doing what's called same convolutions. So if you add a black border 00:36:25.620 |
around the outside, then the result of your convolution is exactly the same size as your 00:36:30.920 |
input. That is literally the only reason to do it. 00:36:34.740 |
In fact, this is a rather inelegant way of going zero padding and then convolution. In 00:36:39.520 |
fact, there's a parameter to nearly every library's convolution function where you can 00:36:44.200 |
say "I want valid" or "full" or "half" which basically means do you add no black pixels, 00:36:52.560 |
one black pixels or two black pixels, assuming it's 3x3. 00:36:57.240 |
So I don't quite know why this one does it this way. It's really doing two functions 00:37:01.660 |
where one would have done, but it does the job. So there's no right answer to that question. 00:37:12.840 |
All neural networks work fine for cartoons. The question was do they work for cartoons. 00:37:19.080 |
However, fine-tuning, which has been fundamental to everything we've learned so far, it's going 00:37:27.240 |
to be difficult to fine-tune from an ImageNet model to a cartoon. Because an ImageNet model 00:37:33.800 |
was built on all those pictures of corn we looked at and all those pictures of dogs we 00:37:37.560 |
looked at. So an ImageNet model has learned to find the kinds of features that are in 00:37:43.480 |
photos of objects out there in the world. And those are very different kinds of photos to 00:37:51.400 |
So if you want to be able to build a cartoon neural network, you'll need to either find 00:37:58.100 |
somebody else who has already trained a neural network on cartoons and fine-tune that, or 00:38:03.600 |
you're going to have to create a really big corpus of cartoons and create your own ImageNet 00:38:24.400 |
So why doesn't an ImageNet network translate to cartoons given that an eye is a circle? 00:38:32.200 |
Because the nuance level of a CNN is very high. It doesn't think of an eye as being just 00:38:40.160 |
a circle. It knows that an eye very specifically has particular gradients and particular shapes 00:38:45.800 |
and particular ways that the light reflects off it and so forth. So when it sees a round 00:38:51.880 |
blob there, it has no ability to abstract that out and say I guess they mean an eye. One 00:39:01.440 |
of the big shortcomings of CNNs is that they can only learn to recognize things that you 00:39:10.280 |
If you feed a neural net with a wide range of photos and drawings, maybe it would learn 00:39:18.640 |
about that kind of abstraction. To my knowledge, that's never been done. It would be a very 00:39:23.680 |
interesting question. It must be possible. I'm just not sure how many examples you would 00:39:28.880 |
need and what kind of architecture you would need. 00:39:31.960 |
In this particular example, I used correlate, not convolution. One of the things we briefly 00:39:49.720 |
mentioned in lesson 1 is that convolve and correlate are exactly the same thing, except 00:39:59.680 |
convolve is equal to correlate of an image with a filter that has been rotated by 90 00:40:06.560 |
degrees. So you can see convolve images with rotated 90 degrees filter looks exactly the 00:40:14.200 |
same and numpy.all_close is true. So convolve and correlate are identical except that correlate 00:40:24.760 |
is more intuitive. In each one it goes rows and then columns, where else with convolve 00:40:33.160 |
one goes along rows and the other one goes down columns. So I tend to prefer to think 00:40:38.280 |
about correlate because it's just more intuitive. 00:40:42.680 |
Convolve originally came really from physics, I think, and it's also a basic math operation. 00:40:49.720 |
There are various reasons that people sometimes find it more intuitive to think about convolution 00:40:54.600 |
but in terms of everything that they can do in a neural net, it doesn't matter which one 00:41:00.360 |
you're using. In fact, many libraries let you set a parameter to true or false to decide 00:41:06.120 |
whether or not internally it uses convolution or correlation. And of course the results 00:41:12.080 |
So let's go back to our CNN review. Our network architecture is a bunch of matrix products 00:41:27.720 |
or in more generally linear layers, and remember a convolution is just a subset of a matrix 00:41:34.040 |
product so it's also a linear layer, a bunch of matrix products or convolutions stacked 00:41:39.120 |
with alternating nonlinear activation functions. And specifically we looked at the activation 00:41:46.360 |
function which was the rectified linear unit, which is just max of 0, x. So that's an incredibly 00:41:55.160 |
simple activation function, but it's by far the most common, it works really well, for 00:42:04.760 |
I want to introduce one more activation function today, and you can read more about it in Lesson 00:42:10.960 |
2. Let's go down here where it says About Activation Functions. And you can see I've 00:42:26.680 |
got all the details of these activation functions here. I want to talk about one core. 00:42:34.720 |
It's called the Softmax function, and Softmax is defined as follows, e^xi divided by sum 00:42:46.460 |
What is this all about? Softmax is used not for the middle layers of a deep learning network, 00:42:52.920 |
but for the last layer. The last layer of a neural network, if you think about what it's 00:42:57.360 |
trying to do for classification, it's trying to match to a one-hot encoded output. Remember 00:43:03.760 |
a one-hot encoded output is a vector with all zeros and just a 1 in one spot. The spot 00:43:10.320 |
is like we had for cats and dogs two spots, the first one was a 1 if it was a cat, the 00:43:19.440 |
So in general, if we're doing classification, we want our output to have one high number 00:43:26.600 |
and all the other ones be low. That's going to be easier to create this one-hot encoded 00:43:31.680 |
output. Furthermore, we would like to be able to interpret these as probabilities, which 00:43:39.000 |
So we've got these two requirements here. Our final layer's activations should add to 00:43:42.480 |
one, and one of them should be higher than all the rest. This particular function does 00:43:50.120 |
exactly that, and we will look at that by looking at a spreadsheet. 00:43:55.980 |
So here is an example of what an output layer might contain. Here is e^of each of those 00:44:05.720 |
things to the left. Here is the sum of e^of those things. And then here is the thing to 00:44:15.080 |
the left divided by the sum of them, in other words, softmax. And you can see that we start 00:44:23.680 |
with a bunch of numbers that are all of a similar kind of scale. And we end up with 00:44:28.160 |
a bunch of numbers that sum to 1, and one of them is much higher than the others. 00:44:34.280 |
So in general, when we design neural networks, we want to come up with architectures, by 00:44:43.480 |
which I mean convolutions, fully connected layers, activation functions, we want to come 00:44:51.480 |
up with architectures where replicating the outcome we want is as convenient as possible. 00:45:00.380 |
So in this case, our activation function for the last layer makes it pretty convenient, 00:45:05.720 |
pretty easy to come up with something that looks a lot like a 1-watt encoded output. 00:45:11.280 |
So the easier it is for our neural net to create the thing we want, the faster it's 00:45:16.260 |
going to get there, and the more likely it is to get there in a way that's quite accurate. 00:45:21.200 |
So we've learned that any big enough, deep enough neural network, because of the Universal 00:45:29.480 |
Approximation Theorem, can approximate any function at all. And we know that Stochastic 00:45:35.600 |
Gradient Descent can find the parameters for any of these, which kind of leaves you thinking 00:45:39.920 |
why do we need 7 weeks of neural network training? Any architecture ought to work. And indeed 00:45:47.520 |
that's true. If you have long enough, any architecture will work. Any architecture can 00:45:53.040 |
translate Hungarian to English, any architecture can recognize cats versus dogs, any architecture 00:45:58.640 |
can analyze Hillary Clinton's emails, as long as it's big enough. However, some of them 00:46:04.480 |
do it much faster than others. They train much faster than others. A bad architecture 00:46:11.560 |
could take so long to train that it doesn't train in the amount of years you have left 00:46:16.040 |
in your lifetime. And that's why we care about things like convolutional neural networks 00:46:21.400 |
instead of just fully connected layers all the way through. That's why we care about 00:46:26.340 |
having a softmax at the last layer rather than just a linear last layer. So we try to 00:46:30.760 |
make it as convenient as possible for our network to create the thing that we want to 00:46:57.800 |
So the first one was? Softmax, just like the other one, is about how Keras internally 00:47:09.800 |
handles these matrices of data. Any more information about that one? 00:47:15.800 |
Honestly, I don't do theoretical justifications, I do intuitive justifications. There is a 00:47:35.760 |
great book for theoretical justifications and it's available for free. If you just google 00:47:40.840 |
for Deep Learning Book, or indeed go to deeplearningbook.org, it actually does have a fantastic theoretical 00:47:48.000 |
justification of why we use softmax. The short version basically is as follows. Softmax contains 00:47:56.400 |
an eta in it, our log-loss layer contains a log in it, the two nicely mesh up against 00:48:05.800 |
each other and in fact the derivative of the two together is just a - b. So that's kind 00:48:12.560 |
of the short version, but I will refer you to the Deep Learning Book for more information 00:48:18.680 |
The intuitive justification is that because we have an eta here, it makes a big number 00:48:24.320 |
really really big, and therefore once we take one divided by the sum of the others, we end 00:48:29.960 |
up with one number that tends to be bigger than all the rest, and that is very close 00:48:33.840 |
to the one-hot encoded output that we're trying to match. 00:48:39.960 |
Could a network learn identical filters? A network absolutely could learn identical filters, 00:48:49.840 |
but it won't. The reason it won't is because it's not optimal to. Stochastic gradient descent 00:48:56.200 |
is an optimization procedure. It will come up with, if you train it for long enough, 00:49:01.640 |
with an appropriate learning rate, the optimal set of filters. Having the same filter twice 00:49:07.080 |
is never optimal, that's redundant. So as long as you start off with random weights, then 00:49:15.880 |
it can learn to find the optimal set of filters, which will not include duplicate filters. 00:49:34.200 |
In this review, we've done our different layers, and then these different layers get optimized 00:49:43.360 |
with SGD. Last week we learned about SGD by using this extremely simple example where 00:49:52.960 |
we said let's define a function which is a line, ax + b. Let's create some data that 00:50:01.960 |
matches a line, x's and y's. Let's define a loss function, which is the sum of squared 00:50:10.480 |
errors. We now no longer know what a and b are, so let's start with some guess. Obviously 00:50:18.960 |
the loss is pretty high, and let's now try and come up with a procedure where each step 00:50:26.760 |
makes the loss a little bit better by making a and b a little bit better. The way we did 00:50:32.520 |
that was very simple. We calculated the derivative of the loss with respect to each of a and 00:50:39.840 |
b, and that means that the derivative of the loss with respect to b is, if I increase b 00:50:46.480 |
by a bit, how does the loss change? And the derivative of the loss with respect to a means 00:50:52.440 |
as I change a a bit, how does the loss change? If I know those two things, then I know that 00:50:58.240 |
I should subtract the derivative times some learning rate, which is 0.01, and as long 00:51:06.440 |
as our learning rate is low enough, we know that this is going to make our a guess a little 00:51:11.600 |
bit better. And we do the same for our b guess, it gets a little bit better. 00:51:17.280 |
And so we learned that that is the entirety of SGD. We run that again and again and again, 00:51:23.160 |
and indeed we set up something that would run it again and again and again in an animation 00:51:27.600 |
loop and we saw that indeed it does optimize our line. 00:51:33.920 |
The tricky thing for me with deep learning is jumping from this kind of easy to visualize 00:51:40.480 |
intuition. If I run this little derivative on these two things a bunch of times, it optimizes 00:51:47.960 |
this line, I can then create a set of layers with hundreds of millions of parameters that 00:52:00.480 |
in theory can match any possible function and it's going to do exactly the same thing. 00:52:07.000 |
So this is where our intuition breaks down, which is that this incredibly simple thing 00:52:11.560 |
called SGD is capable of creating these incredibly sophisticated deep learning models. We really 00:52:19.680 |
have to just respect our understanding of the basics of what's going on. We know it's 00:52:24.800 |
going to work, and we can see that it does work. But even when you've trained dozens 00:52:31.520 |
of deep learning models, it's still surprising that it does work. It's always a bit shocking 00:52:36.800 |
when you start without any ability to analyze some problem. You start with some random weights, 00:52:43.440 |
you start with a general architecture, you throw some data in with SGD, and you end up 00:52:47.240 |
with something that works. Hopefully now it makes sense, you can see why that happens. 00:52:53.760 |
But it takes doing it a few times to really intuitively understand, okay, it really does 00:53:01.280 |
So one question about Softmax, could you use it for multi-class, multi-label classification 00:53:10.840 |
And you use Softmax for multi-class classification, and the answer is absolutely yes. In fact, 00:53:15.400 |
the example I showed here was such an example. So imagine that these outputs were for cat, 00:53:23.320 |
dog, plane, fish, and building. So these might be what these 5 things represent. So this 00:53:34.040 |
is exactly showing a Softmax for a multi-class output. 00:53:41.160 |
You just have to make sure that your neural net has as many outputs as you want. And to 00:53:47.520 |
do that, you just need to make sure that the last weight layer in your neural net has as 00:53:54.740 |
many columns as you want. The number of columns in your final weight matrix tells you how 00:54:01.880 |
Okay, that is not multi-class classification. So if you want to create something that is 00:54:11.200 |
going to find more than one thing, then no. Softmax would not be the best way to do that. 00:54:17.400 |
I'm not sure if we're going to cover that in this set of classes. If we don't, we'll 00:54:25.720 |
Let's go back to the question about 3x3 filters, and more generally, how do we pick an architecture? 00:54:49.200 |
So the question of the VGG authors used 3x3 filters. The 2012 ImageNet winners used a 00:55:00.440 |
combination of 7x7 and 11x11 filters. What has happened over the last few years since 00:55:09.000 |
then if people have realized that 3x3 filters are just better? The original insight for 00:55:17.560 |
this was actually that Matt Zeiler visualization paper I showed you. It's real worth reading 00:55:23.200 |
that paper because he really shows that by looking at lots of pictures of all the stuff 00:55:27.600 |
going on inside of CNN, it clearly works better when you have smaller filters and more layers. 00:55:34.360 |
I'm not going to go into the theoretical justification as to why, for the sake of applying CNNs, 00:55:39.760 |
all you need to know is that there's really no reason to use anything but 3x3 filters. 00:55:45.880 |
So that's a nice simple rule of thumb which always works, 3x3 filters. 00:55:53.000 |
How many layers of 3x3 filters? This is where there is not any standard agreed-upon technique. 00:56:05.100 |
Weeding lots of papers, looking at lots of Kaggle winners, you will over time get a sense 00:56:09.600 |
of for a problem of this level of complexity, you need this many filters. There have been 00:56:16.640 |
various people that have tried to simplify this, but we're really still at a point where 00:56:22.960 |
the answer is try a few different architectures and see what works. The same applies to this 00:56:31.600 |
So in general, this idea of having 3x3 filters with max pooling and doubling the number of 00:56:37.800 |
filters each time you do max pooling is a pretty good rule of thumb. How many do you 00:56:43.080 |
start with? You've kind of got to experiment. Actually, we're going to see today an example 00:56:52.440 |
If you had a much larger image, would you still want 3x3 filters? 00:56:59.960 |
If you had a much larger image, what would you do? For example, on Kaggle, there's a diabetic 00:57:04.800 |
retinopathy competition that has some pictures of eyeballs that are quite a high resolution. 00:57:09.320 |
I think they're a couple of thousand by a couple of thousand. The question of how to 00:57:14.160 |
deal with large images is as yet unsolved in the literature. So if you actually look 00:57:20.520 |
at the winners of that Kaggle competition, all of the winners resampled that image down 00:57:26.000 |
to 512x512. So I find that quite depressing. It's clearly not the right approach. I'm pretty 00:57:34.360 |
sure I know what the right approach is. I'm pretty sure the right approach is to do what 00:57:39.200 |
The eye does something called foveation, which means that when I look directly at something, 00:57:44.200 |
the thing in the middle is very high-res and very clear, and the stuff on the outside is 00:57:52.320 |
I think a lot of people are generally in agreement with the idea that if we could come up with 00:57:56.240 |
an architecture which has this concept of foveation, and then secondly, we need something, 00:58:02.640 |
and there are some good techniques to this already called attentional models. An attentional 00:58:06.600 |
model is something that says, "Okay, the thing I'm looking for is not in the middle of my 00:58:10.440 |
view, but my low-res peripheral vision thinks it might be over there. Let's focus my attention 00:58:19.800 |
And we're going to start looking at recurrent neural networks next week, and we can use 00:58:25.200 |
recurrent neural networks to build attentional models that allow us to search through a big 00:58:29.720 |
image to find areas of interest. That is a very active area of research, but as yet is 00:58:36.760 |
not really finalized. By the time this turns into a MOOC and a video, I wouldn't be surprised 00:58:44.360 |
if that has been much better solved. It's moving very quickly. 00:58:48.000 |
The Matt Zyler paper showed larger filters because he was showing what AlexNet, the 2012 00:59:01.360 |
winner, looked like. Later on in the paper, he said based on what it looks like, here 00:59:07.080 |
are some suggestions about how to build better models. 00:59:12.640 |
So let us now finalize our review by looking at fine-tuning. So we learned how to do fine-tuning 00:59:28.840 |
using the little VGG class that I built, which is one line of code, vgg.fine-tuned. We also 00:59:37.520 |
learned how to take 1000 predictions of all the 1000 ImageNet categories and turn them 00:59:45.680 |
into two predictions, which is just a cat or a dog, by building a simple linear model 00:59:53.080 |
that took as input the 1000 ImageNet category predictions as input, and took the true cat 01:00:14.240 |
and dog labels as output, and we just created a linear model of that. So here is that linear 01:00:30.880 |
model. It's got 1000 inputs and 2 outputs. So we trained that linear model, it took less 01:00:41.240 |
than a second to train, and we got 97.7% accuracy. 01:00:46.780 |
So this was actually pretty effective. So why was it pretty effective to take 1000 predictions 01:00:55.360 |
of is it a cat, is it a fish, is it a bird, is it a poodle, is it a pug, is it a plane, 01:01:01.800 |
and turn it into a cat or is it a dog. The reason that worked so well is because the 01:01:07.600 |
original architecture, the ImageNet architecture, was already trained to do something very similar 01:01:15.960 |
to what we wanted our model to do. We wanted our model to separate cats from dogs, and 01:01:21.080 |
the ImageNet model already separated lots of different cats from different dogs from 01:01:26.040 |
lots of other things as well. So the thing we were trying to do was really just a subset 01:01:33.920 |
So that was why starting with 1000 predictions and building the simple linear model worked 01:01:39.440 |
so well. This week, you're going to be looking at the State Farm competition. And in the 01:01:45.440 |
State Farm competition, you're going to be looking at pictures like this one, and this 01:01:54.840 |
one, and this one. And your job will not be to decide whether or not it's a person or 01:02:03.560 |
a dog or a cat. Your job will be to decide is this person driving in a distracted way 01:02:08.880 |
or not. That is not something that the original ImageNet categories included. And therefore 01:02:16.360 |
this same technique is not going to work this week. 01:02:21.280 |
So what do you do if you need to go further? What do you do if you need to predict something 01:02:27.520 |
which is very different to what the original model did? The answer is to throw away some 01:02:34.280 |
of the later layers in the model and retrain them from scratch. And that's called fine-tuning. 01:02:42.320 |
And so that is pretty simple to do. So if we just want to fine-tune the last layer, 01:02:50.000 |
we can just go model.pop, that removes the last layer. We can then say make all of the 01:02:56.040 |
other layers non-trainable, so that means it won't update those weights, and then add 01:03:01.760 |
a new fully connected layer, dense layer to the end with just our dog and cat, our two 01:03:07.480 |
activations, and then go ahead and fit that model. 01:03:14.320 |
So that is the simplest kind of fine-tuning. Remove the last layer, but previously that 01:03:19.680 |
last layer was going to try and predict 1000 possible categories, and replace it with a 01:03:24.360 |
new last layer which we train. In this case, I've only run it for a two-week box, so I'm 01:03:31.120 |
not getting a great result. But if we ran it for a few more, we would get a bit better 01:03:34.260 |
than the 97.7 we had last time. When we look at State Farm, it's going to be critical to 01:03:43.480 |
So how many layers would you remove? Because you don't just have to remove one. In fact, 01:03:48.320 |
if you go back through your lesson 2 notebook, you'll see after this, I've got a section 01:03:55.560 |
called Retraining More Layers. In it, we see that we can take any model and we can say 01:04:12.120 |
okay, let's grab all the layers up to the nth layer. So in this case, we set all the 01:04:18.840 |
layers up to, sorry, after the first fully connected layer and set them all to trainable. 01:04:27.280 |
And then what would happen if we tried running that model? 01:04:31.120 |
So with Keras, we can tell Keras which layers we want to freeze and leave them at their 01:04:38.280 |
ImageNet-decided weights, and which layers do we want to retrain based on the things 01:04:46.840 |
And so in general, the more different your problem is to the original ImageNet 1000 categories, 01:04:53.320 |
the more layers you're going to have to retrain. 01:04:56.560 |
So how do you decide how far to go back in the layers? Two ways. Way number 1, intuition. 01:05:19.160 |
So have a look at something like those mat-zylar visualizations to get a sense of at what semantic 01:05:24.840 |
level each of those layers is operating at. And go back to the point where you feel like 01:05:31.720 |
that level of meaning is going to be relevant to your model. 01:05:37.920 |
Method number 2, experiment. It doesn't take that long to train another model starting 01:05:43.880 |
at a different point. I generally do a bit of both. When I know dogs and cats are subsets 01:05:52.920 |
of the ImageNet categories, I'm not going to bother generally training more than one replacement 01:06:01.080 |
For State Farm, I really had no idea. I was pretty sure I wouldn't have to retrain any 01:06:07.280 |
of the convolutional layers because the convolutional layers are all about spatial relationships. 01:06:14.160 |
And therefore a convolutional layer is all about recognizing how things in space relate 01:06:17.960 |
to each other. I was pretty confident that figuring out whether somebody is looking at 01:06:23.440 |
a mobile phone or playing with their radio is not going to use different spatial features. 01:06:29.740 |
So for State Farm, I've really only looked at retraining the dense layers. 01:06:35.280 |
And in VGG, there are actually only three dense layers. There are actually only three 01:06:44.900 |
dense layers, the two intermediate layers and the output layer, so I just trained all three. 01:06:53.840 |
Generally speaking, the answer to this is try a few things and see what works the best. 01:07:04.720 |
When we retrain the layers, we do not set the weights randomly. We start the weights 01:07:09.560 |
at their optimal ImageNet levels. That means that if you retrain more layers than you really 01:07:18.880 |
need to, it's not a big problem because the weights are already at the right point. 01:07:24.640 |
If you randomized the weights of the layers that you're retraining, that would actually 01:07:31.160 |
kill the earlier layers as well if you made them trainable. There's no point really setting 01:07:38.840 |
them to random most of the time. We'll be learning a bit more about that after the break. 01:07:45.360 |
So far, we have not reset the weights. When we say layer.trainable = true, we're just 01:07:52.880 |
telling Keras that when you say fit, I want you to actually use SGD to update the weights 01:08:10.480 |
When we come back, we're going to be talking about how to go beyond these basic five pieces 01:08:17.960 |
to create models which are more accurate. Specifically, we're going to look at avoiding underfitting 01:08:28.480 |
Next week, we're going to be doing half a class on review of convolutional neural networks 01:08:35.080 |
and half a class of an introduction to recurrent neural networks which we'll be using for language. 01:08:40.680 |
So hopefully by the end of this class, you'll be feeling ready to really dig deep into CNNs 01:08:48.760 |
during the week. This is really the right time this week to make sure that you're asking 01:08:53.120 |
questions you have about CNNs because next week we'll be wrapping up this topic. 01:09:07.000 |
So we have a lot to cover in our next 55 minutes. I think this approach of doing the new material 01:09:14.440 |
quickly and then you can review it in the lesson notebook on the video by experimenting 01:09:19.920 |
during the week and then reviewing the next week is fine. I think that's a good approach. 01:09:24.760 |
But I just want to make you aware that the new material of the next 55 minutes will move 01:09:32.200 |
So don't worry too much if not everything sinks in straight away. If you have any questions, 01:09:37.640 |
of course, please do ask. But also, recognize that it's really going to sink in as you study 01:09:45.400 |
it and play with it during the week, and then next week we're going to review all of this. 01:09:49.520 |
So if it's still not making sense, and of course you've asked your questions on the forum, 01:09:53.320 |
it's still not making sense, we'll be reviewing it next week. 01:09:56.640 |
So if you don't retrain a layer, does that mean the layer remembers what gets saved? 01:10:21.480 |
So yes, if you don't retrain a layer, then when you save the weights, it's going to contain 01:10:25.720 |
the weights that it originally had. That's a really important question. 01:10:31.720 |
Why would we want to start out by overfitting? We're going to talk about that next. 01:10:38.320 |
The last conflayer in VGG is a 7x7 output. There are 49 boxes and each one has 512 different 01:10:45.440 |
things. That's kind of right, but it's not that it recognizes 512 different things. When 01:10:51.160 |
you have a convolution on a convolution on a convolution on a convolution on a convolution, 01:10:55.280 |
you have a very rich function with hundreds of thousands of parameters. So it's not that 01:11:02.120 |
it's recognizing 512 things, it's that there are 512 rich complex functions. And so those 01:11:11.200 |
rich complex functions can recognize rich complex concepts. 01:11:18.100 |
So for example, we saw in the video that even in layer 6 there's a face detector which can 01:11:25.560 |
recognize cat faces as well as human faces. So the later on we get in these neural networks, 01:11:33.520 |
the harder it is to even say what it is that's being found because they get more and more 01:11:39.080 |
sophisticated and complex. So what those 512 things do in the last layer of VGG, I'm not 01:11:49.080 |
sure that anybody's really got to a point that they could tell you that. 01:11:53.480 |
I'm going to move on. The next section is all about making our model better. So at this 01:12:22.120 |
point, we have a model with an accuracy of 97.7%. So how do we make it better? 01:12:31.760 |
Now because we have started with an existing model, a VGG model, there are two reasons 01:12:46.160 |
that you could be less good than you want to be. Either you're underfitting or you're 01:12:51.960 |
overfitting. Underfitting means that, for example, you're using a linear model to try 01:12:59.060 |
to do image recognition. You're using a model that is not complex and powerful enough for 01:13:05.640 |
the thing you're doing or it doesn't have enough parameters for the thing you're doing. 01:13:10.240 |
That's what underfitting is. Overfitting means that you're using a model with too many parameters 01:13:17.760 |
that you've trained for too long without using any of the techniques or without correctly 01:13:22.440 |
using the techniques you're about to learn about, such that you've ended up learning what 01:13:26.800 |
your specific training pictures look like rather than what the general patterns in them 01:13:33.840 |
look like. You will recognize overfitting if your training set has a much higher accuracy 01:13:43.560 |
than your test set or your validation set. So that means you've learned how to recognize 01:13:49.520 |
the contents of your training set too well. And so then when you look at your validation 01:13:55.200 |
set you get a less good result. So that's overfitting. 01:14:00.120 |
I'm not going to go into detail on this because any of you who have done any machine learning 01:14:03.560 |
have seen this before, so any of you who haven't, please look up overfitting on the internet, 01:14:10.920 |
learn about it, ask questions about it. It is perhaps the most important single concept 01:14:17.960 |
So it's not that we're not covering it because it's not interesting, it's just that we're 01:14:20.800 |
not covering it because I know a lot of you are already familiar with it. Underfitting 01:14:26.840 |
we can see in the same way, but it's the opposite. If our training error is much lower than our 01:14:41.240 |
So I'm going to look at this now because in fact you might have noticed that in all of 01:14:45.240 |
our models so far, our training error has been lower than our validation error, which 01:14:56.360 |
So how is this possible? And the answer to how this is possible is because the VGG network 01:15:03.240 |
includes something called dropout, and specifically dropout with a p of 0.5. What does dropout 01:15:10.880 |
mean with a p of 0.5? It means that at this layer, which happens at the end of every fully 01:15:16.700 |
connected block, it deletes 0.5, so 50% of, the activations at random. It sets them to 01:15:27.640 |
0. That's what a dropout layer does. It sets to 0.5, half of the activations at random. 01:15:36.680 |
Why would it do that? Because when you randomly throw away bits of the network, it means that 01:15:43.120 |
the network can't learn to overfit. It can't learn to build a network that just learns 01:15:48.880 |
about your images, because as soon as it does, you throw away half of it and suddenly it's 01:15:53.640 |
not working anymore. So dropout is a fairly recent development, I think it's about three 01:15:58.960 |
years old, and it's perhaps the most important development of the last few years. Because 01:16:06.040 |
it's the thing that now means we can train big complex models for long periods of time 01:16:15.920 |
But in this case, it seems that we are using too much dropout. So the VGG network, which 01:16:24.200 |
used a dropout of 0.5, they decided they needed that much in order to avoid overfitting ImageNet. 01:16:30.840 |
But it seems for our cats and dogs, it's underfitting. So what do we do? The answer is, let's try 01:16:39.360 |
removing dropout. So how do we remove dropout? And this is where it gets fun. 01:16:46.960 |
We can start with our VGG fine-tuned model. And I've actually created a little function 01:16:52.400 |
called VGG fine-tuned, which creates a VGG fine-tuned model with two outputs. It looks 01:17:02.060 |
exactly like you would expect it to look. It creates a VGG model, it fine-tunes it, it 01:17:14.320 |
returns it. What does fine-tune do? It does exactly what we've learnt. It pops off the 01:17:24.480 |
last layer, sets all the rest of the layers to non-trainable, and adds a new dense layer. 01:17:30.600 |
So I just create a little thing that does all that. Every time I start writing the same 01:17:35.680 |
code more than once, I stick it into a function and use it again in the future. It's good practice. 01:17:42.360 |
I then load the weights that I just saved in my last model, so I don't have to retrain 01:17:48.520 |
it. So saving and loading weights is a really helpful way of avoiding not refitting things. 01:17:54.180 |
So already I now have a model that fits cats and dogs with 97.7% accuracy and underfits. 01:18:05.460 |
We can grab all of the layers of the model and we can then enumerate through them and 01:18:11.920 |
find the last one which is a convolution. So let's remind ourselves, model.summary. So 01:18:25.360 |
that's going to enumerate through all the layers and find the last one that is a convolution. 01:18:33.480 |
So at this point, we now have the index of the last convolutional layer. It turns out 01:18:41.600 |
to be 30. So we can now grab that last convolutional layer. 01:18:46.960 |
And so what we want to try doing is removing dropout from all the rest of the layers. So 01:18:53.160 |
after the convolutional layer are the dense layers. So after the convolutional layers, 01:19:05.400 |
the last convolutional layer, after that we have the dense layers. 01:19:10.560 |
So this is a really important concept in the Keras library of playing around with layers. 01:19:25.560 |
And so spend some time looking at this code and really look at the inputs and the outputs 01:19:30.680 |
and get a sense of it. So you can see here, here are all the layers up to the last convolutional 01:19:38.240 |
layer. Here are all of the layers from the last convolutional layer. So all the fully 01:19:44.120 |
connected layers and all the convolutional layers. I can create a whole new model that 01:19:53.360 |
Why would I do that? Because if I'm going to remove dropout, then clearly I'm going 01:20:00.180 |
to want to fine-tune all of the layers that involve dropout. That is, all of the dense 01:20:05.560 |
layers. I don't need to fine-tune any convolutional layers because none of the convolutional layers 01:20:11.180 |
have dropout. I'm going to save myself some time. I'm going to pre-calculate the output 01:20:26.980 |
So you see this model I've built here, this model that contains all the convolutional 01:20:32.800 |
layers. If I pre-calculate the output of that, then that's the input to the dense layers 01:20:43.040 |
So you can see what I do here is I say conv_model.predict with my validation batches, conv_model.predict 01:20:52.720 |
with my batches, and that now gives me the output of the convolutional layer for my training 01:20:59.800 |
and the output of it for my validation. And because that's something I don't want to have 01:21:08.640 |
So here I'm just going to go load_array and that's going to load from the disk the output 01:21:17.640 |
of that. And so I'm going to say train_features.shape, and this is always the first thing that you 01:21:22.520 |
want to do when you've built something, is look at its shape. And indeed, it's what we 01:21:27.400 |
would expect. It is 23,000 images, each one is 14x14, because I didn't include the final 01:21:39.760 |
And so indeed, if we go model.summary, we should find that the last convolutional layer, here 01:21:50.580 |
it is, 512 filters, 14x14 dimension. So we have basically built a model that is just 01:22:00.600 |
a subset of VGG containing all of these earlier layers. We've run it through our test set 01:22:07.480 |
and our validation set, and we've got the outputs. So that's the stuff that we want 01:22:12.720 |
to fix, and so we don't want to recalculate that every time. 01:22:17.920 |
So now we create a new model which is exactly the same as the dense part of VGG, but we 01:22:24.840 |
replace the dropout P with 0. So here's something pretty interesting, and I'm going to let you 01:22:32.160 |
guys think about this during the week. How do you take the previous weights from VGG 01:22:41.520 |
and put them into this model where dropout is 0? So if you think about it, before we 01:22:46.480 |
had dropout of 0.5, so half the activations were being deleted at random. So since half 01:22:52.440 |
the activations are being deleted at random, now that I've removed dropout, I effectively 01:22:57.400 |
have twice as many weights being active. Since I have twice as many weights being active, 01:23:02.840 |
I need to take my imageNet weights and divide them by 2. So by taking my imageNet weights 01:23:10.360 |
and copying them across, so I take my previous weights and copy them across to my new model, 01:23:15.360 |
each time divide them by 2, that means that this new model is going to be exactly as accurate 01:23:20.960 |
as my old model before I start training, but it has no dropout. 01:23:26.520 |
Is it wasteful to have in the cats and dogs model filters that are being learnt to find 01:23:45.720 |
things like bookshelves? Potentially it is, but it's okay to be wasteful. The only place 01:23:54.040 |
that it's a problem is if we are overfitting. And if we're overfitting, then we can easily 01:24:03.440 |
So let's try this. We now have a model which takes the output of the convolutional layers 01:24:10.000 |
as input, gives us our cats vs. dogs as output, and has no dropout. So now we can just go 01:24:18.200 |
ahead and fit it. So notice that the input to this is my 512 x 14 x 14 inputs. My outputs 01:24:28.720 |
are my cats and dogs as usual, and train it for a few epochs. 01:24:33.760 |
And here's something really interesting. Dense layers take very little time to compute. A 01:24:41.800 |
convolutional layer takes a long time to compute. Think about it, you're computing 512 x 3 x 01:24:48.160 |
3 x 512 filters. For each of 14 x 14 spots, that is a lot of computation. 01:24:58.760 |
So in a deep learning network, your convolutional layers is where all of your computation is 01:25:04.120 |
being taken up. So look, when I train just my dense layers, it's only taking 17 seconds. 01:25:09.440 |
Super fast. On the other hand, the dense layers is where all of your memory is taken up. Because 01:25:15.160 |
between this 4096 layer and this 4096 layer, there are 4000 x 4000 = 16 million weights. 01:25:24.160 |
And between the previous layer, which was 512 x 7 x 7 after Max Pauling, that's 25,088. 01:25:31.880 |
There are 25,088 x 4096 weights. So this is a really important rule of thumb. Your dense 01:25:39.600 |
layers is where your memory is taken up. Your convolutional layers is where your computation 01:25:46.200 |
So it took me a minute or so to run 8 epochs. That's pretty fast. And holy shit, look at 01:25:52.560 |
that! 98.5%. So you can see now, I am overfitting. But even though I'm overfitting, I am doing 01:26:06.680 |
So overfitting is only bad if you're doing it so much that your accuracy is bad. So in 01:26:14.920 |
this case, it looks like actually this amount of overfitting is pretty good. So for cats 01:26:20.480 |
and dogs, this is about as good as I've gotten. And in fact, if I'd stopped it a little earlier, 01:26:26.960 |
you can see it was really good. In fact, the winner was 98.8, and here I've got 98.75. 01:26:34.680 |
And there are some tricks I'll show you later that always give you an extra 50% accuracy. 01:26:39.520 |
So this would definitely have won cats and dogs if we had used this model. 01:26:43.280 |
Question - Can you perform dropout on a convolutional layer? 01:26:48.040 |
You can absolutely perform dropout on a convolutional layer. And indeed, nowadays people normally 01:26:52.920 |
do. I don't quite remember the VGG days. I guess that was 2 years ago. Maybe people in 01:26:59.440 |
Nowadays, the general approach would be you would have dropout of 0.1 before your first 01:27:04.760 |
layer, dropout of 0.2 before this one, 0.3, 0.4, and then finally dropout of 0.5 before 01:27:10.040 |
your fully connected layers. It's kind of the standard. 01:27:13.800 |
If you then find that you're underfitting or overfitting, you can modify all of those 01:27:18.080 |
probabilities by the same amount. If you dropout in an early layer, you're losing that information 01:27:27.840 |
for all of the future layers, so you don't want to drop out too much in the early layers. 01:27:32.520 |
You can feel better dropping out more in the later layers. 01:27:37.240 |
This is how you manually tune with your overfitting or underfitting. Another way to do it would 01:27:57.400 |
be to modify the architecture to have less or more filters. But that's actually pretty 01:28:03.480 |
difficult to do. So it's the point that we didn't need dropout anyway. Perhaps it was. 01:28:10.760 |
But VGG comes with dropout. So when you're fine-tuning, you start with what you start 01:28:19.920 |
We are overfitting here, so my hypothesis is that we maybe should try a little less 01:28:24.400 |
dropout. But before we do, I'm going to show you some better tricks. 01:28:30.280 |
The first trick I'm going to show you is a trick that lets you avoid overfitting without 01:28:35.240 |
deleting information. Dropout deletes information, so we don't want to do it unless we have to. 01:28:40.600 |
So instead of dropout, here is a list. You guys should refer to this every time you're 01:28:49.760 |
5 steps. Step 1, add more data. This is a Kaggle competition, so we can't do that. 01:28:57.720 |
Step 2, use data augmentation, which we're about to learn. 01:29:02.040 |
Step 3, use more generalizable architectures. We're going to learn that after this. 01:29:07.480 |
Step 4, add regularization. That generally means dropout. There's another type of regularization 01:29:13.560 |
which is where you basically add up all of your weights, the value of all of your weights, 01:29:22.360 |
and then multiply it by some small number, and you add that to the loss function. Basically 01:29:26.960 |
you say having higher weights is bad. That's called either L2 regularization, if you take 01:29:34.320 |
the square of your weights and add them up, or L1 regularization if you take the absolute 01:29:40.600 |
Tera supports that as well. Also popular. I don't think anybody has a great sense of 01:29:49.760 |
when do you use L1 and L2 regularization and when do you use dropout. I use dropout pretty 01:29:56.040 |
much all the time, and I don't particularly see why you would need both, but I just wanted 01:30:00.460 |
to let you know that that other type of regularization exists. 01:30:05.520 |
And then lastly, if you really have to reduce architecture complexity, so remove some filters. 01:30:10.560 |
But that's pretty hard to do if you're fine-tuning, because how do you know which filters to remove? 01:30:16.520 |
So really, the first four. Now that we have dropout, the first four are what we do in 01:30:23.440 |
Like in Random Forests, where we randomly select subsets of variables at each point, 01:30:41.160 |
that's kind of what dropout is doing. Dropout is randomly throwing away half the activations, 01:30:46.120 |
so dropout and random forests both effectively create large ensembles. It's actually a fantastic 01:30:59.560 |
So just like when we went from decision trees to random forests, it was this huge step which 01:31:05.640 |
was basically create lots of decision trees with some random differences. Dropout is effectively 01:31:10.440 |
creating lots of, automatically, lots of neural networks with different subsets of features 01:31:19.840 |
Data augmentation is very simple. Data augmentation is something which takes a cat and turns it 01:31:27.160 |
into lots of cats. That's it. Actually, it does it for dogs as well. You can rotate, 01:31:38.160 |
you can flip, you can move up and down, left and right, zoom in and out. And in Keras, 01:31:44.920 |
you do it by, rather than, what we've always said before was image data generator, open 01:31:52.360 |
Now we say all these other things. Flip it horizontally at random, zoom in a bit at random, 01:31:58.600 |
share at random, rotate at random, move it left and right at random, and move it up and 01:32:02.480 |
down at random. So once you've done that, then when you create your batches, rather than 01:32:12.640 |
doing it the way we did it before, you simply add that to your batches. So we said, Ok, this 01:32:22.160 |
is our data generator, and so when we create our batches, use that data generator, the augmenting 01:32:30.780 |
Very important to notice, the validation set does not include that. Because the validation 01:32:36.120 |
set is the validation set. That's the thing we want to check against, so we shouldn't 01:32:39.400 |
be fiddling with that at all. The validation set has no data augmentation and no shuffling. 01:32:44.400 |
It's constant and fixed. The training set, on the other hand, we want to move it around 01:32:49.460 |
as much as we can. So shuffle its order and add all these different types of augmentation. 01:32:55.680 |
How much augmentation to use? This is one of the things that Rachel and I would love 01:32:59.880 |
to automate. For now, two methods, use your intuition. The best way to use your intuition 01:33:06.540 |
is to take one of your images, add some augmentation, and check whether they still look like cats. 01:33:15.020 |
So if it's so warped that you're like, "Ok, nobody takes a photo of a cat like that," 01:33:19.860 |
you've done it wrong. So this is kind of like a small amount of data augmentation. 01:33:25.560 |
Method 2, experiment. Try a range of different augmentations and see which one gives you 01:33:29.600 |
the best results. If we add some augmentation, everything else is exactly the same, except 01:33:38.520 |
we can't pre-compute anything anymore. So earlier on, we pre-computed the output of 01:33:43.520 |
the last convolutional layer. We can't do that now, because every time this cat approaches 01:33:49.040 |
our neural network, it's a little bit different. It's rotated a bit, it's flipped, it's moved 01:33:53.920 |
around or it's zoomed in and out. So unfortunately, when we use data augmentation, we can't pre-compute 01:34:00.440 |
anything and so things take longer. Everything else is the same though. So we grab our fully 01:34:05.800 |
connected model, we add it to the end of our convolutional model, and this is the one with 01:34:10.760 |
our dropout, compile it, fit it, and now rather than taking 9 seconds per epoch, it takes 01:34:18.600 |
273 seconds per epoch because it has to calculate through all the convolutional layers because 01:34:29.680 |
So in terms of results here, we have not managed to get back up to that 98.7 accuracy. I probably 01:34:40.280 |
have, I've run a few more. So if I keep running them, again, I start overfitting. So it's 01:34:51.160 |
a little hard to tell because my validation accuracy is moving around quite a lot because 01:34:55.920 |
my validation sets a little bit on the small side. It's a little bit hard to tell whether 01:35:00.520 |
this data augmentation is helping or hindering. I suspect what we're finding here is that 01:35:07.840 |
maybe we're doing too much data augmentation, so if I went back and reduced my different 01:35:17.440 |
ranges by say half, I might get a better result than this. But really, this is something to 01:35:24.720 |
experiment with and I had better things to do than experiment with this. But you get 01:35:30.640 |
Data augmentation is something you should always do. There's never a reason not to use 01:35:37.340 |
data augmentation. The question is just what kind and how much. So for example, what kind? 01:35:43.120 |
Should you flip x, y? So clearly, for dogs and cats, no. You pretty much never see a 01:35:50.180 |
picture of an upside down dog. So would you do vertical flipping in this particular problem? 01:35:57.520 |
No you wouldn't. Would you do rotations? Yeah, you very often see cats and dogs that are 01:36:03.040 |
kind of on their hind legs or the photos taken a little bit uneven or whatever. You certainly 01:36:06.820 |
would have zooming because sometimes you're close to the dog, sometimes further away. 01:36:10.260 |
So use your intuition to think about what kind of augmentation. 01:36:15.480 |
What about data augmentation? Data augmentation for color? That's an excellent point. So something 01:36:22.680 |
I didn't add to this, but I probably should have, is that there is a channel augmentation 01:36:30.000 |
parameter for the data generator in Keras. And that will slightly change the colors. 01:36:35.380 |
That's a great idea for natural images like these because you have different white balance, 01:36:40.420 |
you have different lighting and so forth. And indeed I think that would be a great idea. 01:36:45.440 |
So I hope during the week people will take this notebook and somebody will tell me what 01:36:51.460 |
is the best result they've got. And hopefully I bet that that data augmentation will include 01:37:01.500 |
Question on the same light screen. If you change all the images to more of long images? Would 01:37:16.380 |
We're changing it to black and white. No it wouldn't, because the Kaggle competition test 01:37:19.980 |
set is in color. So if you're throwing away color, you're throwing away information. And 01:37:25.260 |
figuring out whether something is -- but the Kaggle competition is saying is this a cat 01:37:32.260 |
or is this a dog? And part of seeing whether something is a cat or a dog is looking at 01:37:35.860 |
what color it is. So if you're throwing away the color, you're making that harder. So yeah, 01:37:39.780 |
you could run it on the test set and get answers, but they're going to be less accurate because 01:37:46.380 |
Question on the same light screen. How is it working since you've removed the flattened 01:37:57.100 |
layer between the comp block and the dense layers? 01:37:58.100 |
Okay, so what happened to the flattened layer? And the answer is that it was there. Where 01:38:02.180 |
was it? Oh gosh. I forgot to add it back to this one. So I actually changed my mind about 01:38:11.900 |
whether to include the flattened layer and where to put it and where to put max pooling. 01:38:15.620 |
It will come back later. So this is a slightly old version. Thank you for picking it up. 01:38:19.780 |
Could you do a form of dropout on the raw images by randomly blanking out pieces of 01:38:27.260 |
Yeah, so can you do dropout on the raw images? The simple answer is yes, you could. There's 01:38:32.660 |
no reason I can't put a dropout layer right here. And that's going to drop out raw pixels. 01:38:38.340 |
It turns out that's not a good idea. Throwing away input information is very different to 01:38:44.540 |
throwing away modeled information. Throwing away modeled information is letting you effectively 01:38:49.780 |
avoid overfitting the model. But you don't want to avoid overfitting the data. So you 01:39:08.140 |
To clarify, the augmentation is at random. I just showed you 8 examples of the augmentation. 01:39:14.300 |
So what the augmentation does is it says at random, rotate by up to 20 degrees, move by 01:39:20.740 |
up to 10% in each direction, sheer by up to 5%, zoom by up to 10%, and flip at random 01:39:26.140 |
half the time. So then I just said, OK, here are 8 cats. But what happens is every single 01:39:31.660 |
time an image goes into the batch, it gets randomized. So effectively, it's an infinite 01:39:40.940 |
That doesn't have anything to do with data augmentation, so maybe we'll discuss that 01:39:56.900 |
The final concept to learn about today is batch normalization. Batch normalization, 01:40:03.140 |
like data augmentation, is something you should always do. Why didn't VGG do it? Because it 01:40:09.020 |
didn't exist then. Batch norm is about a year old, maybe 18 months. Here's the basic idea. 01:40:16.620 |
When anybody who's done any machine learning probably knows that one of the first things 01:40:23.900 |
you want to do is take your input data, subtract its mean, and divide by its standard deviation. 01:40:31.380 |
Why is that? Imagine that we had 40, minus 30, and 1. You can see that the outputs are 01:40:44.980 |
all over the place. The intermediate values, some are really big, some are really small. 01:40:51.120 |
So if we change a weight which impacted x_1, it's going to change the loss function by 01:40:57.780 |
a lot, whereas if we change a weight which impacts x_3, it'll change the loss function 01:41:04.260 |
So the different weights have very different gradients, very different amounts that are 01:41:10.620 |
going to affect the outcome. Furthermore, as you go further down through the model, 01:41:16.660 |
that's going to multiply. Particularly when we're using something like softmax, which 01:41:19.980 |
has an 'e' to the power of in it, you end up with these crazy big numbers. 01:41:24.700 |
So when you have inputs that are of very different scales, it makes the whole model very fragile, 01:41:32.300 |
which means it is harder to learn the best set of weights and you have to use smaller 01:41:36.820 |
learning weights. This is not just true of deep learning, it's true of pretty much every 01:41:43.020 |
kind of machine learning model, which is why everybody who's been through the MSAM program 01:41:47.220 |
here hopefully you guys all learn to normalize your inputs. 01:41:51.180 |
So if you haven't done any machine learning before, no problem, just take my word for 01:41:55.540 |
it, you always want to normalize your inputs. It's so common that pretty much all of the 01:42:02.900 |
deep learning libraries will normalize your inputs for you with a single parameter. And 01:42:08.660 |
indeed we're doing it in hours because images, like pixel values only range from 0 to 255, 01:42:19.260 |
you don't generally worry about dividing by the standard deviation with images, but you 01:42:24.120 |
do generally worry about subtracting the mean. So you'll see that the first thing that our 01:42:30.700 |
model does is this thing called pre-process, which subtracts the mean. And the mean was 01:42:37.420 |
something which basically you can look it up on the internet and find out what the mean 01:42:41.260 |
of the ImageNet data is. So these three fixed values. 01:42:46.300 |
Now what's that got to do with batch norm? Well, imagine that somewhere along the line 01:42:52.140 |
in our training, we ended up with one really big weight. Then suddenly one of our layers 01:42:59.700 |
is going to have one really big number. And now we're going to have exactly the same problem 01:43:03.920 |
as we had before, which is the whole model becomes very un-resilient, becomes very fragile, 01:43:10.460 |
becomes very hard to train, going to be all over the place. Some numbers could even get 01:43:23.860 |
So what do we do? Really what we want to do is to normalize not just our inputs but our 01:43:34.420 |
activations as well. So you may think, OK, no problem, let's just subtract the mean and 01:43:40.020 |
divide by the standard deviation for each of our activation layers. Unfortunately that 01:43:45.300 |
doesn't work. SGD is very bloody-minded. If it wants to increase one of the weights higher 01:43:51.420 |
and you try to undo it by subtracting the mean and dividing by the standard deviation, 01:43:55.380 |
the next iteration is going to try to make it higher again. So if SGD decides that it 01:44:00.860 |
wants to make your weights of very different scales, it will do so. So just normalizing 01:44:10.220 |
So batch norm is a really neat trick for avoiding that problem. Before I tell you the trick, 01:44:16.580 |
I will just tell you why you want to use it. Because A) it's about 10 times faster than 01:44:22.800 |
not using it, particularly because it often lets you use a 10 times higher learning rate, 01:44:28.420 |
and B) because it reduces overfitting without removing any information from the model. So 01:44:33.620 |
these are the two things you want, less overfitting and faster models. 01:44:39.860 |
I'm not going to go into detail on how it works. You can read about this during the week if 01:44:42.980 |
you're interested. But a brief outline. First step, it normalizes the intermediate layers 01:44:48.980 |
just the same way as input layers can be normalized. The thing I just told you wouldn't work, well 01:44:54.020 |
it does it, but it does something else critical, which is it adds two more trainable parameters. 01:45:00.340 |
One trainable parameter multiplies by all the activations, and the other one is added 01:45:04.580 |
to all the activations. So effectively that is able to undo that normalization. Both of 01:45:12.060 |
those two things are then incorporated into the calculation of the gradient. 01:45:16.800 |
So the model now knows that it can rescale all of the weights if it wants to without 01:45:24.880 |
moving one of the weights way off into the distance. And so it turns out that this does 01:45:30.720 |
actually effectively control the weights in a really effective way. 01:45:34.580 |
So that's what batch normalization is. The good news is, for you to use it, you just 01:45:38.820 |
type batch normalization. In fact, you can put it after dense layers, you can put it 01:45:47.860 |
after convolutional layers, you should put it after all of your layers. 01:45:52.300 |
Here's the bad news, VGG didn't train originally with batch normalization, and adding batch 01:45:59.300 |
normalization changes all of the weights. I think that there is a way to calculate a new 01:46:05.820 |
set of weights with batch normalization, I haven't gone through that process yet. 01:46:12.020 |
So what I did today was I actually grabbed the entirety of ImageNet and I trained this 01:46:20.900 |
model on all of ImageNet. And that then gave me a model which was basically VGG plus batch 01:46:28.940 |
normalization. And so that is the model here that I'm loading. So this is the ImageNet, 01:46:36.380 |
whatever it is, large visual recognition competition 2012 dataset. And so I trained this set of 01:46:42.140 |
weights on the entirety of ImageNet so that I created basically a VGG plus batch norm. 01:46:47.540 |
And so then I fine-tuned the VGG plus batch norm model by popping off the end and adding 01:46:56.960 |
a new dense layer. And then I trained it, and these only took 6 seconds because I pre-calculated 01:47:06.440 |
the inputs to this. Then I added data augmentation and I started training that. And then I ran 01:47:22.380 |
So I think this was on the right track. I think if I had another hour or so, you guys 01:47:29.100 |
can play with this during the week. Because this is now like all the pieces together. 01:47:34.780 |
It's batch norm and data augmentation and as much dropout as you want. So you'll see 01:47:43.300 |
what I've got here is I have dropout layers with an arbitrary amount of dropout. And so 01:47:50.820 |
in this, the way I set it up, you can go ahead and say create batch norm layers with whatever 01:47:56.140 |
amount of dropout you want. And then later on you can say I want you to change the weights 01:48:04.180 |
So this is kind of like the ultimate ImageNet fine-tuning experience. And I haven't seen 01:48:11.820 |
anybody create this before, so this is a useful tool that didn't exist until today. And hopefully 01:48:22.300 |
Interestingly, I found that when I went back to even 0.5 dropout, it was still massively 01:48:28.300 |
overfitting. So it seems that batch normalization allows the model to be so much better at finding 01:48:35.420 |
the optimum that I actually needed more dropout rather than less. 01:48:39.900 |
So anyway, as I said, this is all something I was doing today. So I haven't quite finalized 01:48:45.500 |
that. What I will show you though is something I did finalize, which I did on Sunday, which 01:48:51.180 |
is going through end-to-end an entire model-building process on MNIST. And so I want to show you 01:48:58.780 |
this entire process and then you guys can play with it. 01:49:03.780 |
MNIST is a great way to really experiment with and revise everything we know about CNNs 01:49:10.460 |
because it's very fast to train, because there are only 28x28 images, and there's also extensive 01:49:15.300 |
benchmarks on what are the best approaches to MNIST. 01:49:19.820 |
So it's very, very easy to get started with MNIST because Keras actually contains a copy 01:49:25.580 |
of MNIST. So we can just go from Keras.datasets, import MNIST, MNIST.loadData, and we're done. 01:49:34.460 |
Now MNIST are grayscale images, and everything in Keras in terms of the convolutional stuff 01:49:41.020 |
expects there to be a number of channels. So we have to use expand-dims to add this empty 01:49:49.260 |
dimension. So this is 60,000 images with one color, which are 28x28. So if you try to use 01:49:59.500 |
grayscale images and get weird errors, I'm pretty sure this is what you've forgotten 01:50:04.780 |
to do, just to add this kind of empty dimension, which is you actually have to tell it there 01:50:10.140 |
is one channel. Because otherwise it doesn't know how many channels are there. So there 01:50:15.780 |
The other thing I had to do was take the y-values, the labels, and one-hot encode them. Because 01:50:23.140 |
otherwise they were like this, they were actual numbers, 50419. And we need to one-hot encode 01:50:30.380 |
them so that they're 50419. Remember, this is the thing that that softmax function is 01:50:40.180 |
trying to approximate. That's how the linear algebra works. So there are the two things 01:50:44.820 |
I had to do to preprocess this. Add the empty dimension and do my one-hot encoding. Then 01:50:51.220 |
I normalize the input by subtracting the mean and dividing by the standard deviation. And 01:51:01.340 |
So I can't fine-tune from ImageNet now because ImageNet is 224x224 and this is 28x28. ImageNet 01:51:09.580 |
is full color and this is grayscale. So we're going to start from scratch. So all of these 01:51:16.620 |
So a linear model needs to normalize the input and needs to flatten it because I'm not going 01:51:22.620 |
to treat it as an image, I'm going to treat it as a single vector. And then I create my 01:51:27.540 |
one dense layer with 10 outputs, compile it, grab my batches, and train my linear model. 01:51:37.500 |
And so you can see, generally speaking, the best way to train a model is to start by doing 01:51:45.440 |
one epoch with a pretty low learning rate. So the default learning rate is 0.001, which 01:51:53.100 |
is actually a pretty good default. So you'll find nearly all of the time I just accept 01:51:56.740 |
the default learning rate and I do a single epoch. And that's enough to get it started. 01:52:02.500 |
Once you've got it started, you can set the learning rate really high. So 0.1 is about 01:52:06.780 |
as high as you ever want to go, and do another epoch. And that's going to move super fast. 01:52:12.420 |
And then gradually, you reduce the learning rate by order of magnitude at a time. So I 01:52:18.780 |
go to 0.01, do a few epochs, and basically keep going like that until you start overfitting. 01:52:25.820 |
So I got down to the point where I had a 92.7% accuracy on the training, 92.4% on the test, 01:52:32.820 |
and I was like, okay, that's about as far as I can go. So that's a linear model. Not 01:52:37.500 |
very interesting. So the next thing to do is to grab one extra dense layer in the middle, 01:52:42.660 |
so one hidden layer. This is what in the 80s and 90s people thought of as a neural network, 01:52:48.500 |
one hidden layer fully connected. And so that still takes 5 seconds to train. Again, we 01:52:55.460 |
do the same thing, one epoch with a low learning rate, then pop up the learning rate for as 01:52:59.740 |
long as we can, gradually decrease it, and we get 94% accuracy. 01:53:07.500 |
So you wouldn't expect a fully connected network to do that well. So let's create a CNN. So 01:53:12.900 |
this was actually the first architecture I tried. And basically I thought, okay, we know 01:53:17.180 |
VGG works pretty well, so how about I create an architecture that looks like VGG, but it's 01:53:23.180 |
much simpler because this is just 28x28. So I thought, okay, well VGG generally has a couple 01:53:28.380 |
of convolutional layers of 3x3, and then a max pooling layer, and then a couple more with 01:53:33.460 |
twice as many filters. So I just tried that. So this is kind of like my inspired by VGG 01:53:42.100 |
And I thought, okay, so after 2 lots of max pooling, it'll go from 28x28 by 14x14 to 7x7. 01:53:50.660 |
Okay, that's probably enough. So then I added my 2 dense layers again. So I didn't use any 01:53:57.140 |
science here, it's just kind of some intuition. And it actually worked pretty well. After 01:54:03.900 |
my learning rate of 0.1, I had an accuracy of 98.9%, validation accuracy of 99%. And then 01:54:12.300 |
after a few layers of 0.01, I had an accuracy of 99.75%. But look, my validation accuracy 01:54:23.280 |
So this is the trick. Start by overfitting. Once you know you're overfitting, you know 01:54:28.500 |
that you have a model that is complex enough to handle your data. So at this point, I was 01:54:33.740 |
like, okay, this is a good architecture. It's capable of overfitting. So let's now try to 01:54:38.160 |
use the same architecture and reduce overfitting, but reduce the complexity of the model no 01:54:45.260 |
So step 1 of my 5-step list was data augmentation. So I added a bit of data augmentation, and 01:54:52.420 |
then I used exactly the same model as I had before. And trained it for a while. And I found 01:54:58.780 |
this time I could actually train it for even longer, as you can see. And I started to get 01:55:03.680 |
some pretty good results here, 99.3, 99.34. But by the end, you can see I'm massively 01:55:09.220 |
overfitting again. 99.6 training versus 91.1 test. 01:55:15.260 |
So data augmentation alone is not enough. And I said to you guys, we'll always use batch 01:55:20.820 |
norm anyway. So then I add batch norm. I use batch norm on every layer. Notice that when 01:55:29.420 |
you use batch norm on convolution layers, you have to add axis=1. I am not going to 01:55:36.220 |
tell you why. I want you guys to read the documentation about batch norm and try and 01:55:41.420 |
figure out why you need this. And then we'll have a discussion about it on the forum because 01:55:45.780 |
it's a really interesting analysis if you really want to understand batch norm and understand 01:55:54.060 |
If you don't care about the details, that's fine. Just know type axis=1 anytime you have 01:55:59.100 |
batch norm. And so this is like a pretty good quality modern network. You can see I've got 01:56:05.860 |
convolution layers, they're 3x3, and then I have batch norm, and then I have max pooling, 01:56:10.100 |
and then at the end I have some dense layers. This is actually a pretty decent looking model. 01:56:16.300 |
Not surprisingly, it does pretty well. So I train it for a while at 0.1, I train it for 01:56:20.780 |
a while at 0.01, I train it for a while at 0.001, and you can see I get up to 99.5%. That's 01:56:30.380 |
not bad. But by the end, I'm starting to overfit. 01:56:35.380 |
So add a little bit of dropout. And remember what I said to you guys, nowadays the rule 01:56:41.980 |
for dropout is to gradually increase it. I only had time yesterday to just try adding 01:56:49.300 |
one layer of dropout right at the end, but as it happened, that seemed to be enough. 01:56:53.800 |
So when I just added one layer of dropout to the previous model, trained it for a while 01:56:57.820 |
at 0.1, 0.01, 0.001, and it's like, oh great, my accuracy and my validation accuracy are 01:57:08.100 |
pretty similar, and my validation accuracy is around 99.5 to 99.6 towards the end here. 01:57:18.900 |
So at 99.5 or 99.6% accuracy on handwriting recognition is pretty good, but there's one 01:57:26.700 |
more trick you can do which makes every model better, and it's called Ensembling. Ensembling 01:57:32.980 |
refers to building multiple versions of your model and combining them together. 01:57:37.780 |
So what I did was I took all of the code from that last section and put it into a single 01:57:43.980 |
function. So this is exactly the same model I had before, and this is my exact steps that 01:57:49.780 |
I talked to train it, my learning rate of 0.1, 0.01, 0.001. So at the end of this, it returns 01:57:58.420 |
And so then I said, okay, 6 times fit a model and return a list of the results. So models 01:58:07.320 |
at the end of this contain 6 trained models using my preferred network. 01:58:15.380 |
So then what I could do was to say, go through every one of those 6 models and predict the 01:58:24.920 |
output for everything in my test set. So now I have 10,000 test images by 10 outputs by 01:58:34.500 |
6 models. And so now I can take the average across the 6 models. And so now I'm basically 01:58:42.020 |
saying here are 6 models, they've all been trained in the same way but from different 01:58:46.460 |
random starting points. And so the idea is that they will be having errors in different 01:58:52.100 |
So let's take the average of them, and I get an accuracy of 99.7%. How good is that? It's 01:59:01.300 |
very good. It's so good that if we go to the academic list of the best MNIST results of 01:59:08.000 |
all time, and many of these were specifically designed for handwriting recognition, it comes 01:59:16.740 |
So one afternoon's work gets us in the list of the best results ever found on this dataset. 01:59:25.140 |
So as you can see, it's not rocket science, it's all stuff you've learned before, you've 01:59:30.700 |
learned now, and it's a process which is fairly repeatable, can get you right up to the state 01:59:39.580 |
So it was easier to do it on MNIST because I only had to wait a few seconds for each 01:59:45.060 |
of my trainings to finish. To get to this point on State Farm, it's going to be harder 01:59:51.460 |
because you're going to have to think about how do you do it in the time you have available 01:59:54.860 |
and how do you do it in the context of fine-tuning and stuff like that. But hopefully you can 02:00:00.140 |
see that you have all of the tools now at your disposal to create literally a state 02:00:08.160 |
So I'm going to make all of these notebooks available. You can play with them. You can 02:00:12.780 |
try to get a better result from dogs and cats. As you can see, it's kind of like an incomplete 02:00:19.100 |
thing that I've done here. I haven't found the best data augmentation, I haven't found 02:00:22.340 |
the best dropout, I haven't trained it as long as I probably need to. So there's some 02:00:29.260 |
So here are your assignments for this week. This is all review now. I suggest you go back 02:00:36.580 |
and actually read. There's quite a bit of prose in every one of these notebooks. Hopefully 02:00:40.820 |
now you can go back and read that prose, and some of that prose at first was a bit mysterious, 02:00:46.340 |
now it's going to make sense. Oh, okay, I see what it's saying. And if you read something 02:00:50.500 |
and it doesn't make sense, ask on the forum. Or if you read something and you want to check, 02:00:55.900 |
oh, is this kind of another way of saying this other thing? Ask on the forum. 02:01:00.340 |
So these are all notebooks that we've looked at already and you should definitely review. 02:01:05.020 |
Ask us something on the forum. Make sure that you can replicate the steps shown in the lesson 02:01:09.620 |
notebooks we've seen so far using the technique in how to use the provided notebooks we looked 02:01:14.620 |
at the start of class. If you haven't yet got into the top 50% of dogs vs cats, hopefully 02:01:22.940 |
If you get stuck at any point, ask on the forum. And then this is your big challenge. Can you 02:01:28.100 |
get into the top 50% of State Farm? Now this is tough. The first step to doing well in 02:01:33.900 |
a Kaggle competition is to create a validation set that gives you accurate answers. So create 02:01:39.940 |
a validation set, and then make sure that the validation set accuracy is the same as 02:01:46.940 |
you get when you submit to Kaggle. If you don't, you don't have a good enough validation 02:01:51.340 |
set yet. Creating a validation set for State Farm is really your first challenge. It requires 02:01:57.100 |
thinking long and hard about the evaluation section on that page and what that means. 02:02:01.820 |
And then it's thinking about which layers of the pre-trained network should I be retraining. 02:02:09.460 |
I actually have read through the top 20 results from the competition close 3 months ago. I 02:02:16.420 |
actually think all of the top 20 result methods are pretty hacky. They're pretty ugly. I feel 02:02:24.860 |
like there's a better way to do this that's kind of in our grasp. So I'm hoping that somebody 02:02:31.060 |
is going to come up with a top 20 result for State Farm that is elegant. We'll see how 02:02:39.140 |
we go. If not this year, maybe next year. Honestly, nobody in Kaggle quite came up with 02:02:44.940 |
a really good way of tackling this. They've got some really good results, but with some 02:02:53.100 |
And then as you go through a review, please, any of these techniques that you're not clear 02:02:58.060 |
about, these 5 pieces, please go and have a look at this additional information and see 02:03:05.260 |
Alright, that was a pretty quick run-through. I hope everything goes well and I will see