Lesson 3: Practical Deep Learning for Coders

Let's start actually on the Wiki on the Lesson 3 section of the Wiki because Rachel added something which I think is super helpful to the Wiki this week, which is in this section about the assignments, you'll see where it talks about going through the notebooks as a section called "How to Use the Provided Notebooks" and I think the feedback I get is each time I talk about the kind of teaching approach in this class, people get a lot out of it.

So I thought I wanted to keep talking a little bit about that. As we've discussed before, in the two hours that we spend together each week, that's not nearly enough time for me to teach you Deep Learning. I can show you what kinds of things you need to learn about and I can show you where to look and try to give you a sense of some of the key topics.

But then the idea is that you're going to learn about deep learning during the week by doing a whole lot of experimenting. And one of the places that you can do that experimenting is with the help of the notebooks that we provide. Having said that, if you do that by loading up a notebook and hitting Shift + Enter a bunch of times to go through each cell until you get an error message and then you go, "Oh shit, I got an error message," you're not going to learn anything about deep learning.

I was almost tempted to not put the notebooks online until a week after each class because it's just so much better when you can build it yourself. But the notebooks are very useful if you use them really rigorously and thoughtfully which is as Rachel described here, read through it and then put it aside, minimize it or close it or whatever, and now try and replicate what you just read from scratch.

And anytime you get stuck, you can go back and open back up the notebook, find the solution to your problem, but don't copy and paste it. Put the notebook aside again, go and read the documentation about what it turns out the solution was, try and understand why is this a solution, and type in that solution yourself from scratch.

And so if you can do that, it means you really understand now the solution to this thing you're previously stuck on and you've now learned something you didn't know before. You might still be stuck, and that's fine. So if you're still stuck, you can refer back to the notebook again, still don't copy and paste the code, but whilst having both open on the screen at the same time, type in the code.

Now that might seem pretty weird. Why would you type in code you can copy and paste, but just the very kinesthetic process of typing it in forces you to think about where are the parentheses, where are the dots, and what's going on. And then once you've done that, you can try changing the inputs to that function and see what happens and see how it affects the outputs and really experiment.

So it's through this process of trying to come up and think about what step do I take next. That means that you're thinking about the concepts you've learned. And then how do you do that step means that you're having to recall how the actual libraries are working. And then most importantly, through experimenting with the inputs and outputs, you get this really intuitive understanding of what's going on.

So one of the questions I was thrilled to see over the weekend, which is exactly the kind of thing I think is super helpful. How do I pronounce your name? I'm trying to understand Correlate. So I sent it two vectors with two things each and was happy with Resolve.

And then I sent it two vectors with three things in, and I don't get it. And so that's great. This is like taking it down to make sure I really understand this. And so I typed something in and the output was not what I expected, what's going on. And so then I tried it by creating a little spreadsheet and showed here are the three numbers and here's how it was calculated.

And then it's like, "Okay, I kind of get that, not fully," and then I finally described it. Did that make sense in the end? So you now understand correlation and convolution. You know you do because you put it in there, you figure out what the answer ought to be, and eventually the answer is what you thought.

So this is exactly the kind of experimentation I find a lot of people try to jump straight to full-scale image recognition before they've got to the kind of 1+1 stage. And so you'll see I do a lot of stuff in Excel, and this is why. In Excel or with simple little things in Python, I think that's where you get the most experimental benefit.

So that's what we're talking about when we talk about experiments. I want to show you something pretty interesting. And remember last week we looked at this paper from Matt Zyla where we saw what the different layers of a convolutional neural network look like. One of the steps in the how to use the provided notebooks is if you don't know why a step is being done or how it works or what you observe, please ask.

Any time you're stuck for half an hour, please ask. So far, I believe that there has been a 100% success rate in answering questions on the forums. So when people ask, they get an answer. So part of the homework this week in the assignments is ask a question on the forum.

Question about setting up AWS, don't be embarrassed if you still have questions there. No, absolutely. I know a lot of people are still working through cats and dogs, or cats and dogs, or redux. And that makes perfect sense. The people here have different backgrounds. There are plenty of people here who have never used Python before.

Python was not a prerequisite. The goal is that for those of you that don't know Python, that we give you the resources to learn it and learn it well enough to be effective in doing deep learning in it. But that does mean that you guys are going to have to ask more questions.

There are no dumb questions. So if you see somebody asking on the forum about how do I analyze functional brain MRIs with 3D convolutional neural networks, that's fine, that's where they are at. That's okay if you then ask, What does this Python function do? Or vice versa. If you see somebody ask, What does this Python function do?

And you want to talk about 3D brain MRIs, do that too. The nice thing about the forum is that as you can see, it really is buzzing now. The nice thing is that the different threads allow people to dig into the stuff that interests them. And I'll tell you from personal experience, the thing that I learn the most from is answering the simplest questions.

So actually answering that question about a 1D convolution, I found very interesting. I actually didn't realize that the reflect parameter was the default parameter and I didn't quite understand how it worked, so answering that question I found very interesting. And even sometimes if you know the answer, figuring out how to express it teaches you a lot.

So asking questions of any level is always helpful to you and to the rest of the community. So please, if everybody only does one part of the assignments this week, do that one. Which is to ask a question. And here are some ideas about questions you could ask if you're not sure.

Thank you Rachel. So I was saying last week we kind of looked later in the class at this amazing visualization of what goes on in a convolutional neural network. I want to show you something even cooler, which is the same thing in video. This is by an amazing guy called Jason Nusinski, his supervisor, Todd Lipson, and some other guys.

And it's doing the same thing but in video. And so I'm going to show you what's going on here. And you can download this. It's called the Deep Visualization Toolbox. So if you go to Google and search for the Deep Visualization Toolbox, you can do this. You can grab pictures, you can click on any one of the layers of a convolutional neural network, and it will visualize every one of the outputs of the filters in that convolutional layer.

So you can see here with this dog, it looks like there's a filter here which is kind of finding edges. And you can even give it a video stream. So if you give it a video stream of your own webcam, you can see the video stream popping up here.

So this is a great tool. And looking at this tool now, I hope it will give us a better intuition about what's going on in a convolutional neural network. Look at this one here he selected. There's clearly an edge detector. As he slides a piece of paper over it, you get this very strong edge.

And clearly it's specifically a horizontal edge detector. And here is actually a visualization of the pixels of the filter itself. And it's exactly what you would expect. Remember from our initial lesson 0, an edge detector has black on one side and white on the other. So you can scroll through all the different layers of this neural network.

And different layers do different things. And the deeper the layer, the larger the area it covers, and therefore the smaller the actual filter is, and the more complex the objects that it can recognize. So here's an interesting example of a layer 5 thing which it looks like it's a face detector.

So you can see that as he moves his face around, this is moving around as well. So one of the cool things you can do with this is you can say show me all the images from ImageNet that match this filter as much as possible, and you can see that it's showing us faces.

This is a really cool way to understand what your neural network is doing, or what ImageNet is doing. You can see other guys come along and here we are. And so here you can see the actual result in real time of the filter deconvolution, and here's the actual recognition that it's doing.

So clearly it's a face detector which also detects cat faces. So the interesting thing about these types of neural net filters is that they're often pretty subtle as to how they work. They're not looking for just some fixed set of pixels, but they really understand concepts. So here's a really interesting example.

Here's one of the filters in the 5th layer which seems to be like an armpit detector. So why would you have an armpit detector? Well interestingly, what he shows here is that actually it's not an armpit detector. Because look what happens. If he smooths out his fabric, this disappears.

So what this actually is, is a texture detector. It's something that detects some kind of regular texture. Here's an interesting example of one which clearly is a text detector. Now interestingly, ImageNet did not have a category called text, one of the thousand categories is not text, but one of the thousand categories is bookshelf.

And so you can't find a bookshelf if you don't know how to find a book, and you can't find a book if you don't know how to recognize its spine, and the way to recognize its spine is by finding text. So this is the cool thing about these neural networks is that you don't have to tell them what to find.

They decide what they want to find in order to solve your problem. So I wanted to start at this end of "Oh my God, deep learning is really cool" and then jump back to the other end of "Oh my God, deep learning is really simple." So everything we just saw works because of the things that we've learned about so far, and I've got a section here called CNN Review in lesson 3.

And Rachel and I have started to add some of our favorite readings about each of these pieces, but everything you just saw in that video consists of the following pieces. Matrix products, convolutions just like we saw in Excel and Python, activations such as ReLuse and Softmax, Stochastic Gradient Descent which is based on backpropagation - we'll learn more about that today - and that's basically it.

One of the, I think, challenging things is even if you feel comfortable with each of these 1, 2, 3, 4, 5 pieces that are convolutional neural networks, is really understanding how will those pieces fit together to actually do deep learning. So we've got two really good resources here on putting it all together.

So I'm going to go through each of these six things today as revision, but what I suggest you do if there's any piece where you feel like I'm not quite confident, I really know what a convolution is or I really know what an activation function is, see if this information is helpful and maybe ask a question on the forum.

So let's go through each of these. I think a particularly good place to start maybe is with convolutions. And a good reason to start with convolutions is because we haven't really looked at them since Lesson 0. And that was quite a while ago. So let's remind ourselves about Lesson 0.

So in Lesson 0, we learned about what a convolution is and we learned about what a convolution is by actually running a convolution against an image. So we used the MNIST dataset. The MNIST dataset, remember, consists of 55,000 28x28 grayscale images of handwritten digits. So each one of these has some known label, and so here's five examples with a known label.

So in order to understand what a convolution is, we tried creating a simple little 3x3 matrix. And so the 3x3 matrix we started with had negative 1s at the top, 1s in the middle, and 0s at the bottom. So we could kind of visualize that. So what would happen if we took this 3x3 matrix and we slid it over every 3x3 part of this image and we multiplied negative 1 by the first pixel, negative 1 by the second pixel, and then move to the next row and multiply by 1, 1, 1, 0, 0, 0, and add them all together.

And so we could do that for every 3x3 area. That's what a convolution does. So you might remember from Lesson 0, we looked at a little area to actually see what this looks like. So we could zoom in, so here's a little small little bit of the 7. And so one thing I think is helpful is just to look at what is that little bit.

Let's make it a bit smaller so it fits on our screen. So you can see that an image just is a bunch of numbers. And the blacks are zeros and the things in between bigger and bigger numbers until eventually the whites are very close to 1. So what would happen if we took this little 3x3 area?

0, 0, 0, 0, 0.35, 0.5, 0.9, 0.9, 0.9, and we multiplied each of those 9 things by each of these 9 things. So clearly anywhere where the first row is zeros and the second row is ones, this is going to be very high when we multiply it all together and add the 9 things up.

And so given that white means high, you can see then that when we do this convolution, we end up with something where the top edges become bright because we went -1, -1, -1 times 1, 1, 1 times 0, 0, 0 and added them all together. So one of the things we looked at in lesson 0 and we have a link to here is this cool little image kernel explained visually site where you can actually create any 3x3 matrix yourself and go through any 3x3 part of this picture and see the actual arithmetic and see the result.

So if you're not comfortable with convolutions, this would be a great place to go next. That's an excellent question. How did you decide on the values of the top matrix? So in order to demonstrate an edge filter, I picked values based on some well-known edge filter matrices. So you can see here's a bunch of different matrices that this guy has.

So for example, top_sobel, I could select, and you can see that does a top_edge filter. Or I could say emboss, and you can see it creates this embossing sense. Here's a better example because it's nice and big here. So these types of filters have been created over many decades, and there's lots and lots of filters designed to do interesting things.

So I just picked a simple filter which I knew from experience and from common sense would create a top edge filter. And so by the same kind of idea, if I rotate that by 90 degrees, that's going to create a left-hand edge filter. So if I create the four different types of filter here, and I could also create four different diagonal filters like these, that would allow me to create top edge, left edge, bottom edge, right edge, and then each diagonal edge filter here.

So I created these filters just by hand through a combination of common sense and having read about filters because people spend time designing filters. The more interesting question then really is what would be the optimal way to design filters? Because it's definitely not the case that these eight filters are the best way of figuring out what's a 7 and what's an 8 and what's a 1.

So this is what deep learning does. What deep learning does is it says let's start with random filters. So let's not design them, but we'll start with totally random numbers for each of our filters. So we might start with eight random filters, each of 3x3. And we then use stochastic gradient descent to find out what are the optimal values of each of those sets of 9 numbers.

And that's what happens in order to create that cool video we just saw, and that cool paper that we saw. That's how those different kinds of edge detectors and gradient detectors and so forth were created. When you use stochastic gradient descent to optimize these kinds of values when they start out random, it figures out that the best way to recognize images is by creating these kinds of different detectors, different filters.

Where it gets interesting is when you start building convolutions on top of convolutions. So we saw last week that we can create a bunch of inputs, so if I don't, please remind me. So we saw last week how if you've got three inputs, you can create a bunch of weight matrices, so we can create one weight matrix.

So if we've got three inputs, we saw last week how you could create a random matrix and then do a matrix multiply of the inputs times a random matrix. We could then put it through an activation function such as max(0,x) and we could then take that and multiply it by another weight matrix to create another output.

And then we could put that through max(0,x) and we can keep doing that to create arbitrarily complex functions. And we looked at this really great neural networks and deep learning chapter where we saw visually how that kind of bunch of matrix products followed by activation functions can approximate any given function.

So where it gets interesting then is instead of just having a bunch of weight matrices and matrix products, what if sometimes we had convolutions and activations? Because a convolution is just a subset of a matrix product, so if you think about it, a matrix product says here's 10 activations and then a weight matrix going down to 10 activations.

The weight matrix goes from every single element of the first layer to every single element of the next layer. So if this goes from 10 to 10, there are 100 weights. Whereas a convolution is just creating a subset of those weights. So I'll let you think about this during the week because it's a really interesting insight to think about that a convolution is identical to a fully connected layer, but it's just a subset of the weights.

And so therefore everything we learned about stacking linear and nonlinear layers together applies also to convolutions. But we also know that convolutions are particularly well-suited to identifying interesting features of images. So by using convolutions, it allows us to more conveniently and quickly find powerful deep learning networks. So the spreadsheet will be available for download tomorrow.

We're trying to get to the point that we can actually get the derivatives to work in the spreadsheet and we're still slightly stuck with some of the details, but we'll make something available tomorrow. Are the filters the layers? Yes they are. So this is something where spending a lot of time looking at simple little convolution examples is really helpful.

Because for a fully connected layer, it's pretty easy. You can see if I have 3 inputs, then my matrix product will have to have 3 rows, otherwise they won't match. And then I can create as many columns as I like. And the number of columns I create tells me how many activations I create because that's what matrix products do.

So it's very easy to see how with what Keras calls dense layers, I can decide how big I want each activation layer to be. If you think about it, you can do exactly the same thing with convolutions. You can decide how many sets of 3x3 matrices you want to create at random, and each one will generate a different output when applied to the image.

So the way that VGG works, for example, so the VGG network, which we learned about in Lesson 1, contains a bunch of layers. It contains a bunch of convolutional layers, followed by a flatten. And all flatten does is just a Keras thing that says don't think of the layers anymore as being x by y by channel matrices, think of them as being a single vector.

So it just concatenates all the dimensions together, and then it contains a bunch of fully connected blocks. And so each of the convolutional blocks is -- you can kind of ignore the zero padding, that just adds zeros around the outside so that your convolutions end up with the same number of outputs as inputs.

It contains a 2D convolution, followed by, and we'll review this in a moment, a max pooling layer. You can see that it starts off with 2 convolutional layers with 64 filters, and then 2 convolutional layers with 128 filters, and then 3 convolutional layers with 256 filters. And so you can see what it's doing is it's gradually creating more and more filters in each layer.

These definitions of block are specific to VGG, so I just created -- this is just me refactoring the model so there wasn't lots and lots of lines of code. So I just didn't want to retype lots of code, so I kind of found that these lines of code were being repeated so I turned it into a function.

So why would we be having the number of filters being increasing? Well, the best way to understand a model is to use the summary command. So let's go back to lesson 1. So let's go right back to our first thing we learned, which was the 7 lines of code that you can run in order to create and train a network.

I won't wait for it to actually finish training, but what I do want to do now is go vgg.model.summary. So anytime you're creating models, it's a really good idea to use the summary command to look inside them and it tells you all about it. So here we can see that the input to our model has 3 channels, red, green and blue, and they are 224x224 images.

After I do my first 2D convolution, I now have 64 channels of 224x224. So I've replaced my 3 channels with 64, just like here I've got 8 different filters, here I've got 64 different filters because that's what I asked for. So again we have a second convolution set with 224x224 of 64, and then we do max pooling.

So max pooling, remember from lesson 0, was this thing where we simplified things. So we started out with these 28x28 images and we said let's take each 7x7 block and replace that entire 7x7 block with a single pixel which contains the maximum pixel value. So here is this 7x7 block which is basically all gray, so we end up with a very low number here.

And so instead of being 28x28, it becomes 4x4 because we are replacing every 7x7 block with a single pixel. That's all max pooling does. So the reason we have max pooling is it allows us to gradually simplify our image so that we get larger and larger areas and smaller and smaller images.

So if we look at VGG, after our max pooling layer, we now longer have 224x224, we now have 112x112. Later on we do another max pooling, we end up with 56x56. Later on we do another max pooling and we end up with 28x28. So each time we do a max pooling we're reducing the resolution of our image.

As we're reducing the resolution, we need to high cut the number of filters otherwise we're losing information. So that's really why each time we have a max pooling, we then double the number of filters because that means that every layer we're keeping the same amount of information content. So it starts out with a very, very important insight, which is a very important insight which is a convolution is position invariant.

So in other words, this thing we created which is a top edge detector, we can apply that to any part of the image and get top edges from every part of the image. And earlier on when we looked at that Jason Nusinski video, it showed that there was a face detector which could find a face in any part of the image.

So this is fundamental to how a convolution works. A convolution is a position invariant. It finds a pattern regardless of where abouts an image is. Now that is a very powerful idea because when we want to say find a face, we want to be able to find eyes. And we want to be able to find eyes regardless of whether the face is in the top left or the bottom right.

So position invariance is important, but also we need to be able to identify position to some extent because if there's four eyes in the picture, or if there's an eye in the top corner and the bottom corner, then something weird is going on, or if the eyes and the nose aren't in the right positions.

So how does a convolutional neural network both have this location invariant filter but also handle location? And the trick is that every one of the 3x3 filters cares deeply about where each of these 3x3 things is. And so as we go down through the layers of our model from 224 to 112 to 56 to 28 to 14 to 7, at each one of these stages (think about this stage which goes from 14x14 to 7x7), these filters are now looking at large parts of the image.

So it's now at a point where it can actually say there needs to be an eye here and an eye here and a nose here. So this is one of the cool things about convolutional neural networks. They can find features everywhere but they can also build things which care about how features relate to each other positionally.

So you get to do both. So do we need zero padding? Zero padding is literally something that sticks zeros around the outside of an image. If you think about what a convolution does, it's taking a 3x3 and moving it over an image. If you do that, when you get to the edge, what do you do?

Because at the very edge, you can't move your 3x3 any further. Which means if you only do what's called a valid convolution, which means you always make sure your 3x3 filter fits entirely within your image, you end up losing 2 pixels from the sides and 2 pixels from the top each time.

There's actually nothing wrong with that, but it's a little inelegant. It's kind of nice to be able to half the size each time and be able to see exactly what's going on. So people tend to often like doing what's called same convolutions. So if you add a black border around the outside, then the result of your convolution is exactly the same size as your input.

That is literally the only reason to do it. In fact, this is a rather inelegant way of going zero padding and then convolution. In fact, there's a parameter to nearly every library's convolution function where you can say "I want valid" or "full" or "half" which basically means do you add no black pixels, one black pixels or two black pixels, assuming it's 3x3.

So I don't quite know why this one does it this way. It's really doing two functions where one would have done, but it does the job. So there's no right answer to that question. All neural networks work fine for cartoons. The question was do they work for cartoons. However, fine-tuning, which has been fundamental to everything we've learned so far, it's going to be difficult to fine-tune from an ImageNet model to a cartoon.

Because an ImageNet model was built on all those pictures of corn we looked at and all those pictures of dogs we looked at. So an ImageNet model has learned to find the kinds of features that are in photos of objects out there in the world. And those are very different kinds of photos to what you see in a cartoon.

So if you want to be able to build a cartoon neural network, you'll need to either find somebody else who has already trained a neural network on cartoons and fine-tune that, or you're going to have to create a really big corpus of cartoons and create your own ImageNet equivalent.

So why doesn't an ImageNet network translate to cartoons given that an eye is a circle? Because the nuance level of a CNN is very high. It doesn't think of an eye as being just a circle. It knows that an eye very specifically has particular gradients and particular shapes and particular ways that the light reflects off it and so forth.

So when it sees a round blob there, it has no ability to abstract that out and say I guess they mean an eye. One of the big shortcomings of CNNs is that they can only learn to recognize things that you specifically give them to recognize. If you feed a neural net with a wide range of photos and drawings, maybe it would learn about that kind of abstraction.

To my knowledge, that's never been done. It would be a very interesting question. It must be possible. I'm just not sure how many examples you would need and what kind of architecture you would need. In this particular example, I used correlate, not convolution. One of the things we briefly mentioned in lesson 1 is that convolve and correlate are exactly the same thing, except convolve is equal to correlate of an image with a filter that has been rotated by 90 degrees.

So you can see convolve images with rotated 90 degrees filter looks exactly the same and numpy.all_close is true. So convolve and correlate are identical except that correlate is more intuitive. In each one it goes rows and then columns, where else with convolve one goes along rows and the other one goes down columns.

So I tend to prefer to think about correlate because it's just more intuitive. Convolve originally came really from physics, I think, and it's also a basic math operation. There are various reasons that people sometimes find it more intuitive to think about convolution but in terms of everything that they can do in a neural net, it doesn't matter which one you're using.

In fact, many libraries let you set a parameter to true or false to decide whether or not internally it uses convolution or correlation. And of course the results are going to be identical. So let's go back to our CNN review. Our network architecture is a bunch of matrix products or in more generally linear layers, and remember a convolution is just a subset of a matrix product so it's also a linear layer, a bunch of matrix products or convolutions stacked with alternating nonlinear activation functions.

And specifically we looked at the activation function which was the rectified linear unit, which is just max of 0, x. So that's an incredibly simple activation function, but it's by far the most common, it works really well, for the internal parts of a neural network. I want to introduce one more activation function today, and you can read more about it in Lesson 2.

Let's go down here where it says About Activation Functions. And you can see I've got all the details of these activation functions here. I want to talk about one core. It's called the Softmax function, and Softmax is defined as follows, e^xi divided by sum of e^xi. What is this all about?

Softmax is used not for the middle layers of a deep learning network, but for the last layer. The last layer of a neural network, if you think about what it's trying to do for classification, it's trying to match to a one-hot encoded output. Remember a one-hot encoded output is a vector with all zeros and just a 1 in one spot.

The spot is like we had for cats and dogs two spots, the first one was a 1 if it was a cat, the second one was a 1 if it was a dog. So in general, if we're doing classification, we want our output to have one high number and all the other ones be low.

That's going to be easier to create this one-hot encoded output. Furthermore, we would like to be able to interpret these as probabilities, which means all of the outputs have to add to 1. So we've got these two requirements here. Our final layer's activations should add to one, and one of them should be higher than all the rest.

This particular function does exactly that, and we will look at that by looking at a spreadsheet. So here is an example of what an output layer might contain. Here is e^of each of those things to the left. Here is the sum of e^of those things. And then here is the thing to the left divided by the sum of them, in other words, softmax.

And you can see that we start with a bunch of numbers that are all of a similar kind of scale. And we end up with a bunch of numbers that sum to 1, and one of them is much higher than the others. So in general, when we design neural networks, we want to come up with architectures, by which I mean convolutions, fully connected layers, activation functions, we want to come up with architectures where replicating the outcome we want is as convenient as possible.

So in this case, our activation function for the last layer makes it pretty convenient, pretty easy to come up with something that looks a lot like a 1-watt encoded output. So the easier it is for our neural net to create the thing we want, the faster it's going to get there, and the more likely it is to get there in a way that's quite accurate.

So we've learned that any big enough, deep enough neural network, because of the Universal Approximation Theorem, can approximate any function at all. And we know that Stochastic Gradient Descent can find the parameters for any of these, which kind of leaves you thinking why do we need 7 weeks of neural network training?

Any architecture ought to work. And indeed that's true. If you have long enough, any architecture will work. Any architecture can translate Hungarian to English, any architecture can recognize cats versus dogs, any architecture can analyze Hillary Clinton's emails, as long as it's big enough. However, some of them do it much faster than others.

They train much faster than others. A bad architecture could take so long to train that it doesn't train in the amount of years you have left in your lifetime. And that's why we care about things like convolutional neural networks instead of just fully connected layers all the way through.

That's why we care about having a softmax at the last layer rather than just a linear last layer. So we try to make it as convenient as possible for our network to create the thing that we want to create. Yes, Rachel? So the first one was? Softmax, just like the other one, is about how Keras internally handles these matrices of data.

Any more information about that one? Honestly, I don't do theoretical justifications, I do intuitive justifications. There is a great book for theoretical justifications and it's available for free. If you just google for Deep Learning Book, or indeed go to deeplearningbook.org, it actually does have a fantastic theoretical justification of why we use softmax.

The short version basically is as follows. Softmax contains an eta in it, our log-loss layer contains a log in it, the two nicely mesh up against each other and in fact the derivative of the two together is just a - b. So that's kind of the short version, but I will refer you to the Deep Learning Book for more information about the theoretical justification.

The intuitive justification is that because we have an eta here, it makes a big number really really big, and therefore once we take one divided by the sum of the others, we end up with one number that tends to be bigger than all the rest, and that is very close to the one-hot encoded output that we're trying to match.

Could a network learn identical filters? A network absolutely could learn identical filters, but it won't. The reason it won't is because it's not optimal to. Stochastic gradient descent is an optimization procedure. It will come up with, if you train it for long enough, with an appropriate learning rate, the optimal set of filters.

Having the same filter twice is never optimal, that's redundant. So as long as you start off with random weights, then it can learn to find the optimal set of filters, which will not include duplicate filters. These are all fantastic questions. In this review, we've done our different layers, and then these different layers get optimized with SGD.

Last week we learned about SGD by using this extremely simple example where we said let's define a function which is a line, ax + b. Let's create some data that matches a line, x's and y's. Let's define a loss function, which is the sum of squared errors. We now no longer know what a and b are, so let's start with some guess.

Obviously the loss is pretty high, and let's now try and come up with a procedure where each step makes the loss a little bit better by making a and b a little bit better. The way we did that was very simple. We calculated the derivative of the loss with respect to each of a and b, and that means that the derivative of the loss with respect to b is, if I increase b by a bit, how does the loss change?

And the derivative of the loss with respect to a means as I change a a bit, how does the loss change? If I know those two things, then I know that I should subtract the derivative times some learning rate, which is 0.01, and as long as our learning rate is low enough, we know that this is going to make our a guess a little bit better.

And we do the same for our b guess, it gets a little bit better. And so we learned that that is the entirety of SGD. We run that again and again and again, and indeed we set up something that would run it again and again and again in an animation loop and we saw that indeed it does optimize our line.

The tricky thing for me with deep learning is jumping from this kind of easy to visualize intuition. If I run this little derivative on these two things a bunch of times, it optimizes this line, I can then create a set of layers with hundreds of millions of parameters that in theory can match any possible function and it's going to do exactly the same thing.

So this is where our intuition breaks down, which is that this incredibly simple thing called SGD is capable of creating these incredibly sophisticated deep learning models. We really have to just respect our understanding of the basics of what's going on. We know it's going to work, and we can see that it does work.

But even when you've trained dozens of deep learning models, it's still surprising that it does work. It's always a bit shocking when you start without any ability to analyze some problem. You start with some random weights, you start with a general architecture, you throw some data in with SGD, and you end up with something that works.

Hopefully now it makes sense, you can see why that happens. But it takes doing it a few times to really intuitively understand, okay, it really does work. So one question about Softmax, could you use it for multi-class, multi-label classification for the multiple correct answers? And you use Softmax for multi-class classification, and the answer is absolutely yes.

In fact, the example I showed here was such an example. So imagine that these outputs were for cat, dog, plane, fish, and building. So these might be what these 5 things represent. So this is exactly showing a Softmax for a multi-class output. You just have to make sure that your neural net has as many outputs as you want.

And to do that, you just need to make sure that the last weight layer in your neural net has as many columns as you want. The number of columns in your final weight matrix tells you how many outputs. Okay, that is not multi-class classification. So if you want to create something that is going to find more than one thing, then no.

Softmax would not be the best way to do that. I'm not sure if we're going to cover that in this set of classes. If we don't, we'll be doing it next year. Let's go back to the question about 3x3 filters, and more generally, how do we pick an architecture?

So the question of the VGG authors used 3x3 filters. The 2012 ImageNet winners used a combination of 7x7 and 11x11 filters. What has happened over the last few years since then if people have realized that 3x3 filters are just better? The original insight for this was actually that Matt Zeiler visualization paper I showed you.

It's real worth reading that paper because he really shows that by looking at lots of pictures of all the stuff going on inside of CNN, it clearly works better when you have smaller filters and more layers. I'm not going to go into the theoretical justification as to why, for the sake of applying CNNs, all you need to know is that there's really no reason to use anything but 3x3 filters.

So that's a nice simple rule of thumb which always works, 3x3 filters. How many layers of 3x3 filters? This is where there is not any standard agreed-upon technique. Weeding lots of papers, looking at lots of Kaggle winners, you will over time get a sense of for a problem of this level of complexity, you need this many filters.

There have been various people that have tried to simplify this, but we're really still at a point where the answer is try a few different architectures and see what works. The same applies to this question of how many filters per layer. So in general, this idea of having 3x3 filters with max pooling and doubling the number of filters each time you do max pooling is a pretty good rule of thumb.

How many do you start with? You've kind of got to experiment. Actually, we're going to see today an example of how that works. If you had a much larger image, would you still want 3x3 filters? If you had a much larger image, what would you do? For example, on Kaggle, there's a diabetic retinopathy competition that has some pictures of eyeballs that are quite a high resolution.

I think they're a couple of thousand by a couple of thousand. The question of how to deal with large images is as yet unsolved in the literature. So if you actually look at the winners of that Kaggle competition, all of the winners resampled that image down to 512x512. So I find that quite depressing.

It's clearly not the right approach. I'm pretty sure I know what the right approach is. I'm pretty sure the right approach is to do what the eye does. The eye does something called foveation, which means that when I look directly at something, the thing in the middle is very high-res and very clear, and the stuff on the outside is not.

I think a lot of people are generally in agreement with the idea that if we could come up with an architecture which has this concept of foveation, and then secondly, we need something, and there are some good techniques to this already called attentional models. An attentional model is something that says, "Okay, the thing I'm looking for is not in the middle of my view, but my low-res peripheral vision thinks it might be over there.

Let's focus my attention over there." And we're going to start looking at recurrent neural networks next week, and we can use recurrent neural networks to build attentional models that allow us to search through a big image to find areas of interest. That is a very active area of research, but as yet is not really finalized.

By the time this turns into a MOOC and a video, I wouldn't be surprised if that has been much better solved. It's moving very quickly. The Matt Zyler paper showed larger filters because he was showing what AlexNet, the 2012 winner, looked like. Later on in the paper, he said based on what it looks like, here are some suggestions about how to build better models.

So let us now finalize our review by looking at fine-tuning. So we learned how to do fine-tuning using the little VGG class that I built, which is one line of code, vgg.fine-tuned. We also learned how to take 1000 predictions of all the 1000 ImageNet categories and turn them into two predictions, which is just a cat or a dog, by building a simple linear model that took as input the 1000 ImageNet category predictions as input, and took the true cat and dog labels as output, and we just created a linear model of that.

So here is that linear model. It's got 1000 inputs and 2 outputs. So we trained that linear model, it took less than a second to train, and we got 97.7% accuracy. So this was actually pretty effective. So why was it pretty effective to take 1000 predictions of is it a cat, is it a fish, is it a bird, is it a poodle, is it a pug, is it a plane, and turn it into a cat or is it a dog.

The reason that worked so well is because the original architecture, the ImageNet architecture, was already trained to do something very similar to what we wanted our model to do. We wanted our model to separate cats from dogs, and the ImageNet model already separated lots of different cats from different dogs from lots of other things as well.

So the thing we were trying to do was really just a subset of what ImageNet already does. So that was why starting with 1000 predictions and building the simple linear model worked so well. This week, you're going to be looking at the State Farm competition. And in the State Farm competition, you're going to be looking at pictures like this one, and this one, and this one.

And your job will not be to decide whether or not it's a person or a dog or a cat. Your job will be to decide is this person driving in a distracted way or not. That is not something that the original ImageNet categories included. And therefore this same technique is not going to work this week.

So what do you do if you need to go further? What do you do if you need to predict something which is very different to what the original model did? The answer is to throw away some of the later layers in the model and retrain them from scratch. And that's called fine-tuning.

And so that is pretty simple to do. So if we just want to fine-tune the last layer, we can just go model.pop, that removes the last layer. We can then say make all of the other layers non-trainable, so that means it won't update those weights, and then add a new fully connected layer, dense layer to the end with just our dog and cat, our two activations, and then go ahead and fit that model.

So that is the simplest kind of fine-tuning. Remove the last layer, but previously that last layer was going to try and predict 1000 possible categories, and replace it with a new last layer which we train. In this case, I've only run it for a two-week box, so I'm not getting a great result.

But if we ran it for a few more, we would get a bit better than the 97.7 we had last time. When we look at State Farm, it's going to be critical to do something like this. So how many layers would you remove? Because you don't just have to remove one.

In fact, if you go back through your lesson 2 notebook, you'll see after this, I've got a section called Retraining More Layers. In it, we see that we can take any model and we can say okay, let's grab all the layers up to the nth layer. So in this case, we set all the layers up to, sorry, after the first fully connected layer and set them all to trainable.

And then what would happen if we tried running that model? So with Keras, we can tell Keras which layers we want to freeze and leave them at their ImageNet-decided weights, and which layers do we want to retrain based on the things that we're interested in. And so in general, the more different your problem is to the original ImageNet 1000 categories, the more layers you're going to have to retrain.

So how do you decide how far to go back in the layers? Two ways. Way number 1, intuition. So have a look at something like those mat-zylar visualizations to get a sense of at what semantic level each of those layers is operating at. And go back to the point where you feel like that level of meaning is going to be relevant to your model.

Method number 2, experiment. It doesn't take that long to train another model starting at a different point. I generally do a bit of both. When I know dogs and cats are subsets of the ImageNet categories, I'm not going to bother generally training more than one replacement layer. For State Farm, I really had no idea.

I was pretty sure I wouldn't have to retrain any of the convolutional layers because the convolutional layers are all about spatial relationships. And therefore a convolutional layer is all about recognizing how things in space relate to each other. I was pretty confident that figuring out whether somebody is looking at a mobile phone or playing with their radio is not going to use different spatial features.

So for State Farm, I've really only looked at retraining the dense layers. And in VGG, there are actually only three dense layers. There are actually only three dense layers, the two intermediate layers and the output layer, so I just trained all three. Generally speaking, the answer to this is try a few things and see what works the best.

When we retrain the layers, we do not set the weights randomly. We start the weights at their optimal ImageNet levels. That means that if you retrain more layers than you really need to, it's not a big problem because the weights are already at the right point. If you randomized the weights of the layers that you're retraining, that would actually kill the earlier layers as well if you made them trainable.

There's no point really setting them to random most of the time. We'll be learning a bit more about that after the break. So far, we have not reset the weights. When we say layer.trainable = true, we're just telling Keras that when you say fit, I want you to actually use SGD to update the weights in that layer.

When we come back, we're going to be talking about how to go beyond these basic five pieces to create models which are more accurate. Specifically, we're going to look at avoiding underfitting and avoiding overfitting. Next week, we're going to be doing half a class on review of convolutional neural networks and half a class of an introduction to recurrent neural networks which we'll be using for language.

So hopefully by the end of this class, you'll be feeling ready to really dig deep into CNNs during the week. This is really the right time this week to make sure that you're asking questions you have about CNNs because next week we'll be wrapping up this topic. Let's come back at 5 past 8.

So we have a lot to cover in our next 55 minutes. I think this approach of doing the new material quickly and then you can review it in the lesson notebook on the video by experimenting during the week and then reviewing the next week is fine. I think that's a good approach.

But I just want to make you aware that the new material of the next 55 minutes will move pretty quickly. So don't worry too much if not everything sinks in straight away. If you have any questions, of course, please do ask. But also, recognize that it's really going to sink in as you study it and play with it during the week, and then next week we're going to review all of this.

So if it's still not making sense, and of course you've asked your questions on the forum, it's still not making sense, we'll be reviewing it next week. So if you don't retrain a layer, does that mean the layer remembers what gets saved? So yes, if you don't retrain a layer, then when you save the weights, it's going to contain the weights that it originally had.

That's a really important question. Why would we want to start out by overfitting? We're going to talk about that next. The last conflayer in VGG is a 7x7 output. There are 49 boxes and each one has 512 different things. That's kind of right, but it's not that it recognizes 512 different things.

When you have a convolution on a convolution on a convolution on a convolution on a convolution, you have a very rich function with hundreds of thousands of parameters. So it's not that it's recognizing 512 things, it's that there are 512 rich complex functions. And so those rich complex functions can recognize rich complex concepts.

So for example, we saw in the video that even in layer 6 there's a face detector which can recognize cat faces as well as human faces. So the later on we get in these neural networks, the harder it is to even say what it is that's being found because they get more and more sophisticated and complex.

So what those 512 things do in the last layer of VGG, I'm not sure that anybody's really got to a point that they could tell you that. I'm going to move on. The next section is all about making our model better. So at this point, we have a model with an accuracy of 97.7%.

So how do we make it better? Now because we have started with an existing model, a VGG model, there are two reasons that you could be less good than you want to be. Either you're underfitting or you're overfitting. Underfitting means that, for example, you're using a linear model to try to do image recognition.

You're using a model that is not complex and powerful enough for the thing you're doing or it doesn't have enough parameters for the thing you're doing. That's what underfitting is. Overfitting means that you're using a model with too many parameters that you've trained for too long without using any of the techniques or without correctly using the techniques you're about to learn about, such that you've ended up learning what your specific training pictures look like rather than what the general patterns in them look like.

You will recognize overfitting if your training set has a much higher accuracy than your test set or your validation set. So that means you've learned how to recognize the contents of your training set too well. And so then when you look at your validation set you get a less good result.

So that's overfitting. I'm not going to go into detail on this because any of you who have done any machine learning have seen this before, so any of you who haven't, please look up overfitting on the internet, learn about it, ask questions about it. It is perhaps the most important single concept in machine learning.

So it's not that we're not covering it because it's not interesting, it's just that we're not covering it because I know a lot of you are already familiar with it. Underfitting we can see in the same way, but it's the opposite. If our training error is much lower than our validation error, then we're underfitting.

So I'm going to look at this now because in fact you might have noticed that in all of our models so far, our training error has been lower than our validation error, which means we are underfitting. So how is this possible? And the answer to how this is possible is because the VGG network includes something called dropout, and specifically dropout with a p of 0.5.

What does dropout mean with a p of 0.5? It means that at this layer, which happens at the end of every fully connected block, it deletes 0.5, so 50% of, the activations at random. It sets them to 0. That's what a dropout layer does. It sets to 0.5, half of the activations at random.

Why would it do that? Because when you randomly throw away bits of the network, it means that the network can't learn to overfit. It can't learn to build a network that just learns about your images, because as soon as it does, you throw away half of it and suddenly it's not working anymore.

So dropout is a fairly recent development, I think it's about three years old, and it's perhaps the most important development of the last few years. Because it's the thing that now means we can train big complex models for long periods of time without overfitting. Incredibly important. But in this case, it seems that we are using too much dropout.

So the VGG network, which used a dropout of 0.5, they decided they needed that much in order to avoid overfitting ImageNet. But it seems for our cats and dogs, it's underfitting. So what do we do? The answer is, let's try removing dropout. So how do we remove dropout? And this is where it gets fun.

We can start with our VGG fine-tuned model. And I've actually created a little function called VGG fine-tuned, which creates a VGG fine-tuned model with two outputs. It looks exactly like you would expect it to look. It creates a VGG model, it fine-tunes it, it returns it. What does fine-tune do?

It does exactly what we've learnt. It pops off the last layer, sets all the rest of the layers to non-trainable, and adds a new dense layer. So I just create a little thing that does all that. Every time I start writing the same code more than once, I stick it into a function and use it again in the future.

It's good practice. I then load the weights that I just saved in my last model, so I don't have to retrain it. So saving and loading weights is a really helpful way of avoiding not refitting things. So already I now have a model that fits cats and dogs with 97.7% accuracy and underfits.

We can grab all of the layers of the model and we can then enumerate through them and find the last one which is a convolution. So let's remind ourselves, model.summary. So that's going to enumerate through all the layers and find the last one that is a convolution. So at this point, we now have the index of the last convolutional layer.

It turns out to be 30. So we can now grab that last convolutional layer. And so what we want to try doing is removing dropout from all the rest of the layers. So after the convolutional layer are the dense layers. So after the convolutional layers, the last convolutional layer, after that we have the dense layers.

So this is a really important concept in the Keras library of playing around with layers. And so spend some time looking at this code and really look at the inputs and the outputs and get a sense of it. So you can see here, here are all the layers up to the last convolutional layer.

Here are all of the layers from the last convolutional layer. So all the fully connected layers and all the convolutional layers. I can create a whole new model that contains just the convolutional layers. Why would I do that? Because if I'm going to remove dropout, then clearly I'm going to want to fine-tune all of the layers that involve dropout.

That is, all of the dense layers. I don't need to fine-tune any convolutional layers because none of the convolutional layers have dropout. I'm going to save myself some time. I'm going to pre-calculate the output of the last convolutional layer. So you see this model I've built here, this model that contains all the convolutional layers.

If I pre-calculate the output of that, then that's the input to the dense layers that I want to train. So you can see what I do here is I say conv_model.predict with my validation batches, conv_model.predict with my batches, and that now gives me the output of the convolutional layer for my training and the output of it for my validation.

And because that's something I don't want to have to do it again and again, I save it. So here I'm just going to go load_array and that's going to load from the disk the output of that. And so I'm going to say train_features.shape, and this is always the first thing that you want to do when you've built something, is look at its shape.

And indeed, it's what we would expect. It is 23,000 images, each one is 14x14, because I didn't include the final Max Pauling layer, with 512 filters. And so indeed, if we go model.summary, we should find that the last convolutional layer, here it is, 512 filters, 14x14 dimension. So we have basically built a model that is just a subset of VGG containing all of these earlier layers.

We've run it through our test set and our validation set, and we've got the outputs. So that's the stuff that we want to fix, and so we don't want to recalculate that every time. So now we create a new model which is exactly the same as the dense part of VGG, but we replace the dropout P with 0.

So here's something pretty interesting, and I'm going to let you guys think about this during the week. How do you take the previous weights from VGG and put them into this model where dropout is 0? So if you think about it, before we had dropout of 0.5, so half the activations were being deleted at random.

So since half the activations are being deleted at random, now that I've removed dropout, I effectively have twice as many weights being active. Since I have twice as many weights being active, I need to take my imageNet weights and divide them by 2. So by taking my imageNet weights and copying them across, so I take my previous weights and copy them across to my new model, each time divide them by 2, that means that this new model is going to be exactly as accurate as my old model before I start training, but it has no dropout.

Is it wasteful to have in the cats and dogs model filters that are being learnt to find things like bookshelves? Potentially it is, but it's okay to be wasteful. The only place that it's a problem is if we are overfitting. And if we're overfitting, then we can easily fix that by adding more dropout.

So let's try this. We now have a model which takes the output of the convolutional layers as input, gives us our cats vs. dogs as output, and has no dropout. So now we can just go ahead and fit it. So notice that the input to this is my 512 x 14 x 14 inputs.

My outputs are my cats and dogs as usual, and train it for a few epochs. And here's something really interesting. Dense layers take very little time to compute. A convolutional layer takes a long time to compute. Think about it, you're computing 512 x 3 x 3 x 512 filters.

For each of 14 x 14 spots, that is a lot of computation. So in a deep learning network, your convolutional layers is where all of your computation is being taken up. So look, when I train just my dense layers, it's only taking 17 seconds. Super fast. On the other hand, the dense layers is where all of your memory is taken up.

Because between this 4096 layer and this 4096 layer, there are 4000 x 4000 = 16 million weights. And between the previous layer, which was 512 x 7 x 7 after Max Pauling, that's 25,088. There are 25,088 x 4096 weights. So this is a really important rule of thumb. Your dense layers is where your memory is taken up.

Your convolutional layers is where your computation time is taking up. So it took me a minute or so to run 8 epochs. That's pretty fast. And holy shit, look at that! 98.5%. So you can see now, I am overfitting. But even though I'm overfitting, I am doing pretty damn well.

So overfitting is only bad if you're doing it so much that your accuracy is bad. So in this case, it looks like actually this amount of overfitting is pretty good. So for cats and dogs, this is about as good as I've gotten. And in fact, if I'd stopped it a little earlier, you can see it was really good.

In fact, the winner was 98.8, and here I've got 98.75. And there are some tricks I'll show you later that always give you an extra 50% accuracy. So this would definitely have won cats and dogs if we had used this model. Question - Can you perform dropout on a convolutional layer?

You can absolutely perform dropout on a convolutional layer. And indeed, nowadays people normally do. I don't quite remember the VGG days. I guess that was 2 years ago. Maybe people in those days didn't. Nowadays, the general approach would be you would have dropout of 0.1 before your first layer, dropout of 0.2 before this one, 0.3, 0.4, and then finally dropout of 0.5 before your fully connected layers.

It's kind of the standard. If you then find that you're underfitting or overfitting, you can modify all of those probabilities by the same amount. If you dropout in an early layer, you're losing that information for all of the future layers, so you don't want to drop out too much in the early layers.

You can feel better dropping out more in the later layers. This is how you manually tune with your overfitting or underfitting. Another way to do it would be to modify the architecture to have less or more filters. But that's actually pretty difficult to do. So it's the point that we didn't need dropout anyway.

Perhaps it was. But VGG comes with dropout. So when you're fine-tuning, you start with what you start with. We are overfitting here, so my hypothesis is that we maybe should try a little less dropout. But before we do, I'm going to show you some better tricks. The first trick I'm going to show you is a trick that lets you avoid overfitting without deleting information.

Dropout deletes information, so we don't want to do it unless we have to. So instead of dropout, here is a list. You guys should refer to this every time you're building a model that is overfitting. 5 steps. Step 1, add more data. This is a Kaggle competition, so we can't do that.

Step 2, use data augmentation, which we're about to learn. Step 3, use more generalizable architectures. We're going to learn that after this. Step 4, add regularization. That generally means dropout. There's another type of regularization which is where you basically add up all of your weights, the value of all of your weights, and then multiply it by some small number, and you add that to the loss function.

Basically you say having higher weights is bad. That's called either L2 regularization, if you take the square of your weights and add them up, or L1 regularization if you take the absolute value of your weights and add them up. Tera supports that as well. Also popular. I don't think anybody has a great sense of when do you use L1 and L2 regularization and when do you use dropout.

I use dropout pretty much all the time, and I don't particularly see why you would need both, but I just wanted to let you know that that other type of regularization exists. And then lastly, if you really have to reduce architecture complexity, so remove some filters. But that's pretty hard to do if you're fine-tuning, because how do you know which filters to remove?

So really, the first four. Now that we have dropout, the first four are what we do in practice. Like in Random Forests, where we randomly select subsets of variables at each point, that's kind of what dropout is doing. Dropout is randomly throwing away half the activations, so dropout and random forests both effectively create large ensembles.

It's actually a fantastic kind of analogy between random forests. So just like when we went from decision trees to random forests, it was this huge step which was basically create lots of decision trees with some random differences. Dropout is effectively creating lots of, automatically, lots of neural networks with different subsets of features that have been randomly selected.

Data augmentation is very simple. Data augmentation is something which takes a cat and turns it into lots of cats. That's it. Actually, it does it for dogs as well. You can rotate, you can flip, you can move up and down, left and right, zoom in and out. And in Keras, you do it by, rather than, what we've always said before was image data generator, open parenthesis, closed parenthesis.

Now we say all these other things. Flip it horizontally at random, zoom in a bit at random, share at random, rotate at random, move it left and right at random, and move it up and down at random. So once you've done that, then when you create your batches, rather than doing it the way we did it before, you simply add that to your batches.

So we said, Ok, this is our data generator, and so when we create our batches, use that data generator, the augmenting data generator. Very important to notice, the validation set does not include that. Because the validation set is the validation set. That's the thing we want to check against, so we shouldn't be fiddling with that at all.

The validation set has no data augmentation and no shuffling. It's constant and fixed. The training set, on the other hand, we want to move it around as much as we can. So shuffle its order and add all these different types of augmentation. How much augmentation to use? This is one of the things that Rachel and I would love to automate.

For now, two methods, use your intuition. The best way to use your intuition is to take one of your images, add some augmentation, and check whether they still look like cats. So if it's so warped that you're like, "Ok, nobody takes a photo of a cat like that," you've done it wrong.

So this is kind of like a small amount of data augmentation. Method 2, experiment. Try a range of different augmentations and see which one gives you the best results. If we add some augmentation, everything else is exactly the same, except we can't pre-compute anything anymore. So earlier on, we pre-computed the output of the last convolutional layer.

We can't do that now, because every time this cat approaches our neural network, it's a little bit different. It's rotated a bit, it's flipped, it's moved around or it's zoomed in and out. So unfortunately, when we use data augmentation, we can't pre-compute anything and so things take longer. Everything else is the same though.

So we grab our fully connected model, we add it to the end of our convolutional model, and this is the one with our dropout, compile it, fit it, and now rather than taking 9 seconds per epoch, it takes 273 seconds per epoch because it has to calculate through all the convolutional layers because of the data augmentation.

So in terms of results here, we have not managed to get back up to that 98.7 accuracy. I probably have, I've run a few more. So if I keep running them, again, I start overfitting. So it's a little hard to tell because my validation accuracy is moving around quite a lot because my validation sets a little bit on the small side.

It's a little bit hard to tell whether this data augmentation is helping or hindering. I suspect what we're finding here is that maybe we're doing too much data augmentation, so if I went back and reduced my different ranges by say half, I might get a better result than this.

But really, this is something to experiment with and I had better things to do than experiment with this. But you get the idea. Data augmentation is something you should always do. There's never a reason not to use data augmentation. The question is just what kind and how much. So for example, what kind?

Should you flip x, y? So clearly, for dogs and cats, no. You pretty much never see a picture of an upside down dog. So would you do vertical flipping in this particular problem? No you wouldn't. Would you do rotations? Yeah, you very often see cats and dogs that are kind of on their hind legs or the photos taken a little bit uneven or whatever.

You certainly would have zooming because sometimes you're close to the dog, sometimes further away. So use your intuition to think about what kind of augmentation. Yes? What about data augmentation? Data augmentation for color? That's an excellent point. So something I didn't add to this, but I probably should have, is that there is a channel augmentation parameter for the data generator in Keras.

And that will slightly change the colors. That's a great idea for natural images like these because you have different white balance, you have different lighting and so forth. And indeed I think that would be a great idea. So I hope during the week people will take this notebook and somebody will tell me what is the best result they've got.

And hopefully I bet that that data augmentation will include some fiddling around with the colors. Question on the same light screen. If you change all the images to more of long images? Would that be equal to black and white? We're changing it to black and white. No it wouldn't, because the Kaggle competition test set is in color.

So if you're throwing away color, you're throwing away information. And figuring out whether something is -- but the Kaggle competition is saying is this a cat or is this a dog? And part of seeing whether something is a cat or a dog is looking at what color it is.

So if you're throwing away the color, you're making that harder. So yeah, you could run it on the test set and get answers, but they're going to be less accurate because you've thrown away information. Question on the same light screen. How is it working since you've removed the flattened layer between the comp block and the dense layers?

Okay, so what happened to the flattened layer? And the answer is that it was there. Where was it? Oh gosh. I forgot to add it back to this one. So I actually changed my mind about whether to include the flattened layer and where to put it and where to put max pooling.

It will come back later. So this is a slightly old version. Thank you for picking it up. Could you do a form of dropout on the raw images by randomly blanking out pieces of the images? Yeah, so can you do dropout on the raw images? The simple answer is yes, you could.

There's no reason I can't put a dropout layer right here. And that's going to drop out raw pixels. It turns out that's not a good idea. Throwing away input information is very different to throwing away modeled information. Throwing away modeled information is letting you effectively avoid overfitting the model.

But you don't want to avoid overfitting the data. So you probably don't want to do that. Question on the same light screen. To clarify, the augmentation is at random. I just showed you 8 examples of the augmentation. So what the augmentation does is it says at random, rotate by up to 20 degrees, move by up to 10% in each direction, sheer by up to 5%, zoom by up to 10%, and flip at random half the time.

So then I just said, OK, here are 8 cats. But what happens is every single time an image goes into the batch, it gets randomized. So effectively, it's an infinite number of augmented images. That doesn't have anything to do with data augmentation, so maybe we'll discuss that on a forum.

The final concept to learn about today is batch normalization. Batch normalization, like data augmentation, is something you should always do. Why didn't VGG do it? Because it didn't exist then. Batch norm is about a year old, maybe 18 months. Here's the basic idea. When anybody who's done any machine learning probably knows that one of the first things you want to do is take your input data, subtract its mean, and divide by its standard deviation.

Why is that? Imagine that we had 40, minus 30, and 1. You can see that the outputs are all over the place. The intermediate values, some are really big, some are really small. So if we change a weight which impacted x_1, it's going to change the loss function by a lot, whereas if we change a weight which impacts x_3, it'll change the loss function by very little.

So the different weights have very different gradients, very different amounts that are going to affect the outcome. Furthermore, as you go further down through the model, that's going to multiply. Particularly when we're using something like softmax, which has an 'e' to the power of in it, you end up with these crazy big numbers.

So when you have inputs that are of very different scales, it makes the whole model very fragile, which means it is harder to learn the best set of weights and you have to use smaller learning weights. This is not just true of deep learning, it's true of pretty much every kind of machine learning model, which is why everybody who's been through the MSAM program here hopefully you guys all learn to normalize your inputs.

So if you haven't done any machine learning before, no problem, just take my word for it, you always want to normalize your inputs. It's so common that pretty much all of the deep learning libraries will normalize your inputs for you with a single parameter. And indeed we're doing it in hours because images, like pixel values only range from 0 to 255, you don't generally worry about dividing by the standard deviation with images, but you do generally worry about subtracting the mean.

So you'll see that the first thing that our model does is this thing called pre-process, which subtracts the mean. And the mean was something which basically you can look it up on the internet and find out what the mean of the ImageNet data is. So these three fixed values.

Now what's that got to do with batch norm? Well, imagine that somewhere along the line in our training, we ended up with one really big weight. Then suddenly one of our layers is going to have one really big number. And now we're going to have exactly the same problem as we had before, which is the whole model becomes very un-resilient, becomes very fragile, becomes very hard to train, going to be all over the place.

Some numbers could even get slightly out of control. So what do we do? Really what we want to do is to normalize not just our inputs but our activations as well. So you may think, OK, no problem, let's just subtract the mean and divide by the standard deviation for each of our activation layers.

Unfortunately that doesn't work. SGD is very bloody-minded. If it wants to increase one of the weights higher and you try to undo it by subtracting the mean and dividing by the standard deviation, the next iteration is going to try to make it higher again. So if SGD decides that it wants to make your weights of very different scales, it will do so.

So just normalizing the activation layers doesn't work. So batch norm is a really neat trick for avoiding that problem. Before I tell you the trick, I will just tell you why you want to use it. Because A) it's about 10 times faster than not using it, particularly because it often lets you use a 10 times higher learning rate, and B) because it reduces overfitting without removing any information from the model.

So these are the two things you want, less overfitting and faster models. I'm not going to go into detail on how it works. You can read about this during the week if you're interested. But a brief outline. First step, it normalizes the intermediate layers just the same way as input layers can be normalized.

The thing I just told you wouldn't work, well it does it, but it does something else critical, which is it adds two more trainable parameters. One trainable parameter multiplies by all the activations, and the other one is added to all the activations. So effectively that is able to undo that normalization.

Both of those two things are then incorporated into the calculation of the gradient. So the model now knows that it can rescale all of the weights if it wants to without moving one of the weights way off into the distance. And so it turns out that this does actually effectively control the weights in a really effective way.

So that's what batch normalization is. The good news is, for you to use it, you just type batch normalization. In fact, you can put it after dense layers, you can put it after convolutional layers, you should put it after all of your layers. Here's the bad news, VGG didn't train originally with batch normalization, and adding batch normalization changes all of the weights.

I think that there is a way to calculate a new set of weights with batch normalization, I haven't gone through that process yet. So what I did today was I actually grabbed the entirety of ImageNet and I trained this model on all of ImageNet. And that then gave me a model which was basically VGG plus batch normalization.

And so that is the model here that I'm loading. So this is the ImageNet, whatever it is, large visual recognition competition 2012 dataset. And so I trained this set of weights on the entirety of ImageNet so that I created basically a VGG plus batch norm. And so then I fine-tuned the VGG plus batch norm model by popping off the end and adding a new dense layer.

And then I trained it, and these only took 6 seconds because I pre-calculated the inputs to this. Then I added data augmentation and I started training that. And then I ran out of time because it was class. So I think this was on the right track. I think if I had another hour or so, you guys can play with this during the week.

Because this is now like all the pieces together. It's batch norm and data augmentation and as much dropout as you want. So you'll see what I've got here is I have dropout layers with an arbitrary amount of dropout. And so in this, the way I set it up, you can go ahead and say create batch norm layers with whatever amount of dropout you want.

And then later on you can say I want you to change the weights to use this new amount of dropout. So this is kind of like the ultimate ImageNet fine-tuning experience. And I haven't seen anybody create this before, so this is a useful tool that didn't exist until today.

And hopefully during the week, we'll keep improving it. Interestingly, I found that when I went back to even 0.5 dropout, it was still massively overfitting. So it seems that batch normalization allows the model to be so much better at finding the optimum that I actually needed more dropout rather than less.

So anyway, as I said, this is all something I was doing today. So I haven't quite finalized that. What I will show you though is something I did finalize, which I did on Sunday, which is going through end-to-end an entire model-building process on MNIST. And so I want to show you this entire process and then you guys can play with it.

MNIST is a great way to really experiment with and revise everything we know about CNNs because it's very fast to train, because there are only 28x28 images, and there's also extensive benchmarks on what are the best approaches to MNIST. So it's very, very easy to get started with MNIST because Keras actually contains a copy of MNIST.

So we can just go from Keras.datasets, import MNIST, MNIST.loadData, and we're done. Now MNIST are grayscale images, and everything in Keras in terms of the convolutional stuff expects there to be a number of channels. So we have to use expand-dims to add this empty dimension. So this is 60,000 images with one color, which are 28x28.

So if you try to use grayscale images and get weird errors, I'm pretty sure this is what you've forgotten to do, just to add this kind of empty dimension, which is you actually have to tell it there is one channel. Because otherwise it doesn't know how many channels are there.

So there is one channel. The other thing I had to do was take the y-values, the labels, and one-hot encode them. Because otherwise they were like this, they were actual numbers, 50419. And we need to one-hot encode them so that they're 50419. Remember, this is the thing that that softmax function is trying to approximate.

That's how the linear algebra works. So there are the two things I had to do to preprocess this. Add the empty dimension and do my one-hot encoding. Then I normalize the input by subtracting the mean and dividing by the standard deviation. And then I tried to build a linear model.

So I can't fine-tune from ImageNet now because ImageNet is 224x224 and this is 28x28. ImageNet is full color and this is grayscale. So we're going to start from scratch. So all of these are going to start from random. So a linear model needs to normalize the input and needs to flatten it because I'm not going to treat it as an image, I'm going to treat it as a single vector.

And then I create my one dense layer with 10 outputs, compile it, grab my batches, and train my linear model. And so you can see, generally speaking, the best way to train a model is to start by doing one epoch with a pretty low learning rate. So the default learning rate is 0.001, which is actually a pretty good default.

So you'll find nearly all of the time I just accept the default learning rate and I do a single epoch. And that's enough to get it started. Once you've got it started, you can set the learning rate really high. So 0.1 is about as high as you ever want to go, and do another epoch.

And that's going to move super fast. And then gradually, you reduce the learning rate by order of magnitude at a time. So I go to 0.01, do a few epochs, and basically keep going like that until you start overfitting. So I got down to the point where I had a 92.7% accuracy on the training, 92.4% on the test, and I was like, okay, that's about as far as I can go.

So that's a linear model. Not very interesting. So the next thing to do is to grab one extra dense layer in the middle, so one hidden layer. This is what in the 80s and 90s people thought of as a neural network, one hidden layer fully connected. And so that still takes 5 seconds to train.

Again, we do the same thing, one epoch with a low learning rate, then pop up the learning rate for as long as we can, gradually decrease it, and we get 94% accuracy. So you wouldn't expect a fully connected network to do that well. So let's create a CNN. So this was actually the first architecture I tried.

And basically I thought, okay, we know VGG works pretty well, so how about I create an architecture that looks like VGG, but it's much simpler because this is just 28x28. So I thought, okay, well VGG generally has a couple of convolutional layers of 3x3, and then a max pooling layer, and then a couple more with twice as many filters.

So I just tried that. So this is kind of like my inspired by VGG model. And I thought, okay, so after 2 lots of max pooling, it'll go from 28x28 by 14x14 to 7x7. Okay, that's probably enough. So then I added my 2 dense layers again. So I didn't use any science here, it's just kind of some intuition.

And it actually worked pretty well. After my learning rate of 0.1, I had an accuracy of 98.9%, validation accuracy of 99%. And then after a few layers of 0.01, I had an accuracy of 99.75%. But look, my validation accuracy is only 99.2%. So look, I'm overfitting. So this is the trick.

Start by overfitting. Once you know you're overfitting, you know that you have a model that is complex enough to handle your data. So at this point, I was like, okay, this is a good architecture. It's capable of overfitting. So let's now try to use the same architecture and reduce overfitting, but reduce the complexity of the model no more than necessary.

So step 1 of my 5-step list was data augmentation. So I added a bit of data augmentation, and then I used exactly the same model as I had before. And trained it for a while. And I found this time I could actually train it for even longer, as you can see.

And I started to get some pretty good results here, 99.3, 99.34. But by the end, you can see I'm massively overfitting again. 99.6 training versus 91.1 test. So data augmentation alone is not enough. And I said to you guys, we'll always use batch norm anyway. So then I add batch norm.

I use batch norm on every layer. Notice that when you use batch norm on convolution layers, you have to add axis=1. I am not going to tell you why. I want you guys to read the documentation about batch norm and try and figure out why you need this. And then we'll have a discussion about it on the forum because it's a really interesting analysis if you really want to understand batch norm and understand why you need this here.

If you don't care about the details, that's fine. Just know type axis=1 anytime you have batch norm. And so this is like a pretty good quality modern network. You can see I've got convolution layers, they're 3x3, and then I have batch norm, and then I have max pooling, and then at the end I have some dense layers.

This is actually a pretty decent looking model. Not surprisingly, it does pretty well. So I train it for a while at 0.1, I train it for a while at 0.01, I train it for a while at 0.001, and you can see I get up to 99.5%. That's not bad.

But by the end, I'm starting to overfit. So add a little bit of dropout. And remember what I said to you guys, nowadays the rule for dropout is to gradually increase it. I only had time yesterday to just try adding one layer of dropout right at the end, but as it happened, that seemed to be enough.

So when I just added one layer of dropout to the previous model, trained it for a while at 0.1, 0.01, 0.001, and it's like, oh great, my accuracy and my validation accuracy are pretty similar, and my validation accuracy is around 99.5 to 99.6 towards the end here. So I thought, okay, that sounds pretty good.

So at 99.5 or 99.6% accuracy on handwriting recognition is pretty good, but there's one more trick you can do which makes every model better, and it's called Ensembling. Ensembling refers to building multiple versions of your model and combining them together. So what I did was I took all of the code from that last section and put it into a single function.

So this is exactly the same model I had before, and this is my exact steps that I talked to train it, my learning rate of 0.1, 0.01, 0.001. So at the end of this, it returns a trained model. And so then I said, okay, 6 times fit a model and return a list of the results.

So models at the end of this contain 6 trained models using my preferred network. So then what I could do was to say, go through every one of those 6 models and predict the output for everything in my test set. So now I have 10,000 test images by 10 outputs by 6 models.

And so now I can take the average across the 6 models. And so now I'm basically saying here are 6 models, they've all been trained in the same way but from different random starting points. And so the idea is that they will be having errors in different places. So let's take the average of them, and I get an accuracy of 99.7%.

How good is that? It's very good. It's so good that if we go to the academic list of the best MNIST results of all time, and many of these were specifically designed for handwriting recognition, it comes here. So one afternoon's work gets us in the list of the best results ever found on this dataset.

So as you can see, it's not rocket science, it's all stuff you've learned before, you've learned now, and it's a process which is fairly repeatable, can get you right up to the state of the art. So it was easier to do it on MNIST because I only had to wait a few seconds for each of my trainings to finish.

To get to this point on State Farm, it's going to be harder because you're going to have to think about how do you do it in the time you have available and how do you do it in the context of fine-tuning and stuff like that. But hopefully you can see that you have all of the tools now at your disposal to create literally a state of the art model.

So I'm going to make all of these notebooks available. You can play with them. You can try to get a better result from dogs and cats. As you can see, it's kind of like an incomplete thing that I've done here. I haven't found the best data augmentation, I haven't found the best dropout, I haven't trained it as long as I probably need to.

So there's some work for you to do. So here are your assignments for this week. This is all review now. I suggest you go back and actually read. There's quite a bit of prose in every one of these notebooks. Hopefully now you can go back and read that prose, and some of that prose at first was a bit mysterious, now it's going to make sense.

Oh, okay, I see what it's saying. And if you read something and it doesn't make sense, ask on the forum. Or if you read something and you want to check, oh, is this kind of another way of saying this other thing? Ask on the forum. So these are all notebooks that we've looked at already and you should definitely review.

Ask us something on the forum. Make sure that you can replicate the steps shown in the lesson notebooks we've seen so far using the technique in how to use the provided notebooks we looked at the start of class. If you haven't yet got into the top 50% of dogs vs cats, hopefully you've now got the tools to do so.

If you get stuck at any point, ask on the forum. And then this is your big challenge. Can you get into the top 50% of State Farm? Now this is tough. The first step to doing well in a Kaggle competition is to create a validation set that gives you accurate answers.

So create a validation set, and then make sure that the validation set accuracy is the same as you get when you submit to Kaggle. If you don't, you don't have a good enough validation set yet. Creating a validation set for State Farm is really your first challenge. It requires thinking long and hard about the evaluation section on that page and what that means.

And then it's thinking about which layers of the pre-trained network should I be retraining. I actually have read through the top 20 results from the competition close 3 months ago. I actually think all of the top 20 result methods are pretty hacky. They're pretty ugly. I feel like there's a better way to do this that's kind of in our grasp.

So I'm hoping that somebody is going to come up with a top 20 result for State Farm that is elegant. We'll see how we go. If not this year, maybe next year. Honestly, nobody in Kaggle quite came up with a really good way of tackling this. They've got some really good results, but with some really convoluted methods.

And then as you go through a review, please, any of these techniques that you're not clear about, these 5 pieces, please go and have a look at this additional information and see if that helps. Alright, that was a pretty quick run-through. I hope everything goes well and I will see you next week.

Lesson 3: Practical Deep Learning for Coders

Transcript