Lesson 4: Practical Deep Learning for Coders

I guess I noticed during the week from some of the questions I've been seeing that the idea of what a convolution is, is still a little counter-intuitive or surprising to some people. I feel like the only way I know to teach things effectively is by creating a spreadsheet, so here we are.

This is the famous number 7 from lesson 0, and I just copied and pasted the numbers into a spreadsheet. They're not exactly 0, they're actually floats, just rounded off. And as you can see, I'm just using conditional coloring, you can see the shape of our little number 7 here.

So I wanted to show you exactly what a convolution does, and specifically what a convolution does in a deep learning neural network. So we are generally using modern convolutions, and that means a 3x3 convolution. So here is a 3x3 convolution, and I have just randomly generated 9 random numbers.

So that is a filter, there's one filter. Here is my second filter, it is 9 more random numbers. So this is what we do in Keras when we ask for a convolutional layer. We tell it, the first thing we pass it is how many filters do we want, and that's how many of these random matrices do we want it to build for us.

So in this case, it's as if I passed convolution2D, the first parameter would be 2, and the second parameter would be 3,3, because it's a 3x3. And what happens to this little random matrix? In order to calculate the very first item, it takes the sum of the blue stuff, those 9, times the red stuff, those 9, all added together.

So let's go down here into where it gets a bit darker, how does this get calculated? This is equal to these 9 times these 9, when I say times, I mean element-wise times, so the top left by the top left, the middle by the middle, and so forth, and add them all together.

That's all a convolution is. So it's just as you go through, we take the corresponding 3x3 area in the image, and we multiply each of those 9 things by each of these 9 things, and then we add those 9 products together. That's it, that's a convolution. So there's really nothing particularly weird or confusing about it, and I'll make this available in class so you can have a look.

You can see that when I get to the top left corner, I can't move further left and up because I've reached the edge, and this is why when you do a 3x3 convolution without zero padding, you lose one pixel on each edge because you can't push this 3x3 any further.

So if we go down to the bottom left, you can see again the same thing, it kind of gets stuck in the corner. So that's why you can see that my result is one row less than my starting point. So I did this for two different filters, so here's my second filter, and you can see when I calculate this one, it's exactly the same thing, it's these 9 times each of these 9 added together.

These are just 9 other random numbers. So that's how we start with our first, in this case I've created two convolutional filters, and this is the output of those two convolutional filters, they're just random at this point. So my second layer, now my second layer is no longer enough just to have a 3x3 matrix, and I need a 3x3x2 tensor because to calculate my top left of my second convolutional layer, I need these 9 by these 9 added together, plus these 9 by these 9 added together.

Because at this point, my previous layer is no longer just one thing, but it's two things. Now indeed, if our original picture was a 3-channel color picture, our very first convolutional layer would have had to have been 3x3x3 tensors. So all of the convolutional layers from now on are going to be 3x3x number of filters in the previous layer convolution matrices.

So here is my first, I've just drawn it like this, 3x3x2 tensor, and you can see it's taking 9 from here, 9 from here and adding those two together. And so then for my second filter in my second layer, it's exactly the same thing. I've created two more random matrices, or one more random 3x3x2 tensor, and here again I have those 9 by these 9 sum plus those 9 by those 9 sum, and that gives me that one.

So that gives me my first two layers of my convolutional neural network. Then I do max pooling. Max pooling is slightly more awkward to do in Excel, but that's fine, we can still handle it. So here's max pooling. So max pooling, because I'm going to do 2x2 max pooling, it's going to decrease the resolution of my image by 2 on each axis.

So how do we calculate that number? That number is simply the maximum of those 4. And then that number is the maximum of those 4, and so forth. So with max pooling, we had two filters in the previous layer, so we still have two filters, but now our filters have half the resolution in each of the x and y axes.

And so then I thought, okay, we've done two convolutional layers, how did you go from one matrix to two matrices in the second layer? How did I go from one matrix to two matrices, as in how did I go from just this one thing to these two things? So the answer to that is I just created two random 3x3 filters.

This is my first random 3x3 filter, this is my second random 3x3 filter. So each output then was simply equal to each corresponding 9-element section, multiplied by each other, and added together. So because I had two random 3x3 matrices, I ended up with two outputs. Two filters means two sets of outputs.

Alright, so now that we've got our max pooling layer, let's use a dense layer to turn it into our output. So a dense layer means that every single one of our activations from our max pooling layer needs a random weight. So these are a whole bunch of random numbers.

So what I do is I take every one of those random numbers and multiply each one by a corresponding input and add them all together. So I've got the sum product of this and this. In MNIST we would have 10 activations because we need an activation for 0, 1, 2, 3, so forth up to 9.

So for MNIST we would need 10 sets of these dense weight matrices so that we could calculate the 10 outputs. If we were only calculating one output, this would be a perfectly reasonable way to do it. So for one output, it's just the sum product of everything from our final layer with a weight for everything in that final layer, add it together.

So that's all a dense layer is. Both dense layers and convolutional layers couldn't be easier mathematically. I think the surprising thing is when you say rather than using random weights, let's calculate the derivative of what happens if we were to change that weight up by a bit or down by a bit, and how would it impact our loss.

In this case, I haven't actually got as far as calculating a loss function, but we could add over here a sigmoid loss, for example. And so we can calculate the derivative of the loss with respect to every single weight in the dense layer, and every single weight in all of our filters in that layer, and every single weight in all of our filters in this layer.

And then with all of those derivatives, we can calculate how to optimize all of these weights. And the surprising thing is that when we optimize all of these weights, we end up with these incredibly powerful models, like those visualizations that we saw. So I'm not quite sure where the disconnect between the incredibly simple math and the outcome is.

I think it might be that it's so easy, it's hard to believe that's all it is, but I'm not skipping over anything. That really is it. And so to help you really understand this, I'm going to talk more about SGD. Why would you use a sigmoid function here? So the loss function we generally use is the softmax, so e^xi divided by the sum of e^xi.

If it's just binary, that's just the equivalent of having just 1/1 + e^xi. So softmax in the binary case simplifies into a sigmoid function. Thank you for clarifying that question. So I think this is super fun. We're going to talk about not just SGD, but every variant of SGD, including one invented just a week ago.

So we've already talked about SGD. SGD happens for all layers at once, yes we calculate the derivative of all the weights with respect to the loss. And when to have a max pool after convolution versus when not to? When to have a max pool after a convolution, who knows.

This is a very controversial question, and indeed some people now say never use max pool. Instead of using max pool when you're doing the convolutions, don't do a convolution over every set of 9 pixels, but instead skip a pixel each time. And so that's another way of downsampling. Jeffrey Hinton, who is kind of the father of deep learning, has gone as far as saying that the extremely great success of max pooling has been the greatest problem deep learning has faced.

Because to him, it really stops us from going further. I don't know if that's true or not, I assume it is because he's Jeffrey Hinton and I'm not. For now, we use max pooling every time we're doing fine-tuning because we need to make sure that our architecture is identical to the original VGG's authors' architecture and so we have to put max pooling wherever they do.

Why do we want max pooling or downsampling or anything like that? Are we just trying to look at bigger features at the input? Why use max pooling at all? There's a couple of reasons. The first is that max pooling helps with translation invariance. So it basically says if this feature is here, or here, or here, or here, I don't care.

It's kind of roughly in the right spot. And so that seems to work well. And the second is exactly what you said. Every time we max pool, we end up with a smaller grid, which means that our 3x3 convolutions are effectively covering a larger part of the original image, which means that our convolutions can find larger and more complex features.

I think they would be the two main reasons. Is Jeffrey Hinton cool with the idea of doing the skipping index page time? You can learn all about the things that he thinks we ought to have but don't yet have. He did point out that -- I can't remember what it was, but one of the key pieces of deep learning that he invented took like 17 years from conception to working, so he is somebody who sticks with these things and makes it work.

Is max pooling unique to image processing? Max pooling is not unique to image processing. It's likely to be useful for any kind of convolutional neural network, and a convolutional neural network can be used for any kind of data that has some kind of consistent ordering. So things like speech, or any kind of audio, or some kind of consistent time series, all of these things have some kind of ordering to them and therefore you can use CNN and therefore you can use max pooling.

And as we look at NLP, we will be looking more at convolutional neural networks for other data types. And interestingly, the author of Keras last week, or maybe the week before, made the contention that perhaps it will turn out that CNNs are the architecture that will be used for every type of ordered data.

And this was just after one of the leading NLP researchers released a paper basically showing a state-of-the-art result in NLP using convolutional neural networks. So although we'll start learning about recurrent neural networks next week, I have to be open to the possibilities that they'll become redundant by the end of the year, but they're still interesting.

So SGD. So we looked at the SGD intro notebook, but I think things are a little more clear sometimes when you can see it all in front of you. So here is basically the identical thing that we saw in the SGD notebook in Excel. So we are going to start by creating a line.

We create 29 random numbers, and then we say okay, let's create something that is equal to 2 times x plus 30. And so here is 2 times x plus 30. So that's my input data. So I am trying to create something that can find the parameters of a line.

Now the important thing, and this is the leap, which requires not thinking too hard lest you realize how surprising and amazing this is. Everything we learn about how to fit a line is identical to how to fit filters and weights in a convolutional neural network. And so everything we learn about calculating the slope and the intercept, we will then use to let computers see.

And so the answer to any question which is basically why is why not. This is a function that takes some inputs and calculates an output, this is a function that takes some inputs and calculates an output, so why not. The only reason it wouldn't work would be because it was too slow, for example.

And we know it's not too slow because we tried it and it works pretty well. So everything we're about to learn works for any kind of function which kind of has the appropriate types of gradients, and we can talk more about that later. But neural nets have the appropriate kinds of gradients.

So SGD, we start with a guess. What do we think the parameters of our function are, in this case the intercept and the slope. And with Keras, they will be randomized using the chloro-initialization procedure, which is 6 divided by n_in plus n_out, random numbers. And I'm just going to say let's assume they're both 1.

We are going to use very, very small mini-batches here. Mini-batches are going to be of size 1, because it's easier to do in Excel and it's easier to see. But everything we're going to see would work equally well for a mini-batch of size 4 or 64 or 128 or whatever.

So here's our first row, our first mini-batch. Our input is 14 and our desired output is 58. And so our guesses to our parameters are 1 and 1. And therefore our predicted y value is equal to 1 plus 1 times 14, which is normally 15. Therefore if we're doing root mean squared error, our error squared is prediction minus actual squared.

So the next thing we do is we want to calculate the derivative with respect to each of our two inputs. One really easy way to do that is to add a tiny amount to each of the two inputs and see how the output varies. So let's start by doing that.

So let's add 0.01 to our intercept and calculate the line and then calculate the loss squared. So this is the error if b is increased by 0.01. And then let's calculate the difference between that error and the actual error and then divide that by our change, which is 0.01.

And that gives us our estimated gradient. I'm using dE for the error, dB, I should have probably been dL for the loss, dB. The change in loss with respect to b is -85.99. We can do the same thing for a. So we can add 0.01 to a, and then calculate our line, subtract our actual, take the square, and so there is our value of estimated dL/dA, subtract it from the actual loss divided by 0.01.

And so there are two estimates of the derivative. This approach to estimating the derivative is called finite differencing. And any time you calculate a derivative by hand, you should always use finite differencing to make sure your calculation is correct. You're not very likely to ever have to do that, however, because all of the libraries do derivatives for you.

They do them analytically, not using finite derivatives. And so here are the derivatives calculated analytically, which you can do by going to Wolfram Alpha and typing in your formula and getting the derivative back. So this is the analytical derivative of the loss with respect to b, and the analytical derivative of the loss with respect to a.

And so you can see that our analytical and our finite difference are very similar for b and they are very similar for a. So that makes me feel comfortable that we got the calculation correct. So all SGD does is it says, okay, this tells us if we change our weights by a little bit, this is the change in our loss function.

We know that increasing our value of b by a bit will decrease the loss function, and we know that increasing our value of a by a little bit will decrease the loss function. So therefore let's decrease both of them by a little bit. And the way we do that is to multiply the derivative times a learning rate, that's the value of a little bit, and subtract that from our previous guess.

So we do that for a, and we do that for b, and here are our new guesses. Now we're at 1.12 and 1.01, and so let's copy them over here, 1.12 and 1.01. And then we do the same thing, and that gives us a new a and a b.

And we keep doing that again and again and again until we've gone through the whole dataset, at the end of which we have a guess of a of 2.61 and a guess of b of 1.07. So that's one epoch. Now in real life, we would be having shuffle=true, which means that these would be randomized.

So this isn't quite perfect, but apart from that, this is SGD with a mini-batch size of 1. So at the end of the epoch, we say this is our new slope, so let's copy 2.61 over here, and this is our new intercept. So let's copy 1.06 over here, and so now it starts again.

So we can keep doing that again and again and again. Copy the stuff from the bottom, stick it back at the top, and each one of these is going to be an epoch. So I recorded a macro with me copying this to the bottom and pasting it at the top, and added something that says for i = 1 to 5 around it.

And so now if I click Run, it will copy and paste it 5 times. And so you can see it's gradually getting closer. And we know that our goal is that it should be a = 2 and b = 30. So we've got as far as a = 2.5 and b = 1.3, so they're better than our starting point.

And you can see our gradually improving loss function. But it's going to take a long time. Yes, Rachel? Question - Can we still do analytic derivatives when we are using nonlinear activation functions? Answer - Yes, we can use analytical derivatives as long as we're using a function that has an analytical derivative, which is pretty much every useful function you can think of, except ones that you can't have something that has an if-then statement in it, because it jumps from here to here, but even those you can approximate.

So a good example would be ReLU, which is max of (0, x) strictly speaking doesn't really have a derivative at every point, or at least not a well-defined one, because this is what ReLU looks like. And so its derivative here is 0, and its derivative here is 1. What is its derivative exactly here?

Who knows? But the thing is, mathematicians care about that kind of thing, we don't. Like in real life, this is a computer, and computers are never exactly anything. We can either assume that it's like an infinite amount to this side, or an infinite amount to this side, and who cares?

So as long as it has a derivative that you can calculate in a meaningful way in practice on a computer, then it'll be fine. So one thing you might have noticed about this is that it's going to take an awfully long time to get anywhere. And so you might think, okay, let's increase the learning rate.

Fine, let's increase the learning rate. So let's get rid of one of these zeroes, oh dear, something went crazy. What went crazy? I'll tell you what went crazy, our a's and b's started to go out into like 11 million, which is not the correct answer. So how did it go ahead and do that?

Well here's the problem. Let's say this was the shape of our loss function, and this was our initial guess. And we figured out the derivative is going this way, actually the derivative is positive so we want to go the opposite direction, and so we step a little bit over here.

And then that leads us to here, and we step a little bit further, and this looks good. But then we increase the learning rate. So rather than stepping a little bit, we stepped a long way, and that put us here. And then we stepped a long way again, and that put us here.

If your learning rate is too high, you're going to get worse and worse. And that's what happened. So getting your learning rate right is critical to getting your thing to train it all. Exploding gradients, yeah, or you can even have gradients that do the opposite. Exploding gradients are something a little bit different, but it's a similar idea.

So it looks like 0.001 is the best we can do, and that's a bit sad because this is really slow. So let's try and improve it. So one thing we could do is say, well, given that every time we've been -- actually let me do this in a few more dimensions.

So let's say we had a 3-dimensional set of axes now, and we kind of had a loss function that looks like this kind of valley. And let's say our initial guess was somewhere over here. So over here, the gradient is pointing in this direction. So we might make a step and end up there.

And then we might make another step which would put us there, and another step that would put us there. And this is actually the most common thing that happens in neural networks. Something that's kind of flat in one dimension like this is called a saddle point. And it's actually been proved that the vast majority of the space of a loss function in a neural network is pretty much all saddle points.

So when you look at this, it's pretty obvious what should be done, which is if we go to here and then we go to here, we can say on average, we're kind of obviously heading in this direction. Especially when we do it again, we're obviously heading in this direction.

So let's take the average of how we've been going so far and do a bit of that. And that's exactly what momentum does. If ReLU isn't the cost function, why are we concerned with its differentiability? We care about the derivative of the output with respect to the inputs. The inputs are the filters, and remember the loss function consists of a function of a function of a function of a function.

So it is categorical cross-entropy loss applied to softmax, applied to ReLU, applied to dense layer, applied to max pooling, applied to ReLU, applied to convolutions, etc. So in other words, to calculate the derivative of the loss with respect to the inputs, you have to calculate the derivative through that whole function.

And this is what's called backpropagation. Backpropagation is easy to calculate that derivative because we know that from the chain rule, the derivative of a function of a function is simply equal to the product of the derivatives of those functions. So in practice, all we do is we calculate the derivative of every layer with respect to its inputs, and then we just multiply them all together.

And so that's why we need to know the derivative of the activation layers as well as the loss layer and everything else. So here's the trick. What we're going to do is we're going to say, every time we take a step, we're going to also calculate the average of the last few steps.

So after these two steps, the average is this direction. So the next step, we're going to take our gradient step as usual, and we're going to add on our average of the last few steps. And that means that we end up actually going to here. And then we do the same thing again.

So we find the average of the last few steps, and it's now even further in this direction, and so this is the surface of the loss function with respect to some of the parameters, in this case just a couple of parameters, it's just an example of what a loss function might look like.

So this is the loss, and this is some weight number 1, and this is some weight number 2. So we're trying to get our little, if you can imagine this is like gravity, we're trying to get this little ball to travel down this valley as far down to the bottom as possible.

And so the trick is that we're going to keep taking a step, not just the gradient step, but also the average of the last few steps. And so in practice, this is going to end up kind of going "donk, donk, donk, donk, donk." That's the idea. So to do that in Excel is pretty straightforward.

To make things simpler, I have removed the finite-differencing base derivatives here, so we just have the analytical derivatives. But other than that, this is identical to the previous spreadsheet. Same data, same predictions, same derivatives, except we've done one extra thing, which is that when we calculate our new B, we say it's our previous B minus our learning rate times, and we're not going times our gradient, but times this cell.

And what is that cell? That cell is equal to our gradient times 0.1 plus the thing just above it times 0.9, and the thing just above it is equal to its gradient times 0.1 plus the thing just above it times 0.9, and so forth. So in other words, this column is keeping track of an average derivative of the last few steps that we've taken, which is exactly what we want.

And we do that for both of our two parameters. So this 0.9 is our momentum parameter. So in Keras, when you use momentum, you can say momentum = and you say how much momentum you want. Where did that beta come from? You just pick it. So you just pick what that parameter, what do you want?

Just like your learning rate, you pick it, your momentum factor, you pick it. It's something you get to choose. And you choose it by trying a few and find out what works best. So let's try running this, and you can see it is still not exactly zipping along. Why is it not exactly zipping along?

Well the reason when we look at it is that we know that the constant term needs to get all the way up to 30, and it's still way down at 1.5. It's not moving fast enough, whereas the slope term moved very quickly to where we want it to be.

So what we really want is we need different learning rates for different parameters. And doing this is called dynamic learning rates. And the first really effective dynamic learning rate approaches have just appeared in the last 3 years or so. And one very popular one is called Adagrad, and it's very simple.

All of these dynamic learning rate approaches have the same insight, which is this. If the parameter that I'm changing, if the derivative of that parameter is consistently of a very low magnitude, then if the derivative of this mini-batch is higher than that, then what I really care about is the relative difference between how much this variable tends to change and how much it's going to change this time around.

So in other words, we don't just care about what's the gradient, but is the magnitude of the gradient a lot more or a lot less than it has tended to be recently? So the easy way to calculate the overall amount of change of the gradient recently is to keep track of the square of the gradient.

So what we do with Adagrad is you can see at the bottom of my epoch here, I have got a sum of squares of all of my gradients. And then I have taken the square root, so I've got the roots and the squares, and then I've just divided it by the count to get the average.

So this is the average of the roots and the squares of my gradients. So this number here will be high if the magnitudes of my gradients is high. And because it's squared, it will be particularly high if sometimes they're really high. So why is it okay to just use a mini-batch since the surface is going to depend on what points are in your mini-batch?

It's not ideal to just use a mini-batch, and we will learn about a better approach to this in a moment. But for now, let's look at this, and in fact, there are two approaches related to Adagrad and Adadelta, and one of them actually does this for all of the gradients so far, and one of them uses a slightly more sophisticated approach.

This approach of doing it on a mini-batch-by-min-batch basis is slightly different either, but it's similar enough to explain the concept. Does this mean for a CNN, would dynamic learning rates mean that each filter would have its own learning rate? It would mean that every parameter has its own learning rate.

So this is one parameter, that's a parameter, that's a parameter, that's a parameter. And then in our dense layer, that's a parameter, that's a parameter, that's a parameter. So when you go model.summary in Keras, it shows you for every layer how many parameters there are. So anytime you're unclear on how many parameters there are, you can go back and have a look at these spreadsheets, and you can also look at the Keras model.summary and make sure you understand how they turn out.

So for the first layer, it's going to be the size of your filter times the number of your filters, if it's just a grayscale. And then after that, the number of parameters will be equal to the size of your filter times the number of filters coming in times the number of filters coming out.

And then of course your dense layers will be every input goes to every output, so number of inputs times the number of outputs, a parameter to the function that is calculating whether it's a cat or a dog. So what we do now is we say this number here, 1857, this is saying that the derivative of the loss with respect to the slope varies a lot, whereas the derivative of the loss with respect to the intercept doesn't vary much at all.

So at the end of every epoch, I copy that up to here. And then I take my learning rate and I divide it by that. And so now for each of my parameters, I now have this adjusted learning rate, which is the learning rate divided by the recent sum of squares average gradient.

And so you can see that now one of my learning rates is 100 times faster than the other one. And so let's see what happens when I run this. Question - Is there a relationship with normalizing the input data? Answer - No, there's not really a relationship with normalizing the input data because it can help, but still if your inputs are very different scales, it's still a lot more work for it to do.

So yes it helps, but it doesn't help so much that it makes it useless, and in fact it turns out that even with dynamic learning rates, not just normalized inputs, but batch normalized activations is extremely helpful. And so the thing about when you're using Adagrad or any kind of dynamic learning rates is generally you'll set the learning rate quite a lot higher, because remember you're dividing it by this recent average.

So if I set it to 0.1, oh, too far, so that's no good. So let's try 0.05, run that. So you can see after just 5 steps, I'm already halfway there. Another 5 steps, getting very close, and another 5 steps, and it's exploded. Now why did that happen? Because as we get closer and closer to where we want to be, you can see that you need to take smaller and smaller steps.

And by keeping the learning rates the same, it meant that eventually we went too far. So this is still something you have to be very careful of. As more elegant, in my opinion, approach to the same thing that Adagrad is doing is something called RMSprop. And RMSprop was first introduced in Jeffrey Hinton's Coursera course.

So if you go to the Coursera course in one of those classes he introduces RMSprop. So it's quite funny nowadays because this comes up in academic papers a lot. When people cite it, they have to cite Coursera course, chapter 6, at minute 14 and 30 seconds. But Hinton has asked that this be the official way that he decided, so there you go.

You see how cool he is. So here's what RMSprop does. What RMSprop does is exactly the same thing as momentum, but instead of keeping track of the weighted running average of the gradients, we keep track of the weighted running average of the square of the gradients. So here it is.

Everything here is the same as momentum so far, except that I take my gradient squared, multiply it by 0.1, and add it to my previous cell times 0.9. So this is keeping track of the recent running average of the squares of the gradients. And when I have that, I do exactly the same thing with it that I did in Adagrad, which is to divide the learning rate by it.

So I take my previous guess as to b and then I subtract from it my derivative times the learning rate divided by the square root of the recent weighted average of the square gradients. So it's doing basically the same thing as Adagrad, but in a way that's doing it kind of continuously.

So these are all different types of learning rate optimization? These last two are different types of dynamic learning rate approaches. So let's try this one. If we run it for a few steps, and again you have to guess what learning rate to start with, say 0.1. So as you can see, this is going pretty well.

And I'll show you something really nice about RMSprop, which is what happens as we get very close. We know the right answer is 2 and 30. Is it about to explode? No, it doesn't explode. And the reason it doesn't explode is because it's recalculating that running average every single minibatch.

And so rather than waiting until the end of the epoch by which stage it's gone so far that it can't come back again, it just jumps a little bit too far and then it recalculates the dynamic learning rates and tries again. So what happens with RMSprop is if your learning rate is too high, then it doesn't explode, it just ends up going around the right answer.

And so when you use RMSprop, as soon as you see your validation scores flatten out, you know this is what's going on, and so therefore you should probably divide your learning rate by 10. And you see me doing this all the time. When I'm running Keras stuff, you'll keep seeing me run a few steps, divide the learning rate by 10, run a few steps, and you don't see that my loss function explodes, you just see that it flattens out.

So do you want your learning rate to get smaller and smaller? Yeah, you do. Your very first learning rate often has to start small, and we'll talk about that in a moment, but once you've kind of got started, you generally have to gradually decrease the learning rate. That's called learning rate annealing.

And can you repeat what you said earlier that something does the same thing as Adagrad, but... So RMSprop, which we're looking at now, does exactly the same thing as Adagrad, which is divide the learning rate by the root-summer-squared of the gradients, but rather than doing it since the beginning of time, or every minibatch, or epoch, RMSprop does it continuously using the same technique that we learned from momentum, which is take the squared of this gradient, multiply it by 0.1, and add it to 0.9 times the last calculation.

That's called a moving average. It's a weighted moving average, where we're weighting it such that the more recent squared gradients are weighted higher. I think it's actually an exponentially weighted moving average, to be more precise. So there's something pretty obvious we could do here, which is momentum seems like a good idea, RMSprop seems like a good idea, why not do both?

And that is called Adam. And so Adam was invented last year, 18 months ago, and hopefully one of the things you see from these spreadsheets is that these recently invented things are still at the ridiculously extremely simple end of the spectrum. So the stuff that people are discovering in deep learning is a long, long, long way away from being incredibly complex or sophisticated.

And so hopefully you'll find this very encouraging, which is if you want to play at the state-of-the-art of deep learning, that's not at all hard to do. So let's look at Adam, which I remember it coming out 12-18 months ago, and everybody was so excited because suddenly it became so much easier and faster to train neural nets.

But once I actually tried to create an Excel spreadsheet out of it, I realized, oh my god, it's just RMSprop to plus momentum. And so literally all I did was I copied my momentum page and then I copied across my RMSprop columns and combined them. So you can see here I have my exponentially weighted moving average of the gradients, that's what these two columns are.

Here is my exponentially weighted moving average of the squares of the gradients. And so then when I calculate my new parameters, I take my old parameter and I subtract not my derivative times the learning rate, but my momentum factor. So in other words, the recent weighted moving average of the gradients multiplied by the learning rate divided by the recent moving average of the squares of the derivatives, or the root of them anyway.

So it's literally just combining momentum plus RMSprop. And so let's see how that goes. Let's run 5 epochs, and we can use a pretty high learning rate now because it's really handling a lot of stuff for us. And 5 epochs, we're almost perfect. And so another 5 epochs does exactly the same thing that RMSprop does, which is it goes too far and tries to come back.

So we need to do the same thing when we use atom, and atom is what I use all the time now. I just divide by 10 every time I see it flatten out. So a week ago, somebody came out with something that they called not atom, but Eve. And Eve is an addition to atom which attempts to deal with this learning rate annealing automatically.

And so all of this is exactly the same as my atom page. But at the bottom, I've added some extra stuff. I have kept track of the root means grid error, this is just my loss function, and then I copy across my loss function from my previous epoch and from the epoch before that.

And what Eve does is it says how much has the loss function changed. And so it's got this ratio between the previous loss function and the loss function before that. So you can see it's the absolute value of the last one minus the one before divided by whichever one is smaller.

And what it says is, let's then adjust the learning rate such that instead of just using the learning rate that we're given, let's adjust the learning rate that we're given. We take the exponentially weighted moving average of these ratios, so you can see another of these betas appearing here, so this thing here is equal to our last ratio times 0.9 plus our new ratio times 0.1.

And so then for our learning rate, we divide the learning rate from atom by this. So what that says is if the learning rate is moving around a lot, if it's very bumpy, we should probably decrease the learning rate because it's going all over the place. Remember how we saw before, if we've kind of gone past where we want to get to, it just jumps up and down.

On the other hand, if the loss function is staying pretty constant, then we probably want to increase the learning rate. So that all seems like a good idea, and so again let's try it. Not bad, so after 5 epochs it's gone a little bit too far. After a week of playing with it, I used this on State Farm a lot during the week, I grabbed a Keras implementation which somebody wrote a day after the paper came out.

The problem is that because it can both decrease and increase the learning rate, sometimes as it gets down to the flat bottom point where it's pretty much optimal, it will often be the case that the loss gets pretty constant at that point. And so therefore, Eve will try to increase the learning rate.

And so what I tend to find happens that it would very quickly get pretty close to the answer, and then suddenly it would jump to somewhere really awful. And then it would start to get to the answer again and jump somewhere really awful. We have not done any such thing, no.

We have always run for a specific number of epochs. We have not defined any kind of stopping criterion. It is possible to define such a stopping criterion, but nobody's really come up with one that's remotely reliable. And the reason why is that when you look at the graph of loss over time, it doesn't tend to look like that, but it tends to look like this.

And so in practice, it's very hard to know when to stop. It's kind of still a human judgment thing. Oh yeah, that's definitely true. And particularly with a type of architecture called ResNet that we'll look at next week, the authors showed that it tends to go like this. So in practice, you have to run your training for as long as you have patience for, at whatever the best learning rate you can come up with is.

So something I actually came up with 6 or 12 months ago, but we've kind of restimulated my interest after I read this Adam paper, is something which dynamically updates learning rates in such a way that they only go down. And rather than using the loss function, which as I just said is incredibly bumpy, there's something else which is less bumpy, which is the average sum of squareds gradients.

So I actually created a little spreadsheet of my idea, and I helped to prototype it in Python maybe this week or the next week after, we'll see how it goes. And the idea is basically this, keep track of the sum of the squares of the derivatives and compare the sum of the squares of the derivatives from the last epoch to the sum of the squares of the derivatives of this epoch and look at the ratio of the two.

The derivatives should keep going down. If they ever go up by too much, that would strongly suggest that you've kind of jumped out of the good part of the function. So anytime they go up too much, you should decrease the learning rate. So I literally added two lines of code to my incredibly simple VBA, Adam with a kneeling here.

If the gradient ratio is greater than 2, so if it doubles, divide the learning rate by 4. Here's what happens when I run that. That's 5 steps, another 5 steps. You can see it's automatically changing it, so I don't have to do anything, I just keep running it. So I'm pretty interested in this idea, I think it's going to work super well because it allows me to focus on just running stuff without ever worrying about setting learning rates.

So I'm hopeful that this approach to automatic learning rate and kneeling is something that we can have in our toolbox by the end of this course. One thing that happened to me today is I tried a lot of different learning rates, I didn't get anywhere. But I was working with the whole dataset, trying to understand if I tried with the sample and I find something, would that apply to the whole dataset or how do I go about investigating this?

I'll hold that thought for 5 seconds. Was there another question at the back before we answered that one? So here is the answer to that question. The question was, "It takes a long time to figure out the optimal learning rate. Can we calculate it using just a sample?" And to answer that question, I'm going to show you how I entered statefum.

Indeed, when I started entering statefum, I started by using a sample. And so step 1 was to think, "What insights can we gain from using a sample which can still apply when we move to the whole dataset?" Because running stuff in a sample took 10 or 20 seconds, and running stuff in the full dataset took 2 to 10 minutes per epoch.

So after I created my sample, which I just created randomly, I first of all wanted to find out what does it take to create a better-than-random model here. So I always start with the simplest possible model. And so the simplest possible model has a single dense layer. Now here's a handy trick.

Rather than worrying about calculating the average and the standard deviation of the input and subtracting it all out in order to normalize your input layer, you can just start with a batch-norm layer. And so if you start with a batch-norm layer, it's going to do that for you. So anytime you create a Keras model from scratch, I would recommend making your first layer a batch-norm layer.

So this is going to normalize the data for me. So that's a cool little trick which I haven't actually seen anybody use elsewhere, but I think it's a good default starting point all the time. If I'm going to use a dense layer, then obviously I have to flatten everything into a single vector first.

So this is really a most minimal model. So I tried fitting it, compiled it, fit it, and nothing happened. Not only did nothing happen to my validation, but really nothing happened by training. It's only taking 7 seconds per epoch to find this out, so that's okay. So what might be going on?

So I look at model.summary, and I see that there's 1.5 million parameters. And that makes me think, okay, it's probably not underfitting. It's probably unlikely that with 1.5 million parameters, there's really nothing useful that can do whatsoever. It's only a linear model, true, but I still think it should be able to do something.

So that makes me think that what must be going on is it must be doing that thing where it jumps too far. And it's particularly easy to jump too far at the very start of training, and let me explain why. It turns out that there are often reasonably good answers that are way too easy to find.

So one reasonably good answer would be always predict 0. Because there are 10 output classes in the state fun competition, there's one of 10 different types of distracted driving, and you are scored based on the cross-entropy loss. And what that's looking at is how accurate are each of your 10 predictions.

So rather than trying to predict something well, what if we just always predict 0.01? Nine times out of 10, you're going to be right. Because 9 out of the 10 categories, it's not that. It's only one of the 10 categories. So actually always predicting 0.01 would be pretty good.

Now it turns out it's not possible to do that because we have a softmax layer. And a softmax layer, remember, is e^x_i divided by sum_of, e^x_i. And so in a softmax layer, everything has to add to 1. So therefore if it makes one of the classes really high, and all of the other ones really low, then 9 times out of 10 it is going to be right, 9 times out of 10.

So in other words, it's a pretty good answer for it to always predict some random class, class 8, close to 100% certainty. And that's what happened. So anybody who tried this, and I saw a lot of people on the forums this week saying, "I tried to train it and nothing happened." And the folks who got the interesting insight were the ones who then went on to say, "And then I looked at my predictions and it kept predicting the same class with great confidence again and again and again." Okay, that's why I did that.

Our next step then is to try decreasing the learning rate. So here is exactly the same model, but I'm now using a much lower learning rate. And when I run that, it's actually moving. So it's only 12 seconds of compute time to figure out that I'm going to have to start with a low learning rate.

Once we've got to a point where the accuracy is reasonably better than random, we're well away from that part of the loss function now that says always predict everything as the same class, and therefore we can now increase the learning rate back up again. So generally speaking, for these harder problems, you'll need to start at an epoch or two at a low learning rate, and then you can increase it back up again.

So you can see now I can put it back up to 0.01 and very quickly increase my accuracy. So you can see here my accuracy on my validation set is 0.5 using a linear model, and this is a good starting point because it says to me anytime that my validation accuracy is worse than about 0.5, this is really no better than even a linear model, so this is not worth spending more time on.

One obvious question would be, how do you decide how big a sample to use? And what I did was I tried a few different sizes of sample for my validation set, and I then said, okay, evaluate the model, in other words, calculate the loss function, on the validation set, but for a whole bunch of randomly sampled batches, so do it 10 times.

And so then I looked and I saw how the accuracy changed. With the validation set at 1000 images, my accuracy changed from 0.48 or 0.47 to 0.51, so it's not changing too much. It's small enough that I think I can make useful insights using a sample size of this size.

So what else can we learn from a sample? One is, are there other architectures that work well? So the obvious thing to do with a computer vision problem is to try a convolutional neural network. And here's one of the most simple convolutional neural networks, two convolutional layers, each one with a max pooling layer.

And then one dense layer followed by my dense output layer. So again I tried that and found that it very quickly got to an accuracy of 100% on the training set, but only 24% on the validation set. And that's because I was very careful to make sure my validation set included different drivers to my training set, because on Kaggle it told us that the test set has different drivers.

So it's much harder to recognize what a driver is doing if we've never seen that driver before. So I could see that convolutional neural networks clearly are a great way to model this kind of data, but I've got to have to think very carefully about overfitting. So step 1 to avoiding overfitting is data augmentation, as we learned in our data augmentation class.

So here's the exact same model. And I tried every type of data augmentation. So I tried shifting it left and right a bit. I tried shifting it up and down a bit. I tried shearing it a bit. I tried rotating it a bit. I tried shifting the channels, so the colors a bit.

And for each of those, I tried four different levels. And I found in each case what was the best. And then I combined them all together. So here are my best data augmentation amounts. So on 1560 images, so a very small set, this is just my sample, I then ran my very simple two convolutional layer model with this data augmentation at these optimized parameters.

And it didn't look very good. After 5 epochs, I only had 0.1 accuracy on my validation set. But I can see that my training set is continuing to improve. And so that makes me think, okay, don't give up yet, try deducing the learning rate and do a few more.

And lo and behold, it started improving. So this is where you've got to be careful not to jump to conclusions too soon. So I ran a few more, and it's improving well. So I ran a few more, another 25. And look at what happened. It kept getting better and better and better until we were getting 67% accuracy.

So this 1.15 validation loss is well within the top 50% in this competition. So using an incredibly simple model, on just a sample, we can get in the top half of this Kaggle competition simply by using the right kind of data augmentation. So I think this is a really interesting insight about the power of this incredibly useful tool.

Okay, let's have a five minute break, and we'll do your question first. It's unlikely that there's going to be a class imbalance in my sample unless there was an equivalent class imbalance in the real data, because I've got a thousand examples. And so statistically speaking, that's unlikely. If there was a class imbalance in my original data, then I want my sample to have that class imbalance too.

So at this point, I felt pretty good that I knew that we should be using a convolutional neural network, which is obviously a very strong hypothesis to start with anyway. And also I felt pretty confident when you knew what kind of learning rate to start with, and then how to change it, and also what data augmentation to do.

The next thing I wanted to wonder about was how else do I handle overfitting, because although I'm getting some pretty good results, I'm still overfitting hugely, 0.6 versus 0.9. So the next thing in our list of ways to avoid overfitting, and I hope you guys all remember that we have that list in lesson 3.

The five steps, let's go and have a look at it now to remind ourselves. Approaches to reducing overfitting, these are the five steps. We can't add more data, we've tried using data augmentation, we're already using batch norm and convnets, so the next step is to add regularization. And dropout is our favored regularization technique.

So I was thinking, okay, before we do that, I'll just mention one more thing about this data augmentation approach. I have literally never seen anybody write down a process as to how to figure out what kind of data augmentation to use and the amount. The only posts I've seen on it always rely on intuition, which is basically like, look at the images and think about how much they seem like they should be able to move around or rotate.

I really tried this week to come up with a rigorous, repeatable process that you could use. And that process is go through each data augmentation type one at a time, try 3 or 4 different levels of it on a sample with a big enough validation set that it's pretty stable to find the best value of each of the data augmentation parameters, and then try combining them all together.

So I hope you kind of come away with this as a practical message which probably your colleagues, even if some of them claim to be deep learning experts, I doubt that they're doing this. So this is something you can hopefully get people into the practice of doing. Regularization however, we cannot do on a sample.

And the reason why is that step 1, add more data, well that step is very correlated with add regularization. As we add more data, we need less regularization. So as we move from a sample to the full dataset, we're going to need less regularization. So to figure out how much regularization to use, we have to use the whole dataset.

So at this point I changed it to use the whole dataset, not the sample, and I started using Dropout. So you can see that I started with my data augmentation amounts that you've already seen, and I started adding in some Dropout. And ran it for a few epochs to see what would happen.

And you can see it's worked pretty well. So we're getting up into the 75% now, and before we were in the 64%. So once we add clipping, which is very important for getting the best cross-entropy loss function, I haven't checked where that would get us on the Kaggle leaderboard, but I'm pretty sure it would be at least in the top third based on this accuracy.

So I ran a few more epochs with an even lower learning rate and got 0.78, 0.79. So this is going to be well up into the top third, maybe even the top third of the leaderboard. So I got to this point by just trying out a couple of different levels of Dropout, and I'll just put them in my dense layers.

There's no rule of thumb here. A lot of people put small amounts of Dropout in their convolutional layers as well. All I can say is to try things. But what VGG does is to put 50% Dropout after each of its dense layers, and that doesn't seem like a bad rule of thumb, so that's what I was doing here.

And then trying around a few different sizes of dense layers to try and find something reasonable. I didn't spend a heap of time on this, so there's probably better architectures, but as you can see this is still a pretty good one. So that was my step 2. Now so far we have not used a pre-trained network at all.

So this is getting into the top third of the leaderboard without even using any ImageNet features. So that's pretty damn cool. But we're pretty sure that ImageNet features would be helpful. So that was the next step, was to use ImageNet features, so VGG features. Specifically, I was reasonably confident that all of the convolutional layers of VGG are probably pretty much good enough.

I didn't expect I would have to fine-tune them much, if at all, because the convolutional layers are the things which really look at the shape and structure of things rather than how they fit together. And these are photos of the real world, just like ImageNet are photos of the real world.

So I really felt like most of the time, if not all of it, was likely to be spent on the dense layers. So therefore, because calculating the convolutional layers takes nearly all the time, because that's where all the computation is, I pre-computed the output of the convolutional layers. And we've done this before, you might remember.

When we looked at dropout, we did exactly this. We figured out what was the last convolutional layer's ID. We grabbed all of the layers up to that ID, we built a model out of them, and then we calculated the output of that model. And that told us the value of those features, those activations from VGG's last convolutional layer.

So I did exactly the same thing. I basically copied and pasted that code. So I said okay, grab VGG 16, find the last convolutional layer, build a model that contains everything up to and including that layer, predict the output of that model. So predicting the output means calculate the activations of that last convolutional layer.

And since that takes some time, then save that so I never have to do it again. So then in the future I can just load that array. So this array, I'm not going to calculate those, I'm simply going to load them. And so have a think about what would you expect the shape of this to be.

And you can figure out what you would expect the shape to be by looking at model.summary and finding the last convolutional layer. Here it is. And we can see it is 512 filters by 14x14. So let's have a look, just one moment. We'll find our conv_val_feet.shape, 512x14x14 as expected.

Is there a reason you chose to leave out the max_pooling and flatten layers? So why did I leave out the max_pooling and flatten layers? Probably because it takes zero time to calculate them and the max_pooling layer loses information. So I thought given that I might want to play around with other types of pooling or other types of convolutions or whatever, I thought pre-calculating this layer is the last one that takes a lot of computation time.

Having said that, the first thing I did with it in my new model was to max_pool_it and flatten it. So now that I have the output of VGG for the last conv layer, I can now build a model that has dense layers on top of that. And so the input to this model will be the output of those conv layers.

And the nice thing is it won't take long to run this, even on the whole dataset, because the dense layers don't take much computation time. So here's my model, and by making p a parameter, I could try a wide range of dropout amounts, and I fit it, and one epoch takes 5 seconds on the entire dataset.

So this is a super good way to play around. And you can see 1 epoch gets me 0.65, 3 epochs get me 0.75. So this is pretty cool. I have something that in 15 seconds can get me 0.75 accuracy. And notice here, I'm not using any data augmentation. Why aren't I using data augmentation?

Because you can't pre-compute the output of convolutional layers if you're using data augmentation. Because with data augmentation, your convolutional layers give you a different output every time. So that's just a bit of a bummer. You can't use data augmentation if you are pre-computing the output of a layer. Because think about it, every time it sees the same cat photo, it's rotating it by a different amount, or moving it by a different amount.

So it gives a different output of the convolutional layer, so you can't pre-compute it. There is something you can do, which I've played with a little bit, which is you could pre-compute something that's 10 times bigger than your dataset, consisting of 10 different data-augmented versions of it, which is why I actually had this -- where is it?

Which is what I was doing here when I brought in this data generator with augmentations, and I created something called data-augmented convolutional features, in which I predicted 5 times the amount of data, or calculated 5 times the amount of data. And so that basically gave me a dataset 5 times bigger, and that actually worked pretty well.

It's not as good as having a whole new sample every time, but it's kind of a compromise. So once I played around with these dense layers, I then did some more fine-tuning and found out that -- so if I went basically here, I then tried saying, okay, let's go through all of my layers in my model from 16 onwards and set them to trainable and see what happens.

So I tried retraining, fine-tuning some of the convolutional layers as well. It basically didn't help. So I experimented with my hypothesis, and I found it was correct, which is it seems that for this particular model, coming up with the right set of dense layers is what it's all about.

Yes, Rachel? Question. If we want rotational invariance, should we keep the max pooling, or can another layer do it as well? Max pooling doesn't really have anything to do with rotational invariance. Max pooling does translation invariance. So I'm going to show you one more cool trick. I'm going to show you a little bit of State Farm every week from now on because there's so many cool things to try, and I want to keep reviewing CNNs because convolutional neural nets really are becoming what deep learning is all about.

I'm going to show you one really cool trick. It's actually a combination of two tricks. The two tricks are called pseudo-labeling and knowledge distillation. So if you Google for pseudo-labeling semi-supervised learning, you can see the original paper that came out with pseudo-labeling. I guess that's 2013. And then knowledge distillation.

This is a Jeffrey Hinton paper, Distilling the Knowledge in a Neural Network. This is from 2015. So these are a couple of really cool techniques which Hinton and Jeff Dean, that's not bad, we're going to combine them together. And they're kind of crazy. What we're going to do is we are going to use the test set to give us more information.

Because in State Farm, the test set has 80,000 images in it, and the training set has 20,000 images in it. What could we do with those 80,000 images which we don't have labels for? It seems a shame to waste them. It seems like we should be able to do something with them, and there's a great little picture here.

Imagine we only had two points, and we knew their labels, white and black. And then somebody said, "How would you label this?" And then they told you that there's a whole lot of other unlabeled data. Notice this is all gray, it's not labeled. But it's helped us, hasn't it?

It's helped us because it's told us how the data is structured. This is what semi-supervised learning is all about. It's all about using the unlabeled data to try and understand something about the structure of it and use that to help you, just like in this picture. Pseudo-labeling and knowledge distillation are a way to do this.

And what we do is -- and I'm not going to do it on the test set, I'm going to do it on the validation set because it's a little bit easier to see the impact of it, and maybe next week we'll look at the test set to see, because that's going to be much cooler when you do it on the test set.

It's this simple. What we do is we take our model, some model we've already built, and we predict the outputs from that model for our unlabeled set. In this case, I'm using the validation set, as if it was unlabeled. So I'm ignoring labels. And those things we call the pseudo-labels.

So now that we have predictions for the test set or the validation set, it's not that they're true, but we can pretend they're true. We can say there's some label, they're not correct labels, but they're labels nonetheless. So what we then do is we take our training labels and we concatenate them with our validation or test set pseudo-labels.

And so we now have a bunch of labels for all of our data. And so we can now also concatenate our convolutional features with the convolutional features of the validation set or test set. And we now use these to train a model. So the model we use is exactly the same model we had before, and we train it in exactly the same way as before.

And our loss goes up from 0.75 to 0.82. So our error has dropped by like 25%. And the reason why is just because we use this additional unlabeled data to try to figure out the structure of it. Question about model choice. How do you learn how to design a model and when to stop messing with them?

It seems like you've taken a few initial ideas, tweaked them to get higher accuracy, but unless your initial guesses are amazing, there should be plenty of architectures that would also work. So if and when you figure out how to find an architecture and stop messing with it, please tell me, because I don't sleep.

We all want to know this. I look back at these models I'm showing you and I'm thinking, I bet there's something twice as good. I don't know what it is. There are all kinds of ways of optimizing other hyperparameters of deep learning. For example, there's something called spearmint, which is a Bayesian optimization hyperparameter tuning thing.

In fact, just last week a new paper came out for hyperparameter tuning, but this is all about tuning things like the learning rate and stuff like that. Coming up with architectures, there are some people who have tried to come up with some kind of more general architectures, and we're going to look at one next week called ResNets, which seem to be pretty encouraging in that direction, but even then, ResNet, which we're going to learn about next week, is an architecture which won ImageNet in 2015.

The author of ResNet, Kaiming He from Microsoft, said, "The reason ResNet is so great is it lets us build very, very, very deep networks." Indeed he showed a network with over a thousand layers, and it was totally state-of-the-art. Somebody else came along a few months ago and built wide ResNets with like 50 layers, and easily beat Kaiming He's best results.

So the very author of the ImageNet winner completely got wrong the reason why his invention was good. The idea that any of us have any idea how to create optimal architectures is totally, totally wrong. We don't. So that's why I'm trying to show you what we know so far, which is like the processes you can use to build them without waiting forever.

So in this case, doing your data augmentation on the small sample in a rigorous way, figuring out that probably the dense layers are where the action is at and pre-computing the input to them. These are the kinds of things that can keep you sane. I'm showing you the outcome of my last weeks kind of playing with this.

I can tell you that during this time I continually fell into the trap of running stuff on the whole network and all the way through and fiddling around with hyperparameters. And I have to stop myself and have a cup of tea and say, "Okay, is this really a good idea?

This is really a good use of time." So we all do it, but not you anymore because you've been to this class. Green box, back there. Can you run us through this one more time? I'm just a little confused because it feels like maybe we're using our validation set as part of our training program and I'm confused how it's not true.

But look, we're not using the validation labels, nowhere here does it say "val_labels". So yeah, we are absolutely using our validation set but we're using the validation set's inputs. And for our test set we have the inputs. So next week I will show you this page again, and this time I'm going to use the test set.

I just didn't have enough time to do it this time around. And hopefully we're going to see some great results, and when we do it on the test set then you'll be really convinced that it's not using the labels because we don't have any labels. But you can see here, all it's doing is it's creating pseudo-labels by calculating what it thinks it ought to be based on the model that we just built with that 75% accuracy.

And so then it's able to use the input data for the validation set in an intelligent way and therefore improve the accuracy. What do you mean the same? Yeah, it's using bn_model, and bn_model is the thing that we just fitted. By using the training labels, so this is bn_model, the thing with this 0.755 accuracy.

So if we were to look at - I know we haven't gone through this - can you move a bit closer to the mic? Sure. And this is supervised and unsupervised learning? And in this case semi-supervised learning. Semi-supervised learning. Right, and semi-supervised works because you're giving it a model which already knows about a bunch of labels but unsupervised wouldn't know.

Unsupervised has nothing, that's right. I wasn't particularly thinking about doing this, but unsupervised learning is where you're trying to build a model when you have no labels at all. How many people here would be interested in hearing about unsupervised learning during this class? Okay, enough people, I should do that, I will add it.

During the week, perhaps we can create a forum thread about unsupervised learning and I can learn about what you're interested in doing with it because many things that people think of as unsupervised problems actually aren't. Okay, so pseudo-labeling is insane and awesome, and we need the green box back.

Okay, and there are a number of questions. Earlier you talked about learning about the structure of the data that you can learn from the validation set, can you say more about that? I don't know, not really. Other than that picture I showed you before with the two little spirally things.

And that picture was kind of showing how they clustered in a way that was higher dimension than what you can see when you just had to work. So think about that Matt Zyler paper we saw, or the Jason Yersinski visualization tool box we saw. The layers learn shapes and textures and concepts.

In that 80,000 test images of people driving in different distracted ways, there are lots of concepts there to learn about ways in which people drive in distracted ways, even although they're not labeled. So what we're doing is we're trying to learn better convolutional or dense features, that's what I mean by learning more.

So the structure of the data here is basically like what do these pictures tend to look like. More importantly, in what ways do they differ? Because it's the ways that they differ that therefore must be related to how they're labeled. Can you use your updated model to make new labels for the validation?

Yes, you can absolutely do pseudo-labeling on pseudo-labeling, and you should. And if I don't get sick of running this code, I will try it next week. Could that introduce bias towards your validation set? No because we don't have any validation labels. One of the tricky parameters in pseudo-labeling is in each batch, how much do I make it a mix of training versus pseudo?

One of the big things that stopped me from getting the test set in this week is that Keras doesn't have a way of creating batches which have like 80% of this set and 20% of that set, which is really what I want -- because if I just pseudo-labeled the whole test set and then concatenated it, then 80% of my batches are going to be pseudo-labels.

And generally speaking, the rule of thumb I've read is that somewhere around a quarter to a third of your mini-batches should be pseudo-labels. So I need to write some code basically to get Keras to generate batches which are a mix from two different places before I can do this properly.

There are two questions and I think you're asking the same thing. Are your pseudo-labels only as good as the initial model you're beginning from, so do you need to have kind of a particular accuracy in your model? Yeah, your pseudo-labels are indeed as good as your model you're starting from.

People have not studied this enough to know how sensitive it is to those initial labels. No, this is too new, you know, and just try it. My guess is that pseudo-labels will be useful regardless of what accuracy level you're at because it will make it better. As long as you are in a semi-supervised learning context, i.e.

you have a lot of unlabeled data that you want to take advantage of. I really want to move on because I told you I wanted to get us down the path to NLP this week. And it turns out that the path to NLP, strange as it sounds, starts with collaborative filtering.

You will learn why next week. This week we are going to learn about collaborative filtering. And so collaborative filtering is a way of doing recommender systems. And I sent you guys an email today with a link to more information about collaborative filtering and recommender systems, so please read those links if you haven't already just to get a sense of what the problem we're solving here is.

In short, what we're trying to do is to learn to predict who is going to like what and how much. For example, the $1 million Netflix price, at what rating level will this person give this movie? If you're writing Amazon's recommender system to figure out what to show you on their homepage, which products is his person likely to rate highly?

If you're trying to figure out what stuff to show at a news speed, which articles is his person likely to enjoy reading? There's a lot of different ways of doing this, but broadly speaking there are two main classifications of recommender system. One is based on metadata, which is for example, this guy filled out a survey in which they said they liked action movies and sci-fi.

And we also have taken all of our movies and put them into genres, and here are all of our action sci-fi movies, so we'll use them. Broadly speaking, that would be a metadata-based approach. A collaborative filtering-based approach is very different. It says, "Let's find other people like you and find out what they liked and assume that you will like the same stuff." And specifically when we say people like you, we mean people who rated the same movies you've watched in a similar way, and that's called collaborative filtering.

It turns out that in a large enough dataset, collaborative filtering is so much better than the metadata-based approaches that adding metadata doesn't even improve it at all. So when people in the Netflix prize actually went out to IMDB and sucked in additional data and tried to use that to make it better, at a certain point it didn't help.

Once their collaborative filtering models were good enough, it didn't help. And that's because it's something I learned about 20 years ago when I used to do a lot of surveys in consulting. It turns out that asking people about their behavior is crap compared to actually looking at people's behavior.

So let me show you what collaborative filtering looks like. What we're going to do is use a dataset called MovieLens. So you guys hopefully will be able to play around with this this week. Unfortunately Rachel and I could not find any Kaggle competitions that were about recommender systems and where the competitions were still open for entries.

However, there is something called MovieLens which is a widely studied dataset in academia. Perhaps surprisingly, approaching or beating an academic state of the art is way easier than winning a Kaggle competition, because in Kaggle competitions lots and lots and lots of people look at that data and they try lots and lots and lots of things and they use a really pragmatic approach, whereas academics state of the arts are done by academics.

So with that said, the MovieLens benchmarks are going to be much easier to beat than any Kaggle competition, but it's still interesting. So you can download MovieLens dataset from the MovieLens dataset website, and you'll see that there's one here recommended for new research with 20 million items in. Also conveniently, they have a small one with only 100,000 ratings.

So you don't have to build a sample, they have already built a sample for you. So I am of course going to use a sample. So what I do is I read in ratings.csv. And as you'll see here, I've started using pandas, pd is pd for pandas. How many people here have tried pandas?

Awesome. So those of you that don't, hopefully the peer group pressure is kicking in. So pandas is a great way of dealing with structured data and you should use it. Reading a CSV file is this easy, showing the first few items is this easy, finding out how big it is, finding out how many users and movies there are, are all this easy.

I wanted to play with this in Excel, because that's the only way I know how to teach. What I did was I grabbed the user ID by rating and grabbed the top 15 most busiest movie-watching users, and then I grabbed the 15 most watched movies, and then I created a cross-tab of the two.

And then I copied that into Excel. Here is the table I downloaded from MovieLens for the 15 busiest movie-watching users and the 15 most widely watched movies. And here are the ratings. Here's the rating of user 14 for movie 27. Look at these guys. These three users have watched every single one of these movies.

I'm probably one of them, I love movies. And these have been watched by every single one of these users. So user 14 kind of liked movie 27, loved movie 49, hated movie 51. So let's have a look, is there anybody else here? So this guy really liked movie 49, didn't much like movie 57, so they may feel the same way about movie 27 as that user.

That's the basic essence of collaborative filtering. We're going to try and automate it a little bit. And the way we're going to automate it is we're going to say let's pretend for each movie we had like five characteristics, which is like is it sci-fi, is it action, is it dialogue-heavy, is it new, and does it have Bruce Willis.

And then we could have those five things for every user as well, which is this user somebody who likes sci-fi, action, dialogue, new movies, and Bruce Willis. And so what we could then do is multiply those matrix product or dot product, that set of user features with that set of movie features.

If this person likes sci-fi and it's sci-fi and they like action and it is action and so forth, then a high number will appear in here for this matrix product of these two vectors, this dot product of these two vectors. And so this would be a cool way to build up a collaborative filtering system if only we could create these five items for every movie and for every user.

Now because we don't actually know what five things are most important for users and what five things are most important for movies, we're instead going to learn them. And the way we learn them is the way we learn everything, which is we start by randomizing them and then we use gradient descent.

So here are five random numbers for every movie, and here are five random numbers for every user, and in the middle is the dot product of that movie with that user. Once we have a good set of movie factors and user factors for each one, then each of these ratings will be similar to each of the observed ratings, and therefore this sum of squared errors will be low.

Currently it is high. So we start with our random numbers, we start with a loss function of 40. So we now want to use gradient descent, and it turns out that every copy of Excel has a gradient descent solver in it. So we're going to go ahead and use it, it's called solver.

And so we have to tell it what thing to minimize, so it's saying minimize this, and which things do we want to change, which is all of our factors, and then we set it to a minimum and we say solve. And then we can see in the bottom left, it is trying to make this better and better and better using gradient descent.

Notice I'm not saying stochastic gradient descent. Stochastic gradient descent means it's doing it mini-batch at a mini-batch time. Gradient descent means it's doing the whole data set each time. Excel uses gradient descent, not stochastic gradient descent. They give the same answer. You might also wonder why is it so slow?

It's so slow because it doesn't know how to create analytical derivatives, so it's having to calculate the derivatives with finite difference, which is slow. So here we've got a solution, it's got it down to 5. That's pretty good. So we can see here that it predicted 5.14 and it was actually 5.

It predicted 3.05 and it was actually 3. So it's done a really, really good job. It's a little bit too easy because there are 5 times that many user factors and 5 times that many user factors. We've got nearly as many factors as we have things to calculate, so it's kind of over-specified.

But the idea is there. There's one piece missing. The piece we're missing is that some users probably just like movies more than others, and some movies are probably just more like than others. And this dot product does not allow us in any way to say this is an enthusiastic user or this is a popular movie.

To do that, we have to add bias terms. So here is exactly the same spreadsheet, but I've added one more row to the movies part and one more column to the users part for our biases. And I've updated the formula so that as well as the matrix multiplication, it also is adding the user bias and the movie bias.

So this is saying this is a very popular movie, and here we are, this is a very enthusiastic user for example. And so now that we have a collaborative filtering plus bias, we can do gradient descent on that. So previously our gradient descent loss function was 5.6. We would expect it to be better with bias because we can really better specify what's going on.

Let's try it. So again we run solver, solve, and we let that zip along, and we see what happens. So these things we're calculating are called latent factors. A latent factor is some factor that is influencing outcome, but we don't quite know what it is. We're just assuming it's there.

And in fact what happens is when people do collaborative filtering, they then go back and they draw graphs where they say here are the movies that are scored highly on this latent factor and low on this latent factor, and so they'll discover the Bruce Willis factor and the sci-fi factor and so forth.

And so if you look at the Netflix prize visualizations, you'll see these graphs people do. And the way they do them is they literally do this. Not in Excel, because they're not that cool, but they calculate these latent factors and then they draw pictures of them and then they actually write the name of the movie on the graph.

So 4.6, even better. So you can see that, oh that's interesting. In fact I also have an error here, because any time that my writing is empty, I really want to be setting this to empty as well, which means my parenthesis was in the wrong place. So I'm going to recalculate this with my error fixed up and see if we get a better answer.

They're randomly generated and then optimized with gradient descent. For some reason, this seems crazier than what we were doing at CNN's, because movies I understand more than features of images that I just don't intuitively understand. So we can look at some pictures next week, but during the week, Google for Netflix prize visualizations and you will see these pictures.

It really does work the way I described. It figures out what are the most interesting dimensions on which we can rate a movie. Things like level of action and sci-fi and dialogue driven are very important features, it turns out. But rather than pre-specifying those features, we have definitely learned from this class that calculating features using gradient descent is going to give us better features than trying to engineer them by hand.

Interesting that it feels crazy. Tell me next week if you find some particularly interesting things, or if it still seems crazy and we can try to decresify it a little bit. So let's do this in Keras. Now there's really only one main new concept we have to learn, which is we started out with data not in a crosstab form, but in this form.

We have user ID, movie ID, rating triplets, and I crosstab them. So the rows and the columns above the random numbers, are they the variations and the features in the movies and the variations and features in the users? Each of these rows is one feature of a movie, and each of these columns is one feature of a user.

And so one of these sets of 5 is one set of features for a user. This is this user's latent factors. I think it's interesting and crazy because you're basically taking random data and you can generate those features out of people that you don't know in movies that you're not looking at.

Yeah, this is the thing I just did at the start of class, which is there's nothing mathematically complicated about gradient descent. The hard part is unlearning the idea that this should be hard, you know, gradient descent just figures it out. Did you have a question? I have one question behind you.

I just wanted to point out that this you can think of as a smaller, more concise way to represent the movies and the users. In math, there's a concept of a matrix factorization, an SVD for example, which is where you basically take a big matrix and turn it into a small narrow one and a small thin one and multiply the two together.

This is exactly what we're doing. Instead of having how user 14 rated every single movie, we just have 5 numbers that represent it, which is pretty cool. So earlier, did you say that both the user features were random as well as the? Yes. I guess I'm in trouble relating to, I thought, you know, usually we run something like gradient descent on, something has like inputs that you know and here, what are the, what do you know?

What we know, that's what we know, the resulting ratings. So can you perhaps come up with the wrong, like you flip the feature for a movie and a user because if you're doing a multiplication, how do you know which value goes which? If one of the numbers was in the wrong spot, our loss function would be less good and therefore there would be a gradient from that weight to say you should make this weight a little higher or a little lower.

So all the gradient descent is doing is saying okay, for every weight, if we make it a little higher, does it get better or if we make it a little bit lower, does it get better? And then we keep making them a little bit higher and lower until we can't go any better.

And we had to decide how to combine the weights. So this was our architecture, our architecture was let's take a dot product of some assumed user feature and some assumed movie feature and let's add in the second case some assumed bias term. So we had to build an architecture and we built the architecture using common sense, which is to say this seems like a reasonable way of thinking about this.

I'm going to show you a better architecture in a moment. In fact, we're running out of time, so let me jump into the better architecture. So I wanted to point out that there is something new we're going to have to learn here, which is how do you start with a numeric user_id and look up to find what is their 5-element latent factor matrix.

Now remember, when we have user_id's like 1, 2 and 3, one way to specify them is using one hot encoding. So one way to handle this situation would be if this was our user matrix, it was one hot encoded, and then we had a factor matrix containing a whole bunch of random numbers -- one way to do it would be to take a dot product or a matrix product of this and this.

And what that would do would be for this one here, it would basically say let's multiply that by this, it would grab the first column of the matrix. And this here would grab the second column of the matrix. And this here would grab the third column of the matrix.

So one way to do this in Keras would be to represent our user_id's as one hot encodings, and to create a user factor matrix just as a regular matrix like this and then take a matrix product. That's horribly slow because if we have 10,000 users, then this thing is 10,000 wide and that's a really big matrix multiplication when all we're actually doing is saying for user_id number 1, take the first column.

For user_id number 2, take the second column, for user_id number 3, take the third column. And so Keras has something which does this for us and it's called an embedding layer. And embedding is literally something which takes an integer as an input and looks up and grabs the corresponding column as output.

So it's doing exactly what we're seeing in this spreadsheet. Question 2 - How do you deal with missing values, so if a user has not rated a particular movie? That's no problem, so missing values are just ignored, so if it's missing, I just set the loss to 0. Question 3 - How do you break up the training and test set?

I broke up the training and test set randomly by grabbing random numbers and saying are they greater or less than 0.8 and then split my ratings into two groups based on that. Question 4 - And you're choosing those from the ratings so that you have some ratings from all users and you have some ratings for all movies?

I just grabbed them at random. So here it is, here's our dot product. In Keras, there's one other thing, I'm going to stop using the sequential model in Keras and start using the functional model in Keras. I'll talk more about this next week, but you can read about it learning the week.

There are two ways of creating models in Keras, the sequential and the functional. They do similar things, but the functional is much more flexible and it's going to be what we're going to need to use now. So this is going to look slightly unfamiliar, but the ideas are the same.

So we create an input layer for a user, and then we say now create an embedding layer for n users, which is 671, and we want to create how many latent factors? I decided not to create 5, but to create 50. And then I create a movie input, and then I create a movie embedding with 50 factors, and then I say take the dot product of those, and that's our model.

So now please compile the model, and now train it, taking the userID and movieID as input, the rating as the target, and run it for 6 epochs, and I get a 1.27 loss. This is with an RMSE loss. Notice that I'm not doing anything else clever, it's just that simple dot product.

That gets me to 1.27. Here's how I add the bias, I use exactly the same kind of embedding inputs as before, and I've encapsulated them in a function. So my user and movie embeddings are the same. And then I create bias by simply creating an embedding with just a single output.

And so then my new model is do a dot product, and then add the user bias, and add the movie bias, and try fitting that. And it takes me to a validation loss of 1.1. How is that going? Well, there are lots of sites on the internet where you can find out benchmarks for movie lens, and on the 100,000 dataset, we're generally looking for RMSE of about 0.89.

There's some more, the best one here is 0.9, here we are, 0.89, and this one, RMSE, that's on the 1,000,000 dataset, let's go to the 100,000, 100,000, RMSE, 1.9, 0.89. So kind of high 0.89s, low 0.9s would be state-of-the-art according to these benchmarks. So, we're on the right track, but we're not there yet.

So let's try something better, let's create a neural net. And a neural net does the same thing. We create a movie embedding and a user embedding, again with 50 factors, and this time we don't take a dot product, we just concatenate the two vectors together, stick one on the end or the other.

And because we now have one big vector, we can create a neural net, create a dense layer, add dropout, create an activation, compile it, and fit it. And after 5 epochs, we get something way better than state-of-the-art. So we couldn't find anything better than about 0.89. And so this whole notebook took me like half an hour to write, and so I don't claim to be a collaborative filtering expert, but I think it's pretty cool that these things that were written by people that write collaborative filtering software for a living, that's what these websites basically are coming from, places that use LensKit.

So LensKit is a piece of software for recommender systems. We have just killed their benchmark, and it took us 10 seconds to train. So I think that's pretty neat. And we're right on time, so we're going to take one last question. So in the neural net, why is it that there are a number of factors so low?

Oh, actually I thought it was an equal, not a comma, never mind, we're good. Alright, so now you can go home. So that was a very, very quick introduction to embeddings, like as per usual in this class, I kind of stick the new stuff in at the end and say go study it.

So your job this week is to keep improving state farm, hopefully win the new fisheries competition. By the way, in the last half hour, I just created this little notebook in which I basically copied the Dogs and Cats Redux competition into something which does the same thing with the fish data, and I quickly submitted a result.

So we currently have one of us in 18th place, yay. So hopefully you can beat that tomorrow. But most importantly, download the movie lens data and have a play with that and we'll talk more about embeddings next week. Thank you. (audience applauds) (audience applauds)

Lesson 4: Practical Deep Learning for Coders

Transcript