back to indexLesson 4: Practical Deep Learning for Coders
00:00:00.000 |
I guess I noticed during the week from some of the questions I've been seeing that the 00:00:06.460 |
idea of what a convolution is, is still a little counter-intuitive or surprising to 00:00:15.860 |
I feel like the only way I know to teach things effectively is by creating a spreadsheet, 00:00:23.420 |
This is the famous number 7 from lesson 0, and I just copied and pasted the numbers into 00:00:31.700 |
They're not exactly 0, they're actually floats, just rounded off. 00:00:39.060 |
And as you can see, I'm just using conditional coloring, you can see the shape of our little 00:00:47.140 |
So I wanted to show you exactly what a convolution does, and specifically what a convolution 00:00:57.660 |
So we are generally using modern convolutions, and that means a 3x3 convolution. 00:01:04.700 |
So here is a 3x3 convolution, and I have just randomly generated 9 random numbers. 00:01:17.540 |
Here is my second filter, it is 9 more random numbers. 00:01:23.500 |
So this is what we do in Keras when we ask for a convolutional layer. 00:01:30.020 |
We tell it, the first thing we pass it is how many filters do we want, and that's how 00:01:35.540 |
many of these random matrices do we want it to build for us. 00:01:40.880 |
So in this case, it's as if I passed convolution2D, the first parameter would be 2, and the second 00:01:50.900 |
And what happens to this little random matrix? 00:01:55.140 |
In order to calculate the very first item, it takes the sum of the blue stuff, those 00:02:06.380 |
9, times the red stuff, those 9, all added together. 00:02:13.460 |
So let's go down here into where it gets a bit darker, how does this get calculated? 00:02:17.980 |
This is equal to these 9 times these 9, when I say times, I mean element-wise times, so 00:02:25.100 |
the top left by the top left, the middle by the middle, and so forth, and add them all 00:02:33.940 |
So it's just as you go through, we take the corresponding 3x3 area in the image, and we 00:02:42.420 |
multiply each of those 9 things by each of these 9 things, and then we add those 9 products 00:02:52.900 |
So there's really nothing particularly weird or confusing about it, and I'll make this 00:03:01.820 |
You can see that when I get to the top left corner, I can't move further left and up because 00:03:09.940 |
I've reached the edge, and this is why when you do a 3x3 convolution without zero padding, 00:03:17.020 |
you lose one pixel on each edge because you can't push this 3x3 any further. 00:03:24.580 |
So if we go down to the bottom left, you can see again the same thing, it kind of gets 00:03:31.380 |
So that's why you can see that my result is one row less than my starting point. 00:03:39.220 |
So I did this for two different filters, so here's my second filter, and you can see when 00:03:44.580 |
I calculate this one, it's exactly the same thing, it's these 9 times each of these 9 00:03:59.220 |
So that's how we start with our first, in this case I've created two convolutional filters, 00:04:06.340 |
and this is the output of those two convolutional filters, they're just random at this point. 00:04:11.300 |
So my second layer, now my second layer is no longer enough just to have a 3x3 matrix, 00:04:17.260 |
and I need a 3x3x2 tensor because to calculate my top left of my second convolutional layer, 00:04:29.440 |
I need these 9 by these 9 added together, plus these 9 by these 9 added together. 00:04:41.300 |
Because at this point, my previous layer is no longer just one thing, but it's two things. 00:04:47.380 |
Now indeed, if our original picture was a 3-channel color picture, our very first convolutional 00:04:54.900 |
layer would have had to have been 3x3x3 tensors. 00:05:01.020 |
So all of the convolutional layers from now on are going to be 3x3x number of filters 00:05:13.900 |
So here is my first, I've just drawn it like this, 3x3x2 tensor, and you can see it's taking 00:05:21.580 |
9 from here, 9 from here and adding those two together. 00:05:26.860 |
And so then for my second filter in my second layer, it's exactly the same thing. 00:05:33.140 |
I've created two more random matrices, or one more random 3x3x2 tensor, and here again 00:05:40.360 |
I have those 9 by these 9 sum plus those 9 by those 9 sum, and that gives me that one. 00:05:53.980 |
So that gives me my first two layers of my convolutional neural network. 00:06:02.980 |
Max pooling is slightly more awkward to do in Excel, but that's fine, we can still handle 00:06:11.140 |
So max pooling, because I'm going to do 2x2 max pooling, it's going to decrease the resolution 00:06:24.120 |
That number is simply the maximum of those 4. 00:06:29.260 |
And then that number is the maximum of those 4, and so forth. 00:06:34.860 |
So with max pooling, we had two filters in the previous layer, so we still have two filters, 00:06:43.060 |
but now our filters have half the resolution in each of the x and y axes. 00:06:51.820 |
And so then I thought, okay, we've done two convolutional layers, how did you go from 00:06:59.860 |
one matrix to two matrices in the second layer? 00:07:05.300 |
How did I go from one matrix to two matrices, as in how did I go from just this one thing 00:07:14.260 |
So the answer to that is I just created two random 3x3 filters. 00:07:19.180 |
This is my first random 3x3 filter, this is my second random 3x3 filter. 00:07:24.980 |
So each output then was simply equal to each corresponding 9-element section, multiplied 00:07:35.640 |
So because I had two random 3x3 matrices, I ended up with two outputs. 00:07:48.780 |
Alright, so now that we've got our max pooling layer, let's use a dense layer to turn it 00:08:03.140 |
So a dense layer means that every single one of our activations from our max pooling layer 00:08:13.460 |
So these are a whole bunch of random numbers. 00:08:17.540 |
So what I do is I take every one of those random numbers and multiply each one by a 00:08:24.140 |
corresponding input and add them all together. 00:08:33.780 |
So I've got the sum product of this and this. 00:08:36.860 |
In MNIST we would have 10 activations because we need an activation for 0, 1, 2, 3, so forth 00:08:44.620 |
So for MNIST we would need 10 sets of these dense weight matrices so that we could calculate 00:08:56.600 |
If we were only calculating one output, this would be a perfectly reasonable way to do 00:09:03.540 |
So for one output, it's just the sum product of everything from our final layer with a 00:09:10.900 |
weight for everything in that final layer, add it together. 00:09:24.260 |
Both dense layers and convolutional layers couldn't be easier mathematically. 00:09:33.140 |
I think the surprising thing is when you say rather than using random weights, let's calculate 00:09:41.980 |
the derivative of what happens if we were to change that weight up by a bit or down 00:09:51.100 |
In this case, I haven't actually got as far as calculating a loss function, but we could 00:10:01.980 |
And so we can calculate the derivative of the loss with respect to every single weight 00:10:05.940 |
in the dense layer, and every single weight in all of our filters in that layer, and every 00:10:13.500 |
single weight in all of our filters in this layer. 00:10:18.100 |
And then with all of those derivatives, we can calculate how to optimize all of these 00:10:22.860 |
And the surprising thing is that when we optimize all of these weights, we end up with these 00:10:27.720 |
incredibly powerful models, like those visualizations that we saw. 00:10:32.540 |
So I'm not quite sure where the disconnect between the incredibly simple math and the 00:10:40.260 |
I think it might be that it's so easy, it's hard to believe that's all it is, but I'm 00:10:49.540 |
And so to help you really understand this, I'm going to talk more about SGD. 00:11:00.900 |
So the loss function we generally use is the softmax, so e^xi divided by the sum of e^xi. 00:11:08.220 |
If it's just binary, that's just the equivalent of having just 1/1 + e^xi. 00:11:16.020 |
So softmax in the binary case simplifies into a sigmoid function. 00:11:31.300 |
We're going to talk about not just SGD, but every variant of SGD, including one invented 00:11:46.900 |
SGD happens for all layers at once, yes we calculate the derivative of all the weights 00:11:55.180 |
And when to have a max pool after convolution versus when not to? 00:11:59.860 |
When to have a max pool after a convolution, who knows. 00:12:06.540 |
This is a very controversial question, and indeed some people now say never use max pool. 00:12:13.180 |
Instead of using max pool when you're doing the convolutions, don't do a convolution over 00:12:20.540 |
every set of 9 pixels, but instead skip a pixel each time. 00:12:33.100 |
Jeffrey Hinton, who is kind of the father of deep learning, has gone as far as saying 00:12:37.980 |
that the extremely great success of max pooling has been the greatest problem deep learning 00:12:48.780 |
Because to him, it really stops us from going further. 00:12:56.060 |
I don't know if that's true or not, I assume it is because he's Jeffrey Hinton and I'm 00:13:05.220 |
For now, we use max pooling every time we're doing fine-tuning because we need to make 00:13:11.500 |
sure that our architecture is identical to the original VGG's authors' architecture and 00:13:16.140 |
so we have to put max pooling wherever they do. 00:13:20.420 |
Why do we want max pooling or downsampling or anything like that? 00:13:23.580 |
Are we just trying to look at bigger features at the input? 00:13:32.660 |
The first is that max pooling helps with translation invariance. 00:13:37.780 |
So it basically says if this feature is here, or here, or here, or here, I don't care. 00:13:47.860 |
Every time we max pool, we end up with a smaller grid, which means that our 3x3 convolutions 00:13:53.560 |
are effectively covering a larger part of the original image, which means that our convolutions 00:14:08.260 |
Is Jeffrey Hinton cool with the idea of doing the skipping index page time? 00:14:23.980 |
You can learn all about the things that he thinks we ought to have but don't yet have. 00:14:45.780 |
He did point out that -- I can't remember what it was, but one of the key pieces of 00:14:52.420 |
deep learning that he invented took like 17 years from conception to working, so he is 00:14:58.660 |
somebody who sticks with these things and makes it work. 00:15:04.900 |
Max pooling is not unique to image processing. 00:15:08.020 |
It's likely to be useful for any kind of convolutional neural network, and a convolutional neural 00:15:12.580 |
network can be used for any kind of data that has some kind of consistent ordering. 00:15:17.580 |
So things like speech, or any kind of audio, or some kind of consistent time series, all 00:15:24.860 |
of these things have some kind of ordering to them and therefore you can use CNN and 00:15:31.760 |
And as we look at NLP, we will be looking more at convolutional neural networks for 00:15:38.740 |
And interestingly, the author of Keras last week, or maybe the week before, made the contention 00:15:46.620 |
that perhaps it will turn out that CNNs are the architecture that will be used for every 00:15:56.060 |
And this was just after one of the leading NLP researchers released a paper basically 00:16:01.020 |
showing a state-of-the-art result in NLP using convolutional neural networks. 00:16:07.540 |
So although we'll start learning about recurrent neural networks next week, I have to be open 00:16:13.580 |
to the possibilities that they'll become redundant by the end of the year, but they're still 00:16:24.220 |
So we looked at the SGD intro notebook, but I think things are a little more clear sometimes 00:16:30.660 |
So here is basically the identical thing that we saw in the SGD notebook in Excel. 00:16:41.980 |
We create 29 random numbers, and then we say okay, let's create something that is equal 00:17:07.560 |
So I am trying to create something that can find the parameters of a line. 00:17:13.060 |
Now the important thing, and this is the leap, which requires not thinking too hard lest 00:17:22.060 |
you realize how surprising and amazing this is. 00:17:25.440 |
Everything we learn about how to fit a line is identical to how to fit filters and weights 00:17:34.180 |
And so everything we learn about calculating the slope and the intercept, we will then 00:17:43.220 |
And so the answer to any question which is basically why is why not. 00:17:50.020 |
This is a function that takes some inputs and calculates an output, this is a function 00:17:53.780 |
that takes some inputs and calculates an output, so why not. 00:17:58.900 |
The only reason it wouldn't work would be because it was too slow, for example. 00:18:02.740 |
And we know it's not too slow because we tried it and it works pretty well. 00:18:06.500 |
So everything we're about to learn works for any kind of function which kind of has the 00:18:15.620 |
appropriate types of gradients, and we can talk more about that later. 00:18:21.100 |
But neural nets have the appropriate kinds of gradients. 00:18:27.460 |
What do we think the parameters of our function are, in this case the intercept and the slope. 00:18:31.340 |
And with Keras, they will be randomized using the chloro-initialization procedure, which 00:18:37.540 |
is 6 divided by n_in plus n_out, random numbers. 00:18:41.940 |
And I'm just going to say let's assume they're both 1. 00:18:47.540 |
We are going to use very, very small mini-batches here. 00:18:51.000 |
Mini-batches are going to be of size 1, because it's easier to do in Excel and it's easier 00:18:57.660 |
But everything we're going to see would work equally well for a mini-batch of size 4 or 00:19:04.460 |
So here's our first row, our first mini-batch. 00:19:07.500 |
Our input is 14 and our desired output is 58. 00:19:11.500 |
And so our guesses to our parameters are 1 and 1. 00:19:14.900 |
And therefore our predicted y value is equal to 1 plus 1 times 14, which is normally 15. 00:19:27.260 |
Therefore if we're doing root mean squared error, our error squared is prediction minus 00:19:35.140 |
So the next thing we do is we want to calculate the derivative with respect to each of our 00:19:42.140 |
One really easy way to do that is to add a tiny amount to each of the two inputs and 00:19:51.860 |
So let's add 0.01 to our intercept and calculate the line and then calculate the loss squared. 00:20:04.420 |
So this is the error if b is increased by 0.01. 00:20:10.180 |
And then let's calculate the difference between that error and the actual error and then divide 00:20:21.140 |
I'm using dE for the error, dB, I should have probably been dL for the loss, dB. 00:20:26.820 |
The change in loss with respect to b is -85.99. 00:20:33.980 |
So we can add 0.01 to a, and then calculate our line, subtract our actual, take the square, 00:20:43.860 |
and so there is our value of estimated dL/dA, subtract it from the actual loss divided by 00:20:54.140 |
And so there are two estimates of the derivative. 00:20:56.420 |
This approach to estimating the derivative is called finite differencing. 00:20:59.900 |
And any time you calculate a derivative by hand, you should always use finite differencing 00:21:07.620 |
You're not very likely to ever have to do that, however, because all of the libraries 00:21:13.900 |
They do them analytically, not using finite derivatives. 00:21:17.700 |
And so here are the derivatives calculated analytically, which you can do by going to 00:21:23.420 |
Wolfram Alpha and typing in your formula and getting the derivative back. 00:21:27.180 |
So this is the analytical derivative of the loss with respect to b, and the analytical 00:21:34.340 |
And so you can see that our analytical and our finite difference are very similar for 00:21:43.580 |
So that makes me feel comfortable that we got the calculation correct. 00:21:47.100 |
So all SGD does is it says, okay, this tells us if we change our weights by a little bit, 00:21:58.680 |
We know that increasing our value of b by a bit will decrease the loss function, and 00:22:03.780 |
we know that increasing our value of a by a little bit will decrease the loss function. 00:22:09.540 |
So therefore let's decrease both of them by a little bit. 00:22:12.700 |
And the way we do that is to multiply the derivative times a learning rate, that's the 00:22:17.580 |
value of a little bit, and subtract that from our previous guess. 00:22:23.020 |
So we do that for a, and we do that for b, and here are our new guesses. 00:22:27.900 |
Now we're at 1.12 and 1.01, and so let's copy them over here, 1.12 and 1.01. 00:22:38.460 |
And then we do the same thing, and that gives us a new a and a b. 00:22:44.020 |
And we keep doing that again and again and again until we've gone through the whole dataset, 00:22:49.700 |
at the end of which we have a guess of a of 2.61 and a guess of b of 1.07. 00:22:58.340 |
Now in real life, we would be having shuffle=true, which means that these would be randomized. 00:23:05.060 |
So this isn't quite perfect, but apart from that, this is SGD with a mini-batch size of 00:23:12.660 |
So at the end of the epoch, we say this is our new slope, so let's copy 2.61 over here, 00:23:27.940 |
So let's copy 1.06 over here, and so now it starts again. 00:23:37.700 |
So we can keep doing that again and again and again. 00:23:40.700 |
Copy the stuff from the bottom, stick it back at the top, and each one of these is going 00:23:45.360 |
So I recorded a macro with me copying this to the bottom and pasting it at the top, and 00:23:50.820 |
added something that says for i = 1 to 5 around it. 00:23:54.540 |
And so now if I click Run, it will copy and paste it 5 times. 00:24:01.860 |
And so you can see it's gradually getting closer. 00:24:04.060 |
And we know that our goal is that it should be a = 2 and b = 30. 00:24:11.860 |
So we've got as far as a = 2.5 and b = 1.3, so they're better than our starting point. 00:24:20.140 |
And you can see our gradually improving loss function. 00:24:28.260 |
Question - Can we still do analytic derivatives when we are using nonlinear activation functions? 00:24:34.300 |
Answer - Yes, we can use analytical derivatives as long as we're using a function that has 00:24:40.300 |
an analytical derivative, which is pretty much every useful function you can think of, 00:24:47.220 |
except ones that you can't have something that has an if-then statement in it, because 00:24:51.420 |
it jumps from here to here, but even those you can approximate. 00:24:55.140 |
So a good example would be ReLU, which is max of (0, x) strictly speaking doesn't really 00:25:06.500 |
have a derivative at every point, or at least not a well-defined one, because this is what 00:25:23.100 |
And so its derivative here is 0, and its derivative here is 1. 00:25:36.860 |
But the thing is, mathematicians care about that kind of thing, we don't. 00:25:41.620 |
Like in real life, this is a computer, and computers are never exactly anything. 00:25:47.020 |
We can either assume that it's like an infinite amount to this side, or an infinite amount 00:25:52.820 |
So as long as it has a derivative that you can calculate in a meaningful way in practice 00:26:06.660 |
So one thing you might have noticed about this is that it's going to take an awfully 00:26:12.060 |
And so you might think, okay, let's increase the learning rate. 00:26:17.780 |
So let's get rid of one of these zeroes, oh dear, something went crazy. 00:26:24.940 |
I'll tell you what went crazy, our a's and b's started to go out into like 11 million, 00:26:36.260 |
Let's say this was the shape of our loss function, and this was our initial guess. 00:26:43.220 |
And we figured out the derivative is going this way, actually the derivative is positive 00:26:48.220 |
so we want to go the opposite direction, and so we step a little bit over here. 00:26:54.180 |
And then that leads us to here, and we step a little bit further, and this looks good. 00:27:03.420 |
So rather than stepping a little bit, we stepped a long way, and that put us here. 00:27:10.280 |
And then we stepped a long way again, and that put us here. 00:27:15.140 |
If your learning rate is too high, you're going to get worse and worse. 00:27:21.620 |
So getting your learning rate right is critical to getting your thing to train it all. 00:27:30.260 |
Exploding gradients, yeah, or you can even have gradients that do the opposite. 00:27:35.540 |
Exploding gradients are something a little bit different, but it's a similar idea. 00:27:40.320 |
So it looks like 0.001 is the best we can do, and that's a bit sad because this is really 00:27:49.580 |
So one thing we could do is say, well, given that every time we've been -- actually let 00:28:01.080 |
So let's say we had a 3-dimensional set of axes now, and we kind of had a loss function 00:28:12.140 |
And let's say our initial guess was somewhere over here. 00:28:15.580 |
So over here, the gradient is pointing in this direction. 00:28:25.260 |
And then we might make another step which would put us there, and another step that 00:28:30.420 |
And this is actually the most common thing that happens in neural networks. 00:28:35.180 |
Something that's kind of flat in one dimension like this is called a saddle point. 00:28:41.100 |
And it's actually been proved that the vast majority of the space of a loss function in 00:28:46.140 |
a neural network is pretty much all saddle points. 00:28:49.580 |
So when you look at this, it's pretty obvious what should be done, which is if we go to 00:28:59.420 |
here and then we go to here, we can say on average, we're kind of obviously heading in 00:29:06.580 |
Especially when we do it again, we're obviously heading in this direction. 00:29:09.620 |
So let's take the average of how we've been going so far and do a bit of that. 00:29:18.260 |
If ReLU isn't the cost function, why are we concerned with its differentiability? 00:29:27.900 |
We care about the derivative of the output with respect to the inputs. 00:29:34.100 |
The inputs are the filters, and remember the loss function consists of a function of a 00:29:41.940 |
So it is categorical cross-entropy loss applied to softmax, applied to ReLU, applied to dense 00:29:53.420 |
layer, applied to max pooling, applied to ReLU, applied to convolutions, etc. 00:29:59.160 |
So in other words, to calculate the derivative of the loss with respect to the inputs, you 00:30:03.660 |
have to calculate the derivative through that whole function. 00:30:09.560 |
Backpropagation is easy to calculate that derivative because we know that from the chain 00:30:14.740 |
rule, the derivative of a function of a function is simply equal to the product of the derivatives 00:30:22.440 |
So in practice, all we do is we calculate the derivative of every layer with respect 00:30:27.060 |
to its inputs, and then we just multiply them all together. 00:30:30.660 |
And so that's why we need to know the derivative of the activation layers as well as the loss 00:30:45.820 |
What we're going to do is we're going to say, every time we take a step, we're going to 00:30:56.900 |
also calculate the average of the last few steps. 00:31:00.920 |
So after these two steps, the average is this direction. 00:31:04.700 |
So the next step, we're going to take our gradient step as usual, and we're going to 00:31:14.700 |
And that means that we end up actually going to here. 00:31:19.300 |
So we find the average of the last few steps, and it's now even further in this direction, 00:31:24.420 |
and so this is the surface of the loss function with respect to some of the parameters, in 00:31:33.980 |
this case just a couple of parameters, it's just an example of what a loss function might 00:31:38.980 |
So this is the loss, and this is some weight number 1, and this is some weight number 2. 00:31:51.660 |
So we're trying to get our little, if you can imagine this is like gravity, we're trying 00:31:55.820 |
to get this little ball to travel down this valley as far down to the bottom as possible. 00:32:00.980 |
And so the trick is that we're going to keep taking a step, not just the gradient step, 00:32:11.660 |
And so in practice, this is going to end up kind of going "donk, donk, donk, donk, donk." 00:32:21.900 |
So to do that in Excel is pretty straightforward. 00:32:26.600 |
To make things simpler, I have removed the finite-differencing base derivatives here, 00:32:32.580 |
But other than that, this is identical to the previous spreadsheet. 00:32:37.460 |
Same data, same predictions, same derivatives, except we've done one extra thing, which is 00:32:43.900 |
that when we calculate our new B, we say it's our previous B minus our learning rate times, 00:32:52.860 |
and we're not going times our gradient, but times this cell. 00:32:59.220 |
That cell is equal to our gradient times 0.1 plus the thing just above it times 0.9, and 00:33:11.980 |
the thing just above it is equal to its gradient times 0.1 plus the thing just above it times 00:33:20.740 |
So in other words, this column is keeping track of an average derivative of the last 00:33:27.820 |
few steps that we've taken, which is exactly what we want. 00:33:31.620 |
And we do that for both of our two parameters. 00:33:40.540 |
So in Keras, when you use momentum, you can say momentum = and you say how much momentum 00:33:51.120 |
So you just pick what that parameter, what do you want? 00:33:53.020 |
Just like your learning rate, you pick it, your momentum factor, you pick it. 00:33:59.100 |
And you choose it by trying a few and find out what works best. 00:34:03.360 |
So let's try running this, and you can see it is still not exactly zipping along. 00:34:17.340 |
Well the reason when we look at it is that we know that the constant term needs to get 00:34:22.180 |
all the way up to 30, and it's still way down at 1.5. 00:34:28.800 |
It's not moving fast enough, whereas the slope term moved very quickly to where we want it 00:34:36.740 |
So what we really want is we need different learning rates for different parameters. 00:34:42.620 |
And doing this is called dynamic learning rates. 00:34:45.580 |
And the first really effective dynamic learning rate approaches have just appeared in the 00:34:55.100 |
And one very popular one is called Adagrad, and it's very simple. 00:34:59.860 |
All of these dynamic learning rate approaches have the same insight, which is this. 00:35:05.340 |
If the parameter that I'm changing, if the derivative of that parameter is consistently 00:35:12.660 |
of a very low magnitude, then if the derivative of this mini-batch is higher than that, then 00:35:20.460 |
what I really care about is the relative difference between how much this variable tends to change 00:35:26.520 |
and how much it's going to change this time around. 00:35:30.100 |
So in other words, we don't just care about what's the gradient, but is the magnitude 00:35:35.980 |
of the gradient a lot more or a lot less than it has tended to be recently? 00:35:41.300 |
So the easy way to calculate the overall amount of change of the gradient recently is to keep 00:35:51.380 |
So what we do with Adagrad is you can see at the bottom of my epoch here, I have got 00:36:02.980 |
And then I have taken the square root, so I've got the roots and the squares, and then 00:36:06.880 |
I've just divided it by the count to get the average. 00:36:09.260 |
So this is the average of the roots and the squares of my gradients. 00:36:13.120 |
So this number here will be high if the magnitudes of my gradients is high. 00:36:18.260 |
And because it's squared, it will be particularly high if sometimes they're really high. 00:36:24.620 |
So why is it okay to just use a mini-batch since the surface is going to depend on what 00:36:32.860 |
It's not ideal to just use a mini-batch, and we will learn about a better approach to this 00:36:38.120 |
But for now, let's look at this, and in fact, there are two approaches related to Adagrad 00:36:43.680 |
and Adadelta, and one of them actually does this for all of the gradients so far, and 00:36:51.280 |
one of them uses a slightly more sophisticated approach. 00:36:54.640 |
This approach of doing it on a mini-batch-by-min-batch basis is slightly different either, but it's 00:37:02.280 |
Does this mean for a CNN, would dynamic learning rates mean that each filter would have its 00:37:16.560 |
It would mean that every parameter has its own learning rate. 00:37:20.320 |
So this is one parameter, that's a parameter, that's a parameter, that's a parameter. 00:37:24.900 |
And then in our dense layer, that's a parameter, that's a parameter, that's a parameter. 00:37:33.900 |
So when you go model.summary in Keras, it shows you for every layer how many parameters there 00:37:41.080 |
So anytime you're unclear on how many parameters there are, you can go back and have a look 00:37:44.900 |
at these spreadsheets, and you can also look at the Keras model.summary and make sure you 00:37:53.280 |
So for the first layer, it's going to be the size of your filter times the number of your 00:38:02.840 |
And then after that, the number of parameters will be equal to the size of your filter times 00:38:08.440 |
the number of filters coming in times the number of filters coming out. 00:38:14.700 |
And then of course your dense layers will be every input goes to every output, so number 00:38:19.000 |
of inputs times the number of outputs, a parameter to the function that is calculating whether 00:38:31.480 |
So what we do now is we say this number here, 1857, this is saying that the derivative of 00:38:40.440 |
the loss with respect to the slope varies a lot, whereas the derivative of the loss 00:38:47.040 |
with respect to the intercept doesn't vary much at all. 00:38:50.660 |
So at the end of every epoch, I copy that up to here. 00:38:56.720 |
And then I take my learning rate and I divide it by that. 00:39:01.820 |
And so now for each of my parameters, I now have this adjusted learning rate, which is 00:39:08.240 |
the learning rate divided by the recent sum of squares average gradient. 00:39:15.100 |
And so you can see that now one of my learning rates is 100 times faster than the other one. 00:39:20.860 |
And so let's see what happens when I run this. 00:39:23.820 |
Question - Is there a relationship with normalizing the input data? 00:39:29.760 |
Answer - No, there's not really a relationship with normalizing the input data because it 00:39:38.480 |
can help, but still if your inputs are very different scales, it's still a lot more work 00:39:50.840 |
So yes it helps, but it doesn't help so much that it makes it useless, and in fact it turns 00:39:55.380 |
out that even with dynamic learning rates, not just normalized inputs, but batch normalized 00:40:07.320 |
And so the thing about when you're using Adagrad or any kind of dynamic learning rates is generally 00:40:12.120 |
you'll set the learning rate quite a lot higher, because remember you're dividing it by this 00:40:16.760 |
So if I set it to 0.1, oh, too far, so that's no good. 00:40:33.080 |
So you can see after just 5 steps, I'm already halfway there. 00:40:37.560 |
Another 5 steps, getting very close, and another 5 steps, and it's exploded. 00:40:49.680 |
Because as we get closer and closer to where we want to be, you can see that you need to 00:41:00.760 |
And by keeping the learning rates the same, it meant that eventually we went too far. 00:41:07.920 |
So this is still something you have to be very careful of. 00:41:14.000 |
As more elegant, in my opinion, approach to the same thing that Adagrad is doing is something 00:41:21.960 |
And RMSprop was first introduced in Jeffrey Hinton's Coursera course. 00:41:26.840 |
So if you go to the Coursera course in one of those classes he introduces RMSprop. 00:41:39.680 |
So it's quite funny nowadays because this comes up in academic papers a lot. 00:41:43.560 |
When people cite it, they have to cite Coursera course, chapter 6, at minute 14 and 30 seconds. 00:41:50.920 |
But Hinton has asked that this be the official way that he decided, so there you go. 00:42:00.920 |
What RMSprop does is exactly the same thing as momentum, but instead of keeping track 00:42:06.920 |
of the weighted running average of the gradients, we keep track of the weighted running average 00:42:17.920 |
Everything here is the same as momentum so far, except that I take my gradient squared, 00:42:27.160 |
multiply it by 0.1, and add it to my previous cell times 0.9. 00:42:35.600 |
So this is keeping track of the recent running average of the squares of the gradients. 00:42:41.920 |
And when I have that, I do exactly the same thing with it that I did in Adagrad, which 00:42:48.180 |
So I take my previous guess as to b and then I subtract from it my derivative times the 00:42:56.760 |
learning rate divided by the square root of the recent weighted average of the square gradients. 00:43:05.120 |
So it's doing basically the same thing as Adagrad, but in a way that's doing it kind 00:43:10.400 |
So these are all different types of learning rate optimization? 00:43:16.080 |
These last two are different types of dynamic learning rate approaches. 00:43:24.540 |
If we run it for a few steps, and again you have to guess what learning rate to start 00:43:49.720 |
So as you can see, this is going pretty well. 00:43:51.820 |
And I'll show you something really nice about RMSprop, which is what happens as we get very 00:44:03.840 |
And the reason it doesn't explode is because it's recalculating that running average every 00:44:10.880 |
And so rather than waiting until the end of the epoch by which stage it's gone so far 00:44:14.720 |
that it can't come back again, it just jumps a little bit too far and then it recalculates 00:44:23.040 |
So what happens with RMSprop is if your learning rate is too high, then it doesn't explode, 00:44:27.600 |
it just ends up going around the right answer. 00:44:31.360 |
And so when you use RMSprop, as soon as you see your validation scores flatten out, you 00:44:38.320 |
know this is what's going on, and so therefore you should probably divide your learning rate 00:44:44.640 |
When I'm running Keras stuff, you'll keep seeing me run a few steps, divide the learning 00:44:49.000 |
rate by 10, run a few steps, and you don't see that my loss function explodes, you just 00:44:54.720 |
So do you want your learning rate to get smaller and smaller? 00:45:00.880 |
Your very first learning rate often has to start small, and we'll talk about that in 00:45:05.200 |
a moment, but once you've kind of got started, you generally have to gradually decrease the 00:45:12.760 |
And can you repeat what you said earlier that something does the same thing as Adagrad, 00:45:19.120 |
So RMSprop, which we're looking at now, does exactly the same thing as Adagrad, which is 00:45:25.000 |
divide the learning rate by the root-summer-squared of the gradients, but rather than doing it 00:45:33.200 |
since the beginning of time, or every minibatch, or epoch, RMSprop does it continuously using 00:45:41.440 |
the same technique that we learned from momentum, which is take the squared of this gradient, 00:45:48.640 |
multiply it by 0.1, and add it to 0.9 times the last calculation. 00:46:00.560 |
It's a weighted moving average, where we're weighting it such that the more recent squared 00:46:09.160 |
I think it's actually an exponentially weighted moving average, to be more precise. 00:46:15.360 |
So there's something pretty obvious we could do here, which is momentum seems like a good 00:46:18.840 |
idea, RMSprop seems like a good idea, why not do both? 00:46:28.140 |
And so Adam was invented last year, 18 months ago, and hopefully one of the things you see 00:46:35.000 |
from these spreadsheets is that these recently invented things are still at the ridiculously 00:46:44.080 |
So the stuff that people are discovering in deep learning is a long, long, long way away 00:46:49.480 |
from being incredibly complex or sophisticated. 00:46:53.180 |
And so hopefully you'll find this very encouraging, which is if you want to play at the state-of-the-art 00:46:58.040 |
of deep learning, that's not at all hard to do. 00:47:02.840 |
So let's look at Adam, which I remember it coming out 12-18 months ago, and everybody 00:47:09.080 |
was so excited because suddenly it became so much easier and faster to train neural 00:47:15.440 |
But once I actually tried to create an Excel spreadsheet out of it, I realized, oh my god, 00:47:22.600 |
And so literally all I did was I copied my momentum page and then I copied across my 00:47:30.920 |
So you can see here I have my exponentially weighted moving average of the gradients, 00:47:42.240 |
Here is my exponentially weighted moving average of the squares of the gradients. 00:47:48.040 |
And so then when I calculate my new parameters, I take my old parameter and I subtract not 00:47:58.240 |
my derivative times the learning rate, but my momentum factor. 00:48:03.480 |
So in other words, the recent weighted moving average of the gradients multiplied by the 00:48:10.560 |
learning rate divided by the recent moving average of the squares of the derivatives, 00:48:18.760 |
So it's literally just combining momentum plus RMSprop. 00:48:28.200 |
Let's run 5 epochs, and we can use a pretty high learning rate now because it's really 00:48:40.600 |
And so another 5 epochs does exactly the same thing that RMSprop does, which is it goes 00:48:49.240 |
So we need to do the same thing when we use atom, and atom is what I use all the time 00:48:55.040 |
I just divide by 10 every time I see it flatten out. 00:49:01.240 |
So a week ago, somebody came out with something that they called not atom, but Eve. 00:49:08.800 |
And Eve is an addition to atom which attempts to deal with this learning rate annealing automatically. 00:49:19.280 |
And so all of this is exactly the same as my atom page. 00:49:25.200 |
But at the bottom, I've added some extra stuff. 00:49:27.680 |
I have kept track of the root means grid error, this is just my loss function, and then I 00:49:34.800 |
copy across my loss function from my previous epoch and from the epoch before that. 00:49:42.600 |
And what Eve does is it says how much has the loss function changed. 00:49:48.120 |
And so it's got this ratio between the previous loss function and the loss function before 00:49:55.640 |
So you can see it's the absolute value of the last one minus the one before divided 00:50:02.880 |
And what it says is, let's then adjust the learning rate such that instead of just using 00:50:10.760 |
the learning rate that we're given, let's adjust the learning rate that we're given. 00:50:27.800 |
We take the exponentially weighted moving average of these ratios, so you can see another 00:50:34.400 |
of these betas appearing here, so this thing here is equal to our last ratio times 0.9 00:50:50.160 |
And so then for our learning rate, we divide the learning rate from atom by this. 00:51:00.560 |
So what that says is if the learning rate is moving around a lot, if it's very bumpy, 00:51:11.880 |
we should probably decrease the learning rate because it's going all over the place. 00:51:16.800 |
Remember how we saw before, if we've kind of gone past where we want to get to, it just 00:51:22.920 |
On the other hand, if the loss function is staying pretty constant, then we probably 00:51:30.640 |
So that all seems like a good idea, and so again let's try it. 00:51:41.040 |
Not bad, so after 5 epochs it's gone a little bit too far. 00:51:46.280 |
After a week of playing with it, I used this on State Farm a lot during the week, I grabbed 00:51:50.200 |
a Keras implementation which somebody wrote a day after the paper came out. 00:51:56.600 |
The problem is that because it can both decrease and increase the learning rate, sometimes 00:52:04.840 |
as it gets down to the flat bottom point where it's pretty much optimal, it will often be 00:52:11.800 |
the case that the loss gets pretty constant at that point. 00:52:18.740 |
And so therefore, Eve will try to increase the learning rate. 00:52:22.200 |
And so what I tend to find happens that it would very quickly get pretty close to the 00:52:26.640 |
answer, and then suddenly it would jump to somewhere really awful. 00:52:29.480 |
And then it would start to get to the answer again and jump somewhere really awful. 00:52:49.440 |
We have always run for a specific number of epochs. 00:52:53.680 |
We have not defined any kind of stopping criterion. 00:52:59.500 |
It is possible to define such a stopping criterion, but nobody's really come up with one that's 00:53:06.240 |
And the reason why is that when you look at the graph of loss over time, it doesn't tend 00:53:14.120 |
to look like that, but it tends to look like this. 00:53:19.880 |
And so in practice, it's very hard to know when to stop. 00:53:31.080 |
And particularly with a type of architecture called ResNet that we'll look at next week, 00:53:36.520 |
the authors showed that it tends to go like this. 00:53:45.200 |
So in practice, you have to run your training for as long as you have patience for, at whatever 00:53:49.840 |
the best learning rate you can come up with is. 00:53:53.400 |
So something I actually came up with 6 or 12 months ago, but we've kind of restimulated 00:54:01.000 |
my interest after I read this Adam paper, is something which dynamically updates learning 00:54:12.380 |
And rather than using the loss function, which as I just said is incredibly bumpy, there's 00:54:17.080 |
something else which is less bumpy, which is the average sum of squareds gradients. 00:54:24.400 |
So I actually created a little spreadsheet of my idea, and I helped to prototype it in 00:54:28.520 |
Python maybe this week or the next week after, we'll see how it goes. 00:54:32.240 |
And the idea is basically this, keep track of the sum of the squares of the derivatives 00:54:40.840 |
and compare the sum of the squares of the derivatives from the last epoch to the sum 00:54:45.000 |
of the squares of the derivatives of this epoch and look at the ratio of the two. 00:54:53.600 |
If they ever go up by too much, that would strongly suggest that you've kind of jumped 00:55:01.720 |
So anytime they go up too much, you should decrease the learning rate. 00:55:07.320 |
So I literally added two lines of code to my incredibly simple VBA, Adam with a kneeling 00:55:15.680 |
If the gradient ratio is greater than 2, so if it doubles, divide the learning rate by 00:55:39.240 |
You can see it's automatically changing it, so I don't have to do anything, I just keep 00:55:47.440 |
So I'm pretty interested in this idea, I think it's going to work super well because it allows 00:55:52.160 |
me to focus on just running stuff without ever worrying about setting learning rates. 00:55:57.780 |
So I'm hopeful that this approach to automatic learning rate and kneeling is something that 00:56:02.560 |
we can have in our toolbox by the end of this course. 00:56:08.240 |
One thing that happened to me today is I tried a lot of different learning rates, I didn't 00:56:22.400 |
But I was working with the whole dataset, trying to understand if I tried with the sample 00:56:30.000 |
and I find something, would that apply to the whole dataset or how do I go about investigating 00:56:38.680 |
Was there another question at the back before we answered that one? 00:56:48.720 |
The question was, "It takes a long time to figure out the optimal learning rate. 00:56:58.760 |
And to answer that question, I'm going to show you how I entered statefum. 00:57:02.720 |
Indeed, when I started entering statefum, I started by using a sample. 00:57:11.880 |
And so step 1 was to think, "What insights can we gain from using a sample which can 00:57:19.920 |
still apply when we move to the whole dataset?" 00:57:23.040 |
Because running stuff in a sample took 10 or 20 seconds, and running stuff in the full 00:57:33.560 |
So after I created my sample, which I just created randomly, I first of all wanted to 00:57:43.280 |
find out what does it take to create a better-than-random model here. 00:57:51.880 |
So I always start with the simplest possible model. 00:57:55.600 |
And so the simplest possible model has a single dense layer. 00:58:03.960 |
Rather than worrying about calculating the average and the standard deviation of the 00:58:07.080 |
input and subtracting it all out in order to normalize your input layer, you can just 00:58:15.640 |
And so if you start with a batch-norm layer, it's going to do that for you. 00:58:18.700 |
So anytime you create a Keras model from scratch, I would recommend making your first layer 00:58:24.980 |
So this is going to normalize the data for me. 00:58:29.300 |
So that's a cool little trick which I haven't actually seen anybody use elsewhere, but I 00:58:32.800 |
think it's a good default starting point all the time. 00:58:38.440 |
If I'm going to use a dense layer, then obviously I have to flatten everything into a single 00:58:53.800 |
So I tried fitting it, compiled it, fit it, and nothing happened. 00:59:00.940 |
Not only did nothing happen to my validation, but really nothing happened by training. 00:59:05.960 |
It's only taking 7 seconds per epoch to find this out, so that's okay. 00:59:13.760 |
So I look at model.summary, and I see that there's 1.5 million parameters. 00:59:18.720 |
And that makes me think, okay, it's probably not underfitting. 00:59:21.620 |
It's probably unlikely that with 1.5 million parameters, there's really nothing useful 00:59:27.240 |
It's only a linear model, true, but I still think it should be able to do something. 00:59:31.800 |
So that makes me think that what must be going on is it must be doing that thing where it 00:59:38.800 |
And it's particularly easy to jump too far at the very start of training, and let me 00:59:49.440 |
It turns out that there are often reasonably good answers that are way too easy to find. 00:59:59.920 |
So one reasonably good answer would be always predict 0. 01:00:06.560 |
Because there are 10 output classes in the state fun competition, there's one of 10 different 01:00:13.000 |
types of distracted driving, and you are scored based on the cross-entropy loss. 01:00:21.600 |
And what that's looking at is how accurate are each of your 10 predictions. 01:00:25.960 |
So rather than trying to predict something well, what if we just always predict 0.01? 01:00:35.120 |
Nine times out of 10, you're going to be right. 01:00:37.440 |
Because 9 out of the 10 categories, it's not that. 01:00:43.160 |
So actually always predicting 0.01 would be pretty good. 01:00:47.500 |
Now it turns out it's not possible to do that because we have a softmax layer. 01:00:51.440 |
And a softmax layer, remember, is e^x_i divided by sum_of, e^x_i. 01:00:56.660 |
And so in a softmax layer, everything has to add to 1. 01:01:02.240 |
So therefore if it makes one of the classes really high, and all of the other ones really 01:01:08.560 |
low, then 9 times out of 10 it is going to be right, 9 times out of 10. 01:01:14.960 |
So in other words, it's a pretty good answer for it to always predict some random class, 01:01:27.400 |
So anybody who tried this, and I saw a lot of people on the forums this week saying, 01:01:35.360 |
And the folks who got the interesting insight were the ones who then went on to say, "And 01:01:39.520 |
then I looked at my predictions and it kept predicting the same class with great confidence 01:01:49.760 |
Our next step then is to try decreasing the learning rate. 01:02:04.220 |
So here is exactly the same model, but I'm now using a much lower learning rate. 01:02:18.360 |
So it's only 12 seconds of compute time to figure out that I'm going to have to start 01:02:24.600 |
Once we've got to a point where the accuracy is reasonably better than random, we're well 01:02:32.800 |
away from that part of the loss function now that says always predict everything as the 01:02:37.880 |
same class, and therefore we can now increase the learning rate back up again. 01:02:43.100 |
So generally speaking, for these harder problems, you'll need to start at an epoch or two at 01:02:47.480 |
a low learning rate, and then you can increase it back up again. 01:02:53.160 |
So you can see now I can put it back up to 0.01 and very quickly increase my accuracy. 01:03:00.920 |
So you can see here my accuracy on my validation set is 0.5 using a linear model, and this 01:03:07.800 |
is a good starting point because it says to me anytime that my validation accuracy is 01:03:13.200 |
worse than about 0.5, this is really no better than even a linear model, so this is not worth 01:03:21.240 |
One obvious question would be, how do you decide how big a sample to use? 01:03:25.920 |
And what I did was I tried a few different sizes of sample for my validation set, and 01:03:31.040 |
I then said, okay, evaluate the model, in other words, calculate the loss function, on the 01:03:37.680 |
validation set, but for a whole bunch of randomly sampled batches, so do it 10 times. 01:03:46.320 |
And so then I looked and I saw how the accuracy changed. 01:03:49.760 |
With the validation set at 1000 images, my accuracy changed from 0.48 or 0.47 to 0.51, 01:04:01.280 |
It's small enough that I think I can make useful insights using a sample size of this 01:04:18.800 |
One is, are there other architectures that work well? 01:04:22.200 |
So the obvious thing to do with a computer vision problem is to try a convolutional neural 01:04:28.880 |
And here's one of the most simple convolutional neural networks, two convolutional layers, 01:04:37.600 |
And then one dense layer followed by my dense output layer. 01:04:43.640 |
So again I tried that and found that it very quickly got to an accuracy of 100% on the 01:04:51.920 |
training set, but only 24% on the validation set. 01:04:56.240 |
And that's because I was very careful to make sure my validation set included different 01:05:00.760 |
drivers to my training set, because on Kaggle it told us that the test set has different 01:05:07.600 |
So it's much harder to recognize what a driver is doing if we've never seen that driver before. 01:05:13.360 |
So I could see that convolutional neural networks clearly are a great way to model this kind 01:05:19.600 |
of data, but I've got to have to think very carefully about overfitting. 01:05:24.200 |
So step 1 to avoiding overfitting is data augmentation, as we learned in our data augmentation class. 01:05:50.880 |
I tried shifting the channels, so the colors a bit. 01:05:55.600 |
And for each of those, I tried four different levels. 01:06:06.200 |
So here are my best data augmentation amounts. 01:06:11.200 |
So on 1560 images, so a very small set, this is just my sample, I then ran my very simple 01:06:18.160 |
two convolutional layer model with this data augmentation at these optimized parameters. 01:06:27.280 |
After 5 epochs, I only had 0.1 accuracy on my validation set. 01:06:32.160 |
But I can see that my training set is continuing to improve. 01:06:36.200 |
And so that makes me think, okay, don't give up yet, try deducing the learning rate and 01:06:43.240 |
So this is where you've got to be careful not to jump to conclusions too soon. 01:06:49.400 |
So I ran a few more, and it's improving well. 01:06:57.240 |
It kept getting better and better and better until we were getting 67% accuracy. 01:07:10.880 |
So this 1.15 validation loss is well within the top 50% in this competition. 01:07:19.620 |
So using an incredibly simple model, on just a sample, we can get in the top half of this 01:07:24.840 |
Kaggle competition simply by using the right kind of data augmentation. 01:07:30.360 |
So I think this is a really interesting insight about the power of this incredibly useful 01:07:36.560 |
Okay, let's have a five minute break, and we'll do your question first. 01:07:53.040 |
It's unlikely that there's going to be a class imbalance in my sample unless there was an 01:07:58.800 |
equivalent class imbalance in the real data, because I've got a thousand examples. 01:08:05.000 |
And so statistically speaking, that's unlikely. 01:08:07.800 |
If there was a class imbalance in my original data, then I want my sample to have that class 01:08:14.960 |
So at this point, I felt pretty good that I knew that we should be using a convolutional 01:08:24.160 |
neural network, which is obviously a very strong hypothesis to start with anyway. 01:08:29.920 |
And also I felt pretty confident when you knew what kind of learning rate to start with, 01:08:36.680 |
and then how to change it, and also what data augmentation to do. 01:08:44.240 |
The next thing I wanted to wonder about was how else do I handle overfitting, because 01:08:49.480 |
although I'm getting some pretty good results, I'm still overfitting hugely, 0.6 versus 0.9. 01:08:57.560 |
So the next thing in our list of ways to avoid overfitting, and I hope you guys all remember 01:09:06.680 |
The five steps, let's go and have a look at it now to remind ourselves. 01:09:14.000 |
Approaches to reducing overfitting, these are the five steps. 01:09:18.160 |
We can't add more data, we've tried using data augmentation, we're already using batch 01:09:23.640 |
norm and convnets, so the next step is to add regularization. 01:09:28.600 |
And dropout is our favored regularization technique. 01:09:32.180 |
So I was thinking, okay, before we do that, I'll just mention one more thing about this 01:09:42.200 |
I have literally never seen anybody write down a process as to how to figure out what 01:09:50.640 |
kind of data augmentation to use and the amount. 01:09:53.900 |
The only posts I've seen on it always rely on intuition, which is basically like, look 01:10:00.680 |
at the images and think about how much they seem like they should be able to move around 01:10:06.240 |
I really tried this week to come up with a rigorous, repeatable process that you could 01:10:15.560 |
And that process is go through each data augmentation type one at a time, try 3 or 4 different levels 01:10:21.420 |
of it on a sample with a big enough validation set that it's pretty stable to find the best 01:10:29.560 |
value of each of the data augmentation parameters, and then try combining them all together. 01:10:40.640 |
So I hope you kind of come away with this as a practical message which probably your 01:10:49.920 |
colleagues, even if some of them claim to be deep learning experts, I doubt that they're 01:10:54.800 |
So this is something you can hopefully get people into the practice of doing. 01:11:00.800 |
Regularization however, we cannot do on a sample. 01:11:04.600 |
And the reason why is that step 1, add more data, well that step is very correlated with 01:11:16.120 |
As we add more data, we need less regularization. 01:11:20.000 |
So as we move from a sample to the full dataset, we're going to need less regularization. 01:11:25.920 |
So to figure out how much regularization to use, we have to use the whole dataset. 01:11:30.360 |
So at this point I changed it to use the whole dataset, not the sample, and I started using 01:11:39.360 |
So you can see that I started with my data augmentation amounts that you've already 01:11:49.200 |
And ran it for a few epochs to see what would happen. 01:11:57.220 |
So we're getting up into the 75% now, and before we were in the 64%. 01:12:02.360 |
So once we add clipping, which is very important for getting the best cross-entropy loss function, 01:12:10.800 |
I haven't checked where that would get us on the Kaggle leaderboard, but I'm pretty 01:12:16.560 |
sure it would be at least in the top third based on this accuracy. 01:12:21.640 |
So I ran a few more epochs with an even lower learning rate and got 0.78, 0.79. 01:12:35.000 |
So this is going to be well up into the top third, maybe even the top third of the leaderboard. 01:12:43.280 |
So I got to this point by just trying out a couple of different levels of Dropout, and 01:12:54.200 |
A lot of people put small amounts of Dropout in their convolutional layers as well. 01:13:03.160 |
But what VGG does is to put 50% Dropout after each of its dense layers, and that doesn't 01:13:09.680 |
seem like a bad rule of thumb, so that's what I was doing here. 01:13:12.440 |
And then trying around a few different sizes of dense layers to try and find something 01:13:17.320 |
I didn't spend a heap of time on this, so there's probably better architectures, but 01:13:21.160 |
as you can see this is still a pretty good one. 01:13:26.040 |
Now so far we have not used a pre-trained network at all. 01:13:35.080 |
So this is getting into the top third of the leaderboard without even using any ImageNet 01:13:44.120 |
But we're pretty sure that ImageNet features would be helpful. 01:13:47.640 |
So that was the next step, was to use ImageNet features, so VGG features. 01:13:52.560 |
Specifically, I was reasonably confident that all of the convolutional layers of VGG are 01:14:01.520 |
I didn't expect I would have to fine-tune them much, if at all, because the convolutional 01:14:06.120 |
layers are the things which really look at the shape and structure of things rather than 01:14:12.960 |
And these are photos of the real world, just like ImageNet are photos of the real world. 01:14:18.580 |
So I really felt like most of the time, if not all of it, was likely to be spent on the 01:14:25.200 |
So therefore, because calculating the convolutional layers takes nearly all the time, because 01:14:30.560 |
that's where all the computation is, I pre-computed the output of the convolutional layers. 01:14:37.240 |
And we've done this before, you might remember. 01:14:41.020 |
When we looked at dropout, we did exactly this. 01:14:50.840 |
We figured out what was the last convolutional layer's ID. 01:14:55.240 |
We grabbed all of the layers up to that ID, we built a model out of them, and then we 01:15:05.640 |
And that told us the value of those features, those activations from VGG's last convolutional 01:15:18.540 |
So I said okay, grab VGG 16, find the last convolutional layer, build a model that contains 01:15:24.000 |
everything up to and including that layer, predict the output of that model. 01:15:34.320 |
So predicting the output means calculate the activations of that last convolutional layer. 01:15:42.080 |
And since that takes some time, then save that so I never have to do it again. 01:15:49.640 |
So then in the future I can just load that array. 01:15:54.920 |
So this array, I'm not going to calculate those, I'm simply going to load them. 01:16:07.840 |
And so have a think about what would you expect the shape of this to be. 01:16:13.560 |
And you can figure out what you would expect the shape to be by looking at model.summary 01:16:36.440 |
We'll find our conv_val_feet.shape, 512x14x14 as expected. 01:16:54.080 |
Is there a reason you chose to leave out the max_pooling and flatten layers? 01:17:00.560 |
So why did I leave out the max_pooling and flatten layers? 01:17:05.440 |
Probably because it takes zero time to calculate them and the max_pooling layer loses information. 01:17:13.560 |
So I thought given that I might want to play around with other types of pooling or other 01:17:19.360 |
types of convolutions or whatever, I thought pre-calculating this layer is the last one 01:17:28.360 |
Having said that, the first thing I did with it in my new model was to max_pool_it and 01:17:40.640 |
So now that I have the output of VGG for the last conv layer, I can now build a model that 01:17:51.500 |
And so the input to this model will be the output of those conv layers. 01:17:54.740 |
And the nice thing is it won't take long to run this, even on the whole dataset, because 01:17:58.240 |
the dense layers don't take much computation time. 01:18:01.560 |
So here's my model, and by making p a parameter, I could try a wide range of dropout amounts, 01:18:11.160 |
and I fit it, and one epoch takes 5 seconds on the entire dataset. 01:18:20.360 |
And you can see 1 epoch gets me 0.65, 3 epochs get me 0.75. 01:18:36.280 |
I have something that in 15 seconds can get me 0.75 accuracy. 01:18:39.960 |
And notice here, I'm not using any data augmentation. 01:18:47.140 |
Because you can't pre-compute the output of convolutional layers if you're using data 01:18:52.520 |
Because with data augmentation, your convolutional layers give you a different output every time. 01:19:05.040 |
You can't use data augmentation if you are pre-computing the output of a layer. 01:19:10.840 |
Because think about it, every time it sees the same cat photo, it's rotating it by a 01:19:17.520 |
different amount, or moving it by a different amount. 01:19:22.240 |
So it gives a different output of the convolutional layer, so you can't pre-compute it. 01:19:31.520 |
There is something you can do, which I've played with a little bit, which is you could 01:19:35.680 |
pre-compute something that's 10 times bigger than your dataset, consisting of 10 different 01:19:43.720 |
data-augmented versions of it, which is why I actually had this -- where is it? 01:19:53.400 |
Which is what I was doing here when I brought in this data generator with augmentations, 01:19:57.520 |
and I created something called data-augmented convolutional features, in which I predicted 01:20:03.960 |
5 times the amount of data, or calculated 5 times the amount of data. 01:20:09.800 |
And so that basically gave me a dataset 5 times bigger, and that actually worked pretty 01:20:16.260 |
It's not as good as having a whole new sample every time, but it's kind of a compromise. 01:20:22.200 |
So once I played around with these dense layers, I then did some more fine-tuning and found 01:20:30.420 |
out that -- so if I went basically here, I then tried saying, okay, let's go through 01:20:36.400 |
all of my layers in my model from 16 onwards and set them to trainable and see what happens. 01:20:44.120 |
So I tried retraining, fine-tuning some of the convolutional layers as well. 01:20:50.400 |
So I experimented with my hypothesis, and I found it was correct, which is it seems 01:20:54.960 |
that for this particular model, coming up with the right set of dense layers is what 01:21:04.400 |
If we want rotational invariance, should we keep the max pooling, or can another layer 01:21:15.120 |
Max pooling doesn't really have anything to do with rotational invariance. 01:21:27.280 |
So I'm going to show you one more cool trick. 01:21:29.200 |
I'm going to show you a little bit of State Farm every week from now on because there's 01:21:33.200 |
so many cool things to try, and I want to keep reviewing CNNs because convolutional 01:21:38.560 |
neural nets really are becoming what deep learning is all about. 01:21:47.640 |
The two tricks are called pseudo-labeling and knowledge distillation. 01:21:53.680 |
So if you Google for pseudo-labeling semi-supervised learning, you can see the original paper that 01:22:13.160 |
This is a Jeffrey Hinton paper, Distilling the Knowledge in a Neural Network. 01:22:20.040 |
So these are a couple of really cool techniques which Hinton and Jeff Dean, that's not bad, 01:22:34.680 |
What we're going to do is we are going to use the test set to give us more information. 01:22:40.400 |
Because in State Farm, the test set has 80,000 images in it, and the training set has 20,000 01:22:54.920 |
What could we do with those 80,000 images which we don't have labels for? 01:23:03.880 |
It seems like we should be able to do something with them, and there's a great little picture 01:23:07.880 |
Imagine we only had two points, and we knew their labels, white and black. 01:23:14.400 |
And then somebody said, "How would you label this?" 01:23:17.960 |
And then they told you that there's a whole lot of other unlabeled data. 01:23:29.720 |
It's helped us because it's told us how the data is structured. 01:23:34.840 |
This is what semi-supervised learning is all about. 01:23:36.920 |
It's all about using the unlabeled data to try and understand something about the structure 01:23:41.120 |
of it and use that to help you, just like in this picture. 01:23:47.680 |
Pseudo-labeling and knowledge distillation are a way to do this. 01:23:51.920 |
And what we do is -- and I'm not going to do it on the test set, I'm going to do it 01:23:55.880 |
on the validation set because it's a little bit easier to see the impact of it, and maybe 01:24:00.920 |
next week we'll look at the test set to see, because that's going to be much cooler when 01:24:07.720 |
What we do is we take our model, some model we've already built, and we predict the outputs 01:24:17.600 |
In this case, I'm using the validation set, as if it was unlabeled. 01:24:26.440 |
So now that we have predictions for the test set or the validation set, it's not that they're 01:24:35.520 |
We can say there's some label, they're not correct labels, but they're labels nonetheless. 01:24:40.080 |
So what we then do is we take our training labels and we concatenate them with our validation 01:24:49.800 |
And so we now have a bunch of labels for all of our data. 01:24:53.840 |
And so we can now also concatenate our convolutional features with the convolutional features of 01:25:09.420 |
So the model we use is exactly the same model we had before, and we train it in exactly 01:25:27.240 |
And the reason why is just because we use this additional unlabeled data to try to figure 01:25:39.840 |
How do you learn how to design a model and when to stop messing with them? 01:25:43.240 |
It seems like you've taken a few initial ideas, tweaked them to get higher accuracy, but unless 01:25:48.120 |
your initial guesses are amazing, there should be plenty of architectures that would also 01:25:53.880 |
So if and when you figure out how to find an architecture and stop messing with it, 01:26:08.680 |
I look back at these models I'm showing you and I'm thinking, I bet there's something 01:26:19.360 |
There are all kinds of ways of optimizing other hyperparameters of deep learning. 01:26:26.020 |
For example, there's something called spearmint, which is a Bayesian optimization hyperparameter 01:26:39.360 |
In fact, just last week a new paper came out for hyperparameter tuning, but this is all 01:26:43.840 |
about tuning things like the learning rate and stuff like that. 01:26:50.280 |
Coming up with architectures, there are some people who have tried to come up with some 01:27:03.840 |
kind of more general architectures, and we're going to look at one next week called ResNets, 01:27:09.840 |
which seem to be pretty encouraging in that direction, but even then, ResNet, which we're 01:27:19.360 |
going to learn about next week, is an architecture which won ImageNet in 2015. 01:27:27.400 |
The author of ResNet, Kaiming He from Microsoft, said, "The reason ResNet is so great is it 01:27:35.960 |
lets us build very, very, very deep networks." 01:27:40.080 |
Indeed he showed a network with over a thousand layers, and it was totally state-of-the-art. 01:27:45.680 |
Somebody else came along a few months ago and built wide ResNets with like 50 layers, 01:27:57.880 |
So the very author of the ImageNet winner completely got wrong the reason why his invention 01:28:06.000 |
The idea that any of us have any idea how to create optimal architectures is totally, 01:28:12.960 |
So that's why I'm trying to show you what we know so far, which is like the processes 01:28:18.000 |
you can use to build them without waiting forever. 01:28:21.320 |
So in this case, doing your data augmentation on the small sample in a rigorous way, figuring 01:28:27.000 |
out that probably the dense layers are where the action is at and pre-computing the input 01:28:32.560 |
These are the kinds of things that can keep you sane. 01:28:36.240 |
I'm showing you the outcome of my last weeks kind of playing with this. 01:28:41.500 |
I can tell you that during this time I continually fell into the trap of running stuff on the 01:28:47.520 |
whole network and all the way through and fiddling around with hyperparameters. 01:28:53.240 |
And I have to stop myself and have a cup of tea and say, "Okay, is this really a good 01:28:59.080 |
So we all do it, but not you anymore because you've been to this class. 01:29:13.080 |
I'm just a little confused because it feels like maybe we're using our validation set 01:29:18.440 |
as part of our training program and I'm confused how it's not true. 01:29:22.000 |
But look, we're not using the validation labels, nowhere here does it say "val_labels". 01:29:30.160 |
So yeah, we are absolutely using our validation set but we're using the validation set's inputs. 01:29:41.320 |
So next week I will show you this page again, and this time I'm going to use the test set. 01:29:46.680 |
I just didn't have enough time to do it this time around. 01:29:49.560 |
And hopefully we're going to see some great results, and when we do it on the test set 01:29:52.840 |
then you'll be really convinced that it's not using the labels because we don't have 01:29:56.720 |
But you can see here, all it's doing is it's creating pseudo-labels by calculating what 01:30:02.280 |
it thinks it ought to be based on the model that we just built with that 75% accuracy. 01:30:10.520 |
And so then it's able to use the input data for the validation set in an intelligent way 01:30:31.480 |
Yeah, it's using bn_model, and bn_model is the thing that we just fitted. 01:30:46.120 |
By using the training labels, so this is bn_model, the thing with this 0.755 accuracy. 01:30:52.520 |
So if we were to look at - I know we haven't gone through this - can you move a bit closer 01:30:59.280 |
And this is supervised and unsupervised learning? 01:31:04.340 |
Right, and semi-supervised works because you're giving it a model which already knows about 01:31:09.360 |
a bunch of labels but unsupervised wouldn't know. 01:31:17.520 |
I wasn't particularly thinking about doing this, but unsupervised learning is where you're 01:31:23.040 |
trying to build a model when you have no labels at all. 01:31:27.320 |
How many people here would be interested in hearing about unsupervised learning during 01:31:32.240 |
Okay, enough people, I should do that, I will add it. 01:31:42.240 |
During the week, perhaps we can create a forum thread about unsupervised learning and I can 01:31:46.520 |
learn about what you're interested in doing with it because many things that people think 01:31:55.120 |
Okay, so pseudo-labeling is insane and awesome, and we need the green box back. 01:32:11.640 |
Earlier you talked about learning about the structure of the data that you can learn from 01:32:14.800 |
the validation set, can you say more about that? 01:32:20.640 |
Other than that picture I showed you before with the two little spirally things. 01:32:25.520 |
And that picture was kind of showing how they clustered in a way that was higher dimension 01:32:29.120 |
than what you can see when you just had to work. 01:32:31.420 |
So think about that Matt Zyler paper we saw, or the Jason Yersinski visualization tool 01:32:38.500 |
The layers learn shapes and textures and concepts. 01:32:46.300 |
In that 80,000 test images of people driving in different distracted ways, there are lots 01:32:52.960 |
of concepts there to learn about ways in which people drive in distracted ways, even although 01:33:00.080 |
So what we're doing is we're trying to learn better convolutional or dense features, that's 01:33:10.660 |
So the structure of the data here is basically like what do these pictures tend to look like. 01:33:16.400 |
More importantly, in what ways do they differ? 01:33:19.360 |
Because it's the ways that they differ that therefore must be related to how they're labeled. 01:33:25.920 |
Can you use your updated model to make new labels for the validation? 01:33:32.520 |
Yes, you can absolutely do pseudo-labeling on pseudo-labeling, and you should. 01:33:38.200 |
And if I don't get sick of running this code, I will try it next week. 01:33:44.600 |
Could that introduce bias towards your validation set? 01:33:49.480 |
No because we don't have any validation labels. 01:33:53.080 |
One of the tricky parameters in pseudo-labeling is in each batch, how much do I make it a 01:34:05.120 |
One of the big things that stopped me from getting the test set in this week is that 01:34:10.440 |
Keras doesn't have a way of creating batches which have like 80% of this set and 20% of 01:34:19.040 |
that set, which is really what I want -- because if I just pseudo-labeled the whole test set 01:34:24.440 |
and then concatenated it, then 80% of my batches are going to be pseudo-labels. 01:34:31.080 |
And generally speaking, the rule of thumb I've read is that somewhere around a quarter 01:34:35.240 |
to a third of your mini-batches should be pseudo-labels. 01:34:39.040 |
So I need to write some code basically to get Keras to generate batches which are a 01:34:45.640 |
mix from two different places before I can do this properly. 01:34:49.440 |
There are two questions and I think you're asking the same thing. 01:34:53.240 |
Are your pseudo-labels only as good as the initial model you're beginning from, so do 01:34:57.320 |
you need to have kind of a particular accuracy in your model? 01:35:00.880 |
Yeah, your pseudo-labels are indeed as good as your model you're starting from. 01:35:05.880 |
People have not studied this enough to know how sensitive it is to those initial labels. 01:35:13.080 |
No, this is too new, you know, and just try it. 01:35:25.360 |
My guess is that pseudo-labels will be useful regardless of what accuracy level you're at 01:35:32.280 |
As long as you are in a semi-supervised learning context, i.e. you have a lot of unlabeled 01:35:40.560 |
I really want to move on because I told you I wanted to get us down the path to NLP this 01:35:49.040 |
And it turns out that the path to NLP, strange as it sounds, starts with collaborative filtering. 01:35:58.380 |
This week we are going to learn about collaborative filtering. 01:36:01.740 |
And so collaborative filtering is a way of doing recommender systems. 01:36:06.560 |
And I sent you guys an email today with a link to more information about collaborative 01:36:11.300 |
filtering and recommender systems, so please read those links if you haven't already just 01:36:17.720 |
to get a sense of what the problem we're solving here is. 01:36:22.520 |
In short, what we're trying to do is to learn to predict who is going to like what and how 01:36:36.200 |
For example, the $1 million Netflix price, at what rating level will this person give 01:36:46.880 |
If you're writing Amazon's recommender system to figure out what to show you on their homepage, 01:36:52.080 |
which products is his person likely to rate highly? 01:36:57.960 |
If you're trying to figure out what stuff to show at a news speed, which articles is 01:37:06.840 |
There's a lot of different ways of doing this, but broadly speaking there are two main classifications 01:37:13.000 |
One is based on metadata, which is for example, this guy filled out a survey in which they 01:37:23.200 |
And we also have taken all of our movies and put them into genres, and here are all of 01:37:31.880 |
Broadly speaking, that would be a metadata-based approach. 01:37:36.000 |
A collaborative filtering-based approach is very different. 01:37:39.040 |
It says, "Let's find other people like you and find out what they liked and assume that 01:37:49.520 |
And specifically when we say people like you, we mean people who rated the same movies you've 01:37:54.840 |
watched in a similar way, and that's called collaborative filtering. 01:38:00.060 |
It turns out that in a large enough dataset, collaborative filtering is so much better 01:38:05.680 |
than the metadata-based approaches that adding metadata doesn't even improve it at all. 01:38:11.280 |
So when people in the Netflix prize actually went out to IMDB and sucked in additional 01:38:17.280 |
data and tried to use that to make it better, at a certain point it didn't help. 01:38:23.860 |
Once their collaborative filtering models were good enough, it didn't help. 01:38:26.440 |
And that's because it's something I learned about 20 years ago when I used to do a lot 01:38:31.720 |
It turns out that asking people about their behavior is crap compared to actually looking 01:38:38.960 |
So let me show you what collaborative filtering looks like. 01:38:42.040 |
What we're going to do is use a dataset called MovieLens. 01:38:45.560 |
So you guys hopefully will be able to play around with this this week. 01:38:50.440 |
Unfortunately Rachel and I could not find any Kaggle competitions that were about recommender 01:38:56.280 |
systems and where the competitions were still open for entries. 01:39:00.120 |
However, there is something called MovieLens which is a widely studied dataset in academia. 01:39:14.520 |
Perhaps surprisingly, approaching or beating an academic state of the art is way easier 01:39:20.960 |
than winning a Kaggle competition, because in Kaggle competitions lots and lots and lots 01:39:25.120 |
of people look at that data and they try lots and lots and lots of things and they use a 01:39:28.840 |
really pragmatic approach, whereas academics state of the arts are done by academics. 01:39:35.440 |
So with that said, the MovieLens benchmarks are going to be much easier to beat than any 01:39:42.060 |
Kaggle competition, but it's still interesting. 01:39:46.520 |
So you can download MovieLens dataset from the MovieLens dataset website, and you'll 01:39:52.120 |
see that there's one here recommended for new research with 20 million items in. 01:39:57.920 |
Also conveniently, they have a small one with only 100,000 ratings. 01:40:01.480 |
So you don't have to build a sample, they have already built a sample for you. 01:40:14.280 |
And as you'll see here, I've started using pandas, pd is pd for pandas. 01:40:23.520 |
So those of you that don't, hopefully the peer group pressure is kicking in. 01:40:27.440 |
So pandas is a great way of dealing with structured data and you should use it. 01:40:32.040 |
Reading a CSV file is this easy, showing the first few items is this easy, finding out 01:40:37.640 |
how big it is, finding out how many users and movies there are, are all this easy. 01:40:46.360 |
I wanted to play with this in Excel, because that's the only way I know how to teach. 01:40:52.160 |
What I did was I grabbed the user ID by rating and grabbed the top 15 most busiest movie-watching 01:41:03.120 |
users, and then I grabbed the 15 most watched movies, and then I created a cross-tab of 01:41:16.320 |
Here is the table I downloaded from MovieLens for the 15 busiest movie-watching users and 01:41:36.440 |
These three users have watched every single one of these movies. 01:41:45.640 |
And these have been watched by every single one of these users. 01:41:49.660 |
So user 14 kind of liked movie 27, loved movie 49, hated movie 51. 01:41:58.080 |
So let's have a look, is there anybody else here? 01:42:05.200 |
So this guy really liked movie 49, didn't much like movie 57, so they may feel the same way 01:42:13.800 |
That's the basic essence of collaborative filtering. 01:42:15.680 |
We're going to try and automate it a little bit. 01:42:18.560 |
And the way we're going to automate it is we're going to say let's pretend for each 01:42:21.800 |
movie we had like five characteristics, which is like is it sci-fi, is it action, is it 01:42:29.480 |
dialogue-heavy, is it new, and does it have Bruce Willis. 01:42:38.360 |
And then we could have those five things for every user as well, which is this user somebody 01:42:52.160 |
who likes sci-fi, action, dialogue, new movies, and Bruce Willis. 01:42:56.960 |
And so what we could then do is multiply those matrix product or dot product, that set of 01:43:07.280 |
user features with that set of movie features. 01:43:11.400 |
If this person likes sci-fi and it's sci-fi and they like action and it is action and 01:43:15.160 |
so forth, then a high number will appear in here for this matrix product of these two 01:43:20.680 |
vectors, this dot product of these two vectors. 01:43:25.640 |
And so this would be a cool way to build up a collaborative filtering system if only we 01:43:33.040 |
could create these five items for every movie and for every user. 01:43:40.940 |
Now because we don't actually know what five things are most important for users and what 01:43:45.680 |
five things are most important for movies, we're instead going to learn them. 01:43:50.560 |
And the way we learn them is the way we learn everything, which is we start by randomizing 01:43:59.380 |
So here are five random numbers for every movie, and here are five random numbers for 01:44:06.360 |
every user, and in the middle is the dot product of that movie with that user. 01:44:14.800 |
Once we have a good set of movie factors and user factors for each one, then each of these 01:44:22.120 |
ratings will be similar to each of the observed ratings, and therefore this sum of squared 01:44:36.840 |
So we start with our random numbers, we start with a loss function of 40. 01:44:43.640 |
So we now want to use gradient descent, and it turns out that every copy of Excel has 01:44:51.680 |
So we're going to go ahead and use it, it's called solver. 01:44:55.380 |
And so we have to tell it what thing to minimize, so it's saying minimize this, and which things 01:45:01.720 |
do we want to change, which is all of our factors, and then we set it to a minimum and 01:45:09.500 |
And then we can see in the bottom left, it is trying to make this better and better and 01:45:16.800 |
Notice I'm not saying stochastic gradient descent. 01:45:20.300 |
Stochastic gradient descent means it's doing it mini-batch at a mini-batch time. 01:45:24.620 |
Gradient descent means it's doing the whole data set each time. 01:45:28.040 |
Excel uses gradient descent, not stochastic gradient descent. 01:45:37.000 |
It's so slow because it doesn't know how to create analytical derivatives, so it's having 01:45:41.080 |
to calculate the derivatives with finite difference, which is slow. 01:45:45.400 |
So here we've got a solution, it's got it down to 5. 01:45:51.600 |
So we can see here that it predicted 5.14 and it was actually 5. 01:46:05.980 |
It's a little bit too easy because there are 5 times that many user factors and 5 times 01:46:17.520 |
We've got nearly as many factors as we have things to calculate, so it's kind of over-specified. 01:46:28.280 |
The piece we're missing is that some users probably just like movies more than others, 01:46:35.460 |
and some movies are probably just more like than others. 01:46:39.440 |
And this dot product does not allow us in any way to say this is an enthusiastic user 01:46:52.560 |
So here is exactly the same spreadsheet, but I've added one more row to the movies part 01:47:02.280 |
and one more column to the users part for our biases. 01:47:07.280 |
And I've updated the formula so that as well as the matrix multiplication, it also is adding 01:47:18.080 |
So this is saying this is a very popular movie, and here we are, this is a very enthusiastic 01:47:30.760 |
And so now that we have a collaborative filtering plus bias, we can do gradient descent on that. 01:47:39.120 |
So previously our gradient descent loss function was 5.6. 01:47:45.540 |
We would expect it to be better with bias because we can really better specify what's 01:47:52.120 |
So again we run solver, solve, and we let that zip along, and we see what happens. 01:47:59.760 |
So these things we're calculating are called latent factors. 01:48:06.440 |
A latent factor is some factor that is influencing outcome, but we don't quite know what it is. 01:48:15.680 |
And in fact what happens is when people do collaborative filtering, they then go back 01:48:19.400 |
and they draw graphs where they say here are the movies that are scored highly on this 01:48:24.860 |
latent factor and low on this latent factor, and so they'll discover the Bruce Willis factor 01:48:34.400 |
And so if you look at the Netflix prize visualizations, you'll see these graphs people do. 01:48:39.980 |
And the way they do them is they literally do this. 01:48:42.440 |
Not in Excel, because they're not that cool, but they calculate these latent factors and 01:48:48.480 |
then they draw pictures of them and then they actually write the name of the movie on the 01:49:05.400 |
In fact I also have an error here, because any time that my writing is empty, I really 01:49:13.800 |
want to be setting this to empty as well, which means my parenthesis was in the wrong 01:49:21.280 |
So I'm going to recalculate this with my error fixed up and see if we get a better answer. 01:49:30.480 |
They're randomly generated and then optimized with gradient descent. 01:49:59.480 |
For some reason, this seems crazier than what we were doing at CNN's, because movies I understand 01:50:11.200 |
more than features of images that I just don't intuitively understand. 01:50:17.080 |
So we can look at some pictures next week, but during the week, Google for Netflix prize 01:50:25.040 |
visualizations and you will see these pictures. 01:50:32.840 |
It figures out what are the most interesting dimensions on which we can rate a movie. 01:50:41.000 |
Things like level of action and sci-fi and dialogue driven are very important features, 01:50:49.960 |
But rather than pre-specifying those features, we have definitely learned from this class 01:50:56.160 |
that calculating features using gradient descent is going to give us better features than trying 01:51:11.040 |
Tell me next week if you find some particularly interesting things, or if it still seems crazy 01:51:24.760 |
Now there's really only one main new concept we have to learn, which is we started out 01:51:31.000 |
with data not in a crosstab form, but in this form. 01:51:36.440 |
We have user ID, movie ID, rating triplets, and I crosstab them. 01:51:47.920 |
So the rows and the columns above the random numbers, are they the variations and the features 01:51:55.700 |
in the movies and the variations and features in the users? 01:51:59.520 |
Each of these rows is one feature of a movie, and each of these columns is one feature of 01:52:06.360 |
And so one of these sets of 5 is one set of features for a user. 01:52:14.240 |
I think it's interesting and crazy because you're basically taking random data and you 01:52:19.520 |
can generate those features out of people that you don't know in movies that you're 01:52:26.360 |
Yeah, this is the thing I just did at the start of class, which is there's nothing mathematically 01:52:38.500 |
The hard part is unlearning the idea that this should be hard, you know, gradient descent 01:52:50.840 |
I just wanted to point out that this you can think of as a smaller, more concise way to 01:53:03.740 |
In math, there's a concept of a matrix factorization, an SVD for example, which is where you basically 01:53:10.360 |
take a big matrix and turn it into a small narrow one and a small thin one and multiply 01:53:17.920 |
Instead of having how user 14 rated every single movie, we just have 5 numbers that 01:53:28.760 |
So earlier, did you say that both the user features were random as well as the? 01:53:36.200 |
I guess I'm in trouble relating to, I thought, you know, usually we run something like gradient 01:53:44.760 |
descent on, something has like inputs that you know and here, what are the, what do you 01:53:53.360 |
What we know, that's what we know, the resulting ratings. 01:53:57.160 |
So can you perhaps come up with the wrong, like you flip the feature for a movie and 01:54:06.800 |
a user because if you're doing a multiplication, how do you know which value goes which? 01:54:16.240 |
If one of the numbers was in the wrong spot, our loss function would be less good and therefore 01:54:23.480 |
there would be a gradient from that weight to say you should make this weight a little 01:54:31.680 |
So all the gradient descent is doing is saying okay, for every weight, if we make it a little 01:54:36.160 |
higher, does it get better or if we make it a little bit lower, does it get better? 01:54:39.760 |
And then we keep making them a little bit higher and lower until we can't go any better. 01:54:47.520 |
And we had to decide how to combine the weights. 01:54:50.720 |
So this was our architecture, our architecture was let's take a dot product of some assumed 01:54:58.200 |
user feature and some assumed movie feature and let's add in the second case some assumed 01:55:06.320 |
So we had to build an architecture and we built the architecture using common sense, 01:55:11.000 |
which is to say this seems like a reasonable way of thinking about this. 01:55:13.600 |
I'm going to show you a better architecture in a moment. 01:55:15.920 |
In fact, we're running out of time, so let me jump into the better architecture. 01:55:21.600 |
So I wanted to point out that there is something new we're going to have to learn here, which 01:55:25.560 |
is how do you start with a numeric user_id and look up to find what is their 5-element 01:55:35.960 |
Now remember, when we have user_id's like 1, 2 and 3, one way to specify them is using 01:55:52.600 |
So one way to handle this situation would be if this was our user matrix, it was one 01:56:01.720 |
hot encoded, and then we had a factor matrix containing a whole bunch of random numbers 01:56:12.360 |
-- one way to do it would be to take a dot product or a matrix product of this and this. 01:56:29.200 |
And what that would do would be for this one here, it would basically say let's multiply 01:56:34.480 |
that by this, it would grab the first column of the matrix. 01:56:42.360 |
And this here would grab the second column of the matrix. 01:56:46.040 |
And this here would grab the third column of the matrix. 01:56:49.080 |
So one way to do this in Keras would be to represent our user_id's as one hot encodings, 01:56:56.720 |
and to create a user factor matrix just as a regular matrix like this and then take a 01:57:07.800 |
That's horribly slow because if we have 10,000 users, then this thing is 10,000 wide and that's 01:57:16.040 |
a really big matrix multiplication when all we're actually doing is saying for user_id 01:57:22.520 |
For user_id number 2, take the second column, for user_id number 3, take the third column. 01:57:27.120 |
And so Keras has something which does this for us and it's called an embedding layer. 01:57:32.280 |
And embedding is literally something which takes an integer as an input and looks up 01:57:36.600 |
and grabs the corresponding column as output. 01:57:39.920 |
So it's doing exactly what we're seeing in this spreadsheet. 01:57:43.160 |
Question 2 - How do you deal with missing values, so if a user has not rated a particular movie? 01:57:52.400 |
That's no problem, so missing values are just ignored, so if it's missing, I just set the 01:57:57.520 |
Question 3 - How do you break up the training and test set? 01:58:02.000 |
I broke up the training and test set randomly by grabbing random numbers and saying are 01:58:09.280 |
they greater or less than 0.8 and then split my ratings into two groups based on that. 01:58:14.160 |
Question 4 - And you're choosing those from the ratings so that you have some ratings 01:58:18.640 |
from all users and you have some ratings for all movies? 01:58:30.480 |
In Keras, there's one other thing, I'm going to stop using the sequential model in Keras 01:58:38.200 |
and start using the functional model in Keras. 01:58:40.760 |
I'll talk more about this next week, but you can read about it learning the week. 01:58:44.080 |
There are two ways of creating models in Keras, the sequential and the functional. 01:58:48.320 |
They do similar things, but the functional is much more flexible and it's going to be 01:58:54.880 |
So this is going to look slightly unfamiliar, but the ideas are the same. 01:58:59.460 |
So we create an input layer for a user, and then we say now create an embedding layer 01:59:07.480 |
for n users, which is 671, and we want to create how many latent factors? 01:59:19.360 |
And then I create a movie input, and then I create a movie embedding with 50 factors, 01:59:26.840 |
and then I say take the dot product of those, and that's our model. 01:59:34.540 |
So now please compile the model, and now train it, taking the userID and movieID as input, 01:59:42.120 |
the rating as the target, and run it for 6 epochs, and I get a 1.27 loss. 01:59:53.720 |
Notice that I'm not doing anything else clever, it's just that simple dot product. 01:59:59.840 |
Here's how I add the bias, I use exactly the same kind of embedding inputs as before, and 02:00:07.760 |
So my user and movie embeddings are the same. 02:00:11.120 |
And then I create bias by simply creating an embedding with just a single output. 02:00:18.400 |
And so then my new model is do a dot product, and then add the user bias, and add the movie 02:00:35.200 |
Well, there are lots of sites on the internet where you can find out benchmarks for movie 02:00:41.080 |
lens, and on the 100,000 dataset, we're generally looking for RMSE of about 0.89. 02:00:47.120 |
There's some more, the best one here is 0.9, here we are, 0.89, and this one, RMSE, that's 02:01:00.560 |
on the 1,000,000 dataset, let's go to the 100,000, 100,000, RMSE, 1.9, 0.89. 02:01:10.360 |
So kind of high 0.89s, low 0.9s would be state-of-the-art according to these benchmarks. 02:01:16.600 |
So, we're on the right track, but we're not there yet. 02:01:21.000 |
So let's try something better, let's create a neural net. 02:01:25.960 |
We create a movie embedding and a user embedding, again with 50 factors, and this time we don't 02:01:31.600 |
take a dot product, we just concatenate the two vectors together, stick one on the end 02:01:37.820 |
And because we now have one big vector, we can create a neural net, create a dense layer, 02:01:43.480 |
add dropout, create an activation, compile it, and fit it. 02:01:51.120 |
And after 5 epochs, we get something way better than state-of-the-art. 02:01:57.280 |
So we couldn't find anything better than about 0.89. 02:02:00.680 |
And so this whole notebook took me like half an hour to write, and so I don't claim to 02:02:06.080 |
be a collaborative filtering expert, but I think it's pretty cool that these things that 02:02:10.360 |
were written by people that write collaborative filtering software for a living, that's what 02:02:15.680 |
these websites basically are coming from, places that use LensKit. 02:02:21.100 |
So LensKit is a piece of software for recommender systems. 02:02:26.000 |
We have just killed their benchmark, and it took us 10 seconds to train. 02:02:33.720 |
And we're right on time, so we're going to take one last question. 02:02:36.480 |
So in the neural net, why is it that there are a number of factors so low? 02:02:46.960 |
Oh, actually I thought it was an equal, not a comma, never mind, we're good. 02:02:53.240 |
So that was a very, very quick introduction to embeddings, like as per usual in this class, 02:02:59.200 |
I kind of stick the new stuff in at the end and say go study it. 02:03:04.160 |
So your job this week is to keep improving state farm, hopefully win the new fisheries 02:03:10.960 |
By the way, in the last half hour, I just created this little notebook in which I basically 02:03:15.880 |
copied the Dogs and Cats Redux competition into something which does the same thing with 02:03:22.960 |
the fish data, and I quickly submitted a result. 02:03:27.880 |
So we currently have one of us in 18th place, yay. 02:03:34.760 |
But most importantly, download the movie lens data and have a play with that and we'll talk