Lesson 4: Practical Deep Learning for Coders

00:00:00.000 | I guess I noticed during the week from some of the questions I've been seeing that the

00:00:06.460 | idea of what a convolution is, is still a little counter-intuitive or surprising to

00:00:14.020 | some people.

00:00:15.860 | I feel like the only way I know to teach things effectively is by creating a spreadsheet,

00:00:20.740 | so here we are.

00:00:23.420 | This is the famous number 7 from lesson 0, and I just copied and pasted the numbers into

00:00:30.700 | a spreadsheet.

00:00:31.700 | They're not exactly 0, they're actually floats, just rounded off.

00:00:39.060 | And as you can see, I'm just using conditional coloring, you can see the shape of our little

00:00:44.700 | number 7 here.

00:00:47.140 | So I wanted to show you exactly what a convolution does, and specifically what a convolution

00:00:52.420 | does in a deep learning neural network.

00:00:57.660 | So we are generally using modern convolutions, and that means a 3x3 convolution.

00:01:04.700 | So here is a 3x3 convolution, and I have just randomly generated 9 random numbers.

00:01:13.740 | So that is a filter, there's one filter.

00:01:17.540 | Here is my second filter, it is 9 more random numbers.

00:01:23.500 | So this is what we do in Keras when we ask for a convolutional layer.

00:01:30.020 | We tell it, the first thing we pass it is how many filters do we want, and that's how

00:01:35.540 | many of these random matrices do we want it to build for us.

00:01:40.880 | So in this case, it's as if I passed convolution2D, the first parameter would be 2, and the second

00:01:46.700 | parameter would be 3,3, because it's a 3x3.

00:01:50.900 | And what happens to this little random matrix?

00:01:55.140 | In order to calculate the very first item, it takes the sum of the blue stuff, those

00:02:06.380 | 9, times the red stuff, those 9, all added together.

00:02:13.460 | So let's go down here into where it gets a bit darker, how does this get calculated?

00:02:17.980 | This is equal to these 9 times these 9, when I say times, I mean element-wise times, so

00:02:25.100 | the top left by the top left, the middle by the middle, and so forth, and add them all

00:02:29.460 | together.

00:02:32.940 | That's all a convolution is.

00:02:33.940 | So it's just as you go through, we take the corresponding 3x3 area in the image, and we

00:02:42.420 | multiply each of those 9 things by each of these 9 things, and then we add those 9 products

00:02:47.940 | together.

00:02:50.660 | That's it, that's a convolution.

00:02:52.900 | So there's really nothing particularly weird or confusing about it, and I'll make this

00:02:59.160 | available in class so you can have a look.

00:03:01.820 | You can see that when I get to the top left corner, I can't move further left and up because

00:03:09.940 | I've reached the edge, and this is why when you do a 3x3 convolution without zero padding,

00:03:17.020 | you lose one pixel on each edge because you can't push this 3x3 any further.

00:03:24.580 | So if we go down to the bottom left, you can see again the same thing, it kind of gets

00:03:29.940 | stuck in the corner.

00:03:31.380 | So that's why you can see that my result is one row less than my starting point.

00:03:39.220 | So I did this for two different filters, so here's my second filter, and you can see when

00:03:44.580 | I calculate this one, it's exactly the same thing, it's these 9 times each of these 9

00:03:54.580 | added together.

00:03:55.580 | These are just 9 other random numbers.

00:03:59.220 | So that's how we start with our first, in this case I've created two convolutional filters,

00:04:06.340 | and this is the output of those two convolutional filters, they're just random at this point.

00:04:11.300 | So my second layer, now my second layer is no longer enough just to have a 3x3 matrix,

00:04:17.260 | and I need a 3x3x2 tensor because to calculate my top left of my second convolutional layer,

00:04:29.440 | I need these 9 by these 9 added together, plus these 9 by these 9 added together.

00:04:41.300 | Because at this point, my previous layer is no longer just one thing, but it's two things.

00:04:47.380 | Now indeed, if our original picture was a 3-channel color picture, our very first convolutional

00:04:54.900 | layer would have had to have been 3x3x3 tensors.

00:05:01.020 | So all of the convolutional layers from now on are going to be 3x3x number of filters

00:05:08.780 | in the previous layer convolution matrices.

00:05:13.900 | So here is my first, I've just drawn it like this, 3x3x2 tensor, and you can see it's taking

00:05:21.580 | 9 from here, 9 from here and adding those two together.

00:05:26.860 | And so then for my second filter in my second layer, it's exactly the same thing.

00:05:33.140 | I've created two more random matrices, or one more random 3x3x2 tensor, and here again

00:05:40.360 | I have those 9 by these 9 sum plus those 9 by those 9 sum, and that gives me that one.

00:05:53.980 | So that gives me my first two layers of my convolutional neural network.

00:05:59.420 | Then I do max pooling.

00:06:02.980 | Max pooling is slightly more awkward to do in Excel, but that's fine, we can still handle

00:06:07.940 | it.

00:06:09.800 | So here's max pooling.

00:06:11.140 | So max pooling, because I'm going to do 2x2 max pooling, it's going to decrease the resolution

00:06:16.780 | of my image by 2 on each axis.

00:06:20.320 | So how do we calculate that number?

00:06:24.120 | That number is simply the maximum of those 4.

00:06:29.260 | And then that number is the maximum of those 4, and so forth.

00:06:34.860 | So with max pooling, we had two filters in the previous layer, so we still have two filters,

00:06:43.060 | but now our filters have half the resolution in each of the x and y axes.

00:06:51.820 | And so then I thought, okay, we've done two convolutional layers, how did you go from

00:06:59.860 | one matrix to two matrices in the second layer?

00:07:05.300 | How did I go from one matrix to two matrices, as in how did I go from just this one thing

00:07:10.940 | to these two things?

00:07:14.260 | So the answer to that is I just created two random 3x3 filters.

00:07:19.180 | This is my first random 3x3 filter, this is my second random 3x3 filter.

00:07:24.980 | So each output then was simply equal to each corresponding 9-element section, multiplied

00:07:34.100 | by each other, and added together.

00:07:35.640 | So because I had two random 3x3 matrices, I ended up with two outputs.

00:07:42.540 | Two filters means two sets of outputs.

00:07:48.780 | Alright, so now that we've got our max pooling layer, let's use a dense layer to turn it

00:07:59.820 | into our output.

00:08:03.140 | So a dense layer means that every single one of our activations from our max pooling layer

00:08:11.060 | needs a random weight.

00:08:13.460 | So these are a whole bunch of random numbers.

00:08:17.540 | So what I do is I take every one of those random numbers and multiply each one by a

00:08:24.140 | corresponding input and add them all together.

00:08:33.780 | So I've got the sum product of this and this.

00:08:36.860 | In MNIST we would have 10 activations because we need an activation for 0, 1, 2, 3, so forth

00:08:43.540 | up to 9.

00:08:44.620 | So for MNIST we would need 10 sets of these dense weight matrices so that we could calculate

00:08:52.860 | the 10 outputs.

00:08:56.600 | If we were only calculating one output, this would be a perfectly reasonable way to do

00:09:02.540 | it.

00:09:03.540 | So for one output, it's just the sum product of everything from our final layer with a

00:09:10.900 | weight for everything in that final layer, add it together.

00:09:17.740 | So that's all a dense layer is.

00:09:24.260 | Both dense layers and convolutional layers couldn't be easier mathematically.

00:09:33.140 | I think the surprising thing is when you say rather than using random weights, let's calculate

00:09:41.980 | the derivative of what happens if we were to change that weight up by a bit or down

00:09:47.940 | by a bit, and how would it impact our loss.

00:09:51.100 | In this case, I haven't actually got as far as calculating a loss function, but we could

00:09:56.100 | add over here a sigmoid loss, for example.

00:10:01.980 | And so we can calculate the derivative of the loss with respect to every single weight

00:10:05.940 | in the dense layer, and every single weight in all of our filters in that layer, and every

00:10:13.500 | single weight in all of our filters in this layer.

00:10:18.100 | And then with all of those derivatives, we can calculate how to optimize all of these

00:10:21.420 | weights.

00:10:22.860 | And the surprising thing is that when we optimize all of these weights, we end up with these

00:10:27.720 | incredibly powerful models, like those visualizations that we saw.

00:10:32.540 | So I'm not quite sure where the disconnect between the incredibly simple math and the

00:10:39.140 | outcome is.

00:10:40.260 | I think it might be that it's so easy, it's hard to believe that's all it is, but I'm

00:10:45.820 | not skipping over anything.

00:10:48.140 | That really is it.

00:10:49.540 | And so to help you really understand this, I'm going to talk more about SGD.

00:10:55.240 | Why would you use a sigmoid function here?

00:11:00.900 | So the loss function we generally use is the softmax, so e^xi divided by the sum of e^xi.

00:11:08.220 | If it's just binary, that's just the equivalent of having just 1/1 + e^xi.

00:11:16.020 | So softmax in the binary case simplifies into a sigmoid function.

00:11:25.540 | Thank you for clarifying that question.

00:11:28.820 | So I think this is super fun.

00:11:31.300 | We're going to talk about not just SGD, but every variant of SGD, including one invented

00:11:38.080 | just a week ago.

00:11:40.820 | So we've already talked about SGD.

00:11:46.900 | SGD happens for all layers at once, yes we calculate the derivative of all the weights

00:11:53.080 | with respect to the loss.

00:11:55.180 | And when to have a max pool after convolution versus when not to?

00:11:59.860 | When to have a max pool after a convolution, who knows.

00:12:06.540 | This is a very controversial question, and indeed some people now say never use max pool.

00:12:13.180 | Instead of using max pool when you're doing the convolutions, don't do a convolution over

00:12:20.540 | every set of 9 pixels, but instead skip a pixel each time.

00:12:28.740 | And so that's another way of downsampling.

00:12:33.100 | Jeffrey Hinton, who is kind of the father of deep learning, has gone as far as saying

00:12:37.980 | that the extremely great success of max pooling has been the greatest problem deep learning

00:12:47.780 | has faced.

00:12:48.780 | Because to him, it really stops us from going further.

00:12:56.060 | I don't know if that's true or not, I assume it is because he's Jeffrey Hinton and I'm

00:13:04.220 | not.

00:13:05.220 | For now, we use max pooling every time we're doing fine-tuning because we need to make

00:13:11.500 | sure that our architecture is identical to the original VGG's authors' architecture and

00:13:16.140 | so we have to put max pooling wherever they do.

00:13:20.420 | Why do we want max pooling or downsampling or anything like that?

00:13:23.580 | Are we just trying to look at bigger features at the input?

00:13:26.840 | Why use max pooling at all?

00:13:31.260 | There's a couple of reasons.

00:13:32.660 | The first is that max pooling helps with translation invariance.

00:13:37.780 | So it basically says if this feature is here, or here, or here, or here, I don't care.

00:13:42.580 | It's kind of roughly in the right spot.

00:13:44.660 | And so that seems to work well.

00:13:46.260 | And the second is exactly what you said.

00:13:47.860 | Every time we max pool, we end up with a smaller grid, which means that our 3x3 convolutions

00:13:53.560 | are effectively covering a larger part of the original image, which means that our convolutions

00:13:58.280 | can find larger and more complex features.

00:14:02.580 | I think they would be the two main reasons.

00:14:08.260 | Is Jeffrey Hinton cool with the idea of doing the skipping index page time?

00:14:23.980 | You can learn all about the things that he thinks we ought to have but don't yet have.

00:14:45.780 | He did point out that -- I can't remember what it was, but one of the key pieces of

00:14:52.420 | deep learning that he invented took like 17 years from conception to working, so he is

00:14:58.660 | somebody who sticks with these things and makes it work.

00:15:01.240 | Is max pooling unique to image processing?

00:15:04.900 | Max pooling is not unique to image processing.

00:15:08.020 | It's likely to be useful for any kind of convolutional neural network, and a convolutional neural

00:15:12.580 | network can be used for any kind of data that has some kind of consistent ordering.

00:15:17.580 | So things like speech, or any kind of audio, or some kind of consistent time series, all

00:15:24.860 | of these things have some kind of ordering to them and therefore you can use CNN and

00:15:28.740 | therefore you can use max pooling.

00:15:31.760 | And as we look at NLP, we will be looking more at convolutional neural networks for

00:15:37.280 | other data types.

00:15:38.740 | And interestingly, the author of Keras last week, or maybe the week before, made the contention

00:15:46.620 | that perhaps it will turn out that CNNs are the architecture that will be used for every

00:15:52.540 | type of ordered data.

00:15:56.060 | And this was just after one of the leading NLP researchers released a paper basically

00:16:01.020 | showing a state-of-the-art result in NLP using convolutional neural networks.

00:16:07.540 | So although we'll start learning about recurrent neural networks next week, I have to be open

00:16:13.580 | to the possibilities that they'll become redundant by the end of the year, but they're still

00:16:20.420 | interesting.

00:16:21.420 | So SGD.

00:16:24.220 | So we looked at the SGD intro notebook, but I think things are a little more clear sometimes

00:16:29.220 | when you can see it all in front of you.

00:16:30.660 | So here is basically the identical thing that we saw in the SGD notebook in Excel.

00:16:36.700 | So we are going to start by creating a line.

00:16:41.980 | We create 29 random numbers, and then we say okay, let's create something that is equal

00:16:55.700 | to 2 times x plus 30.

00:17:00.780 | And so here is 2 times x plus 30.

00:17:03.860 | So that's my input data.

00:17:07.560 | So I am trying to create something that can find the parameters of a line.

00:17:13.060 | Now the important thing, and this is the leap, which requires not thinking too hard lest

00:17:22.060 | you realize how surprising and amazing this is.

00:17:25.440 | Everything we learn about how to fit a line is identical to how to fit filters and weights

00:17:32.300 | in a convolutional neural network.

00:17:34.180 | And so everything we learn about calculating the slope and the intercept, we will then

00:17:39.020 | use to let computers see.

00:17:43.220 | And so the answer to any question which is basically why is why not.

00:17:50.020 | This is a function that takes some inputs and calculates an output, this is a function

00:17:53.780 | that takes some inputs and calculates an output, so why not.

00:17:58.900 | The only reason it wouldn't work would be because it was too slow, for example.

00:18:02.740 | And we know it's not too slow because we tried it and it works pretty well.

00:18:06.500 | So everything we're about to learn works for any kind of function which kind of has the

00:18:15.620 | appropriate types of gradients, and we can talk more about that later.

00:18:21.100 | But neural nets have the appropriate kinds of gradients.

00:18:23.900 | So SGD, we start with a guess.

00:18:27.460 | What do we think the parameters of our function are, in this case the intercept and the slope.

00:18:31.340 | And with Keras, they will be randomized using the chloro-initialization procedure, which

00:18:37.540 | is 6 divided by n_in plus n_out, random numbers.

00:18:41.940 | And I'm just going to say let's assume they're both 1.

00:18:47.540 | We are going to use very, very small mini-batches here.

00:18:51.000 | Mini-batches are going to be of size 1, because it's easier to do in Excel and it's easier

00:18:56.660 | to see.

00:18:57.660 | But everything we're going to see would work equally well for a mini-batch of size 4 or

00:19:01.420 | 64 or 128 or whatever.

00:19:04.460 | So here's our first row, our first mini-batch.

00:19:07.500 | Our input is 14 and our desired output is 58.

00:19:11.500 | And so our guesses to our parameters are 1 and 1.

00:19:14.900 | And therefore our predicted y value is equal to 1 plus 1 times 14, which is normally 15.

00:19:27.260 | Therefore if we're doing root mean squared error, our error squared is prediction minus

00:19:31.980 | actual squared.

00:19:35.140 | So the next thing we do is we want to calculate the derivative with respect to each of our

00:19:39.180 | two inputs.

00:19:42.140 | One really easy way to do that is to add a tiny amount to each of the two inputs and

00:19:48.020 | see how the output varies.

00:19:50.380 | So let's start by doing that.

00:19:51.860 | So let's add 0.01 to our intercept and calculate the line and then calculate the loss squared.

00:20:04.420 | So this is the error if b is increased by 0.01.

00:20:10.180 | And then let's calculate the difference between that error and the actual error and then divide

00:20:14.380 | that by our change, which is 0.01.

00:20:18.380 | And that gives us our estimated gradient.

00:20:21.140 | I'm using dE for the error, dB, I should have probably been dL for the loss, dB.

00:20:26.820 | The change in loss with respect to b is -85.99.

00:20:32.140 | We can do the same thing for a.

00:20:33.980 | So we can add 0.01 to a, and then calculate our line, subtract our actual, take the square,

00:20:43.860 | and so there is our value of estimated dL/dA, subtract it from the actual loss divided by

00:20:52.240 | 0.01.

00:20:54.140 | And so there are two estimates of the derivative.

00:20:56.420 | This approach to estimating the derivative is called finite differencing.

00:20:59.900 | And any time you calculate a derivative by hand, you should always use finite differencing

00:21:04.980 | to make sure your calculation is correct.

00:21:07.620 | You're not very likely to ever have to do that, however, because all of the libraries

00:21:11.580 | do derivatives for you.

00:21:13.900 | They do them analytically, not using finite derivatives.

00:21:17.700 | And so here are the derivatives calculated analytically, which you can do by going to

00:21:23.420 | Wolfram Alpha and typing in your formula and getting the derivative back.

00:21:27.180 | So this is the analytical derivative of the loss with respect to b, and the analytical

00:21:30.980 | derivative of the loss with respect to a.

00:21:34.340 | And so you can see that our analytical and our finite difference are very similar for

00:21:38.740 | b and they are very similar for a.

00:21:43.580 | So that makes me feel comfortable that we got the calculation correct.

00:21:47.100 | So all SGD does is it says, okay, this tells us if we change our weights by a little bit,

00:21:55.780 | this is the change in our loss function.

00:21:58.680 | We know that increasing our value of b by a bit will decrease the loss function, and

00:22:03.780 | we know that increasing our value of a by a little bit will decrease the loss function.

00:22:09.540 | So therefore let's decrease both of them by a little bit.

00:22:12.700 | And the way we do that is to multiply the derivative times a learning rate, that's the

00:22:17.580 | value of a little bit, and subtract that from our previous guess.

00:22:23.020 | So we do that for a, and we do that for b, and here are our new guesses.

00:22:27.900 | Now we're at 1.12 and 1.01, and so let's copy them over here, 1.12 and 1.01.

00:22:38.460 | And then we do the same thing, and that gives us a new a and a b.

00:22:44.020 | And we keep doing that again and again and again until we've gone through the whole dataset,

00:22:49.700 | at the end of which we have a guess of a of 2.61 and a guess of b of 1.07.

00:22:55.980 | So that's one epoch.

00:22:58.340 | Now in real life, we would be having shuffle=true, which means that these would be randomized.

00:23:05.060 | So this isn't quite perfect, but apart from that, this is SGD with a mini-batch size of

00:23:11.220 | 1.

00:23:12.660 | So at the end of the epoch, we say this is our new slope, so let's copy 2.61 over here,

00:23:24.500 | and this is our new intercept.

00:23:27.940 | So let's copy 1.06 over here, and so now it starts again.

00:23:37.700 | So we can keep doing that again and again and again.

00:23:40.700 | Copy the stuff from the bottom, stick it back at the top, and each one of these is going

00:23:44.220 | to be an epoch.

00:23:45.360 | So I recorded a macro with me copying this to the bottom and pasting it at the top, and

00:23:50.820 | added something that says for i = 1 to 5 around it.

00:23:54.540 | And so now if I click Run, it will copy and paste it 5 times.

00:24:01.860 | And so you can see it's gradually getting closer.

00:24:04.060 | And we know that our goal is that it should be a = 2 and b = 30.

00:24:11.860 | So we've got as far as a = 2.5 and b = 1.3, so they're better than our starting point.

00:24:20.140 | And you can see our gradually improving loss function.

00:24:25.140 | But it's going to take a long time.

00:24:27.260 | Yes, Rachel?

00:24:28.260 | Question - Can we still do analytic derivatives when we are using nonlinear activation functions?

00:24:34.300 | Answer - Yes, we can use analytical derivatives as long as we're using a function that has

00:24:40.300 | an analytical derivative, which is pretty much every useful function you can think of,

00:24:47.220 | except ones that you can't have something that has an if-then statement in it, because

00:24:51.420 | it jumps from here to here, but even those you can approximate.

00:24:55.140 | So a good example would be ReLU, which is max of (0, x) strictly speaking doesn't really

00:25:06.500 | have a derivative at every point, or at least not a well-defined one, because this is what

00:25:20.220 | ReLU looks like.

00:25:23.100 | And so its derivative here is 0, and its derivative here is 1.

00:25:32.260 | What is its derivative exactly here?

00:25:35.860 | Who knows?

00:25:36.860 | But the thing is, mathematicians care about that kind of thing, we don't.

00:25:41.620 | Like in real life, this is a computer, and computers are never exactly anything.

00:25:47.020 | We can either assume that it's like an infinite amount to this side, or an infinite amount

00:25:50.460 | to this side, and who cares?

00:25:52.820 | So as long as it has a derivative that you can calculate in a meaningful way in practice

00:25:59.820 | on a computer, then it'll be fine.

00:26:06.660 | So one thing you might have noticed about this is that it's going to take an awfully

00:26:09.180 | long time to get anywhere.

00:26:12.060 | And so you might think, okay, let's increase the learning rate.

00:26:15.260 | Fine, let's increase the learning rate.

00:26:17.780 | So let's get rid of one of these zeroes, oh dear, something went crazy.

00:26:23.940 | What went crazy?

00:26:24.940 | I'll tell you what went crazy, our a's and b's started to go out into like 11 million,

00:26:29.700 | which is not the correct answer.

00:26:32.740 | So how did it go ahead and do that?

00:26:34.660 | Well here's the problem.

00:26:36.260 | Let's say this was the shape of our loss function, and this was our initial guess.

00:26:43.220 | And we figured out the derivative is going this way, actually the derivative is positive

00:26:48.220 | so we want to go the opposite direction, and so we step a little bit over here.

00:26:54.180 | And then that leads us to here, and we step a little bit further, and this looks good.

00:27:00.900 | But then we increase the learning rate.

00:27:03.420 | So rather than stepping a little bit, we stepped a long way, and that put us here.

00:27:10.280 | And then we stepped a long way again, and that put us here.

00:27:15.140 | If your learning rate is too high, you're going to get worse and worse.

00:27:19.500 | And that's what happened.

00:27:21.620 | So getting your learning rate right is critical to getting your thing to train it all.

00:27:30.260 | Exploding gradients, yeah, or you can even have gradients that do the opposite.

00:27:35.540 | Exploding gradients are something a little bit different, but it's a similar idea.

00:27:40.320 | So it looks like 0.001 is the best we can do, and that's a bit sad because this is really

00:27:46.500 | slow.

00:27:47.500 | So let's try and improve it.

00:27:49.580 | So one thing we could do is say, well, given that every time we've been -- actually let

00:27:57.100 | me do this in a few more dimensions.

00:28:01.080 | So let's say we had a 3-dimensional set of axes now, and we kind of had a loss function

00:28:08.260 | that looks like this kind of valley.

00:28:12.140 | And let's say our initial guess was somewhere over here.

00:28:15.580 | So over here, the gradient is pointing in this direction.

00:28:20.960 | So we might make a step and end up there.

00:28:25.260 | And then we might make another step which would put us there, and another step that

00:28:29.060 | would put us there.

00:28:30.420 | And this is actually the most common thing that happens in neural networks.

00:28:35.180 | Something that's kind of flat in one dimension like this is called a saddle point.

00:28:41.100 | And it's actually been proved that the vast majority of the space of a loss function in

00:28:46.140 | a neural network is pretty much all saddle points.

00:28:49.580 | So when you look at this, it's pretty obvious what should be done, which is if we go to

00:28:59.420 | here and then we go to here, we can say on average, we're kind of obviously heading in

00:29:05.580 | this direction.

00:29:06.580 | Especially when we do it again, we're obviously heading in this direction.

00:29:09.620 | So let's take the average of how we've been going so far and do a bit of that.

00:29:15.060 | And that's exactly what momentum does.

00:29:18.260 | If ReLU isn't the cost function, why are we concerned with its differentiability?

00:29:27.900 | We care about the derivative of the output with respect to the inputs.

00:29:34.100 | The inputs are the filters, and remember the loss function consists of a function of a

00:29:40.020 | function of a function of a function.

00:29:41.940 | So it is categorical cross-entropy loss applied to softmax, applied to ReLU, applied to dense

00:29:53.420 | layer, applied to max pooling, applied to ReLU, applied to convolutions, etc.

00:29:59.160 | So in other words, to calculate the derivative of the loss with respect to the inputs, you

00:30:03.660 | have to calculate the derivative through that whole function.

00:30:07.180 | And this is what's called backpropagation.

00:30:09.560 | Backpropagation is easy to calculate that derivative because we know that from the chain

00:30:14.740 | rule, the derivative of a function of a function is simply equal to the product of the derivatives

00:30:21.140 | of those functions.

00:30:22.440 | So in practice, all we do is we calculate the derivative of every layer with respect

00:30:27.060 | to its inputs, and then we just multiply them all together.

00:30:30.660 | And so that's why we need to know the derivative of the activation layers as well as the loss

00:30:37.500 | layer and everything else.

00:30:43.700 | So here's the trick.

00:30:45.820 | What we're going to do is we're going to say, every time we take a step, we're going to

00:30:56.900 | also calculate the average of the last few steps.

00:31:00.920 | So after these two steps, the average is this direction.

00:31:04.700 | So the next step, we're going to take our gradient step as usual, and we're going to

00:31:08.940 | add on our average of the last few steps.

00:31:14.700 | And that means that we end up actually going to here.

00:31:17.700 | And then we do the same thing again.

00:31:19.300 | So we find the average of the last few steps, and it's now even further in this direction,

00:31:24.420 | and so this is the surface of the loss function with respect to some of the parameters, in

00:31:33.980 | this case just a couple of parameters, it's just an example of what a loss function might

00:31:37.980 | look like.

00:31:38.980 | So this is the loss, and this is some weight number 1, and this is some weight number 2.

00:31:51.660 | So we're trying to get our little, if you can imagine this is like gravity, we're trying

00:31:55.820 | to get this little ball to travel down this valley as far down to the bottom as possible.

00:32:00.980 | And so the trick is that we're going to keep taking a step, not just the gradient step,

00:32:07.940 | but also the average of the last few steps.

00:32:11.660 | And so in practice, this is going to end up kind of going "donk, donk, donk, donk, donk."

00:32:20.180 | That's the idea.

00:32:21.900 | So to do that in Excel is pretty straightforward.

00:32:26.600 | To make things simpler, I have removed the finite-differencing base derivatives here,

00:32:30.980 | so we just have the analytical derivatives.

00:32:32.580 | But other than that, this is identical to the previous spreadsheet.

00:32:37.460 | Same data, same predictions, same derivatives, except we've done one extra thing, which is

00:32:43.900 | that when we calculate our new B, we say it's our previous B minus our learning rate times,

00:32:52.860 | and we're not going times our gradient, but times this cell.

00:32:57.340 | And what is that cell?

00:32:59.220 | That cell is equal to our gradient times 0.1 plus the thing just above it times 0.9, and

00:33:11.980 | the thing just above it is equal to its gradient times 0.1 plus the thing just above it times

00:33:18.460 | 0.9, and so forth.

00:33:20.740 | So in other words, this column is keeping track of an average derivative of the last

00:33:27.820 | few steps that we've taken, which is exactly what we want.

00:33:31.620 | And we do that for both of our two parameters.

00:33:35.260 | So this 0.9 is our momentum parameter.

00:33:40.540 | So in Keras, when you use momentum, you can say momentum = and you say how much momentum

00:33:46.060 | you want.

00:33:47.060 | Where did that beta come from?

00:33:49.820 | You just pick it.

00:33:51.120 | So you just pick what that parameter, what do you want?

00:33:53.020 | Just like your learning rate, you pick it, your momentum factor, you pick it.

00:33:57.520 | It's something you get to choose.

00:33:59.100 | And you choose it by trying a few and find out what works best.

00:34:03.360 | So let's try running this, and you can see it is still not exactly zipping along.

00:34:15.540 | Why is it not exactly zipping along?

00:34:17.340 | Well the reason when we look at it is that we know that the constant term needs to get

00:34:22.180 | all the way up to 30, and it's still way down at 1.5.

00:34:28.800 | It's not moving fast enough, whereas the slope term moved very quickly to where we want it

00:34:35.740 | to be.

00:34:36.740 | So what we really want is we need different learning rates for different parameters.

00:34:42.620 | And doing this is called dynamic learning rates.

00:34:45.580 | And the first really effective dynamic learning rate approaches have just appeared in the

00:34:51.340 | last 3 years or so.

00:34:55.100 | And one very popular one is called Adagrad, and it's very simple.

00:34:59.860 | All of these dynamic learning rate approaches have the same insight, which is this.

00:35:05.340 | If the parameter that I'm changing, if the derivative of that parameter is consistently

00:35:12.660 | of a very low magnitude, then if the derivative of this mini-batch is higher than that, then

00:35:20.460 | what I really care about is the relative difference between how much this variable tends to change

00:35:26.520 | and how much it's going to change this time around.

00:35:30.100 | So in other words, we don't just care about what's the gradient, but is the magnitude

00:35:35.980 | of the gradient a lot more or a lot less than it has tended to be recently?

00:35:41.300 | So the easy way to calculate the overall amount of change of the gradient recently is to keep

00:35:47.760 | track of the square of the gradient.

00:35:51.380 | So what we do with Adagrad is you can see at the bottom of my epoch here, I have got

00:35:59.000 | a sum of squares of all of my gradients.

00:36:02.980 | And then I have taken the square root, so I've got the roots and the squares, and then

00:36:06.880 | I've just divided it by the count to get the average.

00:36:09.260 | So this is the average of the roots and the squares of my gradients.

00:36:13.120 | So this number here will be high if the magnitudes of my gradients is high.

00:36:18.260 | And because it's squared, it will be particularly high if sometimes they're really high.

00:36:24.620 | So why is it okay to just use a mini-batch since the surface is going to depend on what

00:36:30.180 | points are in your mini-batch?

00:36:32.860 | It's not ideal to just use a mini-batch, and we will learn about a better approach to this

00:36:37.120 | in a moment.

00:36:38.120 | But for now, let's look at this, and in fact, there are two approaches related to Adagrad

00:36:43.680 | and Adadelta, and one of them actually does this for all of the gradients so far, and

00:36:51.280 | one of them uses a slightly more sophisticated approach.

00:36:54.640 | This approach of doing it on a mini-batch-by-min-batch basis is slightly different either, but it's

00:36:59.620 | similar enough to explain the concept.

00:37:02.280 | Does this mean for a CNN, would dynamic learning rates mean that each filter would have its

00:37:14.080 | own learning rate?

00:37:16.560 | It would mean that every parameter has its own learning rate.

00:37:20.320 | So this is one parameter, that's a parameter, that's a parameter, that's a parameter.

00:37:24.900 | And then in our dense layer, that's a parameter, that's a parameter, that's a parameter.

00:37:33.900 | So when you go model.summary in Keras, it shows you for every layer how many parameters there

00:37:40.080 | are.

00:37:41.080 | So anytime you're unclear on how many parameters there are, you can go back and have a look

00:37:44.900 | at these spreadsheets, and you can also look at the Keras model.summary and make sure you

00:37:50.320 | understand how they turn out.

00:37:53.280 | So for the first layer, it's going to be the size of your filter times the number of your

00:37:59.160 | filters, if it's just a grayscale.

00:38:02.840 | And then after that, the number of parameters will be equal to the size of your filter times

00:38:08.440 | the number of filters coming in times the number of filters coming out.

00:38:14.700 | And then of course your dense layers will be every input goes to every output, so number

00:38:19.000 | of inputs times the number of outputs, a parameter to the function that is calculating whether

00:38:26.580 | it's a cat or a dog.

00:38:31.480 | So what we do now is we say this number here, 1857, this is saying that the derivative of

00:38:40.440 | the loss with respect to the slope varies a lot, whereas the derivative of the loss

00:38:47.040 | with respect to the intercept doesn't vary much at all.

00:38:50.660 | So at the end of every epoch, I copy that up to here.

00:38:56.720 | And then I take my learning rate and I divide it by that.

00:39:01.820 | And so now for each of my parameters, I now have this adjusted learning rate, which is

00:39:08.240 | the learning rate divided by the recent sum of squares average gradient.

00:39:15.100 | And so you can see that now one of my learning rates is 100 times faster than the other one.

00:39:20.860 | And so let's see what happens when I run this.

00:39:23.820 | Question - Is there a relationship with normalizing the input data?

00:39:29.760 | Answer - No, there's not really a relationship with normalizing the input data because it

00:39:38.480 | can help, but still if your inputs are very different scales, it's still a lot more work

00:39:47.720 | for it to do.

00:39:50.840 | So yes it helps, but it doesn't help so much that it makes it useless, and in fact it turns

00:39:55.380 | out that even with dynamic learning rates, not just normalized inputs, but batch normalized

00:40:02.840 | activations is extremely helpful.

00:40:07.320 | And so the thing about when you're using Adagrad or any kind of dynamic learning rates is generally

00:40:12.120 | you'll set the learning rate quite a lot higher, because remember you're dividing it by this

00:40:15.760 | recent average.

00:40:16.760 | So if I set it to 0.1, oh, too far, so that's no good.

00:40:24.400 | So let's try 0.05, run that.

00:40:33.080 | So you can see after just 5 steps, I'm already halfway there.

00:40:37.560 | Another 5 steps, getting very close, and another 5 steps, and it's exploded.

00:40:47.160 | Now why did that happen?

00:40:49.680 | Because as we get closer and closer to where we want to be, you can see that you need to

00:40:57.880 | take smaller and smaller steps.

00:41:00.760 | And by keeping the learning rates the same, it meant that eventually we went too far.

00:41:07.920 | So this is still something you have to be very careful of.

00:41:14.000 | As more elegant, in my opinion, approach to the same thing that Adagrad is doing is something

00:41:20.400 | called RMSprop.

00:41:21.960 | And RMSprop was first introduced in Jeffrey Hinton's Coursera course.

00:41:26.840 | So if you go to the Coursera course in one of those classes he introduces RMSprop.

00:41:39.680 | So it's quite funny nowadays because this comes up in academic papers a lot.

00:41:43.560 | When people cite it, they have to cite Coursera course, chapter 6, at minute 14 and 30 seconds.

00:41:50.920 | But Hinton has asked that this be the official way that he decided, so there you go.

00:41:56.240 | You see how cool he is.

00:41:59.280 | So here's what RMSprop does.

00:42:00.920 | What RMSprop does is exactly the same thing as momentum, but instead of keeping track

00:42:06.920 | of the weighted running average of the gradients, we keep track of the weighted running average

00:42:12.960 | of the square of the gradients.

00:42:14.960 | So here it is.

00:42:17.920 | Everything here is the same as momentum so far, except that I take my gradient squared,

00:42:27.160 | multiply it by 0.1, and add it to my previous cell times 0.9.

00:42:35.600 | So this is keeping track of the recent running average of the squares of the gradients.

00:42:41.920 | And when I have that, I do exactly the same thing with it that I did in Adagrad, which

00:42:45.920 | is to divide the learning rate by it.

00:42:48.180 | So I take my previous guess as to b and then I subtract from it my derivative times the

00:42:56.760 | learning rate divided by the square root of the recent weighted average of the square gradients.

00:43:05.120 | So it's doing basically the same thing as Adagrad, but in a way that's doing it kind

00:43:09.400 | of continuously.

00:43:10.400 | So these are all different types of learning rate optimization?

00:43:16.080 | These last two are different types of dynamic learning rate approaches.

00:43:23.540 | So let's try this one.

00:43:24.540 | If we run it for a few steps, and again you have to guess what learning rate to start

00:43:29.300 | with, say 0.1.

00:43:49.720 | So as you can see, this is going pretty well.

00:43:51.820 | And I'll show you something really nice about RMSprop, which is what happens as we get very

00:43:56.360 | close.

00:43:57.360 | We know the right answer is 2 and 30.

00:43:59.160 | Is it about to explode?

00:44:01.480 | No, it doesn't explode.

00:44:03.840 | And the reason it doesn't explode is because it's recalculating that running average every

00:44:09.120 | single minibatch.

00:44:10.880 | And so rather than waiting until the end of the epoch by which stage it's gone so far

00:44:14.720 | that it can't come back again, it just jumps a little bit too far and then it recalculates

00:44:20.000 | the dynamic learning rates and tries again.

00:44:23.040 | So what happens with RMSprop is if your learning rate is too high, then it doesn't explode,

00:44:27.600 | it just ends up going around the right answer.

00:44:31.360 | And so when you use RMSprop, as soon as you see your validation scores flatten out, you

00:44:38.320 | know this is what's going on, and so therefore you should probably divide your learning rate

00:44:42.320 | by 10.

00:44:43.320 | And you see me doing this all the time.

00:44:44.640 | When I'm running Keras stuff, you'll keep seeing me run a few steps, divide the learning

00:44:49.000 | rate by 10, run a few steps, and you don't see that my loss function explodes, you just

00:44:53.720 | see that it flattens out.

00:44:54.720 | So do you want your learning rate to get smaller and smaller?

00:44:58.440 | Yeah, you do.

00:45:00.880 | Your very first learning rate often has to start small, and we'll talk about that in

00:45:05.200 | a moment, but once you've kind of got started, you generally have to gradually decrease the

00:45:10.760 | learning rate.

00:45:11.760 | That's called learning rate annealing.

00:45:12.760 | And can you repeat what you said earlier that something does the same thing as Adagrad,

00:45:18.120 | but...

00:45:19.120 | So RMSprop, which we're looking at now, does exactly the same thing as Adagrad, which is

00:45:25.000 | divide the learning rate by the root-summer-squared of the gradients, but rather than doing it

00:45:33.200 | since the beginning of time, or every minibatch, or epoch, RMSprop does it continuously using

00:45:41.440 | the same technique that we learned from momentum, which is take the squared of this gradient,

00:45:48.640 | multiply it by 0.1, and add it to 0.9 times the last calculation.

00:45:56.040 | That's called a moving average.

00:46:00.560 | It's a weighted moving average, where we're weighting it such that the more recent squared

00:46:06.440 | gradients are weighted higher.

00:46:09.160 | I think it's actually an exponentially weighted moving average, to be more precise.

00:46:15.360 | So there's something pretty obvious we could do here, which is momentum seems like a good

00:46:18.840 | idea, RMSprop seems like a good idea, why not do both?

00:46:25.500 | And that is called Adam.

00:46:28.140 | And so Adam was invented last year, 18 months ago, and hopefully one of the things you see

00:46:35.000 | from these spreadsheets is that these recently invented things are still at the ridiculously

00:46:41.520 | extremely simple end of the spectrum.

00:46:44.080 | So the stuff that people are discovering in deep learning is a long, long, long way away

00:46:49.480 | from being incredibly complex or sophisticated.

00:46:53.180 | And so hopefully you'll find this very encouraging, which is if you want to play at the state-of-the-art

00:46:58.040 | of deep learning, that's not at all hard to do.

00:47:02.840 | So let's look at Adam, which I remember it coming out 12-18 months ago, and everybody

00:47:09.080 | was so excited because suddenly it became so much easier and faster to train neural

00:47:14.240 | nets.

00:47:15.440 | But once I actually tried to create an Excel spreadsheet out of it, I realized, oh my god,

00:47:19.200 | it's just RMSprop to plus momentum.

00:47:22.600 | And so literally all I did was I copied my momentum page and then I copied across my

00:47:27.280 | RMSprop columns and combined them.

00:47:30.920 | So you can see here I have my exponentially weighted moving average of the gradients,

00:47:38.120 | that's what these two columns are.

00:47:42.240 | Here is my exponentially weighted moving average of the squares of the gradients.

00:47:48.040 | And so then when I calculate my new parameters, I take my old parameter and I subtract not

00:47:58.240 | my derivative times the learning rate, but my momentum factor.

00:48:03.480 | So in other words, the recent weighted moving average of the gradients multiplied by the

00:48:10.560 | learning rate divided by the recent moving average of the squares of the derivatives,

00:48:16.720 | or the root of them anyway.

00:48:18.760 | So it's literally just combining momentum plus RMSprop.

00:48:26.560 | And so let's see how that goes.

00:48:28.200 | Let's run 5 epochs, and we can use a pretty high learning rate now because it's really

00:48:34.640 | handling a lot of stuff for us.

00:48:36.840 | And 5 epochs, we're almost perfect.

00:48:40.600 | And so another 5 epochs does exactly the same thing that RMSprop does, which is it goes

00:48:46.560 | too far and tries to come back.

00:48:49.240 | So we need to do the same thing when we use atom, and atom is what I use all the time

00:48:53.160 | now.

00:48:55.040 | I just divide by 10 every time I see it flatten out.

00:49:01.240 | So a week ago, somebody came out with something that they called not atom, but Eve.

00:49:08.800 | And Eve is an addition to atom which attempts to deal with this learning rate annealing automatically.

00:49:19.280 | And so all of this is exactly the same as my atom page.

00:49:25.200 | But at the bottom, I've added some extra stuff.

00:49:27.680 | I have kept track of the root means grid error, this is just my loss function, and then I

00:49:34.800 | copy across my loss function from my previous epoch and from the epoch before that.

00:49:42.600 | And what Eve does is it says how much has the loss function changed.

00:49:48.120 | And so it's got this ratio between the previous loss function and the loss function before

00:49:54.640 | that.

00:49:55.640 | So you can see it's the absolute value of the last one minus the one before divided

00:50:00.320 | by whichever one is smaller.

00:50:02.880 | And what it says is, let's then adjust the learning rate such that instead of just using

00:50:10.760 | the learning rate that we're given, let's adjust the learning rate that we're given.

00:50:27.800 | We take the exponentially weighted moving average of these ratios, so you can see another

00:50:34.400 | of these betas appearing here, so this thing here is equal to our last ratio times 0.9

00:50:44.960 | plus our new ratio times 0.1.

00:50:50.160 | And so then for our learning rate, we divide the learning rate from atom by this.

00:51:00.560 | So what that says is if the learning rate is moving around a lot, if it's very bumpy,

00:51:11.880 | we should probably decrease the learning rate because it's going all over the place.

00:51:16.800 | Remember how we saw before, if we've kind of gone past where we want to get to, it just

00:51:20.640 | jumps up and down.

00:51:22.920 | On the other hand, if the loss function is staying pretty constant, then we probably

00:51:27.900 | want to increase the learning rate.

00:51:30.640 | So that all seems like a good idea, and so again let's try it.

00:51:41.040 | Not bad, so after 5 epochs it's gone a little bit too far.

00:51:46.280 | After a week of playing with it, I used this on State Farm a lot during the week, I grabbed

00:51:50.200 | a Keras implementation which somebody wrote a day after the paper came out.

00:51:56.600 | The problem is that because it can both decrease and increase the learning rate, sometimes

00:52:04.840 | as it gets down to the flat bottom point where it's pretty much optimal, it will often be

00:52:11.800 | the case that the loss gets pretty constant at that point.

00:52:18.740 | And so therefore, Eve will try to increase the learning rate.

00:52:22.200 | And so what I tend to find happens that it would very quickly get pretty close to the

00:52:26.640 | answer, and then suddenly it would jump to somewhere really awful.

00:52:29.480 | And then it would start to get to the answer again and jump somewhere really awful.

00:52:47.680 | We have not done any such thing, no.

00:52:49.440 | We have always run for a specific number of epochs.

00:52:53.680 | We have not defined any kind of stopping criterion.

00:52:59.500 | It is possible to define such a stopping criterion, but nobody's really come up with one that's

00:53:04.400 | remotely reliable.

00:53:06.240 | And the reason why is that when you look at the graph of loss over time, it doesn't tend

00:53:14.120 | to look like that, but it tends to look like this.

00:53:19.880 | And so in practice, it's very hard to know when to stop.

00:53:25.520 | It's kind of still a human judgment thing.

00:53:27.200 | Oh yeah, that's definitely true.

00:53:31.080 | And particularly with a type of architecture called ResNet that we'll look at next week,

00:53:36.520 | the authors showed that it tends to go like this.

00:53:45.200 | So in practice, you have to run your training for as long as you have patience for, at whatever

00:53:49.840 | the best learning rate you can come up with is.

00:53:53.400 | So something I actually came up with 6 or 12 months ago, but we've kind of restimulated

00:54:01.000 | my interest after I read this Adam paper, is something which dynamically updates learning

00:54:07.640 | rates in such a way that they only go down.

00:54:12.380 | And rather than using the loss function, which as I just said is incredibly bumpy, there's

00:54:17.080 | something else which is less bumpy, which is the average sum of squareds gradients.

00:54:24.400 | So I actually created a little spreadsheet of my idea, and I helped to prototype it in

00:54:28.520 | Python maybe this week or the next week after, we'll see how it goes.

00:54:32.240 | And the idea is basically this, keep track of the sum of the squares of the derivatives

00:54:40.840 | and compare the sum of the squares of the derivatives from the last epoch to the sum

00:54:45.000 | of the squares of the derivatives of this epoch and look at the ratio of the two.

00:54:49.920 | The derivatives should keep going down.

00:54:53.600 | If they ever go up by too much, that would strongly suggest that you've kind of jumped

00:54:58.800 | out of the good part of the function.

00:55:01.720 | So anytime they go up too much, you should decrease the learning rate.

00:55:07.320 | So I literally added two lines of code to my incredibly simple VBA, Adam with a kneeling

00:55:14.680 | here.

00:55:15.680 | If the gradient ratio is greater than 2, so if it doubles, divide the learning rate by

00:55:22.160 | 4.

00:55:23.160 | Here's what happens when I run that.

00:55:32.880 | That's 5 steps, another 5 steps.

00:55:39.240 | You can see it's automatically changing it, so I don't have to do anything, I just keep

00:55:43.640 | running it.

00:55:47.440 | So I'm pretty interested in this idea, I think it's going to work super well because it allows

00:55:52.160 | me to focus on just running stuff without ever worrying about setting learning rates.

00:55:57.780 | So I'm hopeful that this approach to automatic learning rate and kneeling is something that

00:56:02.560 | we can have in our toolbox by the end of this course.

00:56:08.240 | One thing that happened to me today is I tried a lot of different learning rates, I didn't

00:56:21.400 | get anywhere.

00:56:22.400 | But I was working with the whole dataset, trying to understand if I tried with the sample

00:56:30.000 | and I find something, would that apply to the whole dataset or how do I go about investigating

00:56:35.760 | this?

00:56:36.760 | I'll hold that thought for 5 seconds.

00:56:38.680 | Was there another question at the back before we answered that one?

00:56:43.160 | So here is the answer to that question.

00:56:48.720 | The question was, "It takes a long time to figure out the optimal learning rate.

00:56:54.960 | Can we calculate it using just a sample?"

00:56:58.760 | And to answer that question, I'm going to show you how I entered statefum.

00:57:02.720 | Indeed, when I started entering statefum, I started by using a sample.

00:57:11.880 | And so step 1 was to think, "What insights can we gain from using a sample which can

00:57:19.920 | still apply when we move to the whole dataset?"

00:57:23.040 | Because running stuff in a sample took 10 or 20 seconds, and running stuff in the full

00:57:27.840 | dataset took 2 to 10 minutes per epoch.

00:57:33.560 | So after I created my sample, which I just created randomly, I first of all wanted to

00:57:43.280 | find out what does it take to create a better-than-random model here.

00:57:51.880 | So I always start with the simplest possible model.

00:57:55.600 | And so the simplest possible model has a single dense layer.

00:58:02.040 | Now here's a handy trick.

00:58:03.960 | Rather than worrying about calculating the average and the standard deviation of the

00:58:07.080 | input and subtracting it all out in order to normalize your input layer, you can just

00:58:12.440 | start with a batch-norm layer.

00:58:15.640 | And so if you start with a batch-norm layer, it's going to do that for you.

00:58:18.700 | So anytime you create a Keras model from scratch, I would recommend making your first layer

00:58:23.600 | a batch-norm layer.

00:58:24.980 | So this is going to normalize the data for me.

00:58:29.300 | So that's a cool little trick which I haven't actually seen anybody use elsewhere, but I

00:58:32.800 | think it's a good default starting point all the time.

00:58:38.440 | If I'm going to use a dense layer, then obviously I have to flatten everything into a single

00:58:44.000 | vector first.

00:58:45.580 | So this is really a most minimal model.

00:58:53.800 | So I tried fitting it, compiled it, fit it, and nothing happened.

00:59:00.940 | Not only did nothing happen to my validation, but really nothing happened by training.

00:59:05.960 | It's only taking 7 seconds per epoch to find this out, so that's okay.

00:59:12.720 | So what might be going on?

00:59:13.760 | So I look at model.summary, and I see that there's 1.5 million parameters.

00:59:18.720 | And that makes me think, okay, it's probably not underfitting.

00:59:21.620 | It's probably unlikely that with 1.5 million parameters, there's really nothing useful

00:59:26.240 | that can do whatsoever.

00:59:27.240 | It's only a linear model, true, but I still think it should be able to do something.

00:59:31.800 | So that makes me think that what must be going on is it must be doing that thing where it

00:59:35.840 | jumps too far.

00:59:38.800 | And it's particularly easy to jump too far at the very start of training, and let me

00:59:45.520 | explain why.

00:59:49.440 | It turns out that there are often reasonably good answers that are way too easy to find.

00:59:59.920 | So one reasonably good answer would be always predict 0.

01:00:06.560 | Because there are 10 output classes in the state fun competition, there's one of 10 different

01:00:13.000 | types of distracted driving, and you are scored based on the cross-entropy loss.

01:00:21.600 | And what that's looking at is how accurate are each of your 10 predictions.

01:00:25.960 | So rather than trying to predict something well, what if we just always predict 0.01?

01:00:35.120 | Nine times out of 10, you're going to be right.

01:00:37.440 | Because 9 out of the 10 categories, it's not that.

01:00:41.380 | It's only one of the 10 categories.

01:00:43.160 | So actually always predicting 0.01 would be pretty good.

01:00:47.500 | Now it turns out it's not possible to do that because we have a softmax layer.

01:00:51.440 | And a softmax layer, remember, is e^x_i divided by sum_of, e^x_i.

01:00:56.660 | And so in a softmax layer, everything has to add to 1.

01:01:02.240 | So therefore if it makes one of the classes really high, and all of the other ones really

01:01:08.560 | low, then 9 times out of 10 it is going to be right, 9 times out of 10.

01:01:14.960 | So in other words, it's a pretty good answer for it to always predict some random class,

01:01:21.720 | class 8, close to 100% certainty.

01:01:26.400 | And that's what happened.

01:01:27.400 | So anybody who tried this, and I saw a lot of people on the forums this week saying,

01:01:31.320 | "I tried to train it and nothing happened."

01:01:35.360 | And the folks who got the interesting insight were the ones who then went on to say, "And

01:01:39.520 | then I looked at my predictions and it kept predicting the same class with great confidence

01:01:44.980 | again and again and again."

01:01:46.800 | Okay, that's why I did that.

01:01:49.760 | Our next step then is to try decreasing the learning rate.

01:02:04.220 | So here is exactly the same model, but I'm now using a much lower learning rate.

01:02:13.400 | And when I run that, it's actually moving.

01:02:18.360 | So it's only 12 seconds of compute time to figure out that I'm going to have to start

01:02:22.800 | with a low learning rate.

01:02:24.600 | Once we've got to a point where the accuracy is reasonably better than random, we're well

01:02:32.800 | away from that part of the loss function now that says always predict everything as the

01:02:37.880 | same class, and therefore we can now increase the learning rate back up again.

01:02:43.100 | So generally speaking, for these harder problems, you'll need to start at an epoch or two at

01:02:47.480 | a low learning rate, and then you can increase it back up again.

01:02:53.160 | So you can see now I can put it back up to 0.01 and very quickly increase my accuracy.

01:03:00.920 | So you can see here my accuracy on my validation set is 0.5 using a linear model, and this

01:03:07.800 | is a good starting point because it says to me anytime that my validation accuracy is

01:03:13.200 | worse than about 0.5, this is really no better than even a linear model, so this is not worth

01:03:18.280 | spending more time on.

01:03:21.240 | One obvious question would be, how do you decide how big a sample to use?

01:03:25.920 | And what I did was I tried a few different sizes of sample for my validation set, and

01:03:31.040 | I then said, okay, evaluate the model, in other words, calculate the loss function, on the

01:03:37.680 | validation set, but for a whole bunch of randomly sampled batches, so do it 10 times.

01:03:46.320 | And so then I looked and I saw how the accuracy changed.

01:03:49.760 | With the validation set at 1000 images, my accuracy changed from 0.48 or 0.47 to 0.51,

01:03:58.880 | so it's not changing too much.

01:04:01.280 | It's small enough that I think I can make useful insights using a sample size of this

01:04:06.880 | size.

01:04:10.000 | So what else can we learn from a sample?

01:04:18.800 | One is, are there other architectures that work well?

01:04:22.200 | So the obvious thing to do with a computer vision problem is to try a convolutional neural

01:04:26.440 | network.

01:04:28.880 | And here's one of the most simple convolutional neural networks, two convolutional layers,

01:04:35.520 | each one with a max pooling layer.

01:04:37.600 | And then one dense layer followed by my dense output layer.

01:04:43.640 | So again I tried that and found that it very quickly got to an accuracy of 100% on the

01:04:51.920 | training set, but only 24% on the validation set.

01:04:56.240 | And that's because I was very careful to make sure my validation set included different

01:05:00.760 | drivers to my training set, because on Kaggle it told us that the test set has different

01:05:06.600 | drivers.

01:05:07.600 | So it's much harder to recognize what a driver is doing if we've never seen that driver before.

01:05:13.360 | So I could see that convolutional neural networks clearly are a great way to model this kind

01:05:19.600 | of data, but I've got to have to think very carefully about overfitting.

01:05:24.200 | So step 1 to avoiding overfitting is data augmentation, as we learned in our data augmentation class.

01:05:31.700 | So here's the exact same model.

01:05:34.520 | And I tried every type of data augmentation.

01:05:38.980 | So I tried shifting it left and right a bit.

01:05:41.640 | I tried shifting it up and down a bit.

01:05:44.320 | I tried shearing it a bit.

01:05:47.040 | I tried rotating it a bit.

01:05:50.880 | I tried shifting the channels, so the colors a bit.

01:05:55.600 | And for each of those, I tried four different levels.

01:05:58.680 | And I found in each case what was the best.

01:06:03.040 | And then I combined them all together.

01:06:06.200 | So here are my best data augmentation amounts.

01:06:11.200 | So on 1560 images, so a very small set, this is just my sample, I then ran my very simple

01:06:18.160 | two convolutional layer model with this data augmentation at these optimized parameters.

01:06:25.520 | And it didn't look very good.

01:06:27.280 | After 5 epochs, I only had 0.1 accuracy on my validation set.

01:06:32.160 | But I can see that my training set is continuing to improve.

01:06:36.200 | And so that makes me think, okay, don't give up yet, try deducing the learning rate and

01:06:40.160 | do a few more.

01:06:41.640 | And lo and behold, it started improving.

01:06:43.240 | So this is where you've got to be careful not to jump to conclusions too soon.

01:06:49.400 | So I ran a few more, and it's improving well.

01:06:52.080 | So I ran a few more, another 25.

01:06:56.120 | And look at what happened.

01:06:57.240 | It kept getting better and better and better until we were getting 67% accuracy.

01:07:10.880 | So this 1.15 validation loss is well within the top 50% in this competition.

01:07:19.620 | So using an incredibly simple model, on just a sample, we can get in the top half of this

01:07:24.840 | Kaggle competition simply by using the right kind of data augmentation.

01:07:30.360 | So I think this is a really interesting insight about the power of this incredibly useful

01:07:35.560 | tool.

01:07:36.560 | Okay, let's have a five minute break, and we'll do your question first.

01:07:53.040 | It's unlikely that there's going to be a class imbalance in my sample unless there was an

01:07:58.800 | equivalent class imbalance in the real data, because I've got a thousand examples.

01:08:05.000 | And so statistically speaking, that's unlikely.

01:08:07.800 | If there was a class imbalance in my original data, then I want my sample to have that class

01:08:11.400 | imbalance too.

01:08:14.960 | So at this point, I felt pretty good that I knew that we should be using a convolutional

01:08:24.160 | neural network, which is obviously a very strong hypothesis to start with anyway.

01:08:29.920 | And also I felt pretty confident when you knew what kind of learning rate to start with,

01:08:36.680 | and then how to change it, and also what data augmentation to do.

01:08:44.240 | The next thing I wanted to wonder about was how else do I handle overfitting, because

01:08:49.480 | although I'm getting some pretty good results, I'm still overfitting hugely, 0.6 versus 0.9.

01:08:57.560 | So the next thing in our list of ways to avoid overfitting, and I hope you guys all remember

01:09:03.040 | that we have that list in lesson 3.

01:09:06.680 | The five steps, let's go and have a look at it now to remind ourselves.

01:09:14.000 | Approaches to reducing overfitting, these are the five steps.

01:09:18.160 | We can't add more data, we've tried using data augmentation, we're already using batch

01:09:23.640 | norm and convnets, so the next step is to add regularization.

01:09:28.600 | And dropout is our favored regularization technique.

01:09:32.180 | So I was thinking, okay, before we do that, I'll just mention one more thing about this

01:09:38.600 | data augmentation approach.

01:09:42.200 | I have literally never seen anybody write down a process as to how to figure out what

01:09:50.640 | kind of data augmentation to use and the amount.

01:09:53.900 | The only posts I've seen on it always rely on intuition, which is basically like, look

01:10:00.680 | at the images and think about how much they seem like they should be able to move around

01:10:04.240 | or rotate.

01:10:06.240 | I really tried this week to come up with a rigorous, repeatable process that you could

01:10:12.060 | use.

01:10:15.560 | And that process is go through each data augmentation type one at a time, try 3 or 4 different levels

01:10:21.420 | of it on a sample with a big enough validation set that it's pretty stable to find the best

01:10:29.560 | value of each of the data augmentation parameters, and then try combining them all together.

01:10:40.640 | So I hope you kind of come away with this as a practical message which probably your

01:10:49.920 | colleagues, even if some of them claim to be deep learning experts, I doubt that they're

01:10:53.520 | doing this.

01:10:54.800 | So this is something you can hopefully get people into the practice of doing.

01:11:00.800 | Regularization however, we cannot do on a sample.

01:11:04.600 | And the reason why is that step 1, add more data, well that step is very correlated with

01:11:14.960 | add regularization.

01:11:16.120 | As we add more data, we need less regularization.

01:11:20.000 | So as we move from a sample to the full dataset, we're going to need less regularization.

01:11:25.920 | So to figure out how much regularization to use, we have to use the whole dataset.

01:11:30.360 | So at this point I changed it to use the whole dataset, not the sample, and I started using

01:11:38.360 | Dropout.

01:11:39.360 | So you can see that I started with my data augmentation amounts that you've already

01:11:43.600 | seen, and I started adding in some Dropout.

01:11:49.200 | And ran it for a few epochs to see what would happen.

01:11:53.480 | And you can see it's worked pretty well.

01:11:57.220 | So we're getting up into the 75% now, and before we were in the 64%.

01:12:02.360 | So once we add clipping, which is very important for getting the best cross-entropy loss function,

01:12:10.800 | I haven't checked where that would get us on the Kaggle leaderboard, but I'm pretty

01:12:16.560 | sure it would be at least in the top third based on this accuracy.

01:12:21.640 | So I ran a few more epochs with an even lower learning rate and got 0.78, 0.79.

01:12:35.000 | So this is going to be well up into the top third, maybe even the top third of the leaderboard.

01:12:43.280 | So I got to this point by just trying out a couple of different levels of Dropout, and

01:12:49.320 | I'll just put them in my dense layers.

01:12:51.880 | There's no rule of thumb here.

01:12:54.200 | A lot of people put small amounts of Dropout in their convolutional layers as well.

01:12:59.640 | All I can say is to try things.

01:13:03.160 | But what VGG does is to put 50% Dropout after each of its dense layers, and that doesn't

01:13:09.680 | seem like a bad rule of thumb, so that's what I was doing here.

01:13:12.440 | And then trying around a few different sizes of dense layers to try and find something

01:13:16.320 | reasonable.

01:13:17.320 | I didn't spend a heap of time on this, so there's probably better architectures, but

01:13:21.160 | as you can see this is still a pretty good one.

01:13:24.080 | So that was my step 2.

01:13:26.040 | Now so far we have not used a pre-trained network at all.

01:13:35.080 | So this is getting into the top third of the leaderboard without even using any ImageNet

01:13:40.320 | features.

01:13:41.320 | So that's pretty damn cool.

01:13:44.120 | But we're pretty sure that ImageNet features would be helpful.

01:13:47.640 | So that was the next step, was to use ImageNet features, so VGG features.

01:13:52.560 | Specifically, I was reasonably confident that all of the convolutional layers of VGG are

01:13:59.680 | probably pretty much good enough.

01:14:01.520 | I didn't expect I would have to fine-tune them much, if at all, because the convolutional

01:14:06.120 | layers are the things which really look at the shape and structure of things rather than

01:14:11.280 | how they fit together.

01:14:12.960 | And these are photos of the real world, just like ImageNet are photos of the real world.

01:14:18.580 | So I really felt like most of the time, if not all of it, was likely to be spent on the

01:14:23.480 | dense layers.

01:14:25.200 | So therefore, because calculating the convolutional layers takes nearly all the time, because

01:14:30.560 | that's where all the computation is, I pre-computed the output of the convolutional layers.

01:14:37.240 | And we've done this before, you might remember.

01:14:41.020 | When we looked at dropout, we did exactly this.

01:14:50.840 | We figured out what was the last convolutional layer's ID.

01:14:55.240 | We grabbed all of the layers up to that ID, we built a model out of them, and then we

01:15:02.240 | calculated the output of that model.

01:15:05.640 | And that told us the value of those features, those activations from VGG's last convolutional

01:15:12.440 | layer.

01:15:13.560 | So I did exactly the same thing.

01:15:14.960 | I basically copied and pasted that code.

01:15:18.540 | So I said okay, grab VGG 16, find the last convolutional layer, build a model that contains

01:15:24.000 | everything up to and including that layer, predict the output of that model.

01:15:34.320 | So predicting the output means calculate the activations of that last convolutional layer.

01:15:42.080 | And since that takes some time, then save that so I never have to do it again.

01:15:49.640 | So then in the future I can just load that array.

01:15:54.920 | So this array, I'm not going to calculate those, I'm simply going to load them.

01:16:07.840 | And so have a think about what would you expect the shape of this to be.

01:16:13.560 | And you can figure out what you would expect the shape to be by looking at model.summary

01:16:21.420 | and finding the last convolutional layer.

01:16:24.760 | Here it is.

01:16:25.760 | And we can see it is 512 filters by 14x14.

01:16:31.440 | So let's have a look, just one moment.

01:16:36.440 | We'll find our conv_val_feet.shape, 512x14x14 as expected.

01:16:54.080 | Is there a reason you chose to leave out the max_pooling and flatten layers?

01:17:00.560 | So why did I leave out the max_pooling and flatten layers?

01:17:05.440 | Probably because it takes zero time to calculate them and the max_pooling layer loses information.

01:17:13.560 | So I thought given that I might want to play around with other types of pooling or other

01:17:19.360 | types of convolutions or whatever, I thought pre-calculating this layer is the last one

01:17:24.680 | that takes a lot of computation time.

01:17:28.360 | Having said that, the first thing I did with it in my new model was to max_pool_it and

01:17:33.800 | flatten it.

01:17:40.640 | So now that I have the output of VGG for the last conv layer, I can now build a model that

01:17:48.360 | has dense layers on top of that.

01:17:51.500 | And so the input to this model will be the output of those conv layers.

01:17:54.740 | And the nice thing is it won't take long to run this, even on the whole dataset, because

01:17:58.240 | the dense layers don't take much computation time.

01:18:01.560 | So here's my model, and by making p a parameter, I could try a wide range of dropout amounts,

01:18:11.160 | and I fit it, and one epoch takes 5 seconds on the entire dataset.

01:18:17.520 | So this is a super good way to play around.

01:18:20.360 | And you can see 1 epoch gets me 0.65, 3 epochs get me 0.75.

01:18:35.280 | So this is pretty cool.

01:18:36.280 | I have something that in 15 seconds can get me 0.75 accuracy.

01:18:39.960 | And notice here, I'm not using any data augmentation.

01:18:45.080 | Why aren't I using data augmentation?

01:18:47.140 | Because you can't pre-compute the output of convolutional layers if you're using data

01:18:51.520 | augmentation.

01:18:52.520 | Because with data augmentation, your convolutional layers give you a different output every time.

01:19:00.180 | So that's just a bit of a bummer.

01:19:05.040 | You can't use data augmentation if you are pre-computing the output of a layer.

01:19:10.840 | Because think about it, every time it sees the same cat photo, it's rotating it by a

01:19:17.520 | different amount, or moving it by a different amount.

01:19:22.240 | So it gives a different output of the convolutional layer, so you can't pre-compute it.

01:19:31.520 | There is something you can do, which I've played with a little bit, which is you could

01:19:35.680 | pre-compute something that's 10 times bigger than your dataset, consisting of 10 different

01:19:43.720 | data-augmented versions of it, which is why I actually had this -- where is it?

01:19:53.400 | Which is what I was doing here when I brought in this data generator with augmentations,

01:19:57.520 | and I created something called data-augmented convolutional features, in which I predicted

01:20:03.960 | 5 times the amount of data, or calculated 5 times the amount of data.

01:20:09.800 | And so that basically gave me a dataset 5 times bigger, and that actually worked pretty

01:20:14.680 | well.

01:20:16.260 | It's not as good as having a whole new sample every time, but it's kind of a compromise.

01:20:22.200 | So once I played around with these dense layers, I then did some more fine-tuning and found

01:20:30.420 | out that -- so if I went basically here, I then tried saying, okay, let's go through

01:20:36.400 | all of my layers in my model from 16 onwards and set them to trainable and see what happens.

01:20:44.120 | So I tried retraining, fine-tuning some of the convolutional layers as well.

01:20:48.520 | It basically didn't help.

01:20:50.400 | So I experimented with my hypothesis, and I found it was correct, which is it seems

01:20:54.960 | that for this particular model, coming up with the right set of dense layers is what

01:21:01.400 | it's all about.

01:21:02.400 | Yes, Rachel?

01:21:03.400 | Question.

01:21:04.400 | If we want rotational invariance, should we keep the max pooling, or can another layer

01:21:09.800 | do it as well?

01:21:15.120 | Max pooling doesn't really have anything to do with rotational invariance.

01:21:19.080 | Max pooling does translation invariance.

01:21:27.280 | So I'm going to show you one more cool trick.

01:21:29.200 | I'm going to show you a little bit of State Farm every week from now on because there's

01:21:33.200 | so many cool things to try, and I want to keep reviewing CNNs because convolutional

01:21:38.560 | neural nets really are becoming what deep learning is all about.

01:21:43.520 | I'm going to show you one really cool trick.

01:21:45.720 | It's actually a combination of two tricks.

01:21:47.640 | The two tricks are called pseudo-labeling and knowledge distillation.

01:21:53.680 | So if you Google for pseudo-labeling semi-supervised learning, you can see the original paper that

01:22:02.480 | came out with pseudo-labeling.

01:22:04.760 | I guess that's 2013.

01:22:09.600 | And then knowledge distillation.

01:22:13.160 | This is a Jeffrey Hinton paper, Distilling the Knowledge in a Neural Network.

01:22:17.560 | This is from 2015.

01:22:20.040 | So these are a couple of really cool techniques which Hinton and Jeff Dean, that's not bad,

01:22:29.240 | we're going to combine them together.

01:22:32.440 | And they're kind of crazy.

01:22:34.680 | What we're going to do is we are going to use the test set to give us more information.

01:22:40.400 | Because in State Farm, the test set has 80,000 images in it, and the training set has 20,000

01:22:47.960 | images in it.

01:22:54.920 | What could we do with those 80,000 images which we don't have labels for?

01:23:01.760 | It seems a shame to waste them.

01:23:03.880 | It seems like we should be able to do something with them, and there's a great little picture

01:23:06.880 | here.

01:23:07.880 | Imagine we only had two points, and we knew their labels, white and black.

01:23:14.400 | And then somebody said, "How would you label this?"

01:23:17.960 | And then they told you that there's a whole lot of other unlabeled data.

01:23:24.000 | Notice this is all gray, it's not labeled.

01:23:27.880 | But it's helped us, hasn't it?

01:23:29.720 | It's helped us because it's told us how the data is structured.

01:23:34.840 | This is what semi-supervised learning is all about.

01:23:36.920 | It's all about using the unlabeled data to try and understand something about the structure

01:23:41.120 | of it and use that to help you, just like in this picture.

01:23:47.680 | Pseudo-labeling and knowledge distillation are a way to do this.

01:23:51.920 | And what we do is -- and I'm not going to do it on the test set, I'm going to do it

01:23:55.880 | on the validation set because it's a little bit easier to see the impact of it, and maybe

01:24:00.920 | next week we'll look at the test set to see, because that's going to be much cooler when

01:24:04.680 | you do it on the test set.

01:24:06.720 | It's this simple.

01:24:07.720 | What we do is we take our model, some model we've already built, and we predict the outputs

01:24:14.360 | from that model for our unlabeled set.

01:24:17.600 | In this case, I'm using the validation set, as if it was unlabeled.

01:24:21.000 | So I'm ignoring labels.

01:24:23.800 | And those things we call the pseudo-labels.

01:24:26.440 | So now that we have predictions for the test set or the validation set, it's not that they're

01:24:32.160 | true, but we can pretend they're true.

01:24:35.520 | We can say there's some label, they're not correct labels, but they're labels nonetheless.

01:24:40.080 | So what we then do is we take our training labels and we concatenate them with our validation

01:24:47.040 | or test set pseudo-labels.

01:24:49.800 | And so we now have a bunch of labels for all of our data.

01:24:53.840 | And so we can now also concatenate our convolutional features with the convolutional features of

01:24:59.960 | the validation set or test set.

01:25:03.960 | And we now use these to train a model.

01:25:09.420 | So the model we use is exactly the same model we had before, and we train it in exactly

01:25:15.120 | the same way as before.

01:25:17.640 | And our loss goes up from 0.75 to 0.82.

01:25:22.340 | So our error has dropped by like 25%.

01:25:27.240 | And the reason why is just because we use this additional unlabeled data to try to figure

01:25:34.740 | out the structure of it.

01:25:36.240 | Question about model choice.

01:25:39.840 | How do you learn how to design a model and when to stop messing with them?

01:25:43.240 | It seems like you've taken a few initial ideas, tweaked them to get higher accuracy, but unless

01:25:48.120 | your initial guesses are amazing, there should be plenty of architectures that would also

01:25:51.880 | work.

01:25:53.880 | So if and when you figure out how to find an architecture and stop messing with it,

01:25:59.560 | please tell me, because I don't sleep.

01:26:07.040 | We all want to know this.

01:26:08.680 | I look back at these models I'm showing you and I'm thinking, I bet there's something

01:26:14.920 | twice as good.

01:26:17.160 | I don't know what it is.

01:26:19.360 | There are all kinds of ways of optimizing other hyperparameters of deep learning.

01:26:26.020 | For example, there's something called spearmint, which is a Bayesian optimization hyperparameter

01:26:36.120 | tuning thing.

01:26:39.360 | In fact, just last week a new paper came out for hyperparameter tuning, but this is all

01:26:43.840 | about tuning things like the learning rate and stuff like that.

01:26:50.280 | Coming up with architectures, there are some people who have tried to come up with some

01:27:03.840 | kind of more general architectures, and we're going to look at one next week called ResNets,

01:27:09.840 | which seem to be pretty encouraging in that direction, but even then, ResNet, which we're

01:27:19.360 | going to learn about next week, is an architecture which won ImageNet in 2015.

01:27:27.400 | The author of ResNet, Kaiming He from Microsoft, said, "The reason ResNet is so great is it

01:27:35.960 | lets us build very, very, very deep networks."

01:27:40.080 | Indeed he showed a network with over a thousand layers, and it was totally state-of-the-art.

01:27:45.680 | Somebody else came along a few months ago and built wide ResNets with like 50 layers,

01:27:54.520 | and easily beat Kaiming He's best results.

01:27:57.880 | So the very author of the ImageNet winner completely got wrong the reason why his invention

01:28:03.480 | was good.

01:28:06.000 | The idea that any of us have any idea how to create optimal architectures is totally,

01:28:10.960 | totally wrong.

01:28:11.960 | We don't.

01:28:12.960 | So that's why I'm trying to show you what we know so far, which is like the processes

01:28:18.000 | you can use to build them without waiting forever.

01:28:21.320 | So in this case, doing your data augmentation on the small sample in a rigorous way, figuring

01:28:27.000 | out that probably the dense layers are where the action is at and pre-computing the input

01:28:31.320 | to them.

01:28:32.560 | These are the kinds of things that can keep you sane.

01:28:36.240 | I'm showing you the outcome of my last weeks kind of playing with this.

01:28:41.500 | I can tell you that during this time I continually fell into the trap of running stuff on the

01:28:47.520 | whole network and all the way through and fiddling around with hyperparameters.

01:28:53.240 | And I have to stop myself and have a cup of tea and say, "Okay, is this really a good

01:28:56.760 | idea?

01:28:57.760 | This is really a good use of time."

01:28:59.080 | So we all do it, but not you anymore because you've been to this class.

01:29:05.560 | Green box, back there.

01:29:09.520 | Can you run us through this one more time?

01:29:13.080 | I'm just a little confused because it feels like maybe we're using our validation set

01:29:18.440 | as part of our training program and I'm confused how it's not true.

01:29:22.000 | But look, we're not using the validation labels, nowhere here does it say "val_labels".

01:29:30.160 | So yeah, we are absolutely using our validation set but we're using the validation set's inputs.

01:29:37.320 | And for our test set we have the inputs.

01:29:41.320 | So next week I will show you this page again, and this time I'm going to use the test set.

01:29:46.680 | I just didn't have enough time to do it this time around.

01:29:49.560 | And hopefully we're going to see some great results, and when we do it on the test set

01:29:52.840 | then you'll be really convinced that it's not using the labels because we don't have

01:29:55.720 | any labels.

01:29:56.720 | But you can see here, all it's doing is it's creating pseudo-labels by calculating what

01:30:02.280 | it thinks it ought to be based on the model that we just built with that 75% accuracy.

01:30:10.520 | And so then it's able to use the input data for the validation set in an intelligent way

01:30:16.280 | and therefore improve the accuracy.

01:30:22.340 | What do you mean the same?

01:30:31.480 | Yeah, it's using bn_model, and bn_model is the thing that we just fitted.

01:30:46.120 | By using the training labels, so this is bn_model, the thing with this 0.755 accuracy.

01:30:52.520 | So if we were to look at - I know we haven't gone through this - can you move a bit closer

01:30:57.280 | to the mic?

01:30:58.280 | Sure.

01:30:59.280 | And this is supervised and unsupervised learning?

01:31:01.800 | And in this case semi-supervised learning.

01:31:03.340 | Semi-supervised learning.

01:31:04.340 | Right, and semi-supervised works because you're giving it a model which already knows about

01:31:09.360 | a bunch of labels but unsupervised wouldn't know.

01:31:12.400 | Unsupervised has nothing, that's right.

01:31:17.520 | I wasn't particularly thinking about doing this, but unsupervised learning is where you're

01:31:23.040 | trying to build a model when you have no labels at all.

01:31:27.320 | How many people here would be interested in hearing about unsupervised learning during

01:31:31.240 | this class?

01:31:32.240 | Okay, enough people, I should do that, I will add it.

01:31:42.240 | During the week, perhaps we can create a forum thread about unsupervised learning and I can

01:31:46.520 | learn about what you're interested in doing with it because many things that people think

01:31:51.280 | of as unsupervised problems actually aren't.

01:31:55.120 | Okay, so pseudo-labeling is insane and awesome, and we need the green box back.

01:32:02.840 | Okay, and there are a number of questions.

01:32:11.640 | Earlier you talked about learning about the structure of the data that you can learn from

01:32:14.800 | the validation set, can you say more about that?

01:32:18.320 | I don't know, not really.

01:32:20.640 | Other than that picture I showed you before with the two little spirally things.

01:32:25.520 | And that picture was kind of showing how they clustered in a way that was higher dimension

01:32:29.120 | than what you can see when you just had to work.

01:32:31.420 | So think about that Matt Zyler paper we saw, or the Jason Yersinski visualization tool

01:32:36.560 | box we saw.

01:32:38.500 | The layers learn shapes and textures and concepts.

01:32:46.300 | In that 80,000 test images of people driving in different distracted ways, there are lots

01:32:52.960 | of concepts there to learn about ways in which people drive in distracted ways, even although

01:32:58.540 | they're not labeled.

01:33:00.080 | So what we're doing is we're trying to learn better convolutional or dense features, that's

01:33:08.920 | what I mean by learning more.

01:33:10.660 | So the structure of the data here is basically like what do these pictures tend to look like.

01:33:16.400 | More importantly, in what ways do they differ?

01:33:19.360 | Because it's the ways that they differ that therefore must be related to how they're labeled.

01:33:25.920 | Can you use your updated model to make new labels for the validation?

01:33:32.520 | Yes, you can absolutely do pseudo-labeling on pseudo-labeling, and you should.

01:33:38.200 | And if I don't get sick of running this code, I will try it next week.

01:33:44.600 | Could that introduce bias towards your validation set?

01:33:49.480 | No because we don't have any validation labels.

01:33:53.080 | One of the tricky parameters in pseudo-labeling is in each batch, how much do I make it a

01:34:01.280 | mix of training versus pseudo?

01:34:05.120 | One of the big things that stopped me from getting the test set in this week is that

01:34:10.440 | Keras doesn't have a way of creating batches which have like 80% of this set and 20% of

01:34:19.040 | that set, which is really what I want -- because if I just pseudo-labeled the whole test set

01:34:24.440 | and then concatenated it, then 80% of my batches are going to be pseudo-labels.

01:34:31.080 | And generally speaking, the rule of thumb I've read is that somewhere around a quarter

01:34:35.240 | to a third of your mini-batches should be pseudo-labels.

01:34:39.040 | So I need to write some code basically to get Keras to generate batches which are a

01:34:45.640 | mix from two different places before I can do this properly.

01:34:49.440 | There are two questions and I think you're asking the same thing.

01:34:53.240 | Are your pseudo-labels only as good as the initial model you're beginning from, so do

01:34:57.320 | you need to have kind of a particular accuracy in your model?

01:35:00.880 | Yeah, your pseudo-labels are indeed as good as your model you're starting from.

01:35:05.880 | People have not studied this enough to know how sensitive it is to those initial labels.

01:35:13.080 | No, this is too new, you know, and just try it.

01:35:25.360 | My guess is that pseudo-labels will be useful regardless of what accuracy level you're at

01:35:30.720 | because it will make it better.

01:35:32.280 | As long as you are in a semi-supervised learning context, i.e. you have a lot of unlabeled

01:35:36.840 | data that you want to take advantage of.

01:35:40.560 | I really want to move on because I told you I wanted to get us down the path to NLP this

01:35:47.560 | week.

01:35:49.040 | And it turns out that the path to NLP, strange as it sounds, starts with collaborative filtering.

01:35:56.140 | You will learn why next week.

01:35:58.380 | This week we are going to learn about collaborative filtering.

01:36:01.740 | And so collaborative filtering is a way of doing recommender systems.

01:36:06.560 | And I sent you guys an email today with a link to more information about collaborative

01:36:11.300 | filtering and recommender systems, so please read those links if you haven't already just

01:36:17.720 | to get a sense of what the problem we're solving here is.

01:36:22.520 | In short, what we're trying to do is to learn to predict who is going to like what and how

01:36:34.800 | much.

01:36:36.200 | For example, the $1 million Netflix price, at what rating level will this person give

01:36:44.120 | this movie?

01:36:46.880 | If you're writing Amazon's recommender system to figure out what to show you on their homepage,

01:36:52.080 | which products is his person likely to rate highly?

01:36:57.960 | If you're trying to figure out what stuff to show at a news speed, which articles is

01:37:02.840 | his person likely to enjoy reading?

01:37:06.840 | There's a lot of different ways of doing this, but broadly speaking there are two main classifications

01:37:10.920 | of recommender system.

01:37:13.000 | One is based on metadata, which is for example, this guy filled out a survey in which they

01:37:19.800 | said they liked action movies and sci-fi.

01:37:23.200 | And we also have taken all of our movies and put them into genres, and here are all of

01:37:28.080 | our action sci-fi movies, so we'll use them.

01:37:31.880 | Broadly speaking, that would be a metadata-based approach.

01:37:36.000 | A collaborative filtering-based approach is very different.

01:37:39.040 | It says, "Let's find other people like you and find out what they liked and assume that

01:37:46.720 | you will like the same stuff."

01:37:49.520 | And specifically when we say people like you, we mean people who rated the same movies you've

01:37:54.840 | watched in a similar way, and that's called collaborative filtering.

01:38:00.060 | It turns out that in a large enough dataset, collaborative filtering is so much better

01:38:05.680 | than the metadata-based approaches that adding metadata doesn't even improve it at all.

01:38:11.280 | So when people in the Netflix prize actually went out to IMDB and sucked in additional

01:38:17.280 | data and tried to use that to make it better, at a certain point it didn't help.

01:38:23.860 | Once their collaborative filtering models were good enough, it didn't help.

01:38:26.440 | And that's because it's something I learned about 20 years ago when I used to do a lot

01:38:29.960 | of surveys in consulting.

01:38:31.720 | It turns out that asking people about their behavior is crap compared to actually looking

01:38:36.680 | at people's behavior.

01:38:38.960 | So let me show you what collaborative filtering looks like.

01:38:42.040 | What we're going to do is use a dataset called MovieLens.

01:38:45.560 | So you guys hopefully will be able to play around with this this week.

01:38:50.440 | Unfortunately Rachel and I could not find any Kaggle competitions that were about recommender

01:38:56.280 | systems and where the competitions were still open for entries.

01:39:00.120 | However, there is something called MovieLens which is a widely studied dataset in academia.

01:39:14.520 | Perhaps surprisingly, approaching or beating an academic state of the art is way easier

01:39:20.960 | than winning a Kaggle competition, because in Kaggle competitions lots and lots and lots

01:39:25.120 | of people look at that data and they try lots and lots and lots of things and they use a

01:39:28.840 | really pragmatic approach, whereas academics state of the arts are done by academics.

01:39:35.440 | So with that said, the MovieLens benchmarks are going to be much easier to beat than any

01:39:42.060 | Kaggle competition, but it's still interesting.

01:39:46.520 | So you can download MovieLens dataset from the MovieLens dataset website, and you'll

01:39:52.120 | see that there's one here recommended for new research with 20 million items in.

01:39:57.920 | Also conveniently, they have a small one with only 100,000 ratings.

01:40:01.480 | So you don't have to build a sample, they have already built a sample for you.

01:40:05.240 | So I am of course going to use a sample.

01:40:09.080 | So what I do is I read in ratings.csv.

01:40:14.280 | And as you'll see here, I've started using pandas, pd is pd for pandas.

01:40:19.480 | How many people here have tried pandas?

01:40:22.240 | Awesome.

01:40:23.520 | So those of you that don't, hopefully the peer group pressure is kicking in.

01:40:27.440 | So pandas is a great way of dealing with structured data and you should use it.

01:40:32.040 | Reading a CSV file is this easy, showing the first few items is this easy, finding out

01:40:37.640 | how big it is, finding out how many users and movies there are, are all this easy.

01:40:46.360 | I wanted to play with this in Excel, because that's the only way I know how to teach.

01:40:52.160 | What I did was I grabbed the user ID by rating and grabbed the top 15 most busiest movie-watching

01:41:03.120 | users, and then I grabbed the 15 most watched movies, and then I created a cross-tab of

01:41:11.600 | the two.

01:41:13.280 | And then I copied that into Excel.

01:41:16.320 | Here is the table I downloaded from MovieLens for the 15 busiest movie-watching users and

01:41:27.400 | the 15 most widely watched movies.

01:41:30.560 | And here are the ratings.

01:41:31.560 | Here's the rating of user 14 for movie 27.

01:41:35.440 | Look at these guys.

01:41:36.440 | These three users have watched every single one of these movies.

01:41:40.240 | I'm probably one of them, I love movies.

01:41:45.640 | And these have been watched by every single one of these users.

01:41:49.660 | So user 14 kind of liked movie 27, loved movie 49, hated movie 51.

01:41:58.080 | So let's have a look, is there anybody else here?

01:42:05.200 | So this guy really liked movie 49, didn't much like movie 57, so they may feel the same way

01:42:10.960 | about movie 27 as that user.

01:42:13.800 | That's the basic essence of collaborative filtering.

01:42:15.680 | We're going to try and automate it a little bit.

01:42:18.560 | And the way we're going to automate it is we're going to say let's pretend for each

01:42:21.800 | movie we had like five characteristics, which is like is it sci-fi, is it action, is it

01:42:29.480 | dialogue-heavy, is it new, and does it have Bruce Willis.

01:42:38.360 | And then we could have those five things for every user as well, which is this user somebody

01:42:52.160 | who likes sci-fi, action, dialogue, new movies, and Bruce Willis.

01:42:56.960 | And so what we could then do is multiply those matrix product or dot product, that set of

01:43:07.280 | user features with that set of movie features.

01:43:11.400 | If this person likes sci-fi and it's sci-fi and they like action and it is action and

01:43:15.160 | so forth, then a high number will appear in here for this matrix product of these two

01:43:20.680 | vectors, this dot product of these two vectors.

01:43:25.640 | And so this would be a cool way to build up a collaborative filtering system if only we

01:43:33.040 | could create these five items for every movie and for every user.

01:43:40.940 | Now because we don't actually know what five things are most important for users and what

01:43:45.680 | five things are most important for movies, we're instead going to learn them.

01:43:50.560 | And the way we learn them is the way we learn everything, which is we start by randomizing

01:43:55.560 | them and then we use gradient descent.

01:43:59.380 | So here are five random numbers for every movie, and here are five random numbers for

01:44:06.360 | every user, and in the middle is the dot product of that movie with that user.

01:44:14.800 | Once we have a good set of movie factors and user factors for each one, then each of these

01:44:22.120 | ratings will be similar to each of the observed ratings, and therefore this sum of squared

01:44:28.820 | errors will be low.

01:44:34.160 | Currently it is high.

01:44:36.840 | So we start with our random numbers, we start with a loss function of 40.

01:44:43.640 | So we now want to use gradient descent, and it turns out that every copy of Excel has

01:44:49.400 | a gradient descent solver in it.

01:44:51.680 | So we're going to go ahead and use it, it's called solver.

01:44:55.380 | And so we have to tell it what thing to minimize, so it's saying minimize this, and which things

01:45:01.720 | do we want to change, which is all of our factors, and then we set it to a minimum and

01:45:07.400 | we say solve.

01:45:09.500 | And then we can see in the bottom left, it is trying to make this better and better and

01:45:13.760 | better using gradient descent.

01:45:16.800 | Notice I'm not saying stochastic gradient descent.

01:45:20.300 | Stochastic gradient descent means it's doing it mini-batch at a mini-batch time.

01:45:24.620 | Gradient descent means it's doing the whole data set each time.

01:45:28.040 | Excel uses gradient descent, not stochastic gradient descent.

01:45:31.480 | They give the same answer.

01:45:34.420 | You might also wonder why is it so slow?

01:45:37.000 | It's so slow because it doesn't know how to create analytical derivatives, so it's having

01:45:41.080 | to calculate the derivatives with finite difference, which is slow.

01:45:45.400 | So here we've got a solution, it's got it down to 5.

01:45:50.600 | That's pretty good.

01:45:51.600 | So we can see here that it predicted 5.14 and it was actually 5.

01:45:58.440 | It predicted 3.05 and it was actually 3.

01:46:01.600 | So it's done a really, really good job.

01:46:05.980 | It's a little bit too easy because there are 5 times that many user factors and 5 times

01:46:16.040 | that many user factors.

01:46:17.520 | We've got nearly as many factors as we have things to calculate, so it's kind of over-specified.

01:46:23.600 | But the idea is there.

01:46:26.520 | There's one piece missing.

01:46:28.280 | The piece we're missing is that some users probably just like movies more than others,

01:46:35.460 | and some movies are probably just more like than others.

01:46:39.440 | And this dot product does not allow us in any way to say this is an enthusiastic user

01:46:46.440 | or this is a popular movie.

01:46:48.920 | To do that, we have to add bias terms.

01:46:52.560 | So here is exactly the same spreadsheet, but I've added one more row to the movies part

01:47:02.280 | and one more column to the users part for our biases.

01:47:07.280 | And I've updated the formula so that as well as the matrix multiplication, it also is adding

01:47:15.020 | the user bias and the movie bias.

01:47:18.080 | So this is saying this is a very popular movie, and here we are, this is a very enthusiastic

01:47:28.640 | user for example.

01:47:30.760 | And so now that we have a collaborative filtering plus bias, we can do gradient descent on that.

01:47:39.120 | So previously our gradient descent loss function was 5.6.

01:47:45.540 | We would expect it to be better with bias because we can really better specify what's

01:47:49.080 | going on.

01:47:50.080 | Let's try it.

01:47:52.120 | So again we run solver, solve, and we let that zip along, and we see what happens.

01:47:59.760 | So these things we're calculating are called latent factors.

01:48:06.440 | A latent factor is some factor that is influencing outcome, but we don't quite know what it is.

01:48:13.400 | We're just assuming it's there.

01:48:15.680 | And in fact what happens is when people do collaborative filtering, they then go back

01:48:19.400 | and they draw graphs where they say here are the movies that are scored highly on this

01:48:24.860 | latent factor and low on this latent factor, and so they'll discover the Bruce Willis factor

01:48:31.680 | and the sci-fi factor and so forth.

01:48:34.400 | And so if you look at the Netflix prize visualizations, you'll see these graphs people do.

01:48:39.980 | And the way they do them is they literally do this.

01:48:42.440 | Not in Excel, because they're not that cool, but they calculate these latent factors and

01:48:48.480 | then they draw pictures of them and then they actually write the name of the movie on the

01:48:52.760 | graph.

01:48:53.760 | So 4.6, even better.

01:48:58.960 | So you can see that, oh that's interesting.

01:49:05.400 | In fact I also have an error here, because any time that my writing is empty, I really

01:49:13.800 | want to be setting this to empty as well, which means my parenthesis was in the wrong

01:49:18.720 | place.

01:49:21.280 | So I'm going to recalculate this with my error fixed up and see if we get a better answer.

01:49:30.480 | They're randomly generated and then optimized with gradient descent.

01:49:59.480 | For some reason, this seems crazier than what we were doing at CNN's, because movies I understand

01:50:11.200 | more than features of images that I just don't intuitively understand.

01:50:17.080 | So we can look at some pictures next week, but during the week, Google for Netflix prize

01:50:25.040 | visualizations and you will see these pictures.

01:50:29.960 | It really does work the way I described.

01:50:32.840 | It figures out what are the most interesting dimensions on which we can rate a movie.

01:50:41.000 | Things like level of action and sci-fi and dialogue driven are very important features,

01:50:46.760 | it turns out.

01:50:49.960 | But rather than pre-specifying those features, we have definitely learned from this class

01:50:56.160 | that calculating features using gradient descent is going to give us better features than trying

01:51:01.240 | to engineer them by hand.

01:51:08.600 | Interesting that it feels crazy.

01:51:11.040 | Tell me next week if you find some particularly interesting things, or if it still seems crazy

01:51:15.680 | and we can try to decresify it a little bit.

01:51:20.360 | So let's do this in Keras.

01:51:24.760 | Now there's really only one main new concept we have to learn, which is we started out

01:51:31.000 | with data not in a crosstab form, but in this form.

01:51:36.440 | We have user ID, movie ID, rating triplets, and I crosstab them.

01:51:47.920 | So the rows and the columns above the random numbers, are they the variations and the features

01:51:55.700 | in the movies and the variations and features in the users?

01:51:59.520 | Each of these rows is one feature of a movie, and each of these columns is one feature of

01:52:05.200 | a user.

01:52:06.360 | And so one of these sets of 5 is one set of features for a user.

01:52:11.680 | This is this user's latent factors.

01:52:14.240 | I think it's interesting and crazy because you're basically taking random data and you

01:52:19.520 | can generate those features out of people that you don't know in movies that you're

01:52:24.720 | not looking at.

01:52:26.360 | Yeah, this is the thing I just did at the start of class, which is there's nothing mathematically

01:52:33.400 | complicated about gradient descent.

01:52:38.500 | The hard part is unlearning the idea that this should be hard, you know, gradient descent

01:52:46.000 | just figures it out.

01:52:48.280 | Did you have a question?

01:52:49.280 | I have one question behind you.

01:52:50.840 | I just wanted to point out that this you can think of as a smaller, more concise way to

01:52:59.800 | represent the movies and the users.

01:53:03.740 | In math, there's a concept of a matrix factorization, an SVD for example, which is where you basically

01:53:10.360 | take a big matrix and turn it into a small narrow one and a small thin one and multiply

01:53:15.520 | the two together.

01:53:16.520 | This is exactly what we're doing.

01:53:17.920 | Instead of having how user 14 rated every single movie, we just have 5 numbers that

01:53:23.240 | represent it, which is pretty cool.

01:53:28.760 | So earlier, did you say that both the user features were random as well as the?

01:53:35.200 | Yes.

01:53:36.200 | I guess I'm in trouble relating to, I thought, you know, usually we run something like gradient

01:53:44.760 | descent on, something has like inputs that you know and here, what are the, what do you

01:53:52.360 | know?

01:53:53.360 | What we know, that's what we know, the resulting ratings.

01:53:57.160 | So can you perhaps come up with the wrong, like you flip the feature for a movie and

01:54:06.800 | a user because if you're doing a multiplication, how do you know which value goes which?

01:54:16.240 | If one of the numbers was in the wrong spot, our loss function would be less good and therefore

01:54:23.480 | there would be a gradient from that weight to say you should make this weight a little

01:54:29.000 | higher or a little lower.

01:54:31.680 | So all the gradient descent is doing is saying okay, for every weight, if we make it a little

01:54:36.160 | higher, does it get better or if we make it a little bit lower, does it get better?

01:54:39.760 | And then we keep making them a little bit higher and lower until we can't go any better.

01:54:47.520 | And we had to decide how to combine the weights.

01:54:50.720 | So this was our architecture, our architecture was let's take a dot product of some assumed

01:54:58.200 | user feature and some assumed movie feature and let's add in the second case some assumed

01:55:04.400 | bias term.

01:55:06.320 | So we had to build an architecture and we built the architecture using common sense,

01:55:11.000 | which is to say this seems like a reasonable way of thinking about this.

01:55:13.600 | I'm going to show you a better architecture in a moment.

01:55:15.920 | In fact, we're running out of time, so let me jump into the better architecture.

01:55:21.600 | So I wanted to point out that there is something new we're going to have to learn here, which

01:55:25.560 | is how do you start with a numeric user_id and look up to find what is their 5-element

01:55:33.040 | latent factor matrix.

01:55:35.960 | Now remember, when we have user_id's like 1, 2 and 3, one way to specify them is using

01:55:43.240 | one hot encoding.

01:55:52.600 | So one way to handle this situation would be if this was our user matrix, it was one

01:56:01.720 | hot encoded, and then we had a factor matrix containing a whole bunch of random numbers

01:56:12.360 | -- one way to do it would be to take a dot product or a matrix product of this and this.

01:56:29.200 | And what that would do would be for this one here, it would basically say let's multiply

01:56:34.480 | that by this, it would grab the first column of the matrix.

01:56:42.360 | And this here would grab the second column of the matrix.

01:56:46.040 | And this here would grab the third column of the matrix.

01:56:49.080 | So one way to do this in Keras would be to represent our user_id's as one hot encodings,

01:56:56.720 | and to create a user factor matrix just as a regular matrix like this and then take a

01:57:04.680 | matrix product.

01:57:07.800 | That's horribly slow because if we have 10,000 users, then this thing is 10,000 wide and that's

01:57:16.040 | a really big matrix multiplication when all we're actually doing is saying for user_id

01:57:20.280 | number 1, take the first column.

01:57:22.520 | For user_id number 2, take the second column, for user_id number 3, take the third column.

01:57:27.120 | And so Keras has something which does this for us and it's called an embedding layer.

01:57:32.280 | And embedding is literally something which takes an integer as an input and looks up

01:57:36.600 | and grabs the corresponding column as output.

01:57:39.920 | So it's doing exactly what we're seeing in this spreadsheet.

01:57:43.160 | Question 2 - How do you deal with missing values, so if a user has not rated a particular movie?

01:57:52.400 | That's no problem, so missing values are just ignored, so if it's missing, I just set the

01:57:56.520 | loss to 0.

01:57:57.520 | Question 3 - How do you break up the training and test set?

01:58:02.000 | I broke up the training and test set randomly by grabbing random numbers and saying are

01:58:09.280 | they greater or less than 0.8 and then split my ratings into two groups based on that.

01:58:14.160 | Question 4 - And you're choosing those from the ratings so that you have some ratings

01:58:18.640 | from all users and you have some ratings for all movies?

01:58:21.400 | I just grabbed them at random.

01:58:25.960 | So here it is, here's our dot product.

01:58:30.480 | In Keras, there's one other thing, I'm going to stop using the sequential model in Keras

01:58:38.200 | and start using the functional model in Keras.

01:58:40.760 | I'll talk more about this next week, but you can read about it learning the week.

01:58:44.080 | There are two ways of creating models in Keras, the sequential and the functional.

01:58:48.320 | They do similar things, but the functional is much more flexible and it's going to be

01:58:53.000 | what we're going to need to use now.

01:58:54.880 | So this is going to look slightly unfamiliar, but the ideas are the same.

01:58:59.460 | So we create an input layer for a user, and then we say now create an embedding layer

01:59:07.480 | for n users, which is 671, and we want to create how many latent factors?

01:59:14.480 | I decided not to create 5, but to create 50.

01:59:19.360 | And then I create a movie input, and then I create a movie embedding with 50 factors,

01:59:26.840 | and then I say take the dot product of those, and that's our model.

01:59:34.540 | So now please compile the model, and now train it, taking the userID and movieID as input,

01:59:42.120 | the rating as the target, and run it for 6 epochs, and I get a 1.27 loss.

01:59:49.600 | This is with an RMSE loss.

01:59:53.720 | Notice that I'm not doing anything else clever, it's just that simple dot product.

01:59:56.840 | That gets me to 1.27.

01:59:59.840 | Here's how I add the bias, I use exactly the same kind of embedding inputs as before, and

02:00:04.960 | I've encapsulated them in a function.

02:00:07.760 | So my user and movie embeddings are the same.

02:00:11.120 | And then I create bias by simply creating an embedding with just a single output.

02:00:18.400 | And so then my new model is do a dot product, and then add the user bias, and add the movie

02:00:26.720 | bias, and try fitting that.

02:00:30.640 | And it takes me to a validation loss of 1.1.

02:00:34.200 | How is that going?

02:00:35.200 | Well, there are lots of sites on the internet where you can find out benchmarks for movie

02:00:41.080 | lens, and on the 100,000 dataset, we're generally looking for RMSE of about 0.89.

02:00:47.120 | There's some more, the best one here is 0.9, here we are, 0.89, and this one, RMSE, that's

02:01:00.560 | on the 1,000,000 dataset, let's go to the 100,000, 100,000, RMSE, 1.9, 0.89.

02:01:10.360 | So kind of high 0.89s, low 0.9s would be state-of-the-art according to these benchmarks.

02:01:16.600 | So, we're on the right track, but we're not there yet.

02:01:21.000 | So let's try something better, let's create a neural net.

02:01:23.760 | And a neural net does the same thing.

02:01:25.960 | We create a movie embedding and a user embedding, again with 50 factors, and this time we don't

02:01:31.600 | take a dot product, we just concatenate the two vectors together, stick one on the end

02:01:35.680 | or the other.

02:01:37.820 | And because we now have one big vector, we can create a neural net, create a dense layer,

02:01:43.480 | add dropout, create an activation, compile it, and fit it.

02:01:51.120 | And after 5 epochs, we get something way better than state-of-the-art.

02:01:57.280 | So we couldn't find anything better than about 0.89.

02:02:00.680 | And so this whole notebook took me like half an hour to write, and so I don't claim to

02:02:06.080 | be a collaborative filtering expert, but I think it's pretty cool that these things that

02:02:10.360 | were written by people that write collaborative filtering software for a living, that's what

02:02:15.680 | these websites basically are coming from, places that use LensKit.

02:02:21.100 | So LensKit is a piece of software for recommender systems.

02:02:26.000 | We have just killed their benchmark, and it took us 10 seconds to train.

02:02:32.400 | So I think that's pretty neat.

02:02:33.720 | And we're right on time, so we're going to take one last question.

02:02:36.480 | So in the neural net, why is it that there are a number of factors so low?

02:02:46.960 | Oh, actually I thought it was an equal, not a comma, never mind, we're good.

02:02:50.760 | Alright, so now you can go home.

02:02:53.240 | So that was a very, very quick introduction to embeddings, like as per usual in this class,

02:02:59.200 | I kind of stick the new stuff in at the end and say go study it.

02:03:04.160 | So your job this week is to keep improving state farm, hopefully win the new fisheries

02:03:09.960 | competition.

02:03:10.960 | By the way, in the last half hour, I just created this little notebook in which I basically

02:03:15.880 | copied the Dogs and Cats Redux competition into something which does the same thing with

02:03:22.960 | the fish data, and I quickly submitted a result.

02:03:27.880 | So we currently have one of us in 18th place, yay.

02:03:31.960 | So hopefully you can beat that tomorrow.

02:03:34.760 | But most importantly, download the movie lens data and have a play with that and we'll talk

02:03:39.440 | more about embeddings next week.

02:03:40.840 | Thank you.

02:03:41.840 | [Applause]

02:03:41.840 | (audience applauds)

02:03:44.840 | (audience applauds)