back to indexFoundations of Deep Learning (Hugo Larochelle, Twitter)
Chapters
0:0 Intro
1:0 FOUNDATIONS OF DEEP LEARNING
9:45 CAPACITY OF NEURAL NETWORK
11:2 MACHINE LEARNING
15:45 LOSS FUNCTION
19:9 BACKPROPAGATION
24:59 ACTIVATION FUNCTION
27:30 FLOW GRAPH
29:38 REGULARIZATION
30:24 INITIALIZATION
33:26 MODEL SELECTION
36:33 KNOWING WHEN TO STOP
38:7 OTHER TRICKS OF THE TRADE
42:27 GRADIENT CHECKING
43:36 DEBUGGING ON SMALL DATASET
50:54 DROPOUT
55:10 BATCH NORMALIZATION
58:34 UNSUPERVISED PRE-TRAINING
58:37 NEURAL NETWORK ONLINE COURSE
00:00:01.440 |
So yeah, so I was asked to give this presentation 00:00:07.080 |
is mostly going over basic feedforward neural networks 00:00:29.160 |
would mostly be fine if you're familiar enough 00:00:31.600 |
with some machine learning and a little bit about neural nets. 00:00:38.280 |
can go check out my online lectures on YouTube. 00:00:41.320 |
It's now taught by a much younger version of myself. 00:00:47.800 |
And I am not the guy doing a bunch of skateboarding. 00:00:52.920 |
So go check those out if you want more details. 00:00:58.880 |
I'll start with just describing and laying out 00:01:02.280 |
the notation on feedforward neural networks, that 00:01:13.120 |
and the different types of units and the type of functions 00:01:17.680 |
And then I'll talk about how we actually train neural nets, 00:01:22.600 |
backpropagation that allows us to get a gradient for training 00:01:29.840 |
so some of the things we do in practice to successfully train 00:01:33.680 |
And then I'll end by talking about some developments that 00:01:37.680 |
are specifically useful in the context of deep learning, that 00:01:41.240 |
is, neural networks with several hidden layers that came out 00:01:46.920 |
after the beginning of deep learning, say, in 2006. 00:01:49.880 |
That is, things like dropout, batch normalization, 00:01:52.200 |
and if I have some time, unsupervised pre-training. 00:01:57.560 |
And just talk about, assuming we have some neural network, 00:02:10.240 |
is a model that takes as input some vector x, which 00:02:16.400 |
for each of the dimensions in my input vector. 00:02:19.480 |
So each dimension is essentially a unit in that neural network. 00:02:23.640 |
And then it eventually produces, at its output layer, 00:02:38.680 |
So if we're identifying digits in handwritten character 00:03:02.380 |
allows us to capture and perform very sophisticated types 00:03:11.500 |
we compute all the layers in our neural net is as follows. 00:03:23.920 |
So A k is just the pre-activation at layer k. 00:03:28.080 |
And that is only simply going to be a linear transformation 00:03:34.080 |
So I'm going to note h k as the activation on the layer. 00:03:43.080 |
And so using that notation, the pre-activation at layer k 00:03:46.960 |
is going to correspond to taking the activation 00:03:57.640 |
Those essentially corresponds to the connections 00:04:09.640 |
And then next, I'm going to get a hidden layer activation 00:04:14.680 |
This will introduce some non-linearity in the model. 00:04:22.000 |
So we have four common choices for the activation function. 00:04:36.000 |
But then I'll usually apply a different activation function 00:04:38.600 |
depending on the problem I'm trying to solve. 00:04:41.960 |
So having said that, let's go to some of the choices 00:04:48.260 |
So some of the activation functions you'll see. 00:04:50.800 |
One common one is this sigmoid activation function. 00:04:54.880 |
It's just 1 divided by 1 plus the exponential 00:05:01.360 |
The shape of this function, you can focus on that, is this here. 00:05:06.160 |
can vary from minus infinite to plus infinite. 00:05:11.680 |
So it's bounded by below and above, below by 0, 00:05:19.600 |
if you have very large magnitude positive or negative 00:05:26.760 |
Another common choice is the hyperbolic tangent or tanh 00:05:38.720 |
And one that's become quite popular in neural nets 00:05:42.600 |
is what's known as the rectified linear activation function. 00:05:50.400 |
that refers to the use of this activation function. 00:05:56.560 |
in that it's not bounded above, but it is bounded below. 00:06:00.340 |
And it will output 0's exactly if the pre-activation 00:06:08.040 |
So those are the choices of activation functions 00:06:12.060 |
And for the output layer, if we're performing 00:06:13.960 |
classification, as I said, in our output layer, 00:06:24.720 |
and what we often do is interpret each unit's 00:06:30.320 |
to the neural network, that the input belongs 00:06:39.600 |
would be like the index of that unit in the output layer. 00:06:46.840 |
a multinomial distribution over all the different classes. 00:06:57.400 |
You take your pre-activations, and you exponentiate them. 00:07:02.800 |
And then we divide each of the exponentiated pre-activations 00:07:06.440 |
by the sum of all the exponentiated pre-activations. 00:07:13.040 |
it means that all my values in my output layer 00:07:17.920 |
And they're positive because I took the exponential. 00:07:20.080 |
So I can interpret that as a multinomial distribution 00:07:23.080 |
over the choice of all the C different classes. 00:07:26.760 |
So that's what I'll use as the activation function 00:07:32.160 |
And now, beyond the math in terms of conceptually 00:07:35.100 |
and also in the way we're going to program neural networks, 00:07:38.480 |
often what we'll do is that all these different operations, 00:07:41.020 |
the linear transformations, the different types of activation 00:07:43.520 |
functions, we'll essentially implement all of them 00:07:59.520 |
that might correspond to the computation of pre-activation, 00:08:04.920 |
is the weight matrix and the bias vector for that layer 00:08:21.880 |
so like a sigmoid object or a tanh object or a ReLU object. 00:08:27.200 |
chain them into what ends up being a graph, which I refer 00:08:30.960 |
to as a flow graph, that represents the computation done 00:08:34.680 |
when you do a forward pass in your neural network 00:08:41.520 |
see the different softwares that we presented over the weekend 00:08:45.840 |
will essentially exploit some of that representation 00:08:51.600 |
It will also be handy for computing gradients, which 00:08:57.720 |
And so that's how we perform predictions in neural networks. 00:09:13.400 |
to the unit that has the highest activation, that 00:09:16.440 |
would correspond to classifying to the class that 00:09:19.160 |
has the highest probability according to the neural net. 00:09:26.200 |
what kind of problems can we solve with neural networks? 00:09:43.160 |
But essentially, if we have a single hidden layer 00:09:45.960 |
neural network, it's been shown that with a linear output, 00:09:51.040 |
arbitrarily well as long as we have enough hidden units. 00:09:54.560 |
So that is, there's a value for these biases and these weights 00:09:59.240 |
I can actually represent it as well as I want. 00:10:04.760 |
So this result applies if you use activation functions, 00:10:07.560 |
non-linear activation functions like sigmoid and tanh. 00:10:11.360 |
So as I said in my video, if you want a bit more intuition 00:10:14.000 |
as to why that would be, you can go check that out. 00:10:19.920 |
It means that by focusing on this family of machine learning 00:10:23.880 |
models that are neural networks, I can pretty much potentially 00:10:27.720 |
represent any kind of classification function. 00:10:32.560 |
how do we actually find the weights and the bias values 00:10:48.280 |
train a neural network to perform good classification 00:10:54.280 |
So what we'll typically do is use a framework that's 00:10:59.900 |
known as empirical risk minimization or structural risk 00:11:08.940 |
a problem of learning as a problem of optimizing. 00:11:12.860 |
So what we'll do is that we'll first choose a loss function 00:11:28.540 |
to essentially as the index over all my different examples 00:11:36.940 |
is this output good or bad given that the label is actually y? 00:11:42.660 |
And what I'll do, I'll also define a regularizer. 00:11:49.620 |
you can think of it as just a concatenation of all my biases 00:11:54.460 |
So those are all the parameters of my neural network. 00:12:05.980 |
for instance, you might want to have your weights not 00:12:09.940 |
That's a frequent intuition that we implement with regularizer. 00:12:19.740 |
the average loss of my neural network over my training 00:12:23.900 |
examples, so summing over all training examples. 00:12:28.860 |
Plus some weight here that's known as the weight decay, 00:12:33.340 |
some hyperparameter lambda, times my regularizer. 00:12:39.280 |
my loss on my training set the smallest possible over all 00:12:45.260 |
to satisfy my regularizer as much as possible. 00:12:48.660 |
And so now we have this optimization problem. 00:12:55.380 |
So finding this arg min here for over my weights and my biases. 00:13:10.340 |
see constantly in deep learning is stochastic gradient descent. 00:13:16.220 |
that we'll often use for training neural networks. 00:13:19.780 |
So SGD, stochastic gradient descent, functions as follows. 00:13:26.580 |
That is finding initial values for all my weight matrices 00:13:34.140 |
so an epoch will be a full pass over all my examples. 00:13:40.100 |
So for a certain number of full iterations over my training 00:13:51.620 |
And then I'll compute what is the gradient of my loss 00:13:58.580 |
All of my parameters, all my weights, and all my biases. 00:14:03.180 |
so nabla for the gradient of the loss function. 00:14:06.740 |
And here I'm indexing with respect to which parameter. 00:14:12.060 |
So I'm going to compute what is the gradient of my loss 00:14:17.240 |
And plus lambda times the gradient of my regularizer 00:14:21.160 |
And then I'm going to get a direction in which I should 00:14:25.220 |
Since the gradient tells me how to increase the loss, 00:14:28.620 |
I want to go in the opposite direction and decrease it. 00:14:35.540 |
And so this delta is going to be the direction in which I'll 00:14:43.940 |
which is often referred to as a learning rate, 00:14:49.300 |
to my current values of my parameters, my biases 00:14:53.360 |
And that's going to give me my new value for all 00:14:57.180 |
And I iterate like that, going over all pairs x, y's, 00:15:07.620 |
So that's how stochastic gradient descent works. 00:15:10.620 |
And that's essentially the learning procedure. 00:15:17.900 |
we need to specify to be able to implement it and execute it. 00:15:20.860 |
We need a loss function, a choice for the loss function. 00:15:23.620 |
We need a procedure that's efficient for computing 00:15:26.860 |
the gradient of the loss with respect to my parameters. 00:15:30.620 |
We need to choose a regularizer if we want one. 00:15:33.300 |
And we need a way of initializing my parameters. 00:15:35.940 |
So next, what I'll do is I'll go through each 00:15:48.060 |
So as I said, we will interpret the output layer 00:15:50.980 |
as assigning probabilities to each potential class in which 00:15:57.460 |
Well, in this case, something that would be natural 00:16:02.120 |
of the correct class, the actual class in which my example 00:16:06.340 |
I'd like to increase the value of the probability assigned 00:16:12.820 |
And so because we set up the problem in which we 00:16:18.700 |
of maximizing the probability, what we'll actually do 00:16:20.980 |
is minimize the negative and the actual log probability, 00:16:32.100 |
So given my output layer and the true label y, 00:16:35.420 |
my loss will be minus the log of the probability of y 00:16:42.140 |
And that would be, well, take my output layer 00:16:44.980 |
and look at the unit, so index the unit corresponding 00:17:01.340 |
you'll see instead of talking about the negative log 00:17:11.820 |
as performing a sum over all possible classes. 00:17:20.820 |
So I have an indicator function that is 1 if y is equal to c, 00:17:29.700 |
I'm going to multiply that by the log of the probability 00:17:35.660 |
And this function here, so this expression here, 00:17:39.380 |
is like a cross-entropy between the empirical distribution, 00:17:42.900 |
which assigns zero probability to all the other classes, 00:17:50.900 |
that my neural net is computing, which is f of x. 00:17:57.540 |
Here, I only mention it because in certain libraries, 00:17:59.700 |
it's actually mentioned as the cross-entropy loss. 00:18:17.300 |
want the actual derivation of all the details for all 00:18:28.620 |
If you haven't, just go check out the videos. 00:18:30.700 |
In any case, I'm going to go through what the algorithm is. 00:18:33.980 |
I'm going to highlight some of the key points 00:18:35.900 |
that will come up later in understanding how actually 00:18:41.380 |
So the basic idea is that we'll compute gradients 00:18:46.420 |
And we'll go from the top layer all the way to the bottom, 00:18:50.220 |
computing gradients for layers that are closer and closer 00:18:53.820 |
to the input as we go, and exploiting the chain rule 00:18:56.540 |
to exploit or reuse previous computations we've 00:18:59.820 |
made at upper layers to compute the gradients at the layers 00:19:04.980 |
So we usually start by computing what is the gradient 00:19:09.300 |
So what's the gradient of my loss with respect 00:19:15.380 |
to compute the loss with respect to the pre-activation. 00:19:21.140 |
So that's why I have the gradient of this vector, 00:19:25.140 |
That's the pre-activation at the very last layer of the loss 00:19:32.460 |
And it turns out this gradient is super simple. 00:19:40.180 |
So what this means is E of y is just a vector filled 00:19:43.820 |
with a bunch of 0's and then the 1 at the correct class. 00:19:47.860 |
So if y was the fourth class, then in this case, 00:19:51.100 |
it would be this vector, where I have a 1 at the fourth 00:19:58.700 |
And the single 1 at the position corresponding 00:20:05.140 |
is essentially saying is that I'm going to increase-- 00:20:12.520 |
will increase the probability of the correct class. 00:20:15.460 |
And I'm going to subtract what is the current probabilities 00:20:18.660 |
assigned by my neural net to all of the classes. 00:20:23.580 |
And that's the current beliefs of the neural net 00:20:26.420 |
as to in which class, what's the probability of assigning 00:20:33.740 |
trying to decrease the probability of everything 00:20:36.260 |
and specifically decrease it as much as the neural net currently 00:20:42.900 |
And so if you think about the subtraction of these two 00:20:45.420 |
things, well, for the class that's the correct class, 00:20:48.260 |
I'm going to have 1 minus some number between 0 and 1, 00:21:01.180 |
So I'm actually going to decrease the probability 00:21:08.740 |
And I'm going to take that pre-activation gradient. 00:21:11.580 |
I'm going to propagate it from the top to the bottom 00:21:15.340 |
and essentially iterating from the last layer, which 00:21:19.740 |
is the output layer, L plus 1, all the way down 00:21:23.980 |
And as I'm going down, I'm going to compute the gradients 00:21:28.500 |
compute what's the gradient for the pre-activation 00:21:31.420 |
at the layer below and then iterate like that. 00:21:38.180 |
I take what is the current gradient of the loss function 00:21:46.220 |
And I can compute the gradient of the loss function 00:21:58.180 |
So in my notation, I assume that all the vectors 00:22:05.180 |
and I multiply it by the transpose of the activations, 00:22:09.020 |
so the value of the layer right below, the layer k minus 1. 00:22:20.800 |
I'm going to get a matrix of the same size as my weight matrix. 00:22:27.500 |
Turns out that the gradient of the loss with respect 00:22:31.940 |
of the loss with respect to the pre-activation. 00:22:36.220 |
So that gives me now my gradients for my parameters. 00:22:40.420 |
going to be the gradient of the pre-activations 00:22:44.700 |
Well, first, I'm going to get the gradient of the loss 00:22:48.980 |
function with respect to the activation at the layer below. 00:22:54.220 |
Well, that's just taking my pre-activation gradient vector 00:23:01.780 |
and multiply it by the transpose of my weight matrix. 00:23:04.660 |
Super simple operation, just a linear transformation 00:23:07.580 |
of my gradients at layer k, linear and transformed 00:23:10.580 |
to get my gradients of the activation at the layer k 00:23:15.180 |
And then to get the gradients of the pre-activation, 00:23:22.940 |
which is the gradient of the activation function 00:23:29.900 |
to the partial derivative of my nonlinear activation function. 00:23:33.700 |
So this here, this refers to an element-wise product. 00:23:37.060 |
So I'm taking these two vectors, this vector here 00:23:40.740 |
I'm going to do an element-wise product between the two. 00:23:43.740 |
And this vector here is just the partial derivative 00:23:49.620 |
individually that I've put together into a vector. 00:23:59.580 |
and doing all these iterations, is actually fairly cheap. 00:24:02.820 |
Complexity is essentially the same as the one 00:24:11.180 |
multiplying by matrices, in this case, the transpose of my weight 00:24:15.060 |
And then I'm also doing this nonlinear operation 00:24:17.620 |
where I'm multiplying by the gradient of the activation 00:24:24.220 |
that here I'm doing this element-wise product. 00:24:30.060 |
is very close to 0, then the pre-activation gradient 00:24:36.460 |
And I highlight this point because essentially whenever-- 00:24:42.860 |
Whenever this gradient here, these partial derivatives, 00:24:46.060 |
come close to 0, then it means the gradient will not 00:24:50.420 |
means that you're not going to get a good gradient to update 00:24:56.380 |
When will you see these terms here being close to 0? 00:24:59.380 |
Well, that's going to be when the partial derivatives 00:25:05.780 |
So we can look at the partial derivatives, say, 00:25:20.160 |
of the unit for a sigmoid unit is close to 1 or close to 0, 00:25:23.500 |
I essentially get a partial derivative that's close to 0. 00:25:39.860 |
or if my unit is very saturated, then gradients 00:25:42.940 |
will have a hard time propagating to the next layer. 00:25:59.980 |
And indeed, if it's close to minus 1 or close to 1, 00:26:14.140 |
And for the ReLU, the rectified linear activation function, 00:26:21.180 |
You just check whether the pre-activation is greater than 00:26:27.780 |
So actually, you're going to multiply by 1 or 0. 00:26:31.760 |
when you're performing the propagation through the ReLU. 00:26:40.020 |
So actually, here, the shrinking of the gradient towards 0 00:26:52.140 |
And beyond all the math, in terms of actually using those 00:26:58.800 |
see three different libraries that essentially allows you 00:27:03.220 |
You actually usually don't write down backprop. 00:27:09.180 |
And it turns out there's a way of automatically differentiating 00:27:13.380 |
your loss function and getting gradients for free 00:27:16.220 |
in terms of effort, in terms of programming effort, 00:27:23.940 |
and you'll see essentially three different libraries doing it 00:27:31.340 |
by adding, at the very end, the computation of your loss 00:27:35.600 |
And then each of these boxes, which are conceptually 00:27:38.020 |
objects that are taking arguments and computing 00:27:40.540 |
a value, you're going to augment them to also have a method 00:27:47.600 |
You'll often see, actually, this expression being used, bprop. 00:27:54.500 |
is the gradient of the loss with respect to myself? 00:27:57.300 |
And then it should propagate to its arguments, 00:28:00.340 |
so the things that its parents in the flow graph, 00:28:02.780 |
the things it takes to compute its own value, 00:28:04.780 |
it's going to propagate them using the chain rule, what 00:28:10.780 |
So what this means is that you would start the process 00:28:14.220 |
by initializing, well, the gradient of the loss 00:28:22.260 |
And then it's going to propagate to its argument, what 00:28:28.240 |
is the gradient of the loss with respect to f of x? 00:28:32.040 |
And then you're going to call bprop on this object here. 00:28:36.220 |
have the gradient of the loss with respect to myself, 00:28:39.300 |
From this, I can compute what's the gradient of my argument, 00:28:53.020 |
And then I'm going to take the pre-activation here, 00:28:55.140 |
which now knows what is the gradient of the loss with 00:28:59.180 |
It's going to propagate to the weights and the biases 00:29:04.200 |
informing them of what is the gradient of the loss 00:29:09.080 |
going through the flow graph, but in the opposite direction. 00:29:12.880 |
So the library torch, the basic library torch, 00:29:15.400 |
essentially functions like this quite explicitly. 00:29:18.440 |
You construct-- you chain these elements together. 00:29:21.000 |
And then when you're performing backpropagation, 00:29:25.880 |
And then you have libraries like Torchautograd and Theano 00:29:31.240 |
which are doing things slightly more sophisticated there. 00:29:38.540 |
OK, so that's a discussion of how you actually 00:29:40.940 |
compute gradients of the loss with respect to the parameters. 00:29:44.620 |
So that's another component we need in stochastic gradient 00:29:50.100 |
One that's often used is the L2 regularization. 00:29:53.500 |
So that's just the sum of the squared of all the weights. 00:29:57.100 |
And the gradient of that is just twice times the weight. 00:30:06.840 |
There's no particularly important reason for that. 00:30:10.960 |
There are much fewer biases, so it seems less important. 00:30:24.400 |
And then finally, and this is also a very important point, 00:30:28.660 |
you have to initialize the parameters before you actually 00:30:37.640 |
So the biases, often we initialize them to 0. 00:30:40.600 |
There are certain exceptions, but for the most part, 00:30:44.720 |
But for the weights, there are a few things we can't do. 00:30:56.160 |
but it's not a bad exercise to try to figure out why-- 00:30:59.240 |
is that essentially, when you do your first pass, 00:31:02.100 |
you're going to get gradients for all your parameters that 00:31:05.920 |
So you're going to be stuck at this 0 initialization. 00:31:20.600 |
that all the weights coming into a unit within the layer 00:31:24.720 |
are going to have exactly the same gradients, which 00:31:27.640 |
means they're going to be updated exactly the same way, 00:31:30.000 |
which means they're going to stay constant the same-- 00:31:32.280 |
not constant, but they're going to stay the same-- 00:31:35.080 |
So it's as if you have multiple copies of the same unit. 00:31:38.320 |
So you essentially have to break that initial symmetry 00:31:40.920 |
that you would create if you initialized everything 00:31:46.260 |
is initialize the weights to some randomly generated value. 00:31:52.080 |
there are a few other recipes, but one of them 00:31:54.120 |
is to initialize them from some uniform distribution 00:32:01.440 |
used that has some theoretical grounding that was derived 00:32:06.880 |
There's this paper here by Xavier Guerroux and Yoshua 00:32:09.880 |
Bengio you can check out for some intuition as to how 00:32:14.460 |
But essentially, they should be initially random, 00:32:23.160 |
so that initially the units are not already saturated. 00:32:31.280 |
You're essentially going to get gradients very close to 0 00:32:35.120 |
So that's the main intuition, is to have weights 00:32:50.720 |
and have the neural net learn from that training set. 00:32:53.920 |
Now, there are other quantities in our neural network 00:32:57.120 |
that we haven't specified how to choose them. 00:33:02.560 |
So usually, we're going to have a separate validation set. 00:33:05.240 |
Most people here are familiar with machine learning, 00:33:19.540 |
What is the weight decay that I'm going to use? 00:33:32.780 |
maybe I want to try 100, 1,000, and 2,000, say. 00:33:42.420 |
So a grid search would just try all combinations 00:33:49.820 |
So that means that the more hyperparameters there are, 00:33:53.340 |
the number of configurations you have to try out blows up 00:33:59.620 |
So another procedure that is now more and more common, 00:34:03.180 |
which is more practical, is to perform a form of random search. 00:34:07.580 |
In this case, what you do is for each parameter, 00:34:10.020 |
you actually determine a distribution of likely values 00:34:26.920 |
the log uniform distribution, but from 0.001 to 0.01, say. 00:34:37.060 |
to do an experiment with and get a performance on my validation 00:34:39.940 |
set, I just independently sample from these distributions 00:34:43.140 |
for each hyperparameter to get a full configuration 00:34:47.820 |
And then because I have this way of getting one experiment, 00:34:50.720 |
I do it independently for all of my jobs, all of my experiments 00:34:54.620 |
So in this case, if I know I have enough compute power 00:34:58.020 |
to do 50 experiments, I just sample 50 independent samples 00:35:02.120 |
from these distributions for hyperparameters, 00:35:04.100 |
perform these 50 experiments, and I just take the best one. 00:35:09.880 |
unlike grid search, there are never any holes in the grid. 00:35:12.620 |
That is, you just specify how many experiments you do. 00:35:15.260 |
If one of your jobs died, well, you just have one less. 00:35:21.900 |
And also, one reason why it's particularly useful, 00:35:24.700 |
this approach, is that if you have a specific value in grid 00:35:29.380 |
search for one of the hyperparameters that just makes 00:35:32.100 |
the experiment not work at all-- so learning rates 00:35:38.620 |
it's quite possible that convergence of the optimization 00:35:47.060 |
use that specific value of the learning rate, 00:35:52.340 |
And you don't really get this sort of big waste of computation 00:35:56.220 |
if you do a random search, because most likely, 00:36:01.660 |
samples, say, from a uniform distribution over some range. 00:36:12.420 |
like methods based on machine learning, Bayesian 00:36:15.220 |
optimization, or sometimes known as sequential model-based 00:36:21.300 |
but that works a bit better than random search. 00:36:27.780 |
think you have an issue finding good hyperparameters, 00:36:29.940 |
is to investigate some of these more advanced methods. 00:36:34.380 |
Now, you do this for most of your hyperparameters, 00:36:37.180 |
but for the number of epochs, the number of times 00:36:40.060 |
you go through all of your examples in your training set, 00:36:44.700 |
what we usually do is not grid search or random search, 00:36:53.460 |
a neural net for 10 epochs, while training a neural net 00:36:56.700 |
with all the other hyperparameters kept constant, 00:37:10.300 |
track what is the performance on the validation set 00:37:14.740 |
And what we will typically see is the training error 00:37:17.100 |
will go down, but the validation set performance will go down 00:37:31.260 |
And since the training curve cannot go below, usually, 00:37:34.460 |
some bound, then eventually the validation set performance 00:37:44.080 |
is that if we reach a point where the validation set 00:37:46.260 |
performance hasn't improved from some certain number 00:37:49.020 |
of iterations, which we refer to as the look ahead, 00:37:54.660 |
had the best performance overall in the validation set, 00:37:58.700 |
So I have now a very cheap way of actually getting 00:38:01.460 |
the number of iterations or the number of epochs 00:38:09.500 |
So it's always useful to normalize your data. 00:38:13.060 |
It will often have the effect of speeding up training. 00:38:21.940 |
So what I mean by that is just subtract for each dimension 00:38:24.940 |
what is the average in the training set of that dimension, 00:38:41.660 |
One that's very simple is to start with a large learning 00:38:45.300 |
rate and then track the performance on the validation 00:38:48.140 |
And once on the validation set it stops improving, 00:38:50.980 |
you decrease your learning rate by some ratio. 00:38:54.780 |
And then you continue training for some time. 00:38:56.980 |
Hopefully, the validation set performance starts improving. 00:39:00.620 |
And then at some point, it stops improving, and then you stop. 00:39:11.700 |
And that can, again, work better than having a very small 00:39:14.900 |
learning rate than waiting for a longer time. 00:39:26.260 |
for training neural nets that is based on a single example 00:39:31.300 |
But in practice, we actually use what's called mini-batches. 00:39:40.580 |
And then we take the average of the loss of all these examples 00:39:45.020 |
And that's actually-- we compute the gradient 00:39:49.880 |
The reason why we do this is that it turns out 00:39:53.100 |
that you can very efficiently implement the forward pass 00:39:56.780 |
over all of these 64, 128 examples in my mini-batch 00:40:01.260 |
in one pass by, instead of doing vector matrix multiplications 00:40:09.760 |
are faster than doing multiple matrix-vector multiplications. 00:40:17.480 |
is mostly optimized for speed in terms of how quickly training 00:40:21.080 |
will proceed of the number of examples in your mini-batch. 00:40:29.560 |
That is, instead of using, as the descent direction, 00:40:35.220 |
I'm actually going to track a descent direction, which 00:40:50.200 |
And beta now is a hyperparameter you have to optimize. 00:40:52.640 |
So what this does is, if all the update directions agree 00:40:56.840 |
across multiple updates, then it will start picking up momentum 00:41:00.720 |
and actually make bigger steps in those directions. 00:41:05.840 |
And then there are multiple, even more advanced methods 00:41:19.440 |
scaled for each dimension, so for each weight 00:41:24.440 |
It's going to be scaled by what is the square root 00:41:28.800 |
of the cumulative sum of the squared gradients. 00:41:31.920 |
So what I track is I take my gradient vector at each step. 00:41:35.360 |
I do an element-wise square of all the dimensions 00:41:44.920 |
And then for my descent direction, I take the gradient, 00:41:54.720 |
There's also RMSProp, which is essentially like AdaGrad, 00:41:59.640 |
we're going to do an exponential moving average. 00:42:02.080 |
So we take the previous value times some factor 00:42:04.760 |
plus 1 minus this factor times the current squared gradient. 00:42:12.720 |
a combination of RMSProp with momentum, which 00:42:17.960 |
but that's another method that's often actually implemented 00:42:21.440 |
in these different softwares and that people seem 00:42:36.720 |
without difficulty using the current tools that 00:42:38.680 |
are available in Torch or TensorFlow or Tiano. 00:42:42.840 |
to implement certain gradients for a new module 00:42:49.520 |
If you do this, you should check that you've implemented 00:42:53.760 |
And one way of doing that is to actually compare 00:43:03.160 |
you add some very small epsilon value, say 10 to the minus 6, 00:43:07.220 |
and you compute what is the output of your module. 00:43:12.360 |
but where you've subtracted the small quantity, 00:43:17.400 |
So if epsilon converges to 0, then you actually 00:43:21.960 |
But if it's just small, it's going to be an approximate. 00:43:26.600 |
will be very close to a correct implementation 00:43:30.840 |
So you should definitely do that if you've actually 00:43:33.380 |
implemented some of the gradients in your code. 00:43:37.900 |
is to actually do a very small experiment on a small data set 00:43:47.760 |
So just taking a random subset of 50 examples 00:43:52.000 |
Actually, just make sure that your code can overfit 00:43:54.640 |
to that data, can essentially classify it perfectly, 00:43:58.600 |
given enough capacity that you would think it should get it. 00:44:03.040 |
So if it's not the case, then there's a few things 00:44:09.920 |
that the units are already saturated initially, 00:44:14.800 |
happening because some of the gradients on some of the weights 00:44:19.040 |
So you might want to check your initialization. 00:44:23.760 |
you're using a model you implemented gradients for, 00:44:26.040 |
and maybe your gradients are not properly implemented. 00:44:28.920 |
Maybe you haven't normalized your input, which 00:44:33.040 |
harder for stochastic gradient descent to work successfully. 00:44:40.080 |
Then you should consider trying smaller learning rates. 00:44:44.940 |
some idea of the magnitude of the learning rate 00:44:53.060 |
ready to do a full experiment on a larger data set. 00:45:03.440 |
it's a great algorithm that's very bug resistant. 00:45:06.800 |
You will potentially see some learning happening, 00:45:17.800 |
It's fun when code is somewhat bug resistant. 00:45:25.160 |
So do both, gradient checking and a small experiment 00:45:35.040 |
you'll be learning quite a bit about in the next two days. 00:45:40.280 |
That is, the specific case for deep learning. 00:45:43.880 |
So I've already told you that if I have a neural net with enough 00:45:47.760 |
hidden units, theoretically, I can potentially 00:45:49.880 |
represent pretty much any function, any classification 00:45:59.000 |
The first one is taken directly from our own brains. 00:46:02.160 |
So we know in the visual cortex that the light that 00:46:05.320 |
hits our retina eventually goes through several regions 00:46:12.520 |
where you have units that are-- or neurons that are essentially 00:46:20.480 |
slightly more complex patterns that the units are tuned for. 00:46:25.680 |
have neurons that are specific to certain objects 00:46:30.960 |
also what we want in an artificial, say, vision system. 00:46:37.760 |
to have a first layer that detects simple edges, 00:46:41.080 |
and then another layer that perhaps puts these edges 00:46:43.520 |
together, detecting slightly more complex things, 00:46:47.880 |
And then eventually have a layer that combines 00:46:49.880 |
these slightly less abstract or more abstract units 00:47:06.360 |
on studying Boolean functions, or a function that 00:47:10.160 |
can think of it as a vector of just zeros and ones. 00:47:12.640 |
And you could show that there are certain functions that, 00:47:17.080 |
if you had essentially a Boolean neural network 00:47:22.840 |
and you restricted the number of layers of that circuit, 00:47:26.280 |
that there are certain functions that, in this case, 00:47:28.480 |
to represent certain Boolean functions exactly, 00:47:31.160 |
you would need an exponential number of units 00:47:35.160 |
Whereas if you allowed yourself to have multiple layers, 00:47:37.160 |
then you could represent these functions more compactly. 00:47:42.960 |
can represent fairly complex functions in a more compact way. 00:47:48.520 |
And then there's the reason that they just work. 00:47:53.920 |
great success in speech recognition, where it's 00:47:56.440 |
essentially revolutionized the field, where everyone's using 00:48:00.940 |
and same thing for visual object recognition, 00:48:05.400 |
of the method of choice for identifying objects in images. 00:48:16.640 |
when backprop was invented, which is essentially in 1980s 00:48:22.840 |
So it turns out training deep neural networks 00:48:26.040 |
There are a few hurdles that one can be confronted with. 00:48:30.000 |
I've already mentioned one of the issues, which 00:48:32.120 |
is that some of the gradients might be fading as you go 00:48:37.280 |
because we keep multiplying by the derivative of the activation 00:48:41.960 |
It could be that the lower layers at very small gradients 00:48:44.560 |
are barely moving and exploring the space of correct features 00:48:57.080 |
Or it could be that with deeper neural nets or bigger 00:49:02.200 |
So perhaps sometimes we're actually overfitting. 00:49:04.240 |
We're in a situation where all the functions that we 00:49:07.680 |
can represent with the same neural net represented 00:49:10.840 |
by this gray area function actually includes, yes, 00:49:19.780 |
close to the true classifying function, the real system 00:49:22.840 |
that I'd like to have, is going to be very different. 00:49:25.720 |
So in this case, I'm essentially overfitting, 00:49:35.880 |
where one problem is observed, overfitting or underfitting. 00:49:43.480 |
developed tools for fighting both situations. 00:49:46.140 |
And I'm going to rapidly touch a few of those, which you will 00:50:03.600 |
and this is essentially why you're progressing very slowly 00:50:05.800 |
when you're training, well, if you're using GPUs 00:50:08.220 |
and are able to do more iterations over the same 00:50:18.760 |
and this is partly why GPUs have been so game-changing 00:50:22.520 |
Or you can use just better optimization methods also. 00:50:38.820 |
If I have time, I'll talk a little bit about that. 00:50:40.940 |
And there's another method you might have heard 00:50:44.840 |
So I'll try to touch at least two methods that are essentially 00:50:51.360 |
So the first one that I'll talk about is dropout. 00:50:57.920 |
So the idea of if our neural net is essentially overfitting, 00:51:01.120 |
so it's too good at training on the training set, 00:51:04.520 |
well, we're essentially going to cripple training. 00:51:06.880 |
We're going to make it harder to fit the training set. 00:51:09.240 |
And the way we're going to do that in dropout 00:51:16.120 |
So for each hidden unit, before we do a forward pass, 00:51:24.800 |
And with probability half, we'll multiply it by 1. 00:51:27.500 |
So what this means is that if a unit is multiplied by 0, 00:51:30.840 |
it's effectively not in the neural net anymore. 00:51:37.880 |
So that means that in a layer, a unit cannot rely anymore 00:51:52.360 |
And that was partly the motivation behind dropout 00:52:15.060 |
And in terms of how it impacts an implementation of backdrop, 00:52:21.600 |
I just sample my binary masks for all my layers. 00:52:31.580 |
I'm just multiplying by this binary mask here. 00:52:41.160 |
when I get my gradient on the pre-activation. 00:52:43.800 |
And also, don't forget that the activations are now different. 00:52:47.080 |
They actually include the mask in my notation. 00:52:49.920 |
So it's a very simple change in the forward and backward pass 00:52:54.280 |
And also, another thing that I should emphasize 00:52:56.280 |
is that the mask is being resampled for every example. 00:52:59.600 |
So before you do a forward pass, you resample the mask. 00:53:03.280 |
sample it once and then use it the whole time. 00:53:07.560 |
And then at test time, because we don't really 00:53:09.560 |
like a model that sort of randomly changes its output, 00:53:13.600 |
because it will if we stochastically change the masks, 00:53:29.200 |
We can actually show that if you have a neural net 00:53:31.360 |
with a single hidden layer, doing this transformation 00:53:36.760 |
equivalent to doing a geometric average of all 00:53:39.640 |
the possible neural networks with all the different binary 00:53:43.360 |
So it's essentially one way of thinking about dropout 00:53:46.600 |
in the single layer case is that it's kind of an assembling 00:53:52.760 |
are all sharing the same weights but have different masks. 00:53:56.040 |
That intuition, though, doesn't transfer for deep neural nets 00:53:59.520 |
in the sense that you cannot show this result. 00:54:01.480 |
It really only applies to a single hidden layer. 00:54:10.640 |
So often, we tend to see that training a network 00:54:19.980 |
want to learn more about different variations 00:54:23.000 |
And I probably won't talk about unsupervised retraining 00:54:28.920 |
for lack of time, but I'll talk about another thing 00:54:33.440 |
and that's implemented in these different packages, which 00:54:37.360 |
Batch normalization is kind of interesting in the sense 00:54:42.840 |
That is, certain networks that would otherwise underfit 00:54:51.440 |
you use batch normalization, dropout is not as useful. 00:54:56.720 |
that suggests that perhaps batch normalization is also 00:55:05.160 |
You can have a regularizer that also, it turns out, 00:55:14.760 |
is much like I've suggested that normalizing your inputs actually 00:55:21.120 |
Well, how about we also normalize all the hidden layers 00:55:28.960 |
that I can compute the mean and the standard deviations 00:55:31.880 |
of my inputs once and for all because they're constant. 00:55:36.640 |
changing because I'm training these parameters. 00:55:39.080 |
So the mean and the standard deviation of my units 00:55:45.920 |
if every time I did an update on my parameters, 00:55:48.120 |
I recomputed the means and the standard deviations 00:55:52.400 |
So batch normalization addresses some of these issues 00:55:56.360 |
So the way it works is first, the normalization 00:55:59.520 |
is going to be applied on actually the pre-activation. 00:56:08.600 |
that we don't want to compute means over the full training 00:56:12.720 |
I'm actually going to compute it on each mini-batch. 00:56:18.120 |
I'm going to take my small mini-batch of 64, 128 examples. 00:56:23.440 |
I'm going to compute my means and standard deviations. 00:56:28.920 |
going to take into account the normalization. 00:56:33.280 |
through the computation of the mean and the standard deviation 00:56:41.540 |
use the global mean and global standard deviation. 00:56:45.880 |
do a full pass over the whole training set and get 00:56:52.000 |
So that's essentially the pseudocode for that, 00:57:05.240 |
I would compute what is the average for that unit 00:57:08.280 |
pre-activation across my examples in my mini-batch, 00:57:11.400 |
compute my variance, and then subtract the mean 00:57:15.000 |
and divide by the square root of the variance, 00:57:22.160 |
And then another thing is that actually batch normalization 00:57:30.400 |
It then actually performs a linear transformation on it. 00:57:37.160 |
which is going to be trained by gradient descent. 00:57:47.960 |
And the reason is that if I'm subtracting by the mean, 00:57:51.360 |
then each of these units have the bias parameter. 00:57:54.620 |
So if I subtract it, then this essentially here, 00:58:02.720 |
So I have to add the bias, but after the batch normalization, 00:58:13.760 |
So batch normalization adds a few parameters. 00:58:18.120 |
All right, and as I said, I'm just going to skip over this. 00:58:20.640 |
And I'm not showing what the gradients are when you back 00:58:24.760 |
It's described in the paper if you want to see the gradients. 00:58:38.080 |
If you actually want to learn about unsupervised 00:58:40.160 |
pre-training and why it works, I have videos on that. 00:59:00.520 |
So feel free to either go for a break or ask questions to Hugo. 00:59:10.960 |
If anyone has questions, you can go to the mic. 00:59:28.440 |
Yeah, so the first thing is that it's observed in practice. 00:59:37.200 |
because you have the non-linearity at 0 below. 00:59:40.520 |
So it means that units are going to be potentially 00:59:43.560 |
exactly sparse, essentially absent of the hidden layer. 00:59:49.480 |
There are a few reasons to explain why you get sparsity. 00:59:54.400 |
It turns out that this process of doing a linear 00:59:56.900 |
transformation followed by the ReLU activation function 01:00:02.160 |
you would do when you're optimizing for sparse codes 01:00:04.840 |
in a sparse coding model, if you know about sparse coding. 01:00:07.700 |
So they're essentially an optimization method 01:00:17.580 |
And it's mostly a sequence of linear transformations 01:00:20.920 |
followed by this ReLU-like activation function. 01:00:27.600 |
Otherwise, I don't know of a solid explanation 01:00:31.280 |
for why that is beyond what's observed in practice.