Back to Index

Foundations of Deep Learning (Hugo Larochelle, Twitter)


Chapters

0:0 Intro
1:0 FOUNDATIONS OF DEEP LEARNING
9:45 CAPACITY OF NEURAL NETWORK
11:2 MACHINE LEARNING
15:45 LOSS FUNCTION
19:9 BACKPROPAGATION
24:59 ACTIVATION FUNCTION
27:30 FLOW GRAPH
29:38 REGULARIZATION
30:24 INITIALIZATION
33:26 MODEL SELECTION
36:33 KNOWING WHEN TO STOP
38:7 OTHER TRICKS OF THE TRADE
42:27 GRADIENT CHECKING
43:36 DEBUGGING ON SMALL DATASET
50:54 DROPOUT
55:10 BATCH NORMALIZATION
58:34 UNSUPERVISED PRE-TRAINING
58:37 NEURAL NETWORK ONLINE COURSE

Transcript

That's good. All right, cool. So yeah, so I was asked to give this presentation on the foundations of deep learning, which is mostly going over basic feedforward neural networks and motivating a little bit deep learning and some of the more recent developments and some of the topics that you'll see across the next two days.

So as Andrew mentioned, I have just an hour. So I'm going to go fairly quickly on a lot of these things, which I think would mostly be fine if you're familiar enough with some machine learning and a little bit about neural nets. But if you'd like to go into some of the more specific details, you can go check out my online lectures on YouTube.

It's now taught by a much younger version of myself. And so just search for Hugo Larochelle. And I am not the guy doing a bunch of skateboarding. I'm the geek teaching about neural nets. So go check those out if you want more details. But so what I'll cover today is-- I'll start with just describing and laying out the notation on feedforward neural networks, that is, models that take an input vector x-- that might be an image or some text-- and produces an output f of x.

So I'll just describe forward propagation and the different types of units and the type of functions we can represent with those. And then I'll talk about how we actually train neural nets, describing things like loss functions, backpropagation that allows us to get a gradient for training with stochastic gradient descent, and mention a few tricks of the trade, so some of the things we do in practice to successfully train neural nets.

And then I'll end by talking about some developments that are specifically useful in the context of deep learning, that is, neural networks with several hidden layers that came out at the very-- after the beginning of deep learning, say, in 2006. That is, things like dropout, batch normalization, and if I have some time, unsupervised pre-training.

So let's get started. And just talk about, assuming we have some neural network, how do they actually function? How do they make predictions? So let me lay down the notation. So a multilayer feedforward neural network is a model that takes as input some vector x, which I'm representing here with a different node for each of the dimensions in my input vector.

So each dimension is essentially a unit in that neural network. And then it eventually produces, at its output layer, an output. And we'll focus on classification mostly. So you'd have multiple units here. And each unit would correspond to one of the potential classes in which we would want to classify our input.

So if we're identifying digits in handwritten character images, and say we're focusing on digits, you'd have 10 digits. So you would have a sort of 0 from 0 to 9. So you'd have 10 output units. And to produce an output, the neural net will go through a series of hidden layers.

And those will be essentially the components that introduce non-linearity that allows us to capture and perform very sophisticated types of classification functions. So if we have L hidden layers, the way we compute all the layers in our neural net is as follows. We first start by computing what I'm going to call a pre-activation.

I'm going to know that A. And I'm going to index the layers by k. So A k is just the pre-activation at layer k. And that is only simply going to be a linear transformation of the previous layer. So I'm going to note h k as the activation on the layer.

And by default, I'll assume that layer 0 is going to be the input. And so using that notation, the pre-activation at layer k is going to correspond to taking the activation at the previous layer, k minus 1, multiplying it by a matrix, Wk. Those are the parameters of the layer.

Those essentially corresponds to the connections between the units between adjacent layers. And I'm going to add a bias vector. That's another parameter in my layer. So that gives me the pre-activation. And then next, I'm going to get a hidden layer activation by applying an activation function. This will introduce some non-linearity in the model.

So I'm going to call that function g. And we'll go over a few choices. So we have four common choices for the activation function. And so I do this from layer 1 to layer L. And when it comes to the output layer, I'll also compute a pre-activation by performing a linear transformation.

But then I'll usually apply a different activation function depending on the problem I'm trying to solve. So having said that, let's go to some of the choices for the activation function. So some of the activation functions you'll see. One common one is this sigmoid activation function. It's this function here.

It's just 1 divided by 1 plus the exponential of minus the pre-activation. The shape of this function, you can focus on that, is this here. It takes the pre-activation, which can vary from minus infinite to plus infinite. And it squashes this between 0 and 1. So it's bounded by below and above, below by 0, and above by 1.

So it's a function that saturates if you have very large magnitude positive or negative pre-activations. Another common choice is the hyperbolic tangent or tanh activation function. This picture here. So it squashes everything. But instead of being between 0 and 1, it's between minus 1 and 1. And one that's become quite popular in neural nets is what's known as the rectified linear activation function.

Or in papers, you will see the ReLU unit that refers to the use of this activation function. So this one is different from the others in that it's not bounded above, but it is bounded below. And it will output 0's exactly if the pre-activation is negative. So those are the choices of activation functions for the hidden layers.

And for the output layer, if we're performing classification, as I said, in our output layer, we will have as many units as there are classes in which an input could belong. And what we'd like is potentially-- and what we often do is interpret each unit's activation as the probability, according to the neural network, that the input belongs to the corresponding class, that its label y is the corresponding class C.

So C would be like the index of that unit in the output layer. So we need an activation function that produces probabilities, produces a multinomial distribution over all the different classes. And the activation function we use for that is known as the softmax activation function. It is simply as follows.

You take your pre-activations, and you exponentiate them. So that's going to give us positive numbers. And then we divide each of the exponentiated pre-activations by the sum of all the exponentiated pre-activations. So because I'm normalizing this way, it means that all my values in my output layer are going to sum to 1.

And they're positive because I took the exponential. So I can interpret that as a multinomial distribution over the choice of all the C different classes. So that's what I'll use as the activation function at the output layer. And now, beyond the math in terms of conceptually and also in the way we're going to program neural networks, often what we'll do is that all these different operations, the linear transformations, the different types of activation functions, we'll essentially implement all of them as an object, an object that take arguments.

And the arguments would essentially be what other things are being combined to produce the next value. So for instance, we would have an object that might correspond to the computation of pre-activation, which would take as argument what is the weight matrix and the bias vector for that layer and take some layer to transform.

And this object would compute its value by applying the linear activation, the linear transformation. And then we might have objects that correspond to specific activation functions, so like a sigmoid object or a tanh object or a ReLU object. And we just combine these objects together, chain them into what ends up being a graph, which I refer to as a flow graph, that represents the computation done when you do a forward pass in your neural network up until you reach the output layer.

So I mention it now because you'll see the different softwares that we presented over the weekend will essentially exploit some of that representation of the computation in neural nets. It will also be handy for computing gradients, which I'll talk about in a few minutes. And so that's how we perform predictions in neural networks.

So we get an input. We eventually reach an output layer that gives us a distribution over classes if we're performing classification. If I want to actually classify, I would just assign the class corresponding to the unit that has the highest activation, that would correspond to classifying to the class that has the highest probability according to the neural net.

But then you might ask the question, OK, what kind of problems can we solve with neural networks? Or more technically, what kind of functions can we represent mapping from some input x into some arbitrary output? And so if you go look at my videos, I try to give more intuition as to why we have this result here.

But essentially, if we have a single hidden layer neural network, it's been shown that with a linear output, we can approximate any continuous function arbitrarily well as long as we have enough hidden units. So that is, there's a value for these biases and these weights such that any continuous function, I can actually represent it as well as I want.

I just need to add enough hidden units. So this result applies if you use activation functions, non-linear activation functions like sigmoid and tanh. So as I said in my video, if you want a bit more intuition as to why that would be, you can go check that out. But that's a really nice result.

It means that by focusing on this family of machine learning models that are neural networks, I can pretty much potentially represent any kind of classification function. However, this result does not tell us how do we actually find the weights and the bias values such that I can represent a given function.

It doesn't essentially tell us how do we train a neural network. And so that's what we'll discuss next. So let's talk about that. How do we actually, from a data set, train a neural network to perform good classification for that problem? So what we'll typically do is use a framework that's very generic in machine learning, known as empirical risk minimization or structural risk minimization if you're using regularization.

So this framework essentially transforms a problem of learning as a problem of optimizing. So what we'll do is that we'll first choose a loss function that I'm noting as L. And the loss function, it compares the output of my model, so the output layer of my neural network, with the actual target.

So I'm indexing with an exponent here with t to essentially as the index over all my different examples in my training set. And so my loss function will tell me, is this output good or bad given that the label is actually y? And what I'll do, I'll also define a regularizer.

So theta here is-- you can think of it as just a concatenation of all my biases and all of my weights in my neural net. So those are all the parameters of my neural network. And the regularizer will essentially penalize certain values of these weights. So as I'll talk more specifically later on, for instance, you might want to have your weights not be too far from 0.

That's a frequent intuition that we implement with regularizer. And so the optimization problem that we'll try to solve when learning is to minimize the average loss of my neural network over my training examples, so summing over all training examples. I have capital T examples. Plus some weight here that's known as the weight decay, some hyperparameter lambda, times my regularizer.

So in other words, I'm going to try to have my loss on my training set the smallest possible over all the training example and also try to satisfy my regularizer as much as possible. And so now we have this optimization problem. And learning will just correspond to trying to solve this problem.

So finding this arg min here for over my weights and my biases. And if I want to do this, I can just invoke some optimization procedure from the optimization community. And the one algorithm that you'll see constantly in deep learning is stochastic gradient descent. This is the optimization algorithm that we'll often use for training neural networks.

So SGD, stochastic gradient descent, functions as follows. You first initialize all of your parameters. That is finding initial values for all my weight matrices and all of my biases. And then for a certain number of epochs-- so an epoch will be a full pass over all my examples. That's what I'll call an epoch.

So for a certain number of full iterations over my training set, I'll draw each training example. So I pair x, input x, target y. And then I'll compute what is the gradient of my loss with respect to my parameters. All of my parameters, all my weights, and all my biases.

This is what this notation here-- so nabla for the gradient of the loss function. And here I'm indexing with respect to which parameter. I want the gradient. So I'm going to compute what is the gradient of my loss function with respect to my parameters. And plus lambda times the gradient of my regularizer as well.

And then I'm going to get a direction in which I should move my parameters. Since the gradient tells me how to increase the loss, I want to go in the opposite direction and decrease it. So my direction will be the opposite. So that's why I have a minus here.

And so this delta is going to be the direction in which I'll move my parameters by taking a step. And the step is just a step size alpha, which is often referred to as a learning rate, times my direction, which I just add to my current values of my parameters, my biases and my weights.

And that's going to give me my new value for all of my parameters. And I iterate like that, going over all pairs x, y's, computing my gradient, taking a step side in the opposite direction, and then doing that several times. So that's how stochastic gradient descent works. And that's essentially the learning procedure.

It's represented by this procedure. So in this algorithm, there are a few things we need to specify to be able to implement it and execute it. We need a loss function, a choice for the loss function. We need a procedure that's efficient for computing the gradient of the loss with respect to my parameters.

We need to choose a regularizer if we want one. And we need a way of initializing my parameters. So next, what I'll do is I'll go through each of these four different things we need to choose before actually being able to execute stochastic gradient descent. So first, the loss function.

So as I said, we will interpret the output layer as assigning probabilities to each potential class in which I can classify my input x. Well, in this case, something that would be natural is to try to maximize the probability of the correct class, the actual class in which my example x t belongs to.

I'd like to increase the value of the probability assigned by-- computed by my neural network. And so because we set up the problem in which we have a loss that we minimize, instead of maximizing the probability, what we'll actually do is minimize the negative and the actual log probability, so the log likelihood of assigning x to the correct class y.

So this is represented here. So given my output layer and the true label y, my loss will be minus the log of the probability of y according to my neural net. And that would be, well, take my output layer and look at the unit, so index the unit corresponding to the correct class.

So that's why I'm indexing by y here. We take the log because numerically it turns out to be more stable. We get nicer-looking gradients. And sometimes in certain softwares, you'll see instead of talking about the negative log likelihood or log probability, you'll see it referred as the cross-entropy. And that's because you can think of this as performing a sum over all possible classes.

And then for each class, checking, well, is this potential class the target class? So I have an indicator function that is 1 if y is equal to c, so if my iterator class c is actually equal to the real class. I'm going to multiply that by the log of the probability actually assigned to that class c.

And this function here, so this expression here, is like a cross-entropy between the empirical distribution, which assigns zero probability to all the other classes, but a probability of 1 to the correct class, and the actual distribution over classes that my neural net is computing, which is f of x.

That's just a technical detail. You can just think about this. Here, I only mention it because in certain libraries, it's actually mentioned as the cross-entropy loss. So that's for the loss. Then we need also a procedure for computing what is the gradient of my loss with respect to all of my parameters in my neural net, so the biases and the weights.

You can go look at my videos if you want the actual derivation of all the details for all of these different expressions. I don't have time for that, so all I'll do-- and presumably, a lot of you actually have seen these derivations. If you haven't, just go check out the videos.

In any case, I'm going to go through what the algorithm is. I'm going to highlight some of the key points that will come up later in understanding how actually backpropagation functions. So the basic idea is that we'll compute gradients by exploiting the chain rule. And we'll go from the top layer all the way to the bottom, computing gradients for layers that are closer and closer to the input as we go, and exploiting the chain rule to exploit or reuse previous computations we've made at upper layers to compute the gradients at the layers below.

So we usually start by computing what is the gradient at the output layer. So what's the gradient of my loss with respect to my output layer? And actually, it's more convenient to compute the loss with respect to the pre-activation. It's actually a very simple expression. So that's why I have the gradient of this vector, a l plus 1.

That's the pre-activation at the very last layer of the loss function, which is minus the log f of x, y. And it turns out this gradient is super simple. It's minus E of y. So that's the one-hot vector for class y. So what this means is E of y is just a vector filled with a bunch of 0's and then the 1 at the correct class.

So if y was the fourth class, then in this case, it would be this vector, where I have a 1 at the fourth dimension. So E of y is just a vector. We call it the one-hot vector full of 0's. And the single 1 at the position corresponding to the correct class.

So what this part of the gradient is essentially saying is that I'm going to increase-- I want to increase the probability of the correct class. I want to increase the pre-activation, which will increase the probability of the correct class. And I'm going to subtract what is the current probabilities assigned by my neural net to all of the classes.

So f of x, that's my output layer. And that's the current beliefs of the neural net as to in which class, what's the probability of assigning the input to each class. So what this is doing is essentially trying to decrease the probability of everything and specifically decrease it as much as the neural net currently believes that the input belongs to it.

And so if you think about the subtraction of these two things, well, for the class that's the correct class, I'm going to have 1 minus some number between 0 and 1, because it's a probability. So that's going to be positive. So I'm going to increase the probability of the correct class.

And for everything else, it's going to be 0 minus a positive number. So it's going to be negative. So I'm actually going to decrease the probability of everything else. So intuitively, it makes sense. This gradient has the right behavior. And I'm going to take that pre-activation gradient. I'm going to propagate it from the top to the bottom and essentially iterating from the last layer, which is the output layer, L plus 1, all the way down to the first layer.

And as I'm going down, I'm going to compute the gradients with respect to my parameters and then compute what's the gradient for the pre-activation at the layer below and then iterate like that. So at each iteration of that loop, I take what is the current gradient of the loss function with respect to the pre-activation at the current layer.

And I can compute the gradient of the loss function with respect to my weight matrix. So not doing the derivation here, it's actually simply this vector. So in my notation, I assume that all the vectors are column vectors. So this pre-activation gradient vector, and I multiply it by the transpose of the activations, so the value of the layer right below, the layer k minus 1.

So because I take the transpose, that's a multiplication like this. And you can see if I do the outer product, essentially, between these two vectors, I'm going to get a matrix of the same size as my weight matrix. So it all checks out. That makes sense. Turns out that the gradient of the loss with respect to the bias is exactly the gradient of the loss with respect to the pre-activation.

So that's very simple. So that gives me now my gradients for my parameters. Now I need to compute, OK, what is going to be the gradient of the pre-activations at the layer below? Well, first, I'm going to get the gradient of the loss function with respect to the activation at the layer below.

Well, that's just taking my pre-activation gradient vector and multiplying it by-- for some reason, it doesn't show here-- and multiply it by the transpose of my weight matrix. Super simple operation, just a linear transformation of my gradients at layer k, linear and transformed to get my gradients of the activation at the layer k minus 1.

And then to get the gradients of the pre-activation, so before the activation function, I'm going to take this gradient here, which is the gradient of the activation function at the layer k minus 1. And then I apply the gradient corresponding to the partial derivative of my nonlinear activation function.

So this here, this refers to an element-wise product. So I'm taking these two vectors, this vector here and this vector here. I'm going to do an element-wise product between the two. And this vector here is just the partial derivative of the activation function for each unit individually that I've put together into a vector.

This is what this corresponds to. Now, the key things to notice is first that this pass, computing all the gradients and doing all these iterations, is actually fairly cheap. Complexity is essentially the same as the one that's doing a forward pass. So all I'm doing are linear transformations multiplying by matrices, in this case, the transpose of my weight matrix.

And then I'm also doing this nonlinear operation where I'm multiplying by the gradient of the activation function. So that's the first thing to notice. And the second thing to notice is that here I'm doing this element-wise product. So if any of these terms here for a unit is very close to 0, then the pre-activation gradient is going to be 0 for the next layer.

And I highlight this point because essentially whenever-- that's something to think about a lot when you're training neural nets. Whenever this gradient here, these partial derivatives, come close to 0, then it means the gradient will not propagate well to the next layer, which means that you're not going to get a good gradient to update your parameters.

Now, when does that happen? When will you see these terms here being close to 0? Well, that's going to be when the partial derivatives of these nonlinear activation functions are close to 0 or 0. So we can look at the partial derivatives, say, of the sigmoid function. It turns out it's super easy to compute.

It's just the sigmoid itself times 1 minus the sigmoid itself. So that means that whenever the activation of the unit for a sigmoid unit is close to 1 or close to 0, I essentially get a partial derivative that's close to 0. You can kind of see it here. The slope here is essentially flat, and the slope here is flat.

That's the value of the partial derivative. So in other words, if my pre-activations are very negative or very positive, or if my unit is very saturated, then gradients will have a hard time propagating to the next layer. That's the key insight here. Same thing for the tanh function. So it turns out the partial derivative is also easy to compute.

You just take the tanh value, square it, and you're going to subtract it to 1. And indeed, if it's close to minus 1 or close to 1, you can see that the slope is flat. So again, if the unit is saturating, gradients will have a hard time propagating to the next layers.

And for the ReLU, the rectified linear activation function, the gradient is even simpler. You just check whether the pre-activation is greater than 0. If it is, the partial derivative is 1. If it's not, it's 0. So actually, you're going to multiply by 1 or 0. You essentially get a binary mask when you're performing the propagation through the ReLU.

And you can see it. The slope here is flat, and otherwise, you have a linear function. So actually, here, the shrinking of the gradient towards 0 is even harder. It's exactly multiplying by 0 if you have a unit that's saturating below. And beyond all the math, in terms of actually using those in practice, during the weekend, you'll see three different libraries that essentially allows you to compute these gradients for you.

You actually usually don't write down backprop. You just use all of these modules that you've implemented. And it turns out there's a way of automatically differentiating your loss function and getting gradients for free in terms of effort, in terms of programming effort, with respect to your parameters. So conceptually, the way you do this-- and you'll see essentially three different libraries doing it in slightly different ways.

What you do is you augment your flow graph by adding, at the very end, the computation of your loss function. And then each of these boxes, which are conceptually objects that are taking arguments and computing a value, you're going to augment them to also have a method that's a backprop or a bprop method.

You'll often see, actually, this expression being used, bprop. And what this method should do is that it should take as input, what is the gradient of the loss with respect to myself? And then it should propagate to its arguments, so the things that its parents in the flow graph, the things it takes to compute its own value, it's going to propagate them using the chain rule, what is their gradients with respect to the loss?

So what this means is that you would start the process by initializing, well, the gradient of the loss with respect to itself is 1. And then you pass the bprop method here 1. And then it's going to propagate to its argument, what is, by using the chain rule, what is the gradient of the loss with respect to f of x?

And then you're going to call bprop on this object here. And it's going to compute, well, I have the gradient of the loss with respect to myself, f of x. From this, I can compute what's the gradient of my argument, which is the pre-activation at layer 2, with respect to the loss.

So I'm going to reuse the computation I just got and update it using my-- what is essentially the Jacobian. And then I'm going to take the pre-activation here, which now knows what is the gradient of the loss with respect to itself, the pre-activation. It's going to propagate to the weights and the biases and the layer below, updating them with-- informing them of what is the gradient of the loss with respect to themselves.

And you continue like this, essentially going through the flow graph, but in the opposite direction. So the library torch, the basic library torch, essentially functions like this quite explicitly. You construct-- you chain these elements together. And then when you're performing backpropagation, you're going in the reverse order of these chained elements.

And then you have libraries like Torchautograd and Theano and TensorFlow, which you'll learn about, which are doing things slightly more sophisticated there. And you'll learn about that later on. OK, so that's a discussion of how you actually compute gradients of the loss with respect to the parameters. So that's another component we need in stochastic gradient descent.

We can choose a regularizer. One that's often used is the L2 regularization. So that's just the sum of the squared of all the weights. And the gradient of that is just twice times the weight. So it's a super simple gradient to compute. We usually don't regularize the biases. There's no particularly important reason for that.

There are much fewer biases, so it seems less important. And often, this L2 regularization is often referred to as weight decay. So if you hear about weight decay, that often refers to L2 regularization. And then finally, and this is also a very important point, you have to initialize the parameters before you actually start doing backprop.

And there are a few tricky cases you need to make sure that you don't fall into. So the biases, often we initialize them to 0. There are certain exceptions, but for the most part, we initialize them to 0. But for the weights, there are a few things we can't do.

So we can't initialize the weights to 0, and especially if you have tanh activations. The reason-- and I won't explain it here, but it's not a bad exercise to try to figure out why-- is that essentially, when you do your first pass, you're going to get gradients for all your parameters that are going to be 0.

So you're going to be stuck at this 0 initialization. So we can't do that. We also can't initialize all the weights to exactly the same value. Again, you think about it a little bit. What's going to happen is essentially that all the weights coming into a unit within the layer are going to have exactly the same gradients, which means they're going to be updated exactly the same way, which means they're going to stay constant the same-- not constant, but they're going to stay the same-- the whole time.

So it's as if you have multiple copies of the same unit. So you essentially have to break that initial symmetry that you would create if you initialized everything to the same value. So what we end up doing most of the time is initialize the weights to some randomly generated value.

Often, we generate them-- there are a few other recipes, but one of them is to initialize them from some uniform distribution between lower and upper bound. This is a recipe here that is often used that has some theoretical grounding that was derived specifically for the tanh. There's this paper here by Xavier Guerroux and Yoshua Bengio you can check out for some intuition as to how you should initialize the weights.

But essentially, they should be initially random, and they should be initially close to 0. Random to break symmetry, and close to 0 so that initially the units are not already saturated. Because if the units are saturated, then there are no gradients that are going to pass through the units.

You're essentially going to get gradients very close to 0 at the lower layers. So that's the main intuition, is to have weights that are small and random. So those are all the pieces we need for running stochastic gradient descent. So that allows us to take a training set and run a certain number of epochs, and have the neural net learn from that training set.

Now, there are other quantities in our neural network that we haven't specified how to choose them. So those are the hyperparameters. So usually, we're going to have a separate validation set. Most people here are familiar with machine learning, so that's a typical procedure. And then we need to select things like, OK, how many layers do I want?

How many units per layer do I want? What's the step size, the learning rate of my stochastic gradient descent procedure, that alpha number? What is the weight decay that I'm going to use? So a standard thing in machine learning is to perform a grid search. That is, if I have two hyperparameters, I list out a bunch of values I want to try.

So for the number of hidden units, maybe I want to try 100, 1,000, and 2,000, say. And then for the learning rate, maybe I want to try 0.01 and 0.001. So a grid search would just try all combinations of these three values for the hidden units and these two values for the learning rates.

So that means that the more hyperparameters there are, the number of configurations you have to try out blows up and grows exponentially. So another procedure that is now more and more common, which is more practical, is to perform a form of random search. In this case, what you do is for each parameter, you actually determine a distribution of likely values you'd like to try.

So it could be-- so for the number of hidden units, maybe I do a uniform distribution over all integers from 100 to 1,000, say, or maybe a log uniform distribution. And for the learning rate, maybe, again, the log uniform distribution, but from 0.001 to 0.01, say. And then to get an experiment, so to get values for my hyperparameters to do an experiment with and get a performance on my validation set, I just independently sample from these distributions for each hyperparameter to get a full configuration for my experiment.

And then because I have this way of getting one experiment, I do it independently for all of my jobs, all of my experiments that I will do. So in this case, if I know I have enough compute power to do 50 experiments, I just sample 50 independent samples from these distributions for hyperparameters, perform these 50 experiments, and I just take the best one.

What's nice about it is that there are no-- unlike grid search, there are never any holes in the grid. That is, you just specify how many experiments you do. If one of your jobs died, well, you just have one less. But there's no hole in your experiment. And also, one reason why it's particularly useful, this approach, is that if you have a specific value in grid search for one of the hyperparameters that just makes the experiment not work at all-- so learning rates are a lot like this.

If you have a learning rate that's too high, it's quite possible that convergence of the optimization will not converge. Well, if you're using a grid search, it means that for all the experiments that use that specific value of the learning rate, they're all going to be garbage. They're all not going to be useful.

And you don't really get this sort of big waste of computation if you do a random search, because most likely, all the values of your hyperparameters are going to be unique, because they're samples, say, from a uniform distribution over some range. So that actually works quite well, and it's quite recommended.

And there are more advanced methods, like methods based on machine learning, Bayesian optimization, or sometimes known as sequential model-based optimization, that I won't talk about, but that works a bit better than random search. And that's another alternative if you think you have an issue finding good hyperparameters, is to investigate some of these more advanced methods.

Now, you do this for most of your hyperparameters, but for the number of epochs, the number of times you go through all of your examples in your training set, what we usually do is not grid search or random search, but we use a thing known as early stopping. The idea here is that if I've trained a neural net for 10 epochs, while training a neural net with all the other hyperparameters kept constant, but one more epoch is easy.

I just do one more epoch. So I shouldn't start over and then do, say, 11 epochs from scratch. And so what we would do is we would just track what is the performance on the validation set as I do more and more epochs. And what we will typically see is the training error will go down, but the validation set performance will go down and eventually go up.

The intuition here is that the gap between the performance on the training set and the performance on the validation set will tend to increase. And since the training curve cannot go below, usually, some bound, then eventually the validation set performance has to go up. Sometimes it won't necessarily go up, but it sort of stays stable.

So with early stopping, what we do is that if we reach a point where the validation set performance hasn't improved from some certain number of iterations, which we refer to as the look ahead, we just stop. We go back to the neural net that had the best performance overall in the validation set, and that's my neural network.

So I have now a very cheap way of actually getting the number of iterations or the number of epochs over my training set. A few more tricks of the trade. So it's always useful to normalize your data. It will often have the effect of speeding up training. If you have real value data for binary data, that's usually keep it as it is.

So what I mean by that is just subtract for each dimension what is the average in the training set of that dimension, and then dividing by the standard deviation of each dimension again in my input space. So this can speed up training. We often use a decay on the learning rate.

There are a few methods for doing this. One that's very simple is to start with a large learning rate and then track the performance on the validation set. And once on the validation set it stops improving, you decrease your learning rate by some ratio. Maybe you divide it by 2.

And then you continue training for some time. Hopefully, the validation set performance starts improving. And then at some point, it stops improving, and then you stop. Or you divide again by 2. So that sort of gives you an adaptive-- using the validation set, an adaptive way of changing your learning rate.

And that can, again, work better than having a very small learning rate than waiting for a longer time. So making very fast progress initially, and then slower progress towards the end. Also, I've described so far the approach for training neural nets that is based on a single example at a time.

But in practice, we actually use what's called mini-batches. That is, we compute the loss function on a small subset of examples, say, 64, 128. And then we take the average of the loss of all these examples in that mini-batch. And that's actually-- we compute the gradient of this average loss on that mini-batch.

The reason why we do this is that it turns out that you can very efficiently implement the forward pass over all of these 64, 128 examples in my mini-batch in one pass by, instead of doing vector matrix multiplications when we compute the pre-activations, doing matrix-matrix multiplications, which are faster than doing multiple matrix-vector multiplications.

So in your code, often, there will be this other hyperparameter, which is mostly optimized for speed in terms of how quickly training will proceed of the number of examples in your mini-batch. Other things to improve optimization might be using a thing like momentum. That is, instead of using, as the descent direction, the gradient of the loss function, I'm actually going to track a descent direction, which I'm going to compute as the current gradient for my current example or mini-batch, plus some fraction of the previous update, the previous direction of update.

And beta now is a hyperparameter you have to optimize. So what this does is, if all the update directions agree across multiple updates, then it will start picking up momentum and actually make bigger steps in those directions. And then there are multiple, even more advanced methods for adding adaptive types of learning rates.

I mentioned them here very quickly, because you might see them in papers. There's a method known as AdaGrad, where the learning rate is actually scaled for each dimension, so for each weight and each biases. It's going to be scaled by what is the square root of the cumulative sum of the squared gradients.

So what I track is I take my gradient vector at each step. I do an element-wise square of all the dimensions of my gradients, my gradient vector. And then I accumulate that in some variable that I'm noting as gamma here. And then for my descent direction, I take the gradient, and I do an element-wise division by the square root of this cumulative sum of squared gradients.

There's also RMSProp, which is essentially like AdaGrad, but instead of doing a cumulative sum, we're going to do an exponential moving average. So we take the previous value times some factor plus 1 minus this factor times the current squared gradient. So that's RMSProp. And then there's Adam, which is essentially a combination of RMSProp with momentum, which is more involved.

And I won't have time to describe it here, but that's another method that's often actually implemented in these different softwares and that people seem to use with a lot of success. And finally, in terms of actually debugging your implementations-- so for instance, if you're lucky, you can build your neural network without difficulty using the current tools that are available in Torch or TensorFlow or Tiano.

But maybe sometimes you actually have to implement certain gradients for a new module and a new box in your flow graph that isn't currently supported. If you do this, you should check that you've implemented your gradients correctly. And one way of doing that is to actually compare the gradients computed by your code with a finite difference of estimate.

So what you do is, for each parameter, you add some very small epsilon value, say 10 to the minus 6, and you compute what is the output of your module. And then you subtract the same thing, but where you've subtracted the small quantity, and then you divide by 2 epsilon.

So if epsilon converges to 0, then you actually get the partial derivative. But if it's just small, it's going to be an approximate. And usually, this finite difference estimate will be very close to a correct implementation of the real gradient. So you should definitely do that if you've actually implemented some of the gradients in your code.

And then another useful thing to do is to actually do a very small experiment on a small data set before you actually run your full experiment on your complete data set. So use, say, 50 examples. So just taking a random subset of 50 examples from your data set. Actually, just make sure that your code can overfit to that data, can essentially classify it perfectly, given enough capacity that you would think it should get it.

So if it's not the case, then there's a few things that you might want to investigate. Maybe your initialization is such that the units are already saturated initially, and so there's no actual optimization happening because some of the gradients on some of the weights are exactly zero. So you might want to check your initialization.

Maybe your gradients are just-- you're using a model you implemented gradients for, and maybe your gradients are not properly implemented. Maybe you haven't normalized your input, which creates some instability, making it harder for stochastic gradient descent to work successfully. Maybe your learning rate is too large. Then you should consider trying smaller learning rates.

That's actually a pretty good way of adding some idea of the magnitude of the learning rate you should be using. And then once you actually overfit in your small training set, you're ready to do a full experiment on a larger data set. That said, this is not a replacement for gradient checking.

So backprop and stochastic gradient descent, it's a great algorithm that's very bug resistant. You will potentially see some learning happening, even if some of your gradients are wrong, or say, exactly zero. So that's great if you're an engineer and you're implementing things. It's fun when code is somewhat bug resistant.

But if you're actually doing science and trying to understand what's going on, that can be a complication. So do both, gradient checking and a small experiment like that. All right, and so for the last few minutes, I'll actually try to motivate what you'll be learning quite a bit about in the next two days.

That is, the specific case for deep learning. So I've already told you that if I have a neural net with enough hidden units, theoretically, I can potentially represent pretty much any function, any classification function. So why would I want multiple layers? So there are a few motivations behind this.

The first one is taken directly from our own brains. So we know in the visual cortex that the light that hits our retina eventually goes through several regions in the visual cortex. Eventually reaching an area known as V1, where you have units that are-- or neurons that are essentially tuned to small forms like edges.

And then it goes on to V4, where it's slightly more complex patterns that the units are tuned for. And then you reach AIT, where you actually have neurons that are specific to certain objects or certain units. And so the idea here is that perhaps that's also what we want in an artificial, say, vision system.

We'd like it, if it's detecting faces, to have a first layer that detects simple edges, and then another layer that perhaps puts these edges together, detecting slightly more complex things, like a nose or a mouth or eyes. And then eventually have a layer that combines these slightly less abstract or more abstract units to get something even more abstract, like a complete face.

There's also some theoretical justification for using multiple layers. So the early results were mostly based on studying Boolean functions, or a function that takes as input-- can think of it as a vector of just zeros and ones. And you could show that there are certain functions that, if you had essentially a Boolean neural network or essentially a Boolean circuit, and you restricted the number of layers of that circuit, that there are certain functions that, in this case, to represent certain Boolean functions exactly, you would need an exponential number of units in each of these layers.

Whereas if you allowed yourself to have multiple layers, then you could represent these functions more compactly. And so that's another motivation, that perhaps with more layers, we can represent fairly complex functions in a more compact way. And then there's the reason that they just work. So we've seen in the past few years great success in speech recognition, where it's essentially revolutionized the field, where everyone's using deep learning for speech recognition, and same thing for visual object recognition, where, again, deep learning is sort of the method of choice for identifying objects in images.

So then why are we doing this only recently? Why didn't we do deep learning way back when backprop was invented, which is essentially in 1980s and even before that? So it turns out training deep neural networks is actually not that easy. There are a few hurdles that one can be confronted with.

I've already mentioned one of the issues, which is that some of the gradients might be fading as you go from the top layer to the bottom layer, because we keep multiplying by the derivative of the activation function. So that makes training hard. It could be that the lower layers at very small gradients are barely moving and exploring the space of correct features to learn for a given problem.

Sometimes that's the problem you find. You have a hard time just fitting your data, and you're essentially underfitting. Or it could be that with deeper neural nets or bigger neural nets, we have more parameters. So perhaps sometimes we're actually overfitting. We're in a situation where all the functions that we can represent with the same neural net represented by this gray area function actually includes, yes, the right function, but it's so large that for a finite training set, the odds that I'm going to find the one that's close to the true classifying function, the real system that I'd like to have, is going to be very different.

So in this case, I'm essentially overfitting, and that might also be a situation we're in. And unfortunately, there are many situations where one problem is observed, overfitting or underfitting. And so we essentially have, in the field, developed tools for fighting both situations. And I'm going to rapidly touch a few of those, which you will see will come up later on in multiple talks.

So one of the first hypotheses, which might be that you're underfitting, well, you can essentially just fight this by waiting longer, so training longer. If you have your gradients are too small, and this is essentially why you're progressing very slowly when you're training, well, if you're using GPUs and are able to do more iterations over the same training set in less time, that might just solve your problem of underfitting.

And I think we've seen some of that, and this is partly why GPUs have been so game-changing for deep learning. Or you can use just better optimization methods also. And if you're overfitting, well, we just need better regularization. I've been involved early on in my PhD on using unsupervised learning as a way to regularize neural nets.

If I have time, I'll talk a little bit about that. And there's another method you might have heard about known as dropout. So I'll try to touch at least two methods that are essentially trying to address some of these issues. So the first one that I'll talk about is dropout.

It's actually very easy, very simple. So the idea of if our neural net is essentially overfitting, so it's too good at training on the training set, well, we're essentially going to cripple training. We're going to make it harder to fit the training set. And the way we're going to do that in dropout is that we will stochastically remove hidden units independently.

So for each hidden unit, before we do a forward pass, we'll flip a coin. And with probability half, we will multiply the activation by 0. And with probability half, we'll multiply it by 1. So what this means is that if a unit is multiplied by 0, it's effectively not in the neural net anymore.

And we're doing this independently for each hidden units. So that means that in a layer, a unit cannot rely anymore on the presence on any other units to try to sort of synchronize and adapt to perform a complex classification or learn a complex feature. And that was partly the motivation behind dropout is that this procedure might encourage types of features that are not co-adapted and are less likely to overfit.

So we often use 0.5 as the probability of dropping out a unit. It turns out it often, surprisingly, is the best value. But that's another hyperparameter you might want to tune. And in terms of how it impacts an implementation of backdrop, it's very simple. So the forward pass, before I do it, I just sample my binary masks for all my layers.

And then when I'm performing backdrop, well, my gradient on the-- oh, sorry. So that's the forward pass. I'm just multiplying by this binary mask here. So super simple change. And then in terms of backdrop, well, I'm also going to multiply by the mask when I get my gradient on the pre-activation.

And also, don't forget that the activations are now different. They actually include the mask in my notation. So it's a very simple change in the forward and backward pass when you're training. And also, another thing that I should emphasize is that the mask is being resampled for every example.

So before you do a forward pass, you resample the mask. You don't keep it-- sample it once and then use it the whole time. And then at test time, because we don't really like a model that sort of randomly changes its output, because it will if we stochastically change the masks, what we do is we replace the mask by the probability of dropping out a unit, or actually of keeping a unit.

So if we're using 0.5, that's just 0.5. We can actually show that if you have a neural net with a single hidden layer, doing this transformation at test time, multiplying by 0.5 is equivalent to doing a geometric average of all the possible neural networks with all the different binary mask patterns.

So it's essentially one way of thinking about dropout in the single layer case is that it's kind of an assembling method where you have a lot of models, an exponential number of models, which are all sharing the same weights but have different masks. That intuition, though, doesn't transfer for deep neural nets in the sense that you cannot show this result.

It really only applies to a single hidden layer. So in practice, it's very effective, but do expect some slowdown in training. So often, we tend to see that training a network to completion will take twice as many epochs if you're using dropout with 0.5. And here, you have the reference if you want to learn more about different variations of dropouts and so on.

And I probably won't talk about unsupervised retraining for lack of time, but I'll talk about another thing that you'll definitely probably hear about, and that's implemented in these different packages, which is batch normalization. Batch normalization is kind of interesting in the sense that it's been shown to better optimize.

That is, certain networks that would otherwise underfit would not underfit as much anymore if you use batch normalization. But also, it's been shown that when you use batch normalization, dropout is not as useful. And dropout being a regularization method, that suggests that perhaps batch normalization is also regularizing in some way.

So these things are not one or the other. They're not mutually exclusive. You can have a regularizer that also, it turns out, helps you better optimize. So the intuition behind batch normalization is much like I've suggested that normalizing your inputs actually can help speeding up training. Well, how about we also normalize all the hidden layers when I'm doing my forward pass?

Now, the problem in doing this is that I can compute the mean and the standard deviations of my inputs once and for all because they're constant. But my hidden layers are constantly changing because I'm training these parameters. So the mean and the standard deviation of my units will change.

And so it would be very expensive if every time I did an update on my parameters, I recomputed the means and the standard deviations of all of my units. So batch normalization addresses some of these issues as follows. So the way it works is first, the normalization is going to be applied on actually the pre-activation.

So not the activation of the unit, but before the non-linearity. During training, to address the issue that we don't want to compute means over the full training set because that would be too slow, I'm actually going to compute it on each mini-batch. So I have to do mini-batch training here.

I'm going to take my small mini-batch of 64, 128 examples. And that's the set of examples on which I'm going to compute my means and standard deviations. And then when I do backprop, I'm actually going to take into account the normalization. So now there's going to be a gradient going through the computation of the mean and the standard deviation because they depend on the parameters of the neural network.

And then at test time, we'll just use the global mean and global standard deviation. Once I finish training, I can actually do a full pass over the whole training set and get all of my means and standard deviations. So that's essentially the pseudocode for that, taken out of the paper directly.

So if x is a pre-activation for a unit and have multiple pre-activations for a single unit across my mini-batch, I would compute what is the average for that unit pre-activation across my examples in my mini-batch, compute my variance, and then subtract the mean and divide by the square root of the variance, plus some epsilon for numerical stability in case the variance is too close to zero.

And then another thing is that actually batch normalization doesn't just perform this normalization and outputs the normalized pre-activation. It then actually performs a linear transformation on it. So it multiplies it by this parameter gamma, which is going to be trained by gradient descent. And it's often called the gain parameter of batch normalization.

And it adds a bias beta. And the reason is that if I'm subtracting by the mean, then each of these units have the bias parameter. So if I subtract it, then this essentially here, there's no bias anymore. It was present here, it was present here, and now it's been subtracted.

So I have to add the bias, but after the batch normalization, essentially. So these betas here are essentially the new bias parameters. And those will actually be trained. So we do gradient descent also on those. So batch normalization adds a few parameters. All right, and as I said, I'm just going to skip over this.

And I'm not showing what the gradients are when you back prop through the mean and so on. It's described in the paper if you want to see the gradients. But otherwise, in the different packages, you'll get the gradients automatically. It's usually been implemented. Skipping over that, I'll just finish.

If you actually want to learn about unsupervised pre-training and why it works, I have videos on that. So you can check that out. And I guess that's it. Thank you. Thanks, Hugo. So we have a few minutes for questions which are intermingled with a break. So feel free to either go for a break or ask questions to Hugo.

I believe there are microphones. And I'll also stick around. So if you want to ask me questions offline, that's also fine. If anyone has questions, you can go to the mic. Go to the microphone. Hi. Hi. You mentioned the ReLU adds sparsity. Can you explain why? Yeah, so the first thing is that it's observed in practice.

And it adds some sparsity in part because you have the non-linearity at 0 below. So it means that units are going to be potentially exactly sparse, essentially absent of the hidden layer. There are a few reasons to explain why you get sparsity. It turns out that this process of doing a linear transformation followed by the ReLU activation function is very close to some of the steps you would do when you're optimizing for sparse codes in a sparse coding model, if you know about sparse coding.

So they're essentially an optimization method that, given some sparse coding model, will find what is the sparse representation, hidden representation for some input. And it's mostly a sequence of linear transformations followed by this ReLU-like activation function. And I think this is partly the explanation. Otherwise, I don't know of a solid explanation for why that is beyond what's observed in practice.

Any more questions? If not, let's thank Hugo again. And we are reconvening in 10 minutes.