back to index

Foundations of Deep Learning (Hugo Larochelle, Twitter)


Chapters

0:0 Intro
1:0 FOUNDATIONS OF DEEP LEARNING
9:45 CAPACITY OF NEURAL NETWORK
11:2 MACHINE LEARNING
15:45 LOSS FUNCTION
19:9 BACKPROPAGATION
24:59 ACTIVATION FUNCTION
27:30 FLOW GRAPH
29:38 REGULARIZATION
30:24 INITIALIZATION
33:26 MODEL SELECTION
36:33 KNOWING WHEN TO STOP
38:7 OTHER TRICKS OF THE TRADE
42:27 GRADIENT CHECKING
43:36 DEBUGGING ON SMALL DATASET
50:54 DROPOUT
55:10 BATCH NORMALIZATION
58:34 UNSUPERVISED PRE-TRAINING
58:37 NEURAL NETWORK ONLINE COURSE

Whisper Transcript | Transcript Only Page

00:00:00.000 | That's good.
00:00:00.840 | All right, cool.
00:00:01.440 | So yeah, so I was asked to give this presentation
00:00:04.880 | on the foundations of deep learning, which
00:00:07.080 | is mostly going over basic feedforward neural networks
00:00:11.120 | and motivating a little bit deep learning
00:00:13.640 | and some of the more recent developments
00:00:15.960 | and some of the topics that you'll
00:00:17.480 | see across the next two days.
00:00:20.120 | So as Andrew mentioned, I have just an hour.
00:00:26.080 | So I'm going to go fairly quickly
00:00:27.520 | on a lot of these things, which I think
00:00:29.160 | would mostly be fine if you're familiar enough
00:00:31.600 | with some machine learning and a little bit about neural nets.
00:00:35.320 | But if you'd like to go into some
00:00:36.800 | of the more specific details, you
00:00:38.280 | can go check out my online lectures on YouTube.
00:00:41.320 | It's now taught by a much younger version of myself.
00:00:44.400 | And so just search for Hugo Larochelle.
00:00:47.800 | And I am not the guy doing a bunch of skateboarding.
00:00:50.640 | I'm the geek teaching about neural nets.
00:00:52.920 | So go check those out if you want more details.
00:00:56.560 | But so what I'll cover today is--
00:00:58.880 | I'll start with just describing and laying out
00:01:02.280 | the notation on feedforward neural networks, that
00:01:04.920 | is, models that take an input vector x--
00:01:07.280 | that might be an image or some text--
00:01:09.320 | and produces an output f of x.
00:01:11.360 | So I'll just describe forward propagation
00:01:13.120 | and the different types of units and the type of functions
00:01:16.040 | we can represent with those.
00:01:17.680 | And then I'll talk about how we actually train neural nets,
00:01:20.720 | describing things like loss functions,
00:01:22.600 | backpropagation that allows us to get a gradient for training
00:01:26.160 | with stochastic gradient descent,
00:01:27.700 | and mention a few tricks of the trade,
00:01:29.840 | so some of the things we do in practice to successfully train
00:01:32.720 | neural nets.
00:01:33.680 | And then I'll end by talking about some developments that
00:01:37.680 | are specifically useful in the context of deep learning, that
00:01:41.240 | is, neural networks with several hidden layers that came out
00:01:45.220 | at the very--
00:01:46.920 | after the beginning of deep learning, say, in 2006.
00:01:49.880 | That is, things like dropout, batch normalization,
00:01:52.200 | and if I have some time, unsupervised pre-training.
00:01:55.800 | So let's get started.
00:01:57.560 | And just talk about, assuming we have some neural network,
00:02:00.400 | how do they actually function?
00:02:01.680 | How do they make predictions?
00:02:04.280 | So let me lay down the notation.
00:02:06.800 | So a multilayer feedforward neural network
00:02:10.240 | is a model that takes as input some vector x, which
00:02:14.200 | I'm representing here with a different node
00:02:16.400 | for each of the dimensions in my input vector.
00:02:19.480 | So each dimension is essentially a unit in that neural network.
00:02:23.640 | And then it eventually produces, at its output layer,
00:02:27.080 | an output.
00:02:28.680 | And we'll focus on classification mostly.
00:02:31.160 | So you'd have multiple units here.
00:02:33.280 | And each unit would correspond to one
00:02:35.160 | of the potential classes in which we would
00:02:37.280 | want to classify our input.
00:02:38.680 | So if we're identifying digits in handwritten character
00:02:42.960 | images, and say we're focusing on digits,
00:02:46.120 | you'd have 10 digits.
00:02:47.120 | So you would have a sort of 0 from 0 to 9.
00:02:50.120 | So you'd have 10 output units.
00:02:52.420 | And to produce an output, the neural net
00:02:54.820 | will go through a series of hidden layers.
00:02:58.360 | And those will be essentially the components
00:03:01.000 | that introduce non-linearity that
00:03:02.380 | allows us to capture and perform very sophisticated types
00:03:06.340 | of classification functions.
00:03:08.600 | So if we have L hidden layers, the way
00:03:11.500 | we compute all the layers in our neural net is as follows.
00:03:16.040 | We first start by computing what I'm
00:03:17.940 | going to call a pre-activation.
00:03:20.040 | I'm going to know that A. And I'm
00:03:22.200 | going to index the layers by k.
00:03:23.920 | So A k is just the pre-activation at layer k.
00:03:28.080 | And that is only simply going to be a linear transformation
00:03:32.320 | of the previous layer.
00:03:34.080 | So I'm going to note h k as the activation on the layer.
00:03:38.320 | And by default, I'll assume that layer 0
00:03:41.200 | is going to be the input.
00:03:43.080 | And so using that notation, the pre-activation at layer k
00:03:46.960 | is going to correspond to taking the activation
00:03:49.720 | at the previous layer, k minus 1,
00:03:52.200 | multiplying it by a matrix, Wk.
00:03:54.680 | Those are the parameters of the layer.
00:03:57.640 | Those essentially corresponds to the connections
00:04:00.520 | between the units between adjacent layers.
00:04:03.080 | And I'm going to add a bias vector.
00:04:05.120 | That's another parameter in my layer.
00:04:07.360 | So that gives me the pre-activation.
00:04:09.640 | And then next, I'm going to get a hidden layer activation
00:04:12.280 | by applying an activation function.
00:04:14.680 | This will introduce some non-linearity in the model.
00:04:17.760 | So I'm going to call that function g.
00:04:19.320 | And we'll go over a few choices.
00:04:22.000 | So we have four common choices for the activation function.
00:04:26.160 | And so I do this from layer 1 to layer L.
00:04:29.120 | And when it comes to the output layer,
00:04:31.320 | I'll also compute a pre-activation
00:04:33.480 | by performing a linear transformation.
00:04:36.000 | But then I'll usually apply a different activation function
00:04:38.600 | depending on the problem I'm trying to solve.
00:04:41.960 | So having said that, let's go to some of the choices
00:04:47.000 | for the activation function.
00:04:48.260 | So some of the activation functions you'll see.
00:04:50.800 | One common one is this sigmoid activation function.
00:04:53.680 | It's this function here.
00:04:54.880 | It's just 1 divided by 1 plus the exponential
00:04:58.680 | of minus the pre-activation.
00:05:01.360 | The shape of this function, you can focus on that, is this here.
00:05:04.720 | It takes the pre-activation, which
00:05:06.160 | can vary from minus infinite to plus infinite.
00:05:08.520 | And it squashes this between 0 and 1.
00:05:11.680 | So it's bounded by below and above, below by 0,
00:05:15.720 | and above by 1.
00:05:17.480 | So it's a function that saturates
00:05:19.600 | if you have very large magnitude positive or negative
00:05:24.440 | pre-activations.
00:05:26.760 | Another common choice is the hyperbolic tangent or tanh
00:05:29.880 | activation function.
00:05:31.680 | This picture here.
00:05:32.600 | So it squashes everything.
00:05:33.720 | But instead of being between 0 and 1,
00:05:35.840 | it's between minus 1 and 1.
00:05:38.720 | And one that's become quite popular in neural nets
00:05:42.600 | is what's known as the rectified linear activation function.
00:05:46.120 | Or in papers, you will see the ReLU unit
00:05:50.400 | that refers to the use of this activation function.
00:05:54.880 | So this one is different from the others
00:05:56.560 | in that it's not bounded above, but it is bounded below.
00:06:00.340 | And it will output 0's exactly if the pre-activation
00:06:05.640 | is negative.
00:06:08.040 | So those are the choices of activation functions
00:06:10.120 | for the hidden layers.
00:06:12.060 | And for the output layer, if we're performing
00:06:13.960 | classification, as I said, in our output layer,
00:06:16.760 | we will have as many units as there
00:06:18.640 | are classes in which an input could belong.
00:06:21.440 | And what we'd like is potentially--
00:06:24.720 | and what we often do is interpret each unit's
00:06:27.600 | activation as the probability, according
00:06:30.320 | to the neural network, that the input belongs
00:06:33.600 | to the corresponding class, that its label y
00:06:36.480 | is the corresponding class C. So C
00:06:39.600 | would be like the index of that unit in the output layer.
00:06:43.120 | So we need an activation function
00:06:44.520 | that produces probabilities, produces
00:06:46.840 | a multinomial distribution over all the different classes.
00:06:50.080 | And the activation function we use for that
00:06:52.120 | is known as the softmax activation function.
00:06:55.280 | It is simply as follows.
00:06:57.400 | You take your pre-activations, and you exponentiate them.
00:07:00.440 | So that's going to give us positive numbers.
00:07:02.800 | And then we divide each of the exponentiated pre-activations
00:07:06.440 | by the sum of all the exponentiated pre-activations.
00:07:11.200 | So because I'm normalizing this way,
00:07:13.040 | it means that all my values in my output layer
00:07:16.640 | are going to sum to 1.
00:07:17.920 | And they're positive because I took the exponential.
00:07:20.080 | So I can interpret that as a multinomial distribution
00:07:23.080 | over the choice of all the C different classes.
00:07:26.760 | So that's what I'll use as the activation function
00:07:29.040 | at the output layer.
00:07:32.160 | And now, beyond the math in terms of conceptually
00:07:35.100 | and also in the way we're going to program neural networks,
00:07:38.480 | often what we'll do is that all these different operations,
00:07:41.020 | the linear transformations, the different types of activation
00:07:43.520 | functions, we'll essentially implement all of them
00:07:47.440 | as an object, an object that take arguments.
00:07:52.200 | And the arguments would essentially
00:07:53.660 | be what other things are being combined
00:07:55.400 | to produce the next value.
00:07:57.600 | So for instance, we would have an object
00:07:59.520 | that might correspond to the computation of pre-activation,
00:08:02.760 | which would take as argument what
00:08:04.920 | is the weight matrix and the bias vector for that layer
00:08:08.280 | and take some layer to transform.
00:08:10.920 | And this object would compute its value
00:08:13.560 | by applying the linear activation,
00:08:15.840 | the linear transformation.
00:08:17.120 | And then we might have objects that
00:08:18.620 | correspond to specific activation functions,
00:08:21.880 | so like a sigmoid object or a tanh object or a ReLU object.
00:08:25.360 | And we just combine these objects together,
00:08:27.200 | chain them into what ends up being a graph, which I refer
00:08:30.960 | to as a flow graph, that represents the computation done
00:08:34.680 | when you do a forward pass in your neural network
00:08:37.400 | up until you reach the output layer.
00:08:39.520 | So I mention it now because you'll
00:08:41.520 | see the different softwares that we presented over the weekend
00:08:45.840 | will essentially exploit some of that representation
00:08:50.000 | of the computation in neural nets.
00:08:51.600 | It will also be handy for computing gradients, which
00:08:53.880 | I'll talk about in a few minutes.
00:08:57.720 | And so that's how we perform predictions in neural networks.
00:09:02.120 | So we get an input.
00:09:03.560 | We eventually reach an output layer
00:09:05.120 | that gives us a distribution over classes
00:09:06.960 | if we're performing classification.
00:09:08.680 | If I want to actually classify, I
00:09:10.360 | would just assign the class corresponding
00:09:13.400 | to the unit that has the highest activation, that
00:09:16.440 | would correspond to classifying to the class that
00:09:19.160 | has the highest probability according to the neural net.
00:09:21.720 | But then you might ask the question, OK,
00:09:26.200 | what kind of problems can we solve with neural networks?
00:09:29.240 | Or more technically, what kind of functions
00:09:31.440 | can we represent mapping from some input x
00:09:34.240 | into some arbitrary output?
00:09:36.520 | And so if you go look at my videos,
00:09:39.480 | I try to give more intuition as to why
00:09:41.920 | we have this result here.
00:09:43.160 | But essentially, if we have a single hidden layer
00:09:45.960 | neural network, it's been shown that with a linear output,
00:09:48.680 | we can approximate any continuous function
00:09:51.040 | arbitrarily well as long as we have enough hidden units.
00:09:54.560 | So that is, there's a value for these biases and these weights
00:09:57.360 | such that any continuous function,
00:09:59.240 | I can actually represent it as well as I want.
00:10:01.720 | I just need to add enough hidden units.
00:10:04.760 | So this result applies if you use activation functions,
00:10:07.560 | non-linear activation functions like sigmoid and tanh.
00:10:11.360 | So as I said in my video, if you want a bit more intuition
00:10:14.000 | as to why that would be, you can go check that out.
00:10:18.040 | But that's a really nice result.
00:10:19.920 | It means that by focusing on this family of machine learning
00:10:23.880 | models that are neural networks, I can pretty much potentially
00:10:27.720 | represent any kind of classification function.
00:10:30.640 | However, this result does not tell us
00:10:32.560 | how do we actually find the weights and the bias values
00:10:35.680 | such that I can represent a given function.
00:10:38.040 | It doesn't essentially tell us how
00:10:39.560 | do we train a neural network.
00:10:41.640 | And so that's what we'll discuss next.
00:10:44.800 | So let's talk about that.
00:10:45.960 | How do we actually, from a data set,
00:10:48.280 | train a neural network to perform good classification
00:10:51.560 | for that problem?
00:10:54.280 | So what we'll typically do is use a framework that's
00:10:58.100 | very generic in machine learning,
00:10:59.900 | known as empirical risk minimization or structural risk
00:11:03.060 | minimization if you're using regularization.
00:11:05.660 | So this framework essentially transforms
00:11:08.940 | a problem of learning as a problem of optimizing.
00:11:12.860 | So what we'll do is that we'll first choose a loss function
00:11:16.380 | that I'm noting as L. And the loss function,
00:11:19.540 | it compares the output of my model,
00:11:22.100 | so the output layer of my neural network,
00:11:23.980 | with the actual target.
00:11:25.660 | So I'm indexing with an exponent here with t
00:11:28.540 | to essentially as the index over all my different examples
00:11:32.660 | in my training set.
00:11:34.860 | And so my loss function will tell me,
00:11:36.940 | is this output good or bad given that the label is actually y?
00:11:42.660 | And what I'll do, I'll also define a regularizer.
00:11:47.760 | So theta here is--
00:11:49.620 | you can think of it as just a concatenation of all my biases
00:11:52.860 | and all of my weights in my neural net.
00:11:54.460 | So those are all the parameters of my neural network.
00:11:58.200 | And the regularizer will essentially
00:12:00.020 | penalize certain values of these weights.
00:12:03.220 | So as I'll talk more specifically later on,
00:12:05.980 | for instance, you might want to have your weights not
00:12:08.500 | be too far from 0.
00:12:09.940 | That's a frequent intuition that we implement with regularizer.
00:12:14.200 | And so the optimization problem that we'll
00:12:16.660 | try to solve when learning is to minimize
00:12:19.740 | the average loss of my neural network over my training
00:12:23.900 | examples, so summing over all training examples.
00:12:26.100 | I have capital T examples.
00:12:28.860 | Plus some weight here that's known as the weight decay,
00:12:33.340 | some hyperparameter lambda, times my regularizer.
00:12:37.100 | So in other words, I'm going to try to have
00:12:39.280 | my loss on my training set the smallest possible over all
00:12:43.140 | the training example and also try
00:12:45.260 | to satisfy my regularizer as much as possible.
00:12:48.660 | And so now we have this optimization problem.
00:12:51.420 | And learning will just correspond
00:12:53.140 | to trying to solve this problem.
00:12:55.380 | So finding this arg min here for over my weights and my biases.
00:13:00.940 | And if I want to do this, I can just
00:13:02.460 | invoke some optimization procedure
00:13:05.100 | from the optimization community.
00:13:08.860 | And the one algorithm that you'll
00:13:10.340 | see constantly in deep learning is stochastic gradient descent.
00:13:14.420 | This is the optimization algorithm
00:13:16.220 | that we'll often use for training neural networks.
00:13:19.780 | So SGD, stochastic gradient descent, functions as follows.
00:13:23.620 | You first initialize all of your parameters.
00:13:26.580 | That is finding initial values for all my weight matrices
00:13:29.740 | and all of my biases.
00:13:32.060 | And then for a certain number of epochs--
00:13:34.140 | so an epoch will be a full pass over all my examples.
00:13:37.740 | That's what I'll call an epoch.
00:13:40.100 | So for a certain number of full iterations over my training
00:13:44.620 | set, I'll draw each training example.
00:13:47.900 | So I pair x, input x, target y.
00:13:51.620 | And then I'll compute what is the gradient of my loss
00:13:55.900 | with respect to my parameters.
00:13:58.580 | All of my parameters, all my weights, and all my biases.
00:14:01.040 | This is what this notation here--
00:14:03.180 | so nabla for the gradient of the loss function.
00:14:06.740 | And here I'm indexing with respect to which parameter.
00:14:10.300 | I want the gradient.
00:14:12.060 | So I'm going to compute what is the gradient of my loss
00:14:14.700 | function with respect to my parameters.
00:14:17.240 | And plus lambda times the gradient of my regularizer
00:14:20.220 | as well.
00:14:21.160 | And then I'm going to get a direction in which I should
00:14:23.500 | move my parameters.
00:14:25.220 | Since the gradient tells me how to increase the loss,
00:14:28.620 | I want to go in the opposite direction and decrease it.
00:14:31.060 | So my direction will be the opposite.
00:14:32.820 | So that's why I have a minus here.
00:14:35.540 | And so this delta is going to be the direction in which I'll
00:14:38.140 | move my parameters by taking a step.
00:14:40.780 | And the step is just a step size alpha,
00:14:43.940 | which is often referred to as a learning rate,
00:14:46.500 | times my direction, which I just add
00:14:49.300 | to my current values of my parameters, my biases
00:14:52.300 | and my weights.
00:14:53.360 | And that's going to give me my new value for all
00:14:56.120 | of my parameters.
00:14:57.180 | And I iterate like that, going over all pairs x, y's,
00:15:01.260 | computing my gradient, taking a step
00:15:03.420 | side in the opposite direction, and then
00:15:05.280 | doing that several times.
00:15:07.620 | So that's how stochastic gradient descent works.
00:15:10.620 | And that's essentially the learning procedure.
00:15:12.540 | It's represented by this procedure.
00:15:16.100 | So in this algorithm, there are a few things
00:15:17.900 | we need to specify to be able to implement it and execute it.
00:15:20.860 | We need a loss function, a choice for the loss function.
00:15:23.620 | We need a procedure that's efficient for computing
00:15:26.860 | the gradient of the loss with respect to my parameters.
00:15:30.620 | We need to choose a regularizer if we want one.
00:15:33.300 | And we need a way of initializing my parameters.
00:15:35.940 | So next, what I'll do is I'll go through each
00:15:37.960 | of these four different things we
00:15:39.780 | need to choose before actually being
00:15:41.700 | able to execute stochastic gradient descent.
00:15:45.660 | So first, the loss function.
00:15:48.060 | So as I said, we will interpret the output layer
00:15:50.980 | as assigning probabilities to each potential class in which
00:15:53.900 | I can classify my input x.
00:15:57.460 | Well, in this case, something that would be natural
00:15:59.780 | is to try to maximize the probability
00:16:02.120 | of the correct class, the actual class in which my example
00:16:05.500 | x t belongs to.
00:16:06.340 | I'd like to increase the value of the probability assigned
00:16:09.900 | computed by my neural network.
00:16:12.820 | And so because we set up the problem in which we
00:16:16.380 | have a loss that we minimize, instead
00:16:18.700 | of maximizing the probability, what we'll actually do
00:16:20.980 | is minimize the negative and the actual log probability,
00:16:25.380 | so the log likelihood of assigning x
00:16:28.300 | to the correct class y.
00:16:30.340 | So this is represented here.
00:16:32.100 | So given my output layer and the true label y,
00:16:35.420 | my loss will be minus the log of the probability of y
00:16:40.820 | according to my neural net.
00:16:42.140 | And that would be, well, take my output layer
00:16:44.980 | and look at the unit, so index the unit corresponding
00:16:48.500 | to the correct class.
00:16:50.020 | So that's why I'm indexing by y here.
00:16:53.500 | We take the log because numerically it
00:16:55.740 | turns out to be more stable.
00:16:56.940 | We get nicer-looking gradients.
00:16:59.180 | And sometimes in certain softwares,
00:17:01.340 | you'll see instead of talking about the negative log
00:17:03.460 | likelihood or log probability, you'll
00:17:05.020 | see it referred as the cross-entropy.
00:17:07.980 | And that's because you can think of this
00:17:11.820 | as performing a sum over all possible classes.
00:17:15.620 | And then for each class, checking, well,
00:17:17.540 | is this potential class the target class?
00:17:20.820 | So I have an indicator function that is 1 if y is equal to c,
00:17:24.660 | so if my iterator class c is actually
00:17:28.100 | equal to the real class.
00:17:29.700 | I'm going to multiply that by the log of the probability
00:17:33.140 | actually assigned to that class c.
00:17:35.660 | And this function here, so this expression here,
00:17:39.380 | is like a cross-entropy between the empirical distribution,
00:17:42.900 | which assigns zero probability to all the other classes,
00:17:46.100 | but a probability of 1 to the correct class,
00:17:48.500 | and the actual distribution over classes
00:17:50.900 | that my neural net is computing, which is f of x.
00:17:54.660 | That's just a technical detail.
00:17:56.060 | You can just think about this.
00:17:57.540 | Here, I only mention it because in certain libraries,
00:17:59.700 | it's actually mentioned as the cross-entropy loss.
00:18:03.740 | So that's for the loss.
00:18:06.060 | Then we need also a procedure for computing
00:18:08.260 | what is the gradient of my loss with respect
00:18:10.460 | to all of my parameters in my neural net,
00:18:12.620 | so the biases and the weights.
00:18:15.820 | You can go look at my videos if you
00:18:17.300 | want the actual derivation of all the details for all
00:18:20.220 | of these different expressions.
00:18:21.780 | I don't have time for that, so all I'll do--
00:18:23.740 | and presumably, a lot of you actually
00:18:26.300 | have seen these derivations.
00:18:28.620 | If you haven't, just go check out the videos.
00:18:30.700 | In any case, I'm going to go through what the algorithm is.
00:18:33.980 | I'm going to highlight some of the key points
00:18:35.900 | that will come up later in understanding how actually
00:18:39.220 | backpropagation functions.
00:18:41.380 | So the basic idea is that we'll compute gradients
00:18:44.820 | by exploiting the chain rule.
00:18:46.420 | And we'll go from the top layer all the way to the bottom,
00:18:50.220 | computing gradients for layers that are closer and closer
00:18:53.820 | to the input as we go, and exploiting the chain rule
00:18:56.540 | to exploit or reuse previous computations we've
00:18:59.820 | made at upper layers to compute the gradients at the layers
00:19:03.460 | below.
00:19:04.980 | So we usually start by computing what is the gradient
00:19:08.140 | at the output layer.
00:19:09.300 | So what's the gradient of my loss with respect
00:19:12.460 | to my output layer?
00:19:13.820 | And actually, it's more convenient
00:19:15.380 | to compute the loss with respect to the pre-activation.
00:19:18.340 | It's actually a very simple expression.
00:19:21.140 | So that's why I have the gradient of this vector,
00:19:24.180 | a l plus 1.
00:19:25.140 | That's the pre-activation at the very last layer of the loss
00:19:28.980 | function, which is minus the log f of x, y.
00:19:32.460 | And it turns out this gradient is super simple.
00:19:35.060 | It's minus E of y.
00:19:37.100 | So that's the one-hot vector for class y.
00:19:40.180 | So what this means is E of y is just a vector filled
00:19:43.820 | with a bunch of 0's and then the 1 at the correct class.
00:19:47.860 | So if y was the fourth class, then in this case,
00:19:51.100 | it would be this vector, where I have a 1 at the fourth
00:19:53.260 | dimension.
00:19:54.740 | So E of y is just a vector.
00:19:56.580 | We call it the one-hot vector full of 0's.
00:19:58.700 | And the single 1 at the position corresponding
00:20:01.620 | to the correct class.
00:20:03.540 | So what this part of the gradient
00:20:05.140 | is essentially saying is that I'm going to increase--
00:20:07.620 | I want to increase the probability
00:20:09.460 | of the correct class.
00:20:10.580 | I want to increase the pre-activation, which
00:20:12.520 | will increase the probability of the correct class.
00:20:15.460 | And I'm going to subtract what is the current probabilities
00:20:18.660 | assigned by my neural net to all of the classes.
00:20:21.860 | So f of x, that's my output layer.
00:20:23.580 | And that's the current beliefs of the neural net
00:20:26.420 | as to in which class, what's the probability of assigning
00:20:30.100 | the input to each class.
00:20:31.740 | So what this is doing is essentially
00:20:33.740 | trying to decrease the probability of everything
00:20:36.260 | and specifically decrease it as much as the neural net currently
00:20:39.900 | believes that the input belongs to it.
00:20:42.900 | And so if you think about the subtraction of these two
00:20:45.420 | things, well, for the class that's the correct class,
00:20:48.260 | I'm going to have 1 minus some number between 0 and 1,
00:20:51.140 | because it's a probability.
00:20:52.420 | So that's going to be positive.
00:20:53.700 | So I'm going to increase the probability
00:20:55.380 | of the correct class.
00:20:56.500 | And for everything else, it's going
00:20:58.000 | to be 0 minus a positive number.
00:20:59.820 | So it's going to be negative.
00:21:01.180 | So I'm actually going to decrease the probability
00:21:03.180 | of everything else.
00:21:04.180 | So intuitively, it makes sense.
00:21:05.620 | This gradient has the right behavior.
00:21:08.740 | And I'm going to take that pre-activation gradient.
00:21:11.580 | I'm going to propagate it from the top to the bottom
00:21:15.340 | and essentially iterating from the last layer, which
00:21:19.740 | is the output layer, L plus 1, all the way down
00:21:22.380 | to the first layer.
00:21:23.980 | And as I'm going down, I'm going to compute the gradients
00:21:26.780 | with respect to my parameters and then
00:21:28.500 | compute what's the gradient for the pre-activation
00:21:31.420 | at the layer below and then iterate like that.
00:21:34.580 | So at each iteration of that loop,
00:21:38.180 | I take what is the current gradient of the loss function
00:21:42.900 | with respect to the pre-activation
00:21:44.420 | at the current layer.
00:21:46.220 | And I can compute the gradient of the loss function
00:21:49.380 | with respect to my weight matrix.
00:21:51.500 | So not doing the derivation here,
00:21:54.460 | it's actually simply this vector.
00:21:58.180 | So in my notation, I assume that all the vectors
00:22:00.580 | are column vectors.
00:22:02.220 | So this pre-activation gradient vector,
00:22:05.180 | and I multiply it by the transpose of the activations,
00:22:09.020 | so the value of the layer right below, the layer k minus 1.
00:22:14.540 | So because I take the transpose, that's
00:22:16.100 | a multiplication like this.
00:22:17.380 | And you can see if I do the outer product,
00:22:19.180 | essentially, between these two vectors,
00:22:20.800 | I'm going to get a matrix of the same size as my weight matrix.
00:22:24.260 | So it all checks out.
00:22:25.940 | That makes sense.
00:22:27.500 | Turns out that the gradient of the loss with respect
00:22:29.680 | to the bias is exactly the gradient
00:22:31.940 | of the loss with respect to the pre-activation.
00:22:34.700 | So that's very simple.
00:22:36.220 | So that gives me now my gradients for my parameters.
00:22:38.700 | Now I need to compute, OK, what is
00:22:40.420 | going to be the gradient of the pre-activations
00:22:42.660 | at the layer below?
00:22:44.700 | Well, first, I'm going to get the gradient of the loss
00:22:48.980 | function with respect to the activation at the layer below.
00:22:54.220 | Well, that's just taking my pre-activation gradient vector
00:22:57.980 | and multiplying it by--
00:22:59.940 | for some reason, it doesn't show here--
00:23:01.780 | and multiply it by the transpose of my weight matrix.
00:23:04.660 | Super simple operation, just a linear transformation
00:23:07.580 | of my gradients at layer k, linear and transformed
00:23:10.580 | to get my gradients of the activation at the layer k
00:23:13.780 | minus 1.
00:23:15.180 | And then to get the gradients of the pre-activation,
00:23:17.900 | so before the activation function,
00:23:21.020 | I'm going to take this gradient here,
00:23:22.940 | which is the gradient of the activation function
00:23:25.540 | at the layer k minus 1.
00:23:27.220 | And then I apply the gradient corresponding
00:23:29.900 | to the partial derivative of my nonlinear activation function.
00:23:33.700 | So this here, this refers to an element-wise product.
00:23:37.060 | So I'm taking these two vectors, this vector here
00:23:39.860 | and this vector here.
00:23:40.740 | I'm going to do an element-wise product between the two.
00:23:43.740 | And this vector here is just the partial derivative
00:23:47.020 | of the activation function for each unit
00:23:49.620 | individually that I've put together into a vector.
00:23:52.700 | This is what this corresponds to.
00:23:55.180 | Now, the key things to notice is first
00:23:57.060 | that this pass, computing all the gradients
00:23:59.580 | and doing all these iterations, is actually fairly cheap.
00:24:02.820 | Complexity is essentially the same as the one
00:24:05.540 | that's doing a forward pass.
00:24:07.700 | So all I'm doing are linear transformations
00:24:11.180 | multiplying by matrices, in this case, the transpose of my weight
00:24:14.220 | matrix.
00:24:15.060 | And then I'm also doing this nonlinear operation
00:24:17.620 | where I'm multiplying by the gradient of the activation
00:24:20.120 | function.
00:24:21.020 | So that's the first thing to notice.
00:24:22.860 | And the second thing to notice is
00:24:24.220 | that here I'm doing this element-wise product.
00:24:27.420 | So if any of these terms here for a unit
00:24:30.060 | is very close to 0, then the pre-activation gradient
00:24:33.900 | is going to be 0 for the next layer.
00:24:36.460 | And I highlight this point because essentially whenever--
00:24:39.740 | that's something to think about a lot when
00:24:41.500 | you're training neural nets.
00:24:42.860 | Whenever this gradient here, these partial derivatives,
00:24:46.060 | come close to 0, then it means the gradient will not
00:24:48.660 | propagate well to the next layer, which
00:24:50.420 | means that you're not going to get a good gradient to update
00:24:53.120 | your parameters.
00:24:54.980 | Now, when does that happen?
00:24:56.380 | When will you see these terms here being close to 0?
00:24:59.380 | Well, that's going to be when the partial derivatives
00:25:01.580 | of these nonlinear activation functions
00:25:03.700 | are close to 0 or 0.
00:25:05.780 | So we can look at the partial derivatives, say,
00:25:08.160 | of the sigmoid function.
00:25:10.140 | It turns out it's super easy to compute.
00:25:12.460 | It's just the sigmoid itself times 1
00:25:15.340 | minus the sigmoid itself.
00:25:18.100 | So that means that whenever the activation
00:25:20.160 | of the unit for a sigmoid unit is close to 1 or close to 0,
00:25:23.500 | I essentially get a partial derivative that's close to 0.
00:25:27.540 | You can kind of see it here.
00:25:28.780 | The slope here is essentially flat,
00:25:30.380 | and the slope here is flat.
00:25:31.700 | That's the value of the partial derivative.
00:25:35.260 | So in other words, if my pre-activations
00:25:37.940 | are very negative or very positive,
00:25:39.860 | or if my unit is very saturated, then gradients
00:25:42.940 | will have a hard time propagating to the next layer.
00:25:46.500 | That's the key insight here.
00:25:49.020 | Same thing for the tanh function.
00:25:51.820 | So it turns out the partial derivative
00:25:53.500 | is also easy to compute.
00:25:54.940 | You just take the tanh value, square it,
00:25:57.580 | and you're going to subtract it to 1.
00:25:59.980 | And indeed, if it's close to minus 1 or close to 1,
00:26:04.420 | you can see that the slope is flat.
00:26:07.180 | So again, if the unit is saturating,
00:26:09.420 | gradients will have a hard time propagating
00:26:11.900 | to the next layers.
00:26:14.140 | And for the ReLU, the rectified linear activation function,
00:26:18.140 | the gradient is even simpler.
00:26:21.180 | You just check whether the pre-activation is greater than
00:26:24.080 | If it is, the partial derivative is 1.
00:26:26.300 | If it's not, it's 0.
00:26:27.780 | So actually, you're going to multiply by 1 or 0.
00:26:30.020 | You essentially get a binary mask
00:26:31.760 | when you're performing the propagation through the ReLU.
00:26:35.220 | And you can see it.
00:26:36.180 | The slope here is flat, and otherwise, you
00:26:38.100 | have a linear function.
00:26:40.020 | So actually, here, the shrinking of the gradient towards 0
00:26:43.940 | is even harder.
00:26:44.860 | It's exactly multiplying by 0 if you have
00:26:47.780 | a unit that's saturating below.
00:26:52.140 | And beyond all the math, in terms of actually using those
00:26:56.820 | in practice, during the weekend, you'll
00:26:58.800 | see three different libraries that essentially allows you
00:27:01.780 | to compute these gradients for you.
00:27:03.220 | You actually usually don't write down backprop.
00:27:06.000 | You just use all of these modules
00:27:08.020 | that you've implemented.
00:27:09.180 | And it turns out there's a way of automatically differentiating
00:27:13.380 | your loss function and getting gradients for free
00:27:16.220 | in terms of effort, in terms of programming effort,
00:27:19.460 | with respect to your parameters.
00:27:21.500 | So conceptually, the way you do this--
00:27:23.940 | and you'll see essentially three different libraries doing it
00:27:26.460 | in slightly different ways.
00:27:28.580 | What you do is you augment your flow graph
00:27:31.340 | by adding, at the very end, the computation of your loss
00:27:34.420 | function.
00:27:35.600 | And then each of these boxes, which are conceptually
00:27:38.020 | objects that are taking arguments and computing
00:27:40.540 | a value, you're going to augment them to also have a method
00:27:45.420 | that's a backprop or a bprop method.
00:27:47.600 | You'll often see, actually, this expression being used, bprop.
00:27:50.660 | And what this method should do is
00:27:52.980 | that it should take as input, what
00:27:54.500 | is the gradient of the loss with respect to myself?
00:27:57.300 | And then it should propagate to its arguments,
00:28:00.340 | so the things that its parents in the flow graph,
00:28:02.780 | the things it takes to compute its own value,
00:28:04.780 | it's going to propagate them using the chain rule, what
00:28:07.100 | is their gradients with respect to the loss?
00:28:10.780 | So what this means is that you would start the process
00:28:14.220 | by initializing, well, the gradient of the loss
00:28:16.860 | with respect to itself is 1.
00:28:19.360 | And then you pass the bprop method here 1.
00:28:22.260 | And then it's going to propagate to its argument, what
00:28:26.420 | is, by using the chain rule, what
00:28:28.240 | is the gradient of the loss with respect to f of x?
00:28:32.040 | And then you're going to call bprop on this object here.
00:28:34.800 | And it's going to compute, well, I
00:28:36.220 | have the gradient of the loss with respect to myself,
00:28:38.460 | f of x.
00:28:39.300 | From this, I can compute what's the gradient of my argument,
00:28:42.700 | which is the pre-activation at layer 2,
00:28:45.140 | with respect to the loss.
00:28:46.580 | So I'm going to reuse the computation I just
00:28:48.380 | got and update it using my--
00:28:50.860 | what is essentially the Jacobian.
00:28:53.020 | And then I'm going to take the pre-activation here,
00:28:55.140 | which now knows what is the gradient of the loss with
00:28:57.340 | respect to itself, the pre-activation.
00:28:59.180 | It's going to propagate to the weights and the biases
00:29:02.020 | and the layer below, updating them with--
00:29:04.200 | informing them of what is the gradient of the loss
00:29:06.280 | with respect to themselves.
00:29:07.420 | And you continue like this, essentially
00:29:09.080 | going through the flow graph, but in the opposite direction.
00:29:12.880 | So the library torch, the basic library torch,
00:29:15.400 | essentially functions like this quite explicitly.
00:29:18.440 | You construct-- you chain these elements together.
00:29:21.000 | And then when you're performing backpropagation,
00:29:23.000 | you're going in the reverse order
00:29:24.340 | of these chained elements.
00:29:25.880 | And then you have libraries like Torchautograd and Theano
00:29:29.280 | and TensorFlow, which you'll learn about,
00:29:31.240 | which are doing things slightly more sophisticated there.
00:29:34.340 | And you'll learn about that later on.
00:29:38.540 | OK, so that's a discussion of how you actually
00:29:40.940 | compute gradients of the loss with respect to the parameters.
00:29:44.620 | So that's another component we need in stochastic gradient
00:29:47.540 | descent.
00:29:48.640 | We can choose a regularizer.
00:29:50.100 | One that's often used is the L2 regularization.
00:29:53.500 | So that's just the sum of the squared of all the weights.
00:29:57.100 | And the gradient of that is just twice times the weight.
00:30:01.280 | So it's a super simple gradient to compute.
00:30:03.400 | We usually don't regularize the biases.
00:30:06.840 | There's no particularly important reason for that.
00:30:10.960 | There are much fewer biases, so it seems less important.
00:30:15.200 | And often, this L2 regularization
00:30:17.020 | is often referred to as weight decay.
00:30:18.760 | So if you hear about weight decay,
00:30:20.320 | that often refers to L2 regularization.
00:30:24.400 | And then finally, and this is also a very important point,
00:30:28.660 | you have to initialize the parameters before you actually
00:30:31.080 | start doing backprop.
00:30:32.120 | And there are a few tricky cases you
00:30:34.160 | need to make sure that you don't fall into.
00:30:37.640 | So the biases, often we initialize them to 0.
00:30:40.600 | There are certain exceptions, but for the most part,
00:30:42.880 | we initialize them to 0.
00:30:44.720 | But for the weights, there are a few things we can't do.
00:30:47.600 | So we can't initialize the weights to 0,
00:30:50.200 | and especially if you have tanh activations.
00:30:53.960 | The reason-- and I won't explain it here,
00:30:56.160 | but it's not a bad exercise to try to figure out why--
00:30:59.240 | is that essentially, when you do your first pass,
00:31:02.100 | you're going to get gradients for all your parameters that
00:31:04.520 | are going to be 0.
00:31:05.920 | So you're going to be stuck at this 0 initialization.
00:31:08.960 | So we can't do that.
00:31:11.280 | We also can't initialize all the weights
00:31:13.160 | to exactly the same value.
00:31:16.440 | Again, you think about it a little bit.
00:31:18.720 | What's going to happen is essentially
00:31:20.600 | that all the weights coming into a unit within the layer
00:31:24.720 | are going to have exactly the same gradients, which
00:31:27.640 | means they're going to be updated exactly the same way,
00:31:30.000 | which means they're going to stay constant the same--
00:31:32.280 | not constant, but they're going to stay the same--
00:31:34.360 | the whole time.
00:31:35.080 | So it's as if you have multiple copies of the same unit.
00:31:38.320 | So you essentially have to break that initial symmetry
00:31:40.920 | that you would create if you initialized everything
00:31:43.080 | to the same value.
00:31:44.520 | So what we end up doing most of the time
00:31:46.260 | is initialize the weights to some randomly generated value.
00:31:50.480 | Often, we generate them--
00:31:52.080 | there are a few other recipes, but one of them
00:31:54.120 | is to initialize them from some uniform distribution
00:31:56.480 | between lower and upper bound.
00:31:59.360 | This is a recipe here that is often
00:32:01.440 | used that has some theoretical grounding that was derived
00:32:05.440 | specifically for the tanh.
00:32:06.880 | There's this paper here by Xavier Guerroux and Yoshua
00:32:09.880 | Bengio you can check out for some intuition as to how
00:32:13.120 | you should initialize the weights.
00:32:14.460 | But essentially, they should be initially random,
00:32:17.100 | and they should be initially close to 0.
00:32:19.320 | Random to break symmetry, and close to 0
00:32:23.160 | so that initially the units are not already saturated.
00:32:27.040 | Because if the units are saturated,
00:32:28.680 | then there are no gradients that are
00:32:29.800 | going to pass through the units.
00:32:31.280 | You're essentially going to get gradients very close to 0
00:32:33.680 | at the lower layers.
00:32:35.120 | So that's the main intuition, is to have weights
00:32:37.360 | that are small and random.
00:32:40.200 | So those are all the pieces we need
00:32:45.360 | for running stochastic gradient descent.
00:32:47.200 | So that allows us to take a training set
00:32:48.860 | and run a certain number of epochs,
00:32:50.720 | and have the neural net learn from that training set.
00:32:53.920 | Now, there are other quantities in our neural network
00:32:57.120 | that we haven't specified how to choose them.
00:32:59.240 | So those are the hyperparameters.
00:33:02.560 | So usually, we're going to have a separate validation set.
00:33:05.240 | Most people here are familiar with machine learning,
00:33:06.880 | so that's a typical procedure.
00:33:08.440 | And then we need to select things like, OK,
00:33:10.320 | how many layers do I want?
00:33:11.560 | How many units per layer do I want?
00:33:14.200 | What's the step size, the learning rate
00:33:16.120 | of my stochastic gradient descent procedure,
00:33:17.920 | that alpha number?
00:33:19.540 | What is the weight decay that I'm going to use?
00:33:22.300 | So a standard thing in machine learning
00:33:24.340 | is to perform a grid search.
00:33:27.380 | That is, if I have two hyperparameters,
00:33:29.340 | I list out a bunch of values I want to try.
00:33:31.320 | So for the number of hidden units,
00:33:32.780 | maybe I want to try 100, 1,000, and 2,000, say.
00:33:36.940 | And then for the learning rate, maybe I
00:33:38.540 | want to try 0.01 and 0.001.
00:33:42.420 | So a grid search would just try all combinations
00:33:44.580 | of these three values for the hidden units
00:33:46.900 | and these two values for the learning rates.
00:33:49.820 | So that means that the more hyperparameters there are,
00:33:53.340 | the number of configurations you have to try out blows up
00:33:57.420 | and grows exponentially.
00:33:59.620 | So another procedure that is now more and more common,
00:34:03.180 | which is more practical, is to perform a form of random search.
00:34:07.580 | In this case, what you do is for each parameter,
00:34:10.020 | you actually determine a distribution of likely values
00:34:13.460 | you'd like to try.
00:34:14.220 | So it could be--
00:34:16.100 | so for the number of hidden units,
00:34:17.560 | maybe I do a uniform distribution
00:34:19.500 | over all integers from 100 to 1,000, say,
00:34:22.700 | or maybe a log uniform distribution.
00:34:25.220 | And for the learning rate, maybe, again,
00:34:26.920 | the log uniform distribution, but from 0.001 to 0.01, say.
00:34:32.840 | And then to get an experiment, so
00:34:35.260 | to get values for my hyperparameters
00:34:37.060 | to do an experiment with and get a performance on my validation
00:34:39.940 | set, I just independently sample from these distributions
00:34:43.140 | for each hyperparameter to get a full configuration
00:34:46.460 | for my experiment.
00:34:47.820 | And then because I have this way of getting one experiment,
00:34:50.720 | I do it independently for all of my jobs, all of my experiments
00:34:53.740 | that I will do.
00:34:54.620 | So in this case, if I know I have enough compute power
00:34:58.020 | to do 50 experiments, I just sample 50 independent samples
00:35:02.120 | from these distributions for hyperparameters,
00:35:04.100 | perform these 50 experiments, and I just take the best one.
00:35:07.860 | What's nice about it is that there are no--
00:35:09.880 | unlike grid search, there are never any holes in the grid.
00:35:12.620 | That is, you just specify how many experiments you do.
00:35:15.260 | If one of your jobs died, well, you just have one less.
00:35:18.460 | But there's no hole in your experiment.
00:35:21.900 | And also, one reason why it's particularly useful,
00:35:24.700 | this approach, is that if you have a specific value in grid
00:35:29.380 | search for one of the hyperparameters that just makes
00:35:32.100 | the experiment not work at all-- so learning rates
00:35:35.100 | are a lot like this.
00:35:36.300 | If you have a learning rate that's too high,
00:35:38.620 | it's quite possible that convergence of the optimization
00:35:42.180 | will not converge.
00:35:43.540 | Well, if you're using a grid search,
00:35:45.180 | it means that for all the experiments that
00:35:47.060 | use that specific value of the learning rate,
00:35:48.980 | they're all going to be garbage.
00:35:50.460 | They're all not going to be useful.
00:35:52.340 | And you don't really get this sort of big waste of computation
00:35:56.220 | if you do a random search, because most likely,
00:35:58.540 | all the values of your hyperparameters
00:36:00.060 | are going to be unique, because they're
00:36:01.660 | samples, say, from a uniform distribution over some range.
00:36:06.140 | So that actually works quite well,
00:36:08.620 | and it's quite recommended.
00:36:10.580 | And there are more advanced methods,
00:36:12.420 | like methods based on machine learning, Bayesian
00:36:15.220 | optimization, or sometimes known as sequential model-based
00:36:18.460 | optimization, that I won't talk about,
00:36:21.300 | but that works a bit better than random search.
00:36:25.740 | And that's another alternative if you
00:36:27.780 | think you have an issue finding good hyperparameters,
00:36:29.940 | is to investigate some of these more advanced methods.
00:36:34.380 | Now, you do this for most of your hyperparameters,
00:36:37.180 | but for the number of epochs, the number of times
00:36:40.060 | you go through all of your examples in your training set,
00:36:44.700 | what we usually do is not grid search or random search,
00:36:49.020 | but we use a thing known as early stopping.
00:36:51.580 | The idea here is that if I've trained
00:36:53.460 | a neural net for 10 epochs, while training a neural net
00:36:56.700 | with all the other hyperparameters kept constant,
00:36:59.340 | but one more epoch is easy.
00:37:01.300 | I just do one more epoch.
00:37:02.820 | So I shouldn't start over and then do, say,
00:37:06.340 | 11 epochs from scratch.
00:37:08.580 | And so what we would do is we would just
00:37:10.300 | track what is the performance on the validation set
00:37:12.820 | as I do more and more epochs.
00:37:14.740 | And what we will typically see is the training error
00:37:17.100 | will go down, but the validation set performance will go down
00:37:20.980 | and eventually go up.
00:37:22.900 | The intuition here is that the gap
00:37:25.080 | between the performance on the training set
00:37:27.420 | and the performance on the validation set
00:37:29.120 | will tend to increase.
00:37:31.260 | And since the training curve cannot go below, usually,
00:37:34.460 | some bound, then eventually the validation set performance
00:37:38.280 | has to go up.
00:37:39.820 | Sometimes it won't necessarily go up,
00:37:41.320 | but it sort of stays stable.
00:37:42.700 | So with early stopping, what we do
00:37:44.080 | is that if we reach a point where the validation set
00:37:46.260 | performance hasn't improved from some certain number
00:37:49.020 | of iterations, which we refer to as the look ahead,
00:37:52.340 | we just stop.
00:37:53.340 | We go back to the neural net that
00:37:54.660 | had the best performance overall in the validation set,
00:37:56.960 | and that's my neural network.
00:37:58.700 | So I have now a very cheap way of actually getting
00:38:01.460 | the number of iterations or the number of epochs
00:38:03.960 | over my training set.
00:38:07.220 | A few more tricks of the trade.
00:38:09.500 | So it's always useful to normalize your data.
00:38:13.060 | It will often have the effect of speeding up training.
00:38:16.780 | If you have real value data for binary data,
00:38:19.340 | that's usually keep it as it is.
00:38:21.940 | So what I mean by that is just subtract for each dimension
00:38:24.940 | what is the average in the training set of that dimension,
00:38:27.700 | and then dividing by the standard deviation
00:38:29.660 | of each dimension again in my input space.
00:38:33.500 | So this can speed up training.
00:38:35.980 | We often use a decay on the learning rate.
00:38:40.020 | There are a few methods for doing this.
00:38:41.660 | One that's very simple is to start with a large learning
00:38:45.300 | rate and then track the performance on the validation
00:38:48.140 | And once on the validation set it stops improving,
00:38:50.980 | you decrease your learning rate by some ratio.
00:38:53.060 | Maybe you divide it by 2.
00:38:54.780 | And then you continue training for some time.
00:38:56.980 | Hopefully, the validation set performance starts improving.
00:39:00.620 | And then at some point, it stops improving, and then you stop.
00:39:04.020 | Or you divide again by 2.
00:39:05.460 | So that sort of gives you an adaptive--
00:39:08.180 | using the validation set, an adaptive way
00:39:10.260 | of changing your learning rate.
00:39:11.700 | And that can, again, work better than having a very small
00:39:14.900 | learning rate than waiting for a longer time.
00:39:16.860 | So making very fast progress initially,
00:39:18.660 | and then slower progress towards the end.
00:39:20.500 | Also, I've described so far the approach
00:39:26.260 | for training neural nets that is based on a single example
00:39:30.740 | at a time.
00:39:31.300 | But in practice, we actually use what's called mini-batches.
00:39:33.800 | That is, we compute the loss function
00:39:36.100 | on a small subset of examples, say, 64, 128.
00:39:40.580 | And then we take the average of the loss of all these examples
00:39:43.580 | in that mini-batch.
00:39:45.020 | And that's actually-- we compute the gradient
00:39:47.260 | of this average loss on that mini-batch.
00:39:49.880 | The reason why we do this is that it turns out
00:39:53.100 | that you can very efficiently implement the forward pass
00:39:56.780 | over all of these 64, 128 examples in my mini-batch
00:40:01.260 | in one pass by, instead of doing vector matrix multiplications
00:40:05.300 | when we compute the pre-activations,
00:40:07.280 | doing matrix-matrix multiplications, which
00:40:09.760 | are faster than doing multiple matrix-vector multiplications.
00:40:13.880 | So in your code, often, there will
00:40:15.800 | be this other hyperparameter, which
00:40:17.480 | is mostly optimized for speed in terms of how quickly training
00:40:21.080 | will proceed of the number of examples in your mini-batch.
00:40:25.240 | Other things to improve optimization
00:40:27.200 | might be using a thing like momentum.
00:40:29.560 | That is, instead of using, as the descent direction,
00:40:33.320 | the gradient of the loss function,
00:40:35.220 | I'm actually going to track a descent direction, which
00:40:39.040 | I'm going to compute as the current gradient
00:40:41.160 | for my current example or mini-batch,
00:40:44.080 | plus some fraction of the previous update,
00:40:47.040 | the previous direction of update.
00:40:50.200 | And beta now is a hyperparameter you have to optimize.
00:40:52.640 | So what this does is, if all the update directions agree
00:40:56.840 | across multiple updates, then it will start picking up momentum
00:41:00.720 | and actually make bigger steps in those directions.
00:41:05.840 | And then there are multiple, even more advanced methods
00:41:08.760 | for adding adaptive types of learning rates.
00:41:12.560 | I mentioned them here very quickly,
00:41:14.060 | because you might see them in papers.
00:41:15.600 | There's a method known as AdaGrad,
00:41:17.480 | where the learning rate is actually
00:41:19.440 | scaled for each dimension, so for each weight
00:41:23.260 | and each biases.
00:41:24.440 | It's going to be scaled by what is the square root
00:41:28.800 | of the cumulative sum of the squared gradients.
00:41:31.920 | So what I track is I take my gradient vector at each step.
00:41:35.360 | I do an element-wise square of all the dimensions
00:41:39.120 | of my gradients, my gradient vector.
00:41:41.160 | And then I accumulate that in some variable
00:41:43.160 | that I'm noting as gamma here.
00:41:44.920 | And then for my descent direction, I take the gradient,
00:41:47.840 | and I do an element-wise division
00:41:50.080 | by the square root of this cumulative sum
00:41:52.920 | of squared gradients.
00:41:54.720 | There's also RMSProp, which is essentially like AdaGrad,
00:41:57.440 | but instead of doing a cumulative sum,
00:41:59.640 | we're going to do an exponential moving average.
00:42:02.080 | So we take the previous value times some factor
00:42:04.760 | plus 1 minus this factor times the current squared gradient.
00:42:08.960 | So that's RMSProp.
00:42:10.380 | And then there's Adam, which is essentially
00:42:12.720 | a combination of RMSProp with momentum, which
00:42:15.380 | is more involved.
00:42:16.200 | And I won't have time to describe it here,
00:42:17.960 | but that's another method that's often actually implemented
00:42:21.440 | in these different softwares and that people seem
00:42:24.400 | to use with a lot of success.
00:42:28.120 | And finally, in terms of actually debugging
00:42:31.400 | your implementations--
00:42:33.160 | so for instance, if you're lucky,
00:42:35.280 | you can build your neural network
00:42:36.720 | without difficulty using the current tools that
00:42:38.680 | are available in Torch or TensorFlow or Tiano.
00:42:41.240 | But maybe sometimes you actually have
00:42:42.840 | to implement certain gradients for a new module
00:42:45.640 | and a new box in your flow graph that
00:42:47.920 | isn't currently supported.
00:42:49.520 | If you do this, you should check that you've implemented
00:42:51.960 | your gradients correctly.
00:42:53.760 | And one way of doing that is to actually compare
00:42:56.560 | the gradients computed by your code
00:42:58.560 | with a finite difference of estimate.
00:43:01.240 | So what you do is, for each parameter,
00:43:03.160 | you add some very small epsilon value, say 10 to the minus 6,
00:43:07.220 | and you compute what is the output of your module.
00:43:10.760 | And then you subtract the same thing,
00:43:12.360 | but where you've subtracted the small quantity,
00:43:15.600 | and then you divide by 2 epsilon.
00:43:17.400 | So if epsilon converges to 0, then you actually
00:43:20.520 | get the partial derivative.
00:43:21.960 | But if it's just small, it's going to be an approximate.
00:43:24.240 | And usually, this finite difference estimate
00:43:26.600 | will be very close to a correct implementation
00:43:29.520 | of the real gradient.
00:43:30.840 | So you should definitely do that if you've actually
00:43:33.380 | implemented some of the gradients in your code.
00:43:36.160 | And then another useful thing to do
00:43:37.900 | is to actually do a very small experiment on a small data set
00:43:41.880 | before you actually run your full experiment
00:43:44.320 | on your complete data set.
00:43:45.880 | So use, say, 50 examples.
00:43:47.760 | So just taking a random subset of 50 examples
00:43:50.400 | from your data set.
00:43:52.000 | Actually, just make sure that your code can overfit
00:43:54.640 | to that data, can essentially classify it perfectly,
00:43:58.600 | given enough capacity that you would think it should get it.
00:44:03.040 | So if it's not the case, then there's a few things
00:44:06.300 | that you might want to investigate.
00:44:08.320 | Maybe your initialization is such
00:44:09.920 | that the units are already saturated initially,
00:44:12.620 | and so there's no actual optimization
00:44:14.800 | happening because some of the gradients on some of the weights
00:44:17.440 | are exactly zero.
00:44:19.040 | So you might want to check your initialization.
00:44:22.160 | Maybe your gradients are just--
00:44:23.760 | you're using a model you implemented gradients for,
00:44:26.040 | and maybe your gradients are not properly implemented.
00:44:28.920 | Maybe you haven't normalized your input, which
00:44:31.040 | creates some instability, making it
00:44:33.040 | harder for stochastic gradient descent to work successfully.
00:44:38.400 | Maybe your learning rate is too large.
00:44:40.080 | Then you should consider trying smaller learning rates.
00:44:42.840 | That's actually a pretty good way of adding
00:44:44.940 | some idea of the magnitude of the learning rate
00:44:47.640 | you should be using.
00:44:49.320 | And then once you actually overfit
00:44:51.680 | in your small training set, you're
00:44:53.060 | ready to do a full experiment on a larger data set.
00:44:56.640 | That said, this is not a replacement
00:44:58.880 | for gradient checking.
00:45:00.240 | So backprop and stochastic gradient descent,
00:45:03.440 | it's a great algorithm that's very bug resistant.
00:45:06.800 | You will potentially see some learning happening,
00:45:10.540 | even if some of your gradients are wrong,
00:45:12.400 | or say, exactly zero.
00:45:13.840 | So that's great if you're an engineer
00:45:16.160 | and you're implementing things.
00:45:17.800 | It's fun when code is somewhat bug resistant.
00:45:19.960 | But if you're actually doing science
00:45:21.600 | and trying to understand what's going on,
00:45:23.960 | that can be a complication.
00:45:25.160 | So do both, gradient checking and a small experiment
00:45:29.160 | like that.
00:45:31.120 | All right, and so for the last few minutes,
00:45:33.000 | I'll actually try to motivate what
00:45:35.040 | you'll be learning quite a bit about in the next two days.
00:45:40.280 | That is, the specific case for deep learning.
00:45:43.880 | So I've already told you that if I have a neural net with enough
00:45:47.760 | hidden units, theoretically, I can potentially
00:45:49.880 | represent pretty much any function, any classification
00:45:53.000 | function.
00:45:53.920 | So why would I want multiple layers?
00:45:56.360 | So there are a few motivations behind this.
00:45:59.000 | The first one is taken directly from our own brains.
00:46:02.160 | So we know in the visual cortex that the light that
00:46:05.320 | hits our retina eventually goes through several regions
00:46:08.520 | in the visual cortex.
00:46:09.880 | Eventually reaching an area known as V1,
00:46:12.520 | where you have units that are-- or neurons that are essentially
00:46:16.040 | tuned to small forms like edges.
00:46:18.840 | And then it goes on to V4, where it's
00:46:20.480 | slightly more complex patterns that the units are tuned for.
00:46:23.880 | And then you reach AIT, where you actually
00:46:25.680 | have neurons that are specific to certain objects
00:46:27.680 | or certain units.
00:46:28.840 | And so the idea here is that perhaps that's
00:46:30.960 | also what we want in an artificial, say, vision system.
00:46:35.800 | We'd like it, if it's detecting faces,
00:46:37.760 | to have a first layer that detects simple edges,
00:46:41.080 | and then another layer that perhaps puts these edges
00:46:43.520 | together, detecting slightly more complex things,
00:46:45.840 | like a nose or a mouth or eyes.
00:46:47.880 | And then eventually have a layer that combines
00:46:49.880 | these slightly less abstract or more abstract units
00:46:54.520 | to get something even more abstract,
00:46:56.200 | like a complete face.
00:46:58.540 | There's also some theoretical justification
00:47:00.680 | for using multiple layers.
00:47:04.160 | So the early results were mostly based
00:47:06.360 | on studying Boolean functions, or a function that
00:47:08.920 | takes as input--
00:47:10.160 | can think of it as a vector of just zeros and ones.
00:47:12.640 | And you could show that there are certain functions that,
00:47:17.080 | if you had essentially a Boolean neural network
00:47:19.960 | or essentially a Boolean circuit,
00:47:22.840 | and you restricted the number of layers of that circuit,
00:47:26.280 | that there are certain functions that, in this case,
00:47:28.480 | to represent certain Boolean functions exactly,
00:47:31.160 | you would need an exponential number of units
00:47:33.600 | in each of these layers.
00:47:35.160 | Whereas if you allowed yourself to have multiple layers,
00:47:37.160 | then you could represent these functions more compactly.
00:47:39.840 | And so that's another motivation,
00:47:41.560 | that perhaps with more layers, we
00:47:42.960 | can represent fairly complex functions in a more compact way.
00:47:48.520 | And then there's the reason that they just work.
00:47:51.160 | So we've seen in the past few years
00:47:53.920 | great success in speech recognition, where it's
00:47:56.440 | essentially revolutionized the field, where everyone's using
00:47:59.020 | deep learning for speech recognition,
00:48:00.940 | and same thing for visual object recognition,
00:48:03.840 | where, again, deep learning is sort
00:48:05.400 | of the method of choice for identifying objects in images.
00:48:10.520 | So then why are we doing this only recently?
00:48:13.760 | Why didn't we do deep learning way back
00:48:16.640 | when backprop was invented, which is essentially in 1980s
00:48:20.760 | and even before that?
00:48:22.840 | So it turns out training deep neural networks
00:48:24.760 | is actually not that easy.
00:48:26.040 | There are a few hurdles that one can be confronted with.
00:48:30.000 | I've already mentioned one of the issues, which
00:48:32.120 | is that some of the gradients might be fading as you go
00:48:35.640 | from the top layer to the bottom layer,
00:48:37.280 | because we keep multiplying by the derivative of the activation
00:48:39.940 | function.
00:48:40.520 | So that makes training hard.
00:48:41.960 | It could be that the lower layers at very small gradients
00:48:44.560 | are barely moving and exploring the space of correct features
00:48:49.040 | to learn for a given problem.
00:48:51.320 | Sometimes that's the problem you find.
00:48:52.960 | You have a hard time just fitting your data,
00:48:54.880 | and you're essentially underfitting.
00:48:57.080 | Or it could be that with deeper neural nets or bigger
00:49:00.360 | neural nets, we have more parameters.
00:49:02.200 | So perhaps sometimes we're actually overfitting.
00:49:04.240 | We're in a situation where all the functions that we
00:49:07.680 | can represent with the same neural net represented
00:49:10.840 | by this gray area function actually includes, yes,
00:49:14.040 | the right function, but it's so large
00:49:15.660 | that for a finite training set, the odds
00:49:18.280 | that I'm going to find the one that's
00:49:19.780 | close to the true classifying function, the real system
00:49:22.840 | that I'd like to have, is going to be very different.
00:49:25.720 | So in this case, I'm essentially overfitting,
00:49:28.080 | and that might also be a situation we're in.
00:49:31.200 | And unfortunately, there are many situations
00:49:35.880 | where one problem is observed, overfitting or underfitting.
00:49:40.920 | And so we essentially have, in the field,
00:49:43.480 | developed tools for fighting both situations.
00:49:46.140 | And I'm going to rapidly touch a few of those, which you will
00:49:49.600 | see will come up later on in multiple talks.
00:49:53.560 | So one of the first hypotheses, which
00:49:56.320 | might be that you're underfitting,
00:49:57.760 | well, you can essentially just fight this
00:49:59.960 | by waiting longer, so training longer.
00:50:01.880 | If you have your gradients are too small,
00:50:03.600 | and this is essentially why you're progressing very slowly
00:50:05.800 | when you're training, well, if you're using GPUs
00:50:08.220 | and are able to do more iterations over the same
00:50:11.480 | training set in less time, that might just
00:50:15.120 | solve your problem of underfitting.
00:50:16.800 | And I think we've seen some of that,
00:50:18.760 | and this is partly why GPUs have been so game-changing
00:50:21.340 | for deep learning.
00:50:22.520 | Or you can use just better optimization methods also.
00:50:26.120 | And if you're overfitting, well, we just
00:50:28.200 | need better regularization.
00:50:31.560 | I've been involved early on in my PhD
00:50:33.680 | on using unsupervised learning as a way
00:50:36.040 | to regularize neural nets.
00:50:38.820 | If I have time, I'll talk a little bit about that.
00:50:40.940 | And there's another method you might have heard
00:50:43.200 | about known as dropout.
00:50:44.840 | So I'll try to touch at least two methods that are essentially
00:50:49.560 | trying to address some of these issues.
00:50:51.360 | So the first one that I'll talk about is dropout.
00:50:54.720 | It's actually very easy, very simple.
00:50:57.920 | So the idea of if our neural net is essentially overfitting,
00:51:01.120 | so it's too good at training on the training set,
00:51:04.520 | well, we're essentially going to cripple training.
00:51:06.880 | We're going to make it harder to fit the training set.
00:51:09.240 | And the way we're going to do that in dropout
00:51:11.240 | is that we will stochastically remove
00:51:13.880 | hidden units independently.
00:51:16.120 | So for each hidden unit, before we do a forward pass,
00:51:18.880 | we'll flip a coin.
00:51:20.240 | And with probability half, we will
00:51:22.880 | multiply the activation by 0.
00:51:24.800 | And with probability half, we'll multiply it by 1.
00:51:27.500 | So what this means is that if a unit is multiplied by 0,
00:51:30.840 | it's effectively not in the neural net anymore.
00:51:34.120 | And we're doing this independently
00:51:36.360 | for each hidden units.
00:51:37.880 | So that means that in a layer, a unit cannot rely anymore
00:51:41.840 | on the presence on any other units
00:51:44.760 | to try to sort of synchronize and adapt
00:51:47.880 | to perform a complex classification
00:51:50.600 | or learn a complex feature.
00:51:52.360 | And that was partly the motivation behind dropout
00:51:54.680 | is that this procedure might encourage
00:51:57.480 | types of features that are not co-adapted
00:51:59.760 | and are less likely to overfit.
00:52:02.880 | So we often use 0.5 as the probability
00:52:05.960 | of dropping out a unit.
00:52:08.040 | It turns out it often, surprisingly,
00:52:10.040 | is the best value.
00:52:11.160 | But that's another hyperparameter
00:52:12.920 | you might want to tune.
00:52:15.060 | And in terms of how it impacts an implementation of backdrop,
00:52:18.560 | it's very simple.
00:52:20.000 | So the forward pass, before I do it,
00:52:21.600 | I just sample my binary masks for all my layers.
00:52:24.760 | And then when I'm performing backdrop, well,
00:52:28.240 | my gradient on the--
00:52:30.000 | oh, sorry.
00:52:30.480 | So that's the forward pass.
00:52:31.580 | I'm just multiplying by this binary mask here.
00:52:34.760 | So super simple change.
00:52:36.640 | And then in terms of backdrop, well, I'm
00:52:39.000 | also going to multiply by the mask
00:52:41.160 | when I get my gradient on the pre-activation.
00:52:43.800 | And also, don't forget that the activations are now different.
00:52:47.080 | They actually include the mask in my notation.
00:52:49.920 | So it's a very simple change in the forward and backward pass
00:52:52.860 | when you're training.
00:52:54.280 | And also, another thing that I should emphasize
00:52:56.280 | is that the mask is being resampled for every example.
00:52:59.600 | So before you do a forward pass, you resample the mask.
00:53:02.160 | You don't keep it--
00:53:03.280 | sample it once and then use it the whole time.
00:53:07.560 | And then at test time, because we don't really
00:53:09.560 | like a model that sort of randomly changes its output,
00:53:13.600 | because it will if we stochastically change the masks,
00:53:16.840 | what we do is we replace the mask
00:53:18.920 | by the probability of dropping out a unit,
00:53:23.600 | or actually of keeping a unit.
00:53:25.480 | So if we're using 0.5, that's just 0.5.
00:53:29.200 | We can actually show that if you have a neural net
00:53:31.360 | with a single hidden layer, doing this transformation
00:53:34.720 | at test time, multiplying by 0.5 is
00:53:36.760 | equivalent to doing a geometric average of all
00:53:39.640 | the possible neural networks with all the different binary
00:53:42.120 | mask patterns.
00:53:43.360 | So it's essentially one way of thinking about dropout
00:53:46.600 | in the single layer case is that it's kind of an assembling
00:53:49.020 | method where you have a lot of models,
00:53:51.040 | an exponential number of models, which
00:53:52.760 | are all sharing the same weights but have different masks.
00:53:56.040 | That intuition, though, doesn't transfer for deep neural nets
00:53:59.520 | in the sense that you cannot show this result.
00:54:01.480 | It really only applies to a single hidden layer.
00:54:05.920 | So in practice, it's very effective,
00:54:08.480 | but do expect some slowdown in training.
00:54:10.640 | So often, we tend to see that training a network
00:54:13.640 | to completion will take twice as many epochs
00:54:16.160 | if you're using dropout with 0.5.
00:54:18.360 | And here, you have the reference if you
00:54:19.980 | want to learn more about different variations
00:54:21.920 | of dropouts and so on.
00:54:23.000 | And I probably won't talk about unsupervised retraining
00:54:28.920 | for lack of time, but I'll talk about another thing
00:54:31.040 | that you'll definitely probably hear about,
00:54:33.440 | and that's implemented in these different packages, which
00:54:35.740 | is batch normalization.
00:54:37.360 | Batch normalization is kind of interesting in the sense
00:54:39.660 | that it's been shown to better optimize.
00:54:42.840 | That is, certain networks that would otherwise underfit
00:54:46.000 | would not underfit as much anymore
00:54:48.440 | if you use batch normalization.
00:54:49.840 | But also, it's been shown that when
00:54:51.440 | you use batch normalization, dropout is not as useful.
00:54:54.720 | And dropout being a regularization method,
00:54:56.720 | that suggests that perhaps batch normalization is also
00:54:59.760 | regularizing in some way.
00:55:01.280 | So these things are not one or the other.
00:55:03.800 | They're not mutually exclusive.
00:55:05.160 | You can have a regularizer that also, it turns out,
00:55:07.440 | helps you better optimize.
00:55:10.240 | So the intuition behind batch normalization
00:55:14.760 | is much like I've suggested that normalizing your inputs actually
00:55:19.560 | can help speeding up training.
00:55:21.120 | Well, how about we also normalize all the hidden layers
00:55:24.380 | when I'm doing my forward pass?
00:55:27.600 | Now, the problem in doing this is
00:55:28.960 | that I can compute the mean and the standard deviations
00:55:31.880 | of my inputs once and for all because they're constant.
00:55:35.000 | But my hidden layers are constantly
00:55:36.640 | changing because I'm training these parameters.
00:55:39.080 | So the mean and the standard deviation of my units
00:55:41.640 | will change.
00:55:42.800 | And so it would be very expensive
00:55:45.920 | if every time I did an update on my parameters,
00:55:48.120 | I recomputed the means and the standard deviations
00:55:50.540 | of all of my units.
00:55:52.400 | So batch normalization addresses some of these issues
00:55:54.980 | as follows.
00:55:56.360 | So the way it works is first, the normalization
00:55:59.520 | is going to be applied on actually the pre-activation.
00:56:02.240 | So not the activation of the unit,
00:56:03.660 | but before the non-linearity.
00:56:06.480 | During training, to address the issue
00:56:08.600 | that we don't want to compute means over the full training
00:56:11.120 | set because that would be too slow,
00:56:12.720 | I'm actually going to compute it on each mini-batch.
00:56:15.960 | So I have to do mini-batch training here.
00:56:18.120 | I'm going to take my small mini-batch of 64, 128 examples.
00:56:21.560 | And that's the set of examples on which
00:56:23.440 | I'm going to compute my means and standard deviations.
00:56:26.800 | And then when I do backprop, I'm actually
00:56:28.920 | going to take into account the normalization.
00:56:31.160 | So now there's going to be a gradient going
00:56:33.280 | through the computation of the mean and the standard deviation
00:56:36.360 | because they depend on the parameters
00:56:38.320 | of the neural network.
00:56:40.160 | And then at test time, we'll just
00:56:41.540 | use the global mean and global standard deviation.
00:56:44.200 | Once I finish training, I can actually
00:56:45.880 | do a full pass over the whole training set and get
00:56:48.120 | all of my means and standard deviations.
00:56:52.000 | So that's essentially the pseudocode for that,
00:56:54.680 | taken out of the paper directly.
00:56:57.360 | So if x is a pre-activation for a unit
00:57:00.520 | and have multiple pre-activations
00:57:02.640 | for a single unit across my mini-batch,
00:57:05.240 | I would compute what is the average for that unit
00:57:08.280 | pre-activation across my examples in my mini-batch,
00:57:11.400 | compute my variance, and then subtract the mean
00:57:15.000 | and divide by the square root of the variance,
00:57:17.760 | plus some epsilon for numerical stability
00:57:19.680 | in case the variance is too close to zero.
00:57:22.160 | And then another thing is that actually batch normalization
00:57:25.200 | doesn't just perform this normalization
00:57:28.080 | and outputs the normalized pre-activation.
00:57:30.400 | It then actually performs a linear transformation on it.
00:57:34.640 | So it multiplies it by this parameter gamma,
00:57:37.160 | which is going to be trained by gradient descent.
00:57:40.160 | And it's often called the gain parameter
00:57:43.240 | of batch normalization.
00:57:45.760 | And it adds a bias beta.
00:57:47.960 | And the reason is that if I'm subtracting by the mean,
00:57:51.360 | then each of these units have the bias parameter.
00:57:54.620 | So if I subtract it, then this essentially here,
00:57:58.280 | there's no bias anymore.
00:57:59.680 | It was present here, it was present here,
00:58:01.360 | and now it's been subtracted.
00:58:02.720 | So I have to add the bias, but after the batch normalization,
00:58:05.580 | essentially.
00:58:06.360 | So these betas here are essentially
00:58:08.240 | the new bias parameters.
00:58:10.380 | And those will actually be trained.
00:58:11.800 | So we do gradient descent also on those.
00:58:13.760 | So batch normalization adds a few parameters.
00:58:18.120 | All right, and as I said, I'm just going to skip over this.
00:58:20.640 | And I'm not showing what the gradients are when you back
00:58:23.200 | prop through the mean and so on.
00:58:24.760 | It's described in the paper if you want to see the gradients.
00:58:27.300 | But otherwise, in the different packages,
00:58:31.200 | you'll get the gradients automatically.
00:58:33.160 | It's usually been implemented.
00:58:35.580 | Skipping over that, I'll just finish.
00:58:38.080 | If you actually want to learn about unsupervised
00:58:40.160 | pre-training and why it works, I have videos on that.
00:58:42.480 | So you can check that out.
00:58:44.200 | And I guess that's it.
00:58:46.160 | Thank you.
00:58:46.960 | [APPLAUSE]
00:58:49.880 | Thanks, Hugo.
00:58:56.040 | So we have a few minutes for questions which
00:58:58.400 | are intermingled with a break.
00:59:00.520 | So feel free to either go for a break or ask questions to Hugo.
00:59:04.800 | I believe there are microphones.
00:59:06.220 | And I'll also stick around.
00:59:07.760 | So if you want to ask me questions offline,
00:59:09.880 | that's also fine.
00:59:10.960 | If anyone has questions, you can go to the mic.
00:59:13.040 | Go to the microphone.
00:59:24.080 | You mentioned the ReLU adds sparsity.
00:59:26.340 | Can you explain why?
00:59:28.440 | Yeah, so the first thing is that it's observed in practice.
00:59:34.400 | And it adds some sparsity in part
00:59:37.200 | because you have the non-linearity at 0 below.
00:59:40.520 | So it means that units are going to be potentially
00:59:43.560 | exactly sparse, essentially absent of the hidden layer.
00:59:49.480 | There are a few reasons to explain why you get sparsity.
00:59:54.400 | It turns out that this process of doing a linear
00:59:56.900 | transformation followed by the ReLU activation function
01:00:00.460 | is very close to some of the steps
01:00:02.160 | you would do when you're optimizing for sparse codes
01:00:04.840 | in a sparse coding model, if you know about sparse coding.
01:00:07.700 | So they're essentially an optimization method
01:00:10.280 | that, given some sparse coding model,
01:00:12.720 | will find what is the sparse representation,
01:00:15.640 | hidden representation for some input.
01:00:17.580 | And it's mostly a sequence of linear transformations
01:00:20.920 | followed by this ReLU-like activation function.
01:00:25.120 | And I think this is partly the explanation.
01:00:27.600 | Otherwise, I don't know of a solid explanation
01:00:31.280 | for why that is beyond what's observed in practice.
01:00:34.280 | Any more questions?
01:00:41.240 | If not, let's thank Hugo again.
01:00:43.120 | [APPLAUSE]
01:00:47.600 | And we are reconvening in 10 minutes.