Foundations of Deep Learning (Hugo Larochelle, Twitter)

00:00:00.000 | That's good.

00:00:00.840 | All right, cool.

00:00:01.440 | So yeah, so I was asked to give this presentation

00:00:04.880 | on the foundations of deep learning, which

00:00:07.080 | is mostly going over basic feedforward neural networks

00:00:11.120 | and motivating a little bit deep learning

00:00:13.640 | and some of the more recent developments

00:00:15.960 | and some of the topics that you'll

00:00:17.480 | see across the next two days.

00:00:20.120 | So as Andrew mentioned, I have just an hour.

00:00:26.080 | So I'm going to go fairly quickly

00:00:27.520 | on a lot of these things, which I think

00:00:29.160 | would mostly be fine if you're familiar enough

00:00:31.600 | with some machine learning and a little bit about neural nets.

00:00:35.320 | But if you'd like to go into some

00:00:36.800 | of the more specific details, you

00:00:38.280 | can go check out my online lectures on YouTube.

00:00:41.320 | It's now taught by a much younger version of myself.

00:00:44.400 | And so just search for Hugo Larochelle.

00:00:47.800 | And I am not the guy doing a bunch of skateboarding.

00:00:50.640 | I'm the geek teaching about neural nets.

00:00:52.920 | So go check those out if you want more details.

00:00:56.560 | But so what I'll cover today is--

00:00:58.880 | I'll start with just describing and laying out

00:01:02.280 | the notation on feedforward neural networks, that

00:01:04.920 | is, models that take an input vector x--

00:01:07.280 | that might be an image or some text--

00:01:09.320 | and produces an output f of x.

00:01:11.360 | So I'll just describe forward propagation

00:01:13.120 | and the different types of units and the type of functions

00:01:16.040 | we can represent with those.

00:01:17.680 | And then I'll talk about how we actually train neural nets,

00:01:20.720 | describing things like loss functions,

00:01:22.600 | backpropagation that allows us to get a gradient for training

00:01:26.160 | with stochastic gradient descent,

00:01:27.700 | and mention a few tricks of the trade,

00:01:29.840 | so some of the things we do in practice to successfully train

00:01:32.720 | neural nets.

00:01:33.680 | And then I'll end by talking about some developments that

00:01:37.680 | are specifically useful in the context of deep learning, that

00:01:41.240 | is, neural networks with several hidden layers that came out

00:01:45.220 | at the very--

00:01:46.920 | after the beginning of deep learning, say, in 2006.

00:01:49.880 | That is, things like dropout, batch normalization,

00:01:52.200 | and if I have some time, unsupervised pre-training.

00:01:55.800 | So let's get started.

00:01:57.560 | And just talk about, assuming we have some neural network,

00:02:00.400 | how do they actually function?

00:02:01.680 | How do they make predictions?

00:02:04.280 | So let me lay down the notation.

00:02:06.800 | So a multilayer feedforward neural network

00:02:10.240 | is a model that takes as input some vector x, which

00:02:14.200 | I'm representing here with a different node

00:02:16.400 | for each of the dimensions in my input vector.

00:02:19.480 | So each dimension is essentially a unit in that neural network.

00:02:23.640 | And then it eventually produces, at its output layer,

00:02:27.080 | an output.

00:02:28.680 | And we'll focus on classification mostly.

00:02:31.160 | So you'd have multiple units here.

00:02:33.280 | And each unit would correspond to one

00:02:35.160 | of the potential classes in which we would

00:02:37.280 | want to classify our input.

00:02:38.680 | So if we're identifying digits in handwritten character

00:02:42.960 | images, and say we're focusing on digits,

00:02:46.120 | you'd have 10 digits.

00:02:47.120 | So you would have a sort of 0 from 0 to 9.

00:02:50.120 | So you'd have 10 output units.

00:02:52.420 | And to produce an output, the neural net

00:02:54.820 | will go through a series of hidden layers.

00:02:58.360 | And those will be essentially the components

00:03:01.000 | that introduce non-linearity that

00:03:02.380 | allows us to capture and perform very sophisticated types

00:03:06.340 | of classification functions.

00:03:08.600 | So if we have L hidden layers, the way

00:03:11.500 | we compute all the layers in our neural net is as follows.

00:03:16.040 | We first start by computing what I'm

00:03:17.940 | going to call a pre-activation.

00:03:20.040 | I'm going to know that A. And I'm

00:03:22.200 | going to index the layers by k.

00:03:23.920 | So A k is just the pre-activation at layer k.

00:03:28.080 | And that is only simply going to be a linear transformation

00:03:32.320 | of the previous layer.

00:03:34.080 | So I'm going to note h k as the activation on the layer.

00:03:38.320 | And by default, I'll assume that layer 0

00:03:41.200 | is going to be the input.

00:03:43.080 | And so using that notation, the pre-activation at layer k

00:03:46.960 | is going to correspond to taking the activation

00:03:49.720 | at the previous layer, k minus 1,

00:03:52.200 | multiplying it by a matrix, Wk.

00:03:54.680 | Those are the parameters of the layer.

00:03:57.640 | Those essentially corresponds to the connections

00:04:00.520 | between the units between adjacent layers.

00:04:03.080 | And I'm going to add a bias vector.

00:04:05.120 | That's another parameter in my layer.

00:04:07.360 | So that gives me the pre-activation.

00:04:09.640 | And then next, I'm going to get a hidden layer activation

00:04:12.280 | by applying an activation function.

00:04:14.680 | This will introduce some non-linearity in the model.

00:04:17.760 | So I'm going to call that function g.

00:04:19.320 | And we'll go over a few choices.

00:04:22.000 | So we have four common choices for the activation function.

00:04:26.160 | And so I do this from layer 1 to layer L.

00:04:29.120 | And when it comes to the output layer,

00:04:31.320 | I'll also compute a pre-activation

00:04:33.480 | by performing a linear transformation.

00:04:36.000 | But then I'll usually apply a different activation function

00:04:38.600 | depending on the problem I'm trying to solve.

00:04:41.960 | So having said that, let's go to some of the choices

00:04:47.000 | for the activation function.

00:04:48.260 | So some of the activation functions you'll see.

00:04:50.800 | One common one is this sigmoid activation function.

00:04:53.680 | It's this function here.

00:04:54.880 | It's just 1 divided by 1 plus the exponential

00:04:58.680 | of minus the pre-activation.

00:05:01.360 | The shape of this function, you can focus on that, is this here.

00:05:04.720 | It takes the pre-activation, which

00:05:06.160 | can vary from minus infinite to plus infinite.

00:05:08.520 | And it squashes this between 0 and 1.

00:05:11.680 | So it's bounded by below and above, below by 0,

00:05:15.720 | and above by 1.

00:05:17.480 | So it's a function that saturates

00:05:19.600 | if you have very large magnitude positive or negative

00:05:24.440 | pre-activations.

00:05:26.760 | Another common choice is the hyperbolic tangent or tanh

00:05:29.880 | activation function.

00:05:31.680 | This picture here.

00:05:32.600 | So it squashes everything.

00:05:33.720 | But instead of being between 0 and 1,

00:05:35.840 | it's between minus 1 and 1.

00:05:38.720 | And one that's become quite popular in neural nets

00:05:42.600 | is what's known as the rectified linear activation function.

00:05:46.120 | Or in papers, you will see the ReLU unit

00:05:50.400 | that refers to the use of this activation function.

00:05:54.880 | So this one is different from the others

00:05:56.560 | in that it's not bounded above, but it is bounded below.

00:06:00.340 | And it will output 0's exactly if the pre-activation

00:06:05.640 | is negative.

00:06:08.040 | So those are the choices of activation functions

00:06:10.120 | for the hidden layers.

00:06:12.060 | And for the output layer, if we're performing

00:06:13.960 | classification, as I said, in our output layer,

00:06:16.760 | we will have as many units as there

00:06:18.640 | are classes in which an input could belong.

00:06:21.440 | And what we'd like is potentially--

00:06:24.720 | and what we often do is interpret each unit's

00:06:27.600 | activation as the probability, according

00:06:30.320 | to the neural network, that the input belongs

00:06:33.600 | to the corresponding class, that its label y

00:06:36.480 | is the corresponding class C. So C

00:06:39.600 | would be like the index of that unit in the output layer.

00:06:43.120 | So we need an activation function

00:06:44.520 | that produces probabilities, produces

00:06:46.840 | a multinomial distribution over all the different classes.

00:06:50.080 | And the activation function we use for that

00:06:52.120 | is known as the softmax activation function.

00:06:55.280 | It is simply as follows.

00:06:57.400 | You take your pre-activations, and you exponentiate them.

00:07:00.440 | So that's going to give us positive numbers.

00:07:02.800 | And then we divide each of the exponentiated pre-activations

00:07:06.440 | by the sum of all the exponentiated pre-activations.

00:07:11.200 | So because I'm normalizing this way,

00:07:13.040 | it means that all my values in my output layer

00:07:16.640 | are going to sum to 1.

00:07:17.920 | And they're positive because I took the exponential.

00:07:20.080 | So I can interpret that as a multinomial distribution

00:07:23.080 | over the choice of all the C different classes.

00:07:26.760 | So that's what I'll use as the activation function

00:07:29.040 | at the output layer.

00:07:32.160 | And now, beyond the math in terms of conceptually

00:07:35.100 | and also in the way we're going to program neural networks,

00:07:38.480 | often what we'll do is that all these different operations,

00:07:41.020 | the linear transformations, the different types of activation

00:07:43.520 | functions, we'll essentially implement all of them

00:07:47.440 | as an object, an object that take arguments.

00:07:52.200 | And the arguments would essentially

00:07:53.660 | be what other things are being combined

00:07:55.400 | to produce the next value.

00:07:57.600 | So for instance, we would have an object

00:07:59.520 | that might correspond to the computation of pre-activation,

00:08:02.760 | which would take as argument what

00:08:04.920 | is the weight matrix and the bias vector for that layer

00:08:08.280 | and take some layer to transform.

00:08:10.920 | And this object would compute its value

00:08:13.560 | by applying the linear activation,

00:08:15.840 | the linear transformation.

00:08:17.120 | And then we might have objects that

00:08:18.620 | correspond to specific activation functions,

00:08:21.880 | so like a sigmoid object or a tanh object or a ReLU object.

00:08:25.360 | And we just combine these objects together,

00:08:27.200 | chain them into what ends up being a graph, which I refer

00:08:30.960 | to as a flow graph, that represents the computation done

00:08:34.680 | when you do a forward pass in your neural network

00:08:37.400 | up until you reach the output layer.

00:08:39.520 | So I mention it now because you'll

00:08:41.520 | see the different softwares that we presented over the weekend

00:08:45.840 | will essentially exploit some of that representation

00:08:50.000 | of the computation in neural nets.

00:08:51.600 | It will also be handy for computing gradients, which

00:08:53.880 | I'll talk about in a few minutes.

00:08:57.720 | And so that's how we perform predictions in neural networks.

00:09:02.120 | So we get an input.

00:09:03.560 | We eventually reach an output layer

00:09:05.120 | that gives us a distribution over classes

00:09:06.960 | if we're performing classification.

00:09:08.680 | If I want to actually classify, I

00:09:10.360 | would just assign the class corresponding

00:09:13.400 | to the unit that has the highest activation, that

00:09:16.440 | would correspond to classifying to the class that

00:09:19.160 | has the highest probability according to the neural net.

00:09:21.720 | But then you might ask the question, OK,

00:09:26.200 | what kind of problems can we solve with neural networks?

00:09:29.240 | Or more technically, what kind of functions

00:09:31.440 | can we represent mapping from some input x

00:09:34.240 | into some arbitrary output?

00:09:36.520 | And so if you go look at my videos,

00:09:39.480 | I try to give more intuition as to why

00:09:41.920 | we have this result here.

00:09:43.160 | But essentially, if we have a single hidden layer

00:09:45.960 | neural network, it's been shown that with a linear output,

00:09:48.680 | we can approximate any continuous function

00:09:51.040 | arbitrarily well as long as we have enough hidden units.

00:09:54.560 | So that is, there's a value for these biases and these weights

00:09:57.360 | such that any continuous function,

00:09:59.240 | I can actually represent it as well as I want.

00:10:01.720 | I just need to add enough hidden units.

00:10:04.760 | So this result applies if you use activation functions,

00:10:07.560 | non-linear activation functions like sigmoid and tanh.

00:10:11.360 | So as I said in my video, if you want a bit more intuition

00:10:14.000 | as to why that would be, you can go check that out.

00:10:18.040 | But that's a really nice result.

00:10:19.920 | It means that by focusing on this family of machine learning

00:10:23.880 | models that are neural networks, I can pretty much potentially

00:10:27.720 | represent any kind of classification function.

00:10:30.640 | However, this result does not tell us

00:10:32.560 | how do we actually find the weights and the bias values

00:10:35.680 | such that I can represent a given function.

00:10:38.040 | It doesn't essentially tell us how

00:10:39.560 | do we train a neural network.

00:10:41.640 | And so that's what we'll discuss next.

00:10:44.800 | So let's talk about that.

00:10:45.960 | How do we actually, from a data set,

00:10:48.280 | train a neural network to perform good classification

00:10:51.560 | for that problem?

00:10:54.280 | So what we'll typically do is use a framework that's

00:10:58.100 | very generic in machine learning,

00:10:59.900 | known as empirical risk minimization or structural risk

00:11:03.060 | minimization if you're using regularization.

00:11:05.660 | So this framework essentially transforms

00:11:08.940 | a problem of learning as a problem of optimizing.

00:11:12.860 | So what we'll do is that we'll first choose a loss function

00:11:16.380 | that I'm noting as L. And the loss function,

00:11:19.540 | it compares the output of my model,

00:11:22.100 | so the output layer of my neural network,

00:11:23.980 | with the actual target.

00:11:25.660 | So I'm indexing with an exponent here with t

00:11:28.540 | to essentially as the index over all my different examples

00:11:32.660 | in my training set.

00:11:34.860 | And so my loss function will tell me,

00:11:36.940 | is this output good or bad given that the label is actually y?

00:11:42.660 | And what I'll do, I'll also define a regularizer.

00:11:47.760 | So theta here is--

00:11:49.620 | you can think of it as just a concatenation of all my biases

00:11:52.860 | and all of my weights in my neural net.

00:11:54.460 | So those are all the parameters of my neural network.

00:11:58.200 | And the regularizer will essentially

00:12:00.020 | penalize certain values of these weights.

00:12:03.220 | So as I'll talk more specifically later on,

00:12:05.980 | for instance, you might want to have your weights not

00:12:08.500 | be too far from 0.

00:12:09.940 | That's a frequent intuition that we implement with regularizer.

00:12:14.200 | And so the optimization problem that we'll

00:12:16.660 | try to solve when learning is to minimize

00:12:19.740 | the average loss of my neural network over my training

00:12:23.900 | examples, so summing over all training examples.

00:12:26.100 | I have capital T examples.

00:12:28.860 | Plus some weight here that's known as the weight decay,

00:12:33.340 | some hyperparameter lambda, times my regularizer.

00:12:37.100 | So in other words, I'm going to try to have

00:12:39.280 | my loss on my training set the smallest possible over all

00:12:43.140 | the training example and also try

00:12:45.260 | to satisfy my regularizer as much as possible.

00:12:48.660 | And so now we have this optimization problem.

00:12:51.420 | And learning will just correspond

00:12:53.140 | to trying to solve this problem.

00:12:55.380 | So finding this arg min here for over my weights and my biases.

00:13:00.940 | And if I want to do this, I can just

00:13:02.460 | invoke some optimization procedure

00:13:05.100 | from the optimization community.

00:13:08.860 | And the one algorithm that you'll

00:13:10.340 | see constantly in deep learning is stochastic gradient descent.

00:13:14.420 | This is the optimization algorithm

00:13:16.220 | that we'll often use for training neural networks.

00:13:19.780 | So SGD, stochastic gradient descent, functions as follows.

00:13:23.620 | You first initialize all of your parameters.

00:13:26.580 | That is finding initial values for all my weight matrices

00:13:29.740 | and all of my biases.

00:13:32.060 | And then for a certain number of epochs--

00:13:34.140 | so an epoch will be a full pass over all my examples.

00:13:37.740 | That's what I'll call an epoch.

00:13:40.100 | So for a certain number of full iterations over my training

00:13:44.620 | set, I'll draw each training example.

00:13:47.900 | So I pair x, input x, target y.

00:13:51.620 | And then I'll compute what is the gradient of my loss

00:13:55.900 | with respect to my parameters.

00:13:58.580 | All of my parameters, all my weights, and all my biases.

00:14:01.040 | This is what this notation here--

00:14:03.180 | so nabla for the gradient of the loss function.

00:14:06.740 | And here I'm indexing with respect to which parameter.

00:14:10.300 | I want the gradient.

00:14:12.060 | So I'm going to compute what is the gradient of my loss

00:14:14.700 | function with respect to my parameters.

00:14:17.240 | And plus lambda times the gradient of my regularizer

00:14:20.220 | as well.

00:14:21.160 | And then I'm going to get a direction in which I should

00:14:23.500 | move my parameters.

00:14:25.220 | Since the gradient tells me how to increase the loss,

00:14:28.620 | I want to go in the opposite direction and decrease it.

00:14:31.060 | So my direction will be the opposite.

00:14:32.820 | So that's why I have a minus here.

00:14:35.540 | And so this delta is going to be the direction in which I'll

00:14:38.140 | move my parameters by taking a step.

00:14:40.780 | And the step is just a step size alpha,

00:14:43.940 | which is often referred to as a learning rate,

00:14:46.500 | times my direction, which I just add

00:14:49.300 | to my current values of my parameters, my biases

00:14:52.300 | and my weights.

00:14:53.360 | And that's going to give me my new value for all

00:14:56.120 | of my parameters.

00:14:57.180 | And I iterate like that, going over all pairs x, y's,

00:15:01.260 | computing my gradient, taking a step

00:15:03.420 | side in the opposite direction, and then

00:15:05.280 | doing that several times.

00:15:07.620 | So that's how stochastic gradient descent works.

00:15:10.620 | And that's essentially the learning procedure.

00:15:12.540 | It's represented by this procedure.

00:15:16.100 | So in this algorithm, there are a few things

00:15:17.900 | we need to specify to be able to implement it and execute it.

00:15:20.860 | We need a loss function, a choice for the loss function.

00:15:23.620 | We need a procedure that's efficient for computing

00:15:26.860 | the gradient of the loss with respect to my parameters.

00:15:30.620 | We need to choose a regularizer if we want one.

00:15:33.300 | And we need a way of initializing my parameters.

00:15:35.940 | So next, what I'll do is I'll go through each

00:15:37.960 | of these four different things we

00:15:39.780 | need to choose before actually being

00:15:41.700 | able to execute stochastic gradient descent.

00:15:45.660 | So first, the loss function.

00:15:48.060 | So as I said, we will interpret the output layer

00:15:50.980 | as assigning probabilities to each potential class in which

00:15:53.900 | I can classify my input x.

00:15:57.460 | Well, in this case, something that would be natural

00:15:59.780 | is to try to maximize the probability

00:16:02.120 | of the correct class, the actual class in which my example

00:16:05.500 | x t belongs to.

00:16:06.340 | I'd like to increase the value of the probability assigned

00:16:09.340 | by--

00:16:09.900 | computed by my neural network.

00:16:12.820 | And so because we set up the problem in which we

00:16:16.380 | have a loss that we minimize, instead

00:16:18.700 | of maximizing the probability, what we'll actually do

00:16:20.980 | is minimize the negative and the actual log probability,

00:16:25.380 | so the log likelihood of assigning x

00:16:28.300 | to the correct class y.

00:16:30.340 | So this is represented here.

00:16:32.100 | So given my output layer and the true label y,

00:16:35.420 | my loss will be minus the log of the probability of y

00:16:40.820 | according to my neural net.

00:16:42.140 | And that would be, well, take my output layer

00:16:44.980 | and look at the unit, so index the unit corresponding

00:16:48.500 | to the correct class.

00:16:50.020 | So that's why I'm indexing by y here.

00:16:53.500 | We take the log because numerically it

00:16:55.740 | turns out to be more stable.

00:16:56.940 | We get nicer-looking gradients.

00:16:59.180 | And sometimes in certain softwares,

00:17:01.340 | you'll see instead of talking about the negative log

00:17:03.460 | likelihood or log probability, you'll

00:17:05.020 | see it referred as the cross-entropy.

00:17:07.980 | And that's because you can think of this

00:17:11.820 | as performing a sum over all possible classes.

00:17:15.620 | And then for each class, checking, well,

00:17:17.540 | is this potential class the target class?

00:17:20.820 | So I have an indicator function that is 1 if y is equal to c,

00:17:24.660 | so if my iterator class c is actually

00:17:28.100 | equal to the real class.

00:17:29.700 | I'm going to multiply that by the log of the probability

00:17:33.140 | actually assigned to that class c.

00:17:35.660 | And this function here, so this expression here,

00:17:39.380 | is like a cross-entropy between the empirical distribution,

00:17:42.900 | which assigns zero probability to all the other classes,

00:17:46.100 | but a probability of 1 to the correct class,

00:17:48.500 | and the actual distribution over classes

00:17:50.900 | that my neural net is computing, which is f of x.

00:17:54.660 | That's just a technical detail.

00:17:56.060 | You can just think about this.

00:17:57.540 | Here, I only mention it because in certain libraries,

00:17:59.700 | it's actually mentioned as the cross-entropy loss.

00:18:03.740 | So that's for the loss.

00:18:06.060 | Then we need also a procedure for computing

00:18:08.260 | what is the gradient of my loss with respect

00:18:10.460 | to all of my parameters in my neural net,

00:18:12.620 | so the biases and the weights.

00:18:15.820 | You can go look at my videos if you

00:18:17.300 | want the actual derivation of all the details for all

00:18:20.220 | of these different expressions.

00:18:21.780 | I don't have time for that, so all I'll do--

00:18:23.740 | and presumably, a lot of you actually

00:18:26.300 | have seen these derivations.

00:18:28.620 | If you haven't, just go check out the videos.

00:18:30.700 | In any case, I'm going to go through what the algorithm is.

00:18:33.980 | I'm going to highlight some of the key points

00:18:35.900 | that will come up later in understanding how actually

00:18:39.220 | backpropagation functions.

00:18:41.380 | So the basic idea is that we'll compute gradients

00:18:44.820 | by exploiting the chain rule.

00:18:46.420 | And we'll go from the top layer all the way to the bottom,

00:18:50.220 | computing gradients for layers that are closer and closer

00:18:53.820 | to the input as we go, and exploiting the chain rule

00:18:56.540 | to exploit or reuse previous computations we've

00:18:59.820 | made at upper layers to compute the gradients at the layers

00:19:03.460 | below.

00:19:04.980 | So we usually start by computing what is the gradient

00:19:08.140 | at the output layer.

00:19:09.300 | So what's the gradient of my loss with respect

00:19:12.460 | to my output layer?

00:19:13.820 | And actually, it's more convenient

00:19:15.380 | to compute the loss with respect to the pre-activation.

00:19:18.340 | It's actually a very simple expression.

00:19:21.140 | So that's why I have the gradient of this vector,

00:19:24.180 | a l plus 1.

00:19:25.140 | That's the pre-activation at the very last layer of the loss

00:19:28.980 | function, which is minus the log f of x, y.

00:19:32.460 | And it turns out this gradient is super simple.

00:19:35.060 | It's minus E of y.

00:19:37.100 | So that's the one-hot vector for class y.

00:19:40.180 | So what this means is E of y is just a vector filled

00:19:43.820 | with a bunch of 0's and then the 1 at the correct class.

00:19:47.860 | So if y was the fourth class, then in this case,

00:19:51.100 | it would be this vector, where I have a 1 at the fourth

00:19:53.260 | dimension.

00:19:54.740 | So E of y is just a vector.

00:19:56.580 | We call it the one-hot vector full of 0's.

00:19:58.700 | And the single 1 at the position corresponding

00:20:01.620 | to the correct class.

00:20:03.540 | So what this part of the gradient

00:20:05.140 | is essentially saying is that I'm going to increase--

00:20:07.620 | I want to increase the probability

00:20:09.460 | of the correct class.

00:20:10.580 | I want to increase the pre-activation, which

00:20:12.520 | will increase the probability of the correct class.

00:20:15.460 | And I'm going to subtract what is the current probabilities

00:20:18.660 | assigned by my neural net to all of the classes.

00:20:21.860 | So f of x, that's my output layer.

00:20:23.580 | And that's the current beliefs of the neural net

00:20:26.420 | as to in which class, what's the probability of assigning

00:20:30.100 | the input to each class.

00:20:31.740 | So what this is doing is essentially

00:20:33.740 | trying to decrease the probability of everything

00:20:36.260 | and specifically decrease it as much as the neural net currently

00:20:39.900 | believes that the input belongs to it.

00:20:42.900 | And so if you think about the subtraction of these two

00:20:45.420 | things, well, for the class that's the correct class,

00:20:48.260 | I'm going to have 1 minus some number between 0 and 1,

00:20:51.140 | because it's a probability.

00:20:52.420 | So that's going to be positive.

00:20:53.700 | So I'm going to increase the probability

00:20:55.380 | of the correct class.

00:20:56.500 | And for everything else, it's going

00:20:58.000 | to be 0 minus a positive number.

00:20:59.820 | So it's going to be negative.

00:21:01.180 | So I'm actually going to decrease the probability

00:21:03.180 | of everything else.

00:21:04.180 | So intuitively, it makes sense.

00:21:05.620 | This gradient has the right behavior.

00:21:08.740 | And I'm going to take that pre-activation gradient.

00:21:11.580 | I'm going to propagate it from the top to the bottom

00:21:15.340 | and essentially iterating from the last layer, which

00:21:19.740 | is the output layer, L plus 1, all the way down

00:21:22.380 | to the first layer.

00:21:23.980 | And as I'm going down, I'm going to compute the gradients

00:21:26.780 | with respect to my parameters and then

00:21:28.500 | compute what's the gradient for the pre-activation

00:21:31.420 | at the layer below and then iterate like that.

00:21:34.580 | So at each iteration of that loop,

00:21:38.180 | I take what is the current gradient of the loss function

00:21:42.900 | with respect to the pre-activation

00:21:44.420 | at the current layer.

00:21:46.220 | And I can compute the gradient of the loss function

00:21:49.380 | with respect to my weight matrix.

00:21:51.500 | So not doing the derivation here,

00:21:54.460 | it's actually simply this vector.

00:21:58.180 | So in my notation, I assume that all the vectors

00:22:00.580 | are column vectors.

00:22:02.220 | So this pre-activation gradient vector,

00:22:05.180 | and I multiply it by the transpose of the activations,

00:22:09.020 | so the value of the layer right below, the layer k minus 1.

00:22:14.540 | So because I take the transpose, that's

00:22:16.100 | a multiplication like this.

00:22:17.380 | And you can see if I do the outer product,

00:22:19.180 | essentially, between these two vectors,

00:22:20.800 | I'm going to get a matrix of the same size as my weight matrix.

00:22:24.260 | So it all checks out.

00:22:25.940 | That makes sense.

00:22:27.500 | Turns out that the gradient of the loss with respect

00:22:29.680 | to the bias is exactly the gradient

00:22:31.940 | of the loss with respect to the pre-activation.

00:22:34.700 | So that's very simple.

00:22:36.220 | So that gives me now my gradients for my parameters.

00:22:38.700 | Now I need to compute, OK, what is

00:22:40.420 | going to be the gradient of the pre-activations

00:22:42.660 | at the layer below?

00:22:44.700 | Well, first, I'm going to get the gradient of the loss

00:22:48.980 | function with respect to the activation at the layer below.

00:22:54.220 | Well, that's just taking my pre-activation gradient vector

00:22:57.980 | and multiplying it by--

00:22:59.940 | for some reason, it doesn't show here--

00:23:01.780 | and multiply it by the transpose of my weight matrix.

00:23:04.660 | Super simple operation, just a linear transformation

00:23:07.580 | of my gradients at layer k, linear and transformed

00:23:10.580 | to get my gradients of the activation at the layer k

00:23:13.780 | minus 1.

00:23:15.180 | And then to get the gradients of the pre-activation,

00:23:17.900 | so before the activation function,

00:23:21.020 | I'm going to take this gradient here,

00:23:22.940 | which is the gradient of the activation function

00:23:25.540 | at the layer k minus 1.

00:23:27.220 | And then I apply the gradient corresponding

00:23:29.900 | to the partial derivative of my nonlinear activation function.

00:23:33.700 | So this here, this refers to an element-wise product.

00:23:37.060 | So I'm taking these two vectors, this vector here

00:23:39.860 | and this vector here.

00:23:40.740 | I'm going to do an element-wise product between the two.

00:23:43.740 | And this vector here is just the partial derivative

00:23:47.020 | of the activation function for each unit

00:23:49.620 | individually that I've put together into a vector.

00:23:52.700 | This is what this corresponds to.

00:23:55.180 | Now, the key things to notice is first

00:23:57.060 | that this pass, computing all the gradients

00:23:59.580 | and doing all these iterations, is actually fairly cheap.

00:24:02.820 | Complexity is essentially the same as the one

00:24:05.540 | that's doing a forward pass.

00:24:07.700 | So all I'm doing are linear transformations

00:24:11.180 | multiplying by matrices, in this case, the transpose of my weight

00:24:14.220 | matrix.

00:24:15.060 | And then I'm also doing this nonlinear operation

00:24:17.620 | where I'm multiplying by the gradient of the activation

00:24:20.120 | function.

00:24:21.020 | So that's the first thing to notice.

00:24:22.860 | And the second thing to notice is

00:24:24.220 | that here I'm doing this element-wise product.

00:24:27.420 | So if any of these terms here for a unit

00:24:30.060 | is very close to 0, then the pre-activation gradient

00:24:33.900 | is going to be 0 for the next layer.

00:24:36.460 | And I highlight this point because essentially whenever--

00:24:39.740 | that's something to think about a lot when

00:24:41.500 | you're training neural nets.

00:24:42.860 | Whenever this gradient here, these partial derivatives,

00:24:46.060 | come close to 0, then it means the gradient will not

00:24:48.660 | propagate well to the next layer, which

00:24:50.420 | means that you're not going to get a good gradient to update

00:24:53.120 | your parameters.

00:24:54.980 | Now, when does that happen?

00:24:56.380 | When will you see these terms here being close to 0?

00:24:59.380 | Well, that's going to be when the partial derivatives

00:25:01.580 | of these nonlinear activation functions

00:25:03.700 | are close to 0 or 0.

00:25:05.780 | So we can look at the partial derivatives, say,

00:25:08.160 | of the sigmoid function.

00:25:10.140 | It turns out it's super easy to compute.

00:25:12.460 | It's just the sigmoid itself times 1

00:25:15.340 | minus the sigmoid itself.

00:25:18.100 | So that means that whenever the activation

00:25:20.160 | of the unit for a sigmoid unit is close to 1 or close to 0,

00:25:23.500 | I essentially get a partial derivative that's close to 0.

00:25:27.540 | You can kind of see it here.

00:25:28.780 | The slope here is essentially flat,

00:25:30.380 | and the slope here is flat.

00:25:31.700 | That's the value of the partial derivative.

00:25:35.260 | So in other words, if my pre-activations

00:25:37.940 | are very negative or very positive,

00:25:39.860 | or if my unit is very saturated, then gradients

00:25:42.940 | will have a hard time propagating to the next layer.

00:25:46.500 | That's the key insight here.

00:25:49.020 | Same thing for the tanh function.

00:25:51.820 | So it turns out the partial derivative

00:25:53.500 | is also easy to compute.

00:25:54.940 | You just take the tanh value, square it,

00:25:57.580 | and you're going to subtract it to 1.

00:25:59.980 | And indeed, if it's close to minus 1 or close to 1,

00:26:04.420 | you can see that the slope is flat.

00:26:07.180 | So again, if the unit is saturating,

00:26:09.420 | gradients will have a hard time propagating

00:26:11.900 | to the next layers.

00:26:14.140 | And for the ReLU, the rectified linear activation function,

00:26:18.140 | the gradient is even simpler.

00:26:21.180 | You just check whether the pre-activation is greater than

00:26:23.580 | 0.

00:26:24.080 | If it is, the partial derivative is 1.

00:26:26.300 | If it's not, it's 0.

00:26:27.780 | So actually, you're going to multiply by 1 or 0.

00:26:30.020 | You essentially get a binary mask

00:26:31.760 | when you're performing the propagation through the ReLU.

00:26:35.220 | And you can see it.

00:26:36.180 | The slope here is flat, and otherwise, you

00:26:38.100 | have a linear function.

00:26:40.020 | So actually, here, the shrinking of the gradient towards 0

00:26:43.940 | is even harder.

00:26:44.860 | It's exactly multiplying by 0 if you have

00:26:47.780 | a unit that's saturating below.

00:26:52.140 | And beyond all the math, in terms of actually using those

00:26:56.820 | in practice, during the weekend, you'll

00:26:58.800 | see three different libraries that essentially allows you

00:27:01.780 | to compute these gradients for you.

00:27:03.220 | You actually usually don't write down backprop.

00:27:06.000 | You just use all of these modules

00:27:08.020 | that you've implemented.

00:27:09.180 | And it turns out there's a way of automatically differentiating

00:27:13.380 | your loss function and getting gradients for free

00:27:16.220 | in terms of effort, in terms of programming effort,

00:27:19.460 | with respect to your parameters.

00:27:21.500 | So conceptually, the way you do this--

00:27:23.940 | and you'll see essentially three different libraries doing it

00:27:26.460 | in slightly different ways.

00:27:28.580 | What you do is you augment your flow graph

00:27:31.340 | by adding, at the very end, the computation of your loss

00:27:34.420 | function.

00:27:35.600 | And then each of these boxes, which are conceptually

00:27:38.020 | objects that are taking arguments and computing

00:27:40.540 | a value, you're going to augment them to also have a method

00:27:45.420 | that's a backprop or a bprop method.

00:27:47.600 | You'll often see, actually, this expression being used, bprop.

00:27:50.660 | And what this method should do is

00:27:52.980 | that it should take as input, what

00:27:54.500 | is the gradient of the loss with respect to myself?

00:27:57.300 | And then it should propagate to its arguments,

00:28:00.340 | so the things that its parents in the flow graph,

00:28:02.780 | the things it takes to compute its own value,

00:28:04.780 | it's going to propagate them using the chain rule, what

00:28:07.100 | is their gradients with respect to the loss?

00:28:10.780 | So what this means is that you would start the process

00:28:14.220 | by initializing, well, the gradient of the loss

00:28:16.860 | with respect to itself is 1.

00:28:19.360 | And then you pass the bprop method here 1.

00:28:22.260 | And then it's going to propagate to its argument, what

00:28:26.420 | is, by using the chain rule, what

00:28:28.240 | is the gradient of the loss with respect to f of x?

00:28:32.040 | And then you're going to call bprop on this object here.

00:28:34.800 | And it's going to compute, well, I

00:28:36.220 | have the gradient of the loss with respect to myself,

00:28:38.460 | f of x.

00:28:39.300 | From this, I can compute what's the gradient of my argument,

00:28:42.700 | which is the pre-activation at layer 2,

00:28:45.140 | with respect to the loss.

00:28:46.580 | So I'm going to reuse the computation I just

00:28:48.380 | got and update it using my--

00:28:50.860 | what is essentially the Jacobian.

00:28:53.020 | And then I'm going to take the pre-activation here,

00:28:55.140 | which now knows what is the gradient of the loss with

00:28:57.340 | respect to itself, the pre-activation.

00:28:59.180 | It's going to propagate to the weights and the biases

00:29:02.020 | and the layer below, updating them with--

00:29:04.200 | informing them of what is the gradient of the loss

00:29:06.280 | with respect to themselves.

00:29:07.420 | And you continue like this, essentially

00:29:09.080 | going through the flow graph, but in the opposite direction.

00:29:12.880 | So the library torch, the basic library torch,

00:29:15.400 | essentially functions like this quite explicitly.

00:29:18.440 | You construct-- you chain these elements together.

00:29:21.000 | And then when you're performing backpropagation,

00:29:23.000 | you're going in the reverse order

00:29:24.340 | of these chained elements.

00:29:25.880 | And then you have libraries like Torchautograd and Theano

00:29:29.280 | and TensorFlow, which you'll learn about,

00:29:31.240 | which are doing things slightly more sophisticated there.

00:29:34.340 | And you'll learn about that later on.

00:29:38.540 | OK, so that's a discussion of how you actually

00:29:40.940 | compute gradients of the loss with respect to the parameters.

00:29:44.620 | So that's another component we need in stochastic gradient

00:29:47.540 | descent.

00:29:48.640 | We can choose a regularizer.

00:29:50.100 | One that's often used is the L2 regularization.

00:29:53.500 | So that's just the sum of the squared of all the weights.

00:29:57.100 | And the gradient of that is just twice times the weight.

00:30:01.280 | So it's a super simple gradient to compute.

00:30:03.400 | We usually don't regularize the biases.

00:30:06.840 | There's no particularly important reason for that.

00:30:10.960 | There are much fewer biases, so it seems less important.

00:30:15.200 | And often, this L2 regularization

00:30:17.020 | is often referred to as weight decay.

00:30:18.760 | So if you hear about weight decay,

00:30:20.320 | that often refers to L2 regularization.

00:30:24.400 | And then finally, and this is also a very important point,

00:30:28.660 | you have to initialize the parameters before you actually

00:30:31.080 | start doing backprop.

00:30:32.120 | And there are a few tricky cases you

00:30:34.160 | need to make sure that you don't fall into.

00:30:37.640 | So the biases, often we initialize them to 0.

00:30:40.600 | There are certain exceptions, but for the most part,

00:30:42.880 | we initialize them to 0.

00:30:44.720 | But for the weights, there are a few things we can't do.

00:30:47.600 | So we can't initialize the weights to 0,

00:30:50.200 | and especially if you have tanh activations.

00:30:53.960 | The reason-- and I won't explain it here,

00:30:56.160 | but it's not a bad exercise to try to figure out why--

00:30:59.240 | is that essentially, when you do your first pass,

00:31:02.100 | you're going to get gradients for all your parameters that

00:31:04.520 | are going to be 0.

00:31:05.920 | So you're going to be stuck at this 0 initialization.

00:31:08.960 | So we can't do that.

00:31:11.280 | We also can't initialize all the weights

00:31:13.160 | to exactly the same value.

00:31:16.440 | Again, you think about it a little bit.

00:31:18.720 | What's going to happen is essentially

00:31:20.600 | that all the weights coming into a unit within the layer

00:31:24.720 | are going to have exactly the same gradients, which

00:31:27.640 | means they're going to be updated exactly the same way,

00:31:30.000 | which means they're going to stay constant the same--

00:31:32.280 | not constant, but they're going to stay the same--

00:31:34.360 | the whole time.

00:31:35.080 | So it's as if you have multiple copies of the same unit.

00:31:38.320 | So you essentially have to break that initial symmetry

00:31:40.920 | that you would create if you initialized everything

00:31:43.080 | to the same value.

00:31:44.520 | So what we end up doing most of the time

00:31:46.260 | is initialize the weights to some randomly generated value.

00:31:50.480 | Often, we generate them--

00:31:52.080 | there are a few other recipes, but one of them

00:31:54.120 | is to initialize them from some uniform distribution

00:31:56.480 | between lower and upper bound.

00:31:59.360 | This is a recipe here that is often

00:32:01.440 | used that has some theoretical grounding that was derived

00:32:05.440 | specifically for the tanh.

00:32:06.880 | There's this paper here by Xavier Guerroux and Yoshua

00:32:09.880 | Bengio you can check out for some intuition as to how

00:32:13.120 | you should initialize the weights.

00:32:14.460 | But essentially, they should be initially random,

00:32:17.100 | and they should be initially close to 0.

00:32:19.320 | Random to break symmetry, and close to 0

00:32:23.160 | so that initially the units are not already saturated.

00:32:27.040 | Because if the units are saturated,

00:32:28.680 | then there are no gradients that are

00:32:29.800 | going to pass through the units.

00:32:31.280 | You're essentially going to get gradients very close to 0

00:32:33.680 | at the lower layers.

00:32:35.120 | So that's the main intuition, is to have weights

00:32:37.360 | that are small and random.

00:32:40.200 | So those are all the pieces we need

00:32:45.360 | for running stochastic gradient descent.

00:32:47.200 | So that allows us to take a training set

00:32:48.860 | and run a certain number of epochs,

00:32:50.720 | and have the neural net learn from that training set.

00:32:53.920 | Now, there are other quantities in our neural network

00:32:57.120 | that we haven't specified how to choose them.

00:32:59.240 | So those are the hyperparameters.

00:33:02.560 | So usually, we're going to have a separate validation set.

00:33:05.240 | Most people here are familiar with machine learning,

00:33:06.880 | so that's a typical procedure.

00:33:08.440 | And then we need to select things like, OK,

00:33:10.320 | how many layers do I want?

00:33:11.560 | How many units per layer do I want?

00:33:14.200 | What's the step size, the learning rate

00:33:16.120 | of my stochastic gradient descent procedure,

00:33:17.920 | that alpha number?

00:33:19.540 | What is the weight decay that I'm going to use?

00:33:22.300 | So a standard thing in machine learning

00:33:24.340 | is to perform a grid search.

00:33:27.380 | That is, if I have two hyperparameters,

00:33:29.340 | I list out a bunch of values I want to try.

00:33:31.320 | So for the number of hidden units,

00:33:32.780 | maybe I want to try 100, 1,000, and 2,000, say.

00:33:36.940 | And then for the learning rate, maybe I

00:33:38.540 | want to try 0.01 and 0.001.

00:33:42.420 | So a grid search would just try all combinations

00:33:44.580 | of these three values for the hidden units

00:33:46.900 | and these two values for the learning rates.

00:33:49.820 | So that means that the more hyperparameters there are,

00:33:53.340 | the number of configurations you have to try out blows up

00:33:57.420 | and grows exponentially.

00:33:59.620 | So another procedure that is now more and more common,

00:34:03.180 | which is more practical, is to perform a form of random search.

00:34:07.580 | In this case, what you do is for each parameter,

00:34:10.020 | you actually determine a distribution of likely values

00:34:13.460 | you'd like to try.

00:34:14.220 | So it could be--

00:34:16.100 | so for the number of hidden units,

00:34:17.560 | maybe I do a uniform distribution

00:34:19.500 | over all integers from 100 to 1,000, say,

00:34:22.700 | or maybe a log uniform distribution.

00:34:25.220 | And for the learning rate, maybe, again,

00:34:26.920 | the log uniform distribution, but from 0.001 to 0.01, say.

00:34:32.840 | And then to get an experiment, so

00:34:35.260 | to get values for my hyperparameters

00:34:37.060 | to do an experiment with and get a performance on my validation

00:34:39.940 | set, I just independently sample from these distributions

00:34:43.140 | for each hyperparameter to get a full configuration

00:34:46.460 | for my experiment.

00:34:47.820 | And then because I have this way of getting one experiment,

00:34:50.720 | I do it independently for all of my jobs, all of my experiments

00:34:53.740 | that I will do.

00:34:54.620 | So in this case, if I know I have enough compute power

00:34:58.020 | to do 50 experiments, I just sample 50 independent samples

00:35:02.120 | from these distributions for hyperparameters,

00:35:04.100 | perform these 50 experiments, and I just take the best one.

00:35:07.860 | What's nice about it is that there are no--

00:35:09.880 | unlike grid search, there are never any holes in the grid.

00:35:12.620 | That is, you just specify how many experiments you do.

00:35:15.260 | If one of your jobs died, well, you just have one less.

00:35:18.460 | But there's no hole in your experiment.

00:35:21.900 | And also, one reason why it's particularly useful,

00:35:24.700 | this approach, is that if you have a specific value in grid

00:35:29.380 | search for one of the hyperparameters that just makes

00:35:32.100 | the experiment not work at all-- so learning rates

00:35:35.100 | are a lot like this.

00:35:36.300 | If you have a learning rate that's too high,

00:35:38.620 | it's quite possible that convergence of the optimization

00:35:42.180 | will not converge.

00:35:43.540 | Well, if you're using a grid search,

00:35:45.180 | it means that for all the experiments that

00:35:47.060 | use that specific value of the learning rate,

00:35:48.980 | they're all going to be garbage.

00:35:50.460 | They're all not going to be useful.

00:35:52.340 | And you don't really get this sort of big waste of computation

00:35:56.220 | if you do a random search, because most likely,

00:35:58.540 | all the values of your hyperparameters

00:36:00.060 | are going to be unique, because they're

00:36:01.660 | samples, say, from a uniform distribution over some range.

00:36:06.140 | So that actually works quite well,

00:36:08.620 | and it's quite recommended.

00:36:10.580 | And there are more advanced methods,

00:36:12.420 | like methods based on machine learning, Bayesian

00:36:15.220 | optimization, or sometimes known as sequential model-based

00:36:18.460 | optimization, that I won't talk about,

00:36:21.300 | but that works a bit better than random search.

00:36:25.740 | And that's another alternative if you

00:36:27.780 | think you have an issue finding good hyperparameters,

00:36:29.940 | is to investigate some of these more advanced methods.

00:36:34.380 | Now, you do this for most of your hyperparameters,

00:36:37.180 | but for the number of epochs, the number of times

00:36:40.060 | you go through all of your examples in your training set,

00:36:44.700 | what we usually do is not grid search or random search,

00:36:49.020 | but we use a thing known as early stopping.

00:36:51.580 | The idea here is that if I've trained

00:36:53.460 | a neural net for 10 epochs, while training a neural net

00:36:56.700 | with all the other hyperparameters kept constant,

00:36:59.340 | but one more epoch is easy.

00:37:01.300 | I just do one more epoch.

00:37:02.820 | So I shouldn't start over and then do, say,

00:37:06.340 | 11 epochs from scratch.

00:37:08.580 | And so what we would do is we would just

00:37:10.300 | track what is the performance on the validation set

00:37:12.820 | as I do more and more epochs.

00:37:14.740 | And what we will typically see is the training error

00:37:17.100 | will go down, but the validation set performance will go down

00:37:20.980 | and eventually go up.

00:37:22.900 | The intuition here is that the gap

00:37:25.080 | between the performance on the training set

00:37:27.420 | and the performance on the validation set

00:37:29.120 | will tend to increase.

00:37:31.260 | And since the training curve cannot go below, usually,

00:37:34.460 | some bound, then eventually the validation set performance

00:37:38.280 | has to go up.

00:37:39.820 | Sometimes it won't necessarily go up,

00:37:41.320 | but it sort of stays stable.

00:37:42.700 | So with early stopping, what we do

00:37:44.080 | is that if we reach a point where the validation set

00:37:46.260 | performance hasn't improved from some certain number

00:37:49.020 | of iterations, which we refer to as the look ahead,

00:37:52.340 | we just stop.

00:37:53.340 | We go back to the neural net that

00:37:54.660 | had the best performance overall in the validation set,

00:37:56.960 | and that's my neural network.

00:37:58.700 | So I have now a very cheap way of actually getting

00:38:01.460 | the number of iterations or the number of epochs

00:38:03.960 | over my training set.

00:38:07.220 | A few more tricks of the trade.

00:38:09.500 | So it's always useful to normalize your data.

00:38:13.060 | It will often have the effect of speeding up training.

00:38:16.780 | If you have real value data for binary data,

00:38:19.340 | that's usually keep it as it is.

00:38:21.940 | So what I mean by that is just subtract for each dimension

00:38:24.940 | what is the average in the training set of that dimension,

00:38:27.700 | and then dividing by the standard deviation

00:38:29.660 | of each dimension again in my input space.

00:38:33.500 | So this can speed up training.

00:38:35.980 | We often use a decay on the learning rate.

00:38:40.020 | There are a few methods for doing this.

00:38:41.660 | One that's very simple is to start with a large learning

00:38:45.300 | rate and then track the performance on the validation

00:38:47.460 | set.

00:38:48.140 | And once on the validation set it stops improving,

00:38:50.980 | you decrease your learning rate by some ratio.

00:38:53.060 | Maybe you divide it by 2.

00:38:54.780 | And then you continue training for some time.

00:38:56.980 | Hopefully, the validation set performance starts improving.

00:39:00.620 | And then at some point, it stops improving, and then you stop.

00:39:04.020 | Or you divide again by 2.

00:39:05.460 | So that sort of gives you an adaptive--

00:39:08.180 | using the validation set, an adaptive way

00:39:10.260 | of changing your learning rate.

00:39:11.700 | And that can, again, work better than having a very small

00:39:14.900 | learning rate than waiting for a longer time.

00:39:16.860 | So making very fast progress initially,

00:39:18.660 | and then slower progress towards the end.

00:39:20.500 | Also, I've described so far the approach

00:39:26.260 | for training neural nets that is based on a single example

00:39:30.740 | at a time.

00:39:31.300 | But in practice, we actually use what's called mini-batches.

00:39:33.800 | That is, we compute the loss function

00:39:36.100 | on a small subset of examples, say, 64, 128.

00:39:40.580 | And then we take the average of the loss of all these examples

00:39:43.580 | in that mini-batch.

00:39:45.020 | And that's actually-- we compute the gradient

00:39:47.260 | of this average loss on that mini-batch.

00:39:49.880 | The reason why we do this is that it turns out

00:39:53.100 | that you can very efficiently implement the forward pass

00:39:56.780 | over all of these 64, 128 examples in my mini-batch

00:40:01.260 | in one pass by, instead of doing vector matrix multiplications

00:40:05.300 | when we compute the pre-activations,

00:40:07.280 | doing matrix-matrix multiplications, which

00:40:09.760 | are faster than doing multiple matrix-vector multiplications.

00:40:13.880 | So in your code, often, there will

00:40:15.800 | be this other hyperparameter, which

00:40:17.480 | is mostly optimized for speed in terms of how quickly training

00:40:21.080 | will proceed of the number of examples in your mini-batch.

00:40:25.240 | Other things to improve optimization

00:40:27.200 | might be using a thing like momentum.

00:40:29.560 | That is, instead of using, as the descent direction,

00:40:33.320 | the gradient of the loss function,

00:40:35.220 | I'm actually going to track a descent direction, which

00:40:39.040 | I'm going to compute as the current gradient

00:40:41.160 | for my current example or mini-batch,

00:40:44.080 | plus some fraction of the previous update,

00:40:47.040 | the previous direction of update.

00:40:50.200 | And beta now is a hyperparameter you have to optimize.

00:40:52.640 | So what this does is, if all the update directions agree

00:40:56.840 | across multiple updates, then it will start picking up momentum

00:41:00.720 | and actually make bigger steps in those directions.

00:41:05.840 | And then there are multiple, even more advanced methods

00:41:08.760 | for adding adaptive types of learning rates.

00:41:12.560 | I mentioned them here very quickly,

00:41:14.060 | because you might see them in papers.

00:41:15.600 | There's a method known as AdaGrad,

00:41:17.480 | where the learning rate is actually

00:41:19.440 | scaled for each dimension, so for each weight

00:41:23.260 | and each biases.

00:41:24.440 | It's going to be scaled by what is the square root

00:41:28.800 | of the cumulative sum of the squared gradients.

00:41:31.920 | So what I track is I take my gradient vector at each step.

00:41:35.360 | I do an element-wise square of all the dimensions

00:41:39.120 | of my gradients, my gradient vector.

00:41:41.160 | And then I accumulate that in some variable

00:41:43.160 | that I'm noting as gamma here.

00:41:44.920 | And then for my descent direction, I take the gradient,

00:41:47.840 | and I do an element-wise division

00:41:50.080 | by the square root of this cumulative sum

00:41:52.920 | of squared gradients.

00:41:54.720 | There's also RMSProp, which is essentially like AdaGrad,

00:41:57.440 | but instead of doing a cumulative sum,

00:41:59.640 | we're going to do an exponential moving average.

00:42:02.080 | So we take the previous value times some factor

00:42:04.760 | plus 1 minus this factor times the current squared gradient.

00:42:08.960 | So that's RMSProp.

00:42:10.380 | And then there's Adam, which is essentially

00:42:12.720 | a combination of RMSProp with momentum, which

00:42:15.380 | is more involved.

00:42:16.200 | And I won't have time to describe it here,

00:42:17.960 | but that's another method that's often actually implemented

00:42:21.440 | in these different softwares and that people seem

00:42:24.400 | to use with a lot of success.

00:42:28.120 | And finally, in terms of actually debugging

00:42:31.400 | your implementations--

00:42:33.160 | so for instance, if you're lucky,

00:42:35.280 | you can build your neural network

00:42:36.720 | without difficulty using the current tools that

00:42:38.680 | are available in Torch or TensorFlow or Tiano.

00:42:41.240 | But maybe sometimes you actually have

00:42:42.840 | to implement certain gradients for a new module

00:42:45.640 | and a new box in your flow graph that

00:42:47.920 | isn't currently supported.

00:42:49.520 | If you do this, you should check that you've implemented

00:42:51.960 | your gradients correctly.

00:42:53.760 | And one way of doing that is to actually compare

00:42:56.560 | the gradients computed by your code

00:42:58.560 | with a finite difference of estimate.

00:43:01.240 | So what you do is, for each parameter,

00:43:03.160 | you add some very small epsilon value, say 10 to the minus 6,

00:43:07.220 | and you compute what is the output of your module.

00:43:10.760 | And then you subtract the same thing,

00:43:12.360 | but where you've subtracted the small quantity,

00:43:15.600 | and then you divide by 2 epsilon.

00:43:17.400 | So if epsilon converges to 0, then you actually

00:43:20.520 | get the partial derivative.

00:43:21.960 | But if it's just small, it's going to be an approximate.

00:43:24.240 | And usually, this finite difference estimate

00:43:26.600 | will be very close to a correct implementation

00:43:29.520 | of the real gradient.

00:43:30.840 | So you should definitely do that if you've actually

00:43:33.380 | implemented some of the gradients in your code.

00:43:36.160 | And then another useful thing to do

00:43:37.900 | is to actually do a very small experiment on a small data set

00:43:41.880 | before you actually run your full experiment

00:43:44.320 | on your complete data set.

00:43:45.880 | So use, say, 50 examples.

00:43:47.760 | So just taking a random subset of 50 examples

00:43:50.400 | from your data set.

00:43:52.000 | Actually, just make sure that your code can overfit

00:43:54.640 | to that data, can essentially classify it perfectly,

00:43:58.600 | given enough capacity that you would think it should get it.

00:44:03.040 | So if it's not the case, then there's a few things

00:44:06.300 | that you might want to investigate.

00:44:08.320 | Maybe your initialization is such

00:44:09.920 | that the units are already saturated initially,

00:44:12.620 | and so there's no actual optimization

00:44:14.800 | happening because some of the gradients on some of the weights

00:44:17.440 | are exactly zero.

00:44:19.040 | So you might want to check your initialization.

00:44:22.160 | Maybe your gradients are just--

00:44:23.760 | you're using a model you implemented gradients for,

00:44:26.040 | and maybe your gradients are not properly implemented.

00:44:28.920 | Maybe you haven't normalized your input, which

00:44:31.040 | creates some instability, making it

00:44:33.040 | harder for stochastic gradient descent to work successfully.

00:44:38.400 | Maybe your learning rate is too large.

00:44:40.080 | Then you should consider trying smaller learning rates.

00:44:42.840 | That's actually a pretty good way of adding

00:44:44.940 | some idea of the magnitude of the learning rate

00:44:47.640 | you should be using.

00:44:49.320 | And then once you actually overfit

00:44:51.680 | in your small training set, you're

00:44:53.060 | ready to do a full experiment on a larger data set.

00:44:56.640 | That said, this is not a replacement

00:44:58.880 | for gradient checking.

00:45:00.240 | So backprop and stochastic gradient descent,

00:45:03.440 | it's a great algorithm that's very bug resistant.

00:45:06.800 | You will potentially see some learning happening,

00:45:10.540 | even if some of your gradients are wrong,

00:45:12.400 | or say, exactly zero.

00:45:13.840 | So that's great if you're an engineer

00:45:16.160 | and you're implementing things.

00:45:17.800 | It's fun when code is somewhat bug resistant.

00:45:19.960 | But if you're actually doing science

00:45:21.600 | and trying to understand what's going on,

00:45:23.960 | that can be a complication.

00:45:25.160 | So do both, gradient checking and a small experiment

00:45:29.160 | like that.

00:45:31.120 | All right, and so for the last few minutes,

00:45:33.000 | I'll actually try to motivate what

00:45:35.040 | you'll be learning quite a bit about in the next two days.

00:45:40.280 | That is, the specific case for deep learning.

00:45:43.880 | So I've already told you that if I have a neural net with enough

00:45:47.760 | hidden units, theoretically, I can potentially

00:45:49.880 | represent pretty much any function, any classification

00:45:53.000 | function.

00:45:53.920 | So why would I want multiple layers?

00:45:56.360 | So there are a few motivations behind this.

00:45:59.000 | The first one is taken directly from our own brains.

00:46:02.160 | So we know in the visual cortex that the light that

00:46:05.320 | hits our retina eventually goes through several regions

00:46:08.520 | in the visual cortex.

00:46:09.880 | Eventually reaching an area known as V1,

00:46:12.520 | where you have units that are-- or neurons that are essentially

00:46:16.040 | tuned to small forms like edges.

00:46:18.840 | And then it goes on to V4, where it's

00:46:20.480 | slightly more complex patterns that the units are tuned for.

00:46:23.880 | And then you reach AIT, where you actually

00:46:25.680 | have neurons that are specific to certain objects

00:46:27.680 | or certain units.

00:46:28.840 | And so the idea here is that perhaps that's

00:46:30.960 | also what we want in an artificial, say, vision system.

00:46:35.800 | We'd like it, if it's detecting faces,

00:46:37.760 | to have a first layer that detects simple edges,

00:46:41.080 | and then another layer that perhaps puts these edges

00:46:43.520 | together, detecting slightly more complex things,

00:46:45.840 | like a nose or a mouth or eyes.

00:46:47.880 | And then eventually have a layer that combines

00:46:49.880 | these slightly less abstract or more abstract units

00:46:54.520 | to get something even more abstract,

00:46:56.200 | like a complete face.

00:46:58.540 | There's also some theoretical justification

00:47:00.680 | for using multiple layers.

00:47:04.160 | So the early results were mostly based

00:47:06.360 | on studying Boolean functions, or a function that

00:47:08.920 | takes as input--

00:47:10.160 | can think of it as a vector of just zeros and ones.

00:47:12.640 | And you could show that there are certain functions that,

00:47:17.080 | if you had essentially a Boolean neural network

00:47:19.960 | or essentially a Boolean circuit,

00:47:22.840 | and you restricted the number of layers of that circuit,

00:47:26.280 | that there are certain functions that, in this case,

00:47:28.480 | to represent certain Boolean functions exactly,

00:47:31.160 | you would need an exponential number of units

00:47:33.600 | in each of these layers.

00:47:35.160 | Whereas if you allowed yourself to have multiple layers,

00:47:37.160 | then you could represent these functions more compactly.

00:47:39.840 | And so that's another motivation,

00:47:41.560 | that perhaps with more layers, we

00:47:42.960 | can represent fairly complex functions in a more compact way.

00:47:48.520 | And then there's the reason that they just work.

00:47:51.160 | So we've seen in the past few years

00:47:53.920 | great success in speech recognition, where it's

00:47:56.440 | essentially revolutionized the field, where everyone's using

00:47:59.020 | deep learning for speech recognition,

00:48:00.940 | and same thing for visual object recognition,

00:48:03.840 | where, again, deep learning is sort

00:48:05.400 | of the method of choice for identifying objects in images.

00:48:10.520 | So then why are we doing this only recently?

00:48:13.760 | Why didn't we do deep learning way back

00:48:16.640 | when backprop was invented, which is essentially in 1980s

00:48:20.760 | and even before that?

00:48:22.840 | So it turns out training deep neural networks

00:48:24.760 | is actually not that easy.

00:48:26.040 | There are a few hurdles that one can be confronted with.

00:48:30.000 | I've already mentioned one of the issues, which

00:48:32.120 | is that some of the gradients might be fading as you go

00:48:35.640 | from the top layer to the bottom layer,

00:48:37.280 | because we keep multiplying by the derivative of the activation

00:48:39.940 | function.

00:48:40.520 | So that makes training hard.

00:48:41.960 | It could be that the lower layers at very small gradients

00:48:44.560 | are barely moving and exploring the space of correct features

00:48:49.040 | to learn for a given problem.

00:48:51.320 | Sometimes that's the problem you find.

00:48:52.960 | You have a hard time just fitting your data,

00:48:54.880 | and you're essentially underfitting.

00:48:57.080 | Or it could be that with deeper neural nets or bigger

00:49:00.360 | neural nets, we have more parameters.

00:49:02.200 | So perhaps sometimes we're actually overfitting.

00:49:04.240 | We're in a situation where all the functions that we

00:49:07.680 | can represent with the same neural net represented

00:49:10.840 | by this gray area function actually includes, yes,

00:49:14.040 | the right function, but it's so large

00:49:15.660 | that for a finite training set, the odds

00:49:18.280 | that I'm going to find the one that's

00:49:19.780 | close to the true classifying function, the real system

00:49:22.840 | that I'd like to have, is going to be very different.

00:49:25.720 | So in this case, I'm essentially overfitting,

00:49:28.080 | and that might also be a situation we're in.

00:49:31.200 | And unfortunately, there are many situations

00:49:35.880 | where one problem is observed, overfitting or underfitting.

00:49:40.920 | And so we essentially have, in the field,

00:49:43.480 | developed tools for fighting both situations.

00:49:46.140 | And I'm going to rapidly touch a few of those, which you will

00:49:49.600 | see will come up later on in multiple talks.

00:49:53.560 | So one of the first hypotheses, which

00:49:56.320 | might be that you're underfitting,

00:49:57.760 | well, you can essentially just fight this

00:49:59.960 | by waiting longer, so training longer.

00:50:01.880 | If you have your gradients are too small,

00:50:03.600 | and this is essentially why you're progressing very slowly

00:50:05.800 | when you're training, well, if you're using GPUs

00:50:08.220 | and are able to do more iterations over the same

00:50:11.480 | training set in less time, that might just

00:50:15.120 | solve your problem of underfitting.

00:50:16.800 | And I think we've seen some of that,

00:50:18.760 | and this is partly why GPUs have been so game-changing

00:50:21.340 | for deep learning.

00:50:22.520 | Or you can use just better optimization methods also.

00:50:26.120 | And if you're overfitting, well, we just

00:50:28.200 | need better regularization.

00:50:31.560 | I've been involved early on in my PhD

00:50:33.680 | on using unsupervised learning as a way

00:50:36.040 | to regularize neural nets.

00:50:38.820 | If I have time, I'll talk a little bit about that.

00:50:40.940 | And there's another method you might have heard

00:50:43.200 | about known as dropout.

00:50:44.840 | So I'll try to touch at least two methods that are essentially

00:50:49.560 | trying to address some of these issues.

00:50:51.360 | So the first one that I'll talk about is dropout.

00:50:54.720 | It's actually very easy, very simple.

00:50:57.920 | So the idea of if our neural net is essentially overfitting,

00:51:01.120 | so it's too good at training on the training set,

00:51:04.520 | well, we're essentially going to cripple training.

00:51:06.880 | We're going to make it harder to fit the training set.

00:51:09.240 | And the way we're going to do that in dropout

00:51:11.240 | is that we will stochastically remove

00:51:13.880 | hidden units independently.

00:51:16.120 | So for each hidden unit, before we do a forward pass,

00:51:18.880 | we'll flip a coin.

00:51:20.240 | And with probability half, we will

00:51:22.880 | multiply the activation by 0.

00:51:24.800 | And with probability half, we'll multiply it by 1.

00:51:27.500 | So what this means is that if a unit is multiplied by 0,

00:51:30.840 | it's effectively not in the neural net anymore.

00:51:34.120 | And we're doing this independently

00:51:36.360 | for each hidden units.

00:51:37.880 | So that means that in a layer, a unit cannot rely anymore

00:51:41.840 | on the presence on any other units

00:51:44.760 | to try to sort of synchronize and adapt

00:51:47.880 | to perform a complex classification

00:51:50.600 | or learn a complex feature.

00:51:52.360 | And that was partly the motivation behind dropout

00:51:54.680 | is that this procedure might encourage

00:51:57.480 | types of features that are not co-adapted

00:51:59.760 | and are less likely to overfit.

00:52:02.880 | So we often use 0.5 as the probability

00:52:05.960 | of dropping out a unit.

00:52:08.040 | It turns out it often, surprisingly,

00:52:10.040 | is the best value.

00:52:11.160 | But that's another hyperparameter

00:52:12.920 | you might want to tune.

00:52:15.060 | And in terms of how it impacts an implementation of backdrop,

00:52:18.560 | it's very simple.

00:52:20.000 | So the forward pass, before I do it,

00:52:21.600 | I just sample my binary masks for all my layers.

00:52:24.760 | And then when I'm performing backdrop, well,

00:52:28.240 | my gradient on the--

00:52:30.000 | oh, sorry.

00:52:30.480 | So that's the forward pass.

00:52:31.580 | I'm just multiplying by this binary mask here.

00:52:34.760 | So super simple change.

00:52:36.640 | And then in terms of backdrop, well, I'm

00:52:39.000 | also going to multiply by the mask

00:52:41.160 | when I get my gradient on the pre-activation.

00:52:43.800 | And also, don't forget that the activations are now different.

00:52:47.080 | They actually include the mask in my notation.

00:52:49.920 | So it's a very simple change in the forward and backward pass

00:52:52.860 | when you're training.

00:52:54.280 | And also, another thing that I should emphasize

00:52:56.280 | is that the mask is being resampled for every example.

00:52:59.600 | So before you do a forward pass, you resample the mask.

00:53:02.160 | You don't keep it--

00:53:03.280 | sample it once and then use it the whole time.

00:53:07.560 | And then at test time, because we don't really

00:53:09.560 | like a model that sort of randomly changes its output,

00:53:13.600 | because it will if we stochastically change the masks,

00:53:16.840 | what we do is we replace the mask

00:53:18.920 | by the probability of dropping out a unit,

00:53:23.600 | or actually of keeping a unit.

00:53:25.480 | So if we're using 0.5, that's just 0.5.

00:53:29.200 | We can actually show that if you have a neural net

00:53:31.360 | with a single hidden layer, doing this transformation

00:53:34.720 | at test time, multiplying by 0.5 is

00:53:36.760 | equivalent to doing a geometric average of all

00:53:39.640 | the possible neural networks with all the different binary

00:53:42.120 | mask patterns.

00:53:43.360 | So it's essentially one way of thinking about dropout

00:53:46.600 | in the single layer case is that it's kind of an assembling

00:53:49.020 | method where you have a lot of models,

00:53:51.040 | an exponential number of models, which

00:53:52.760 | are all sharing the same weights but have different masks.

00:53:56.040 | That intuition, though, doesn't transfer for deep neural nets

00:53:59.520 | in the sense that you cannot show this result.

00:54:01.480 | It really only applies to a single hidden layer.

00:54:05.920 | So in practice, it's very effective,

00:54:08.480 | but do expect some slowdown in training.

00:54:10.640 | So often, we tend to see that training a network

00:54:13.640 | to completion will take twice as many epochs

00:54:16.160 | if you're using dropout with 0.5.

00:54:18.360 | And here, you have the reference if you

00:54:19.980 | want to learn more about different variations

00:54:21.920 | of dropouts and so on.

00:54:23.000 | And I probably won't talk about unsupervised retraining

00:54:28.920 | for lack of time, but I'll talk about another thing

00:54:31.040 | that you'll definitely probably hear about,

00:54:33.440 | and that's implemented in these different packages, which

00:54:35.740 | is batch normalization.

00:54:37.360 | Batch normalization is kind of interesting in the sense

00:54:39.660 | that it's been shown to better optimize.

00:54:42.840 | That is, certain networks that would otherwise underfit

00:54:46.000 | would not underfit as much anymore

00:54:48.440 | if you use batch normalization.

00:54:49.840 | But also, it's been shown that when

00:54:51.440 | you use batch normalization, dropout is not as useful.

00:54:54.720 | And dropout being a regularization method,

00:54:56.720 | that suggests that perhaps batch normalization is also

00:54:59.760 | regularizing in some way.

00:55:01.280 | So these things are not one or the other.

00:55:03.800 | They're not mutually exclusive.

00:55:05.160 | You can have a regularizer that also, it turns out,

00:55:07.440 | helps you better optimize.

00:55:10.240 | So the intuition behind batch normalization

00:55:14.760 | is much like I've suggested that normalizing your inputs actually

00:55:19.560 | can help speeding up training.

00:55:21.120 | Well, how about we also normalize all the hidden layers

00:55:24.380 | when I'm doing my forward pass?

00:55:27.600 | Now, the problem in doing this is

00:55:28.960 | that I can compute the mean and the standard deviations

00:55:31.880 | of my inputs once and for all because they're constant.

00:55:35.000 | But my hidden layers are constantly

00:55:36.640 | changing because I'm training these parameters.

00:55:39.080 | So the mean and the standard deviation of my units

00:55:41.640 | will change.

00:55:42.800 | And so it would be very expensive

00:55:45.920 | if every time I did an update on my parameters,

00:55:48.120 | I recomputed the means and the standard deviations

00:55:50.540 | of all of my units.

00:55:52.400 | So batch normalization addresses some of these issues

00:55:54.980 | as follows.

00:55:56.360 | So the way it works is first, the normalization

00:55:59.520 | is going to be applied on actually the pre-activation.

00:56:02.240 | So not the activation of the unit,

00:56:03.660 | but before the non-linearity.

00:56:06.480 | During training, to address the issue

00:56:08.600 | that we don't want to compute means over the full training

00:56:11.120 | set because that would be too slow,

00:56:12.720 | I'm actually going to compute it on each mini-batch.

00:56:15.960 | So I have to do mini-batch training here.

00:56:18.120 | I'm going to take my small mini-batch of 64, 128 examples.

00:56:21.560 | And that's the set of examples on which

00:56:23.440 | I'm going to compute my means and standard deviations.

00:56:26.800 | And then when I do backprop, I'm actually

00:56:28.920 | going to take into account the normalization.

00:56:31.160 | So now there's going to be a gradient going

00:56:33.280 | through the computation of the mean and the standard deviation

00:56:36.360 | because they depend on the parameters

00:56:38.320 | of the neural network.

00:56:40.160 | And then at test time, we'll just

00:56:41.540 | use the global mean and global standard deviation.

00:56:44.200 | Once I finish training, I can actually

00:56:45.880 | do a full pass over the whole training set and get

00:56:48.120 | all of my means and standard deviations.

00:56:52.000 | So that's essentially the pseudocode for that,

00:56:54.680 | taken out of the paper directly.

00:56:57.360 | So if x is a pre-activation for a unit

00:57:00.520 | and have multiple pre-activations

00:57:02.640 | for a single unit across my mini-batch,

00:57:05.240 | I would compute what is the average for that unit

00:57:08.280 | pre-activation across my examples in my mini-batch,

00:57:11.400 | compute my variance, and then subtract the mean

00:57:15.000 | and divide by the square root of the variance,

00:57:17.760 | plus some epsilon for numerical stability

00:57:19.680 | in case the variance is too close to zero.

00:57:22.160 | And then another thing is that actually batch normalization

00:57:25.200 | doesn't just perform this normalization

00:57:28.080 | and outputs the normalized pre-activation.

00:57:30.400 | It then actually performs a linear transformation on it.

00:57:34.640 | So it multiplies it by this parameter gamma,

00:57:37.160 | which is going to be trained by gradient descent.

00:57:40.160 | And it's often called the gain parameter

00:57:43.240 | of batch normalization.

00:57:45.760 | And it adds a bias beta.

00:57:47.960 | And the reason is that if I'm subtracting by the mean,

00:57:51.360 | then each of these units have the bias parameter.

00:57:54.620 | So if I subtract it, then this essentially here,

00:57:58.280 | there's no bias anymore.

00:57:59.680 | It was present here, it was present here,

00:58:01.360 | and now it's been subtracted.

00:58:02.720 | So I have to add the bias, but after the batch normalization,

00:58:05.580 | essentially.

00:58:06.360 | So these betas here are essentially

00:58:08.240 | the new bias parameters.

00:58:10.380 | And those will actually be trained.

00:58:11.800 | So we do gradient descent also on those.

00:58:13.760 | So batch normalization adds a few parameters.

00:58:18.120 | All right, and as I said, I'm just going to skip over this.

00:58:20.640 | And I'm not showing what the gradients are when you back

00:58:23.200 | prop through the mean and so on.

00:58:24.760 | It's described in the paper if you want to see the gradients.

00:58:27.300 | But otherwise, in the different packages,

00:58:31.200 | you'll get the gradients automatically.

00:58:33.160 | It's usually been implemented.

00:58:35.580 | Skipping over that, I'll just finish.

00:58:38.080 | If you actually want to learn about unsupervised

00:58:40.160 | pre-training and why it works, I have videos on that.

00:58:42.480 | So you can check that out.

00:58:44.200 | And I guess that's it.

00:58:46.160 | Thank you.

00:58:46.960 | [APPLAUSE]

00:58:49.880 | Thanks, Hugo.

00:58:56.040 | So we have a few minutes for questions which

00:58:58.400 | are intermingled with a break.

00:59:00.520 | So feel free to either go for a break or ask questions to Hugo.

00:59:04.800 | I believe there are microphones.

00:59:06.220 | And I'll also stick around.

00:59:07.760 | So if you want to ask me questions offline,

00:59:09.880 | that's also fine.

00:59:10.960 | If anyone has questions, you can go to the mic.

00:59:13.040 | Go to the microphone.

00:59:23.080 | Hi.

00:59:23.580 | Hi.

00:59:24.080 | You mentioned the ReLU adds sparsity.

00:59:26.340 | Can you explain why?

00:59:28.440 | Yeah, so the first thing is that it's observed in practice.

00:59:34.400 | And it adds some sparsity in part

00:59:37.200 | because you have the non-linearity at 0 below.

00:59:40.520 | So it means that units are going to be potentially

00:59:43.560 | exactly sparse, essentially absent of the hidden layer.

00:59:49.480 | There are a few reasons to explain why you get sparsity.

00:59:54.400 | It turns out that this process of doing a linear

00:59:56.900 | transformation followed by the ReLU activation function

01:00:00.460 | is very close to some of the steps

01:00:02.160 | you would do when you're optimizing for sparse codes

01:00:04.840 | in a sparse coding model, if you know about sparse coding.

01:00:07.700 | So they're essentially an optimization method

01:00:10.280 | that, given some sparse coding model,

01:00:12.720 | will find what is the sparse representation,

01:00:15.640 | hidden representation for some input.

01:00:17.580 | And it's mostly a sequence of linear transformations

01:00:20.920 | followed by this ReLU-like activation function.

01:00:25.120 | And I think this is partly the explanation.

01:00:27.600 | Otherwise, I don't know of a solid explanation

01:00:31.280 | for why that is beyond what's observed in practice.

01:00:34.280 | Any more questions?

01:00:41.240 | If not, let's thank Hugo again.

01:00:43.120 | [APPLAUSE]

01:00:47.600 | And we are reconvening in 10 minutes.

Foundations of Deep Learning (Hugo Larochelle, Twitter)

Chapters