back to indexStanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 3 - Backprop and Neural Networks
Chapters
0:0 Introduction
0:40 Assignments
4:51 Named Entity Recognition
8:16 Neural Network
13:45 Gradient Computing
21:29 Nonlinearities
23:54 Jacobians
26:34 Neural Net
35:32 Jacobian
37:6 Shape Convention
42:15 Jacobian vs Shape Convention
48:16 Computation Graph
54:12 Example
00:00:00.000 |
Hi, everyone. I'll get started. Okay. So we're now back to the second week of CS224N on natural 00:00:14.600 |
language processing with deep learning. Okay. So for today's lecture, what we're going to 00:00:20.040 |
be looking at is all the math details of doing neural net learning. First of all, looking 00:00:27.480 |
at how we can work out by hand gradients for training neural networks and then looking 00:00:34.320 |
at how it's done more algorithmically, which is known as the back propagation algorithm. 00:00:40.600 |
And correspondingly for you guys, well, I hope you've remembered that, you know, one 00:00:46.000 |
minute ago was when assignment one was due and everyone has handed that in. If by some 00:00:51.640 |
chance you haven't handed it in, really should hand it in as soon as possible. Best to preserve 00:00:57.400 |
those late days for the harder assignments. So, I mean, I actually forgot to mention, 00:01:02.200 |
we actually did make one change for this year to make it a bit easier when occasionally 00:01:07.940 |
people join the class a week late. If you want to this year in the grading assignment, 00:01:13.960 |
one can be discounted and we'll just use your other four assignments. But if you've been 00:01:19.120 |
in the class so far for that 98% of people, well, since assignment one is the easiest 00:01:24.840 |
assignment, again, it's silly not to do it and have it as part of your grade. Okay. So 00:01:32.000 |
starting today, we've put out assignment two and assignment two is all about making sure 00:01:37.640 |
you really understand the math of neural networks and then the software that we use to do that 00:01:44.040 |
math. So, this is going to be a bit of a tough week for some. So, for some people who are 00:01:50.440 |
great on all their math and backgrounds, they'll feel like this is stuff they know well, nothing 00:01:57.380 |
very difficult. But I know there are quite a few of you who this lecture and week is 00:02:03.600 |
the biggest struggle of the course. We really do want people to actually have an understanding 00:02:09.900 |
of what goes on in neural network learning rather than viewing it as some kind of deep 00:02:15.440 |
magic. And I hope that some of the material we give today and that you read up on and 00:02:21.040 |
use in the assignment will really give you more of a sense of what these neural networks 00:02:26.360 |
are doing and how it is just math that's applied in a systematic large scale that works out 00:02:33.160 |
the answers and that this will be valuable and give you a deeper sense of what's going 00:02:38.080 |
on. But if this material seems very scary and difficult, you can take some refuge in 00:02:46.120 |
the fact that there's fast light at the end of the tunnel, since this is really the only 00:02:51.120 |
lecture that's heavily going through the math details of neural networks. After that, we'll 00:02:57.240 |
be kind of popping back up to a higher level. And by and large, after this week, we'll be 00:03:03.640 |
making use of software to do a lot of the complicated math for us. But nevertheless, 00:03:10.440 |
I hope this is valuable. I'll go through everything quickly today. But if this isn't stuff that 00:03:16.840 |
you know backwards, I really do encourage you to, you know, work through it and get 00:03:22.640 |
help as you need it. So do come along to our office hours. There are also a number of pieces 00:03:28.200 |
of tutorial material given in the syllabus. So there's both the lecture notes, there's 00:03:34.080 |
some materials from CS 231. In the list of readings, the very top reading is some material 00:03:41.800 |
put together by Kevin Clark a couple of years ago. And actually, that one's my favorite. 00:03:47.880 |
The presentation there fairly closely follows the presentation in this lecture of going 00:03:53.840 |
through matrix calculus. So, you know, personally, I'd recommend starting with that one. But 00:03:58.560 |
there are four different ones you can choose from if one of them seems more helpful to 00:04:03.000 |
you. Two other things on what's coming up. Actually, for Thursday's lecture, we make 00:04:11.360 |
a big change. And Thursday's lecture is probably the most linguistic lecture of the whole class, 00:04:16.640 |
where we go through the details of dependency grammar and dependency parsing. Some people 00:04:21.000 |
find that tough as well, but at least it'll be tough in a different way. And then one 00:04:25.720 |
other really good opportunity is this Friday, we have our second tutorial at 10am, which 00:04:32.240 |
is an introduction to PyTorch, which is the deep learning framework that we'll be using 00:04:37.360 |
for the rest of the class, once we've gone through these first two assignments where 00:04:41.640 |
you do things by yourself. So this is a great chance to get an intro to PyTorch. It'll be 00:04:47.320 |
really useful for later in the class. Okay. Today's material is really all about sort 00:04:55.920 |
of the math of neural networks. But just to sort of introduce a setting where we can work 00:05:02.040 |
through this, I'm going to introduce a simple NLP task and a simple form of classifier that 00:05:08.880 |
we can use for it. So the task of named entity recognition is a very common basic NLP task. 00:05:15.840 |
And the goal of this is you're looking through pieces of text and you're wanting to label 00:05:21.300 |
by labeling the words, which words belong to entity categories like persons, locations, 00:05:27.680 |
products, dates, times, etc. So for this piece of text, last night, Paris Hilton wowed in 00:05:35.120 |
the sequined gown. Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989. 00:05:42.320 |
Some words are being labeled as named entities as shown. These two sentences don't actually 00:05:48.160 |
belong together in the same article, but I chose those two sentences to illustrate the 00:05:54.080 |
basic point that it's not that you can just do this task by using a dictionary. Yes, a 00:05:59.760 |
dictionary is helpful to know that Paris can possibly be a location, but Paris can also 00:06:06.120 |
be a person name. So you have to use context to get named entity recognition right. Okay, 00:06:14.840 |
well how might we do that with a neural network? There are much more advanced ways of doing 00:06:21.960 |
this, but a simple yet already pretty good way of doing named entity recognition with 00:06:29.600 |
a simple neural net is to say, well, what we're going to do is use the word vectors 00:06:36.560 |
that we've learned about, and we're going to build up a context window of word vectors. 00:06:43.920 |
And then we're going to put those through a neural network layer and then feed it through 00:06:49.520 |
a softmax classifier of the kind that we, sorry, I said that wrong. And then we're going 00:06:55.760 |
to feed it through a logistic classifier of the kind that we saw when looking at negative 00:07:00.800 |
sampling, which is going to say for a particular entity type, such as location, is it high 00:07:08.280 |
probability location or is it not a high probability location? So for a sentence like the museums 00:07:15.740 |
in Paris are amazing to see, what we're going to do is for each word, say we're doing the 00:07:20.880 |
word Paris, we're going to form a window around it, say a plus or minus two word window. And 00:07:27.080 |
so for those five words, we're going to get word vectors for them from the kind of word 00:07:33.720 |
to vector or glove word vectors we've learned. And we're going to make a long vector out 00:07:38.880 |
of the concatenation of those five word vectors. So the word of interest is in the middle. 00:07:44.060 |
And then we're going to feed this vector to a classifier, which is at the end going to 00:07:50.900 |
have a probability of the word being a location. And then we could have another classifier 00:07:57.300 |
that says the probability of the word being a person name. And so once we've done that, 00:08:02.020 |
we're then going to run it at the next position. So we then say, well, is the word are a location? 00:08:07.660 |
And we'd feed a window of five words as then in Paris are amazing to and put it through 00:08:13.100 |
the same kind of classifier. And so this is the classifier that we'll use. So it's input 00:08:20.440 |
will be this word window. So if we have D dimensional word vectors, this will be a five 00:08:26.420 |
D vector. And then we're going to put it through a layer of a neural network. So the layer 00:08:32.500 |
of the neural network is going to multiply this vector by a matrix, add on a bias spectra, 00:08:41.540 |
and then put that through a nonlinearity such as the softmax transformation that we've seen 00:08:49.980 |
before. And that will give us a hidden vector, which might be of a smaller dimensionality 00:08:56.340 |
such as this one here. And so then with that hidden vector, we're then going to take the 00:09:04.380 |
dot product of it with an extra vector here. Here's a U. So we take U dot product H. And 00:09:13.100 |
so when we do that, we're getting out a single number and that number can be any real number. 00:09:20.100 |
And so then finally, we're going to put that number through a logistic transform of the 00:09:27.020 |
same kind that we saw when doing negative sampling. The logistic transform will take 00:09:33.600 |
any real number and it will transform it into a probability that that word is a location. 00:09:40.620 |
So its output is the predicted probability of the word belonging to a particular class. 00:09:47.260 |
And so this could be our location classifier, which could classify each word in a window 00:09:53.500 |
as to what the probability is that it's a location word. And so this little neural network 00:09:59.900 |
here is the neural network I'm going to use today when going through some of the math. 00:10:06.580 |
But actually I'm going to make it even easier on myself. I'm going to throw away the logistic 00:10:13.620 |
function at the top, and I'm really just going to work through the math of the bottom three 00:10:18.640 |
quarters of this. If you look at Kevin Clark's handout that I just mentioned, he includes 00:10:25.620 |
when he works through it, also working through the logistic function. And we also saw working 00:10:32.120 |
through a softmax in the first lecture when I was working through some of the word to 00:10:36.720 |
deck model. Okay. So the overall question we want to be able to answer is, so here's 00:10:45.600 |
our stochastic gradient descent equation that we have existing parameters of our model, 00:10:53.940 |
and we want to update them based on our current loss, which is at the J of theta. So for getting 00:11:03.700 |
our loss here, that the true answer as to whether a word is a location or not will be 00:11:09.680 |
either, you know, one if it is a location or zero if it isn't. Our logistic classifier 00:11:16.140 |
will return some number like .9, and we'll use the distance away from what it should 00:11:21.620 |
have been squared as our loss. So we work out a loss, and then we're moving a little 00:11:28.260 |
distance in the negative of the gradient, which will be changing our parameter estimates 00:11:35.140 |
in such a way that they reduce the loss. And so this is already being written in terms 00:11:41.640 |
of a whole vector of parameters, which is being updated as to a new vector of parameters. 00:11:48.220 |
But you can also think about it that for each individual parameter, theta J, that we're 00:11:54.160 |
working out the partial derivative of the loss with respect to that parameter, and then 00:12:00.860 |
we're moving a little bit in the negative direction of that. That's going to give us 00:12:07.420 |
a new value for parameter theta J. And we're going to update all of the parameters of our 00:12:15.580 |
model as we learn. I mean, in particular, in contrast to what commonly happens in statistics, 00:12:23.100 |
we also we update not only the sort of parameters of our model that are sort of weights in the 00:12:30.260 |
classifier, but we also will update our data representation. So we'll also be changing 00:12:36.180 |
our word vectors as we learn. Okay. So to build neural nets, i.e. to train neural nets 00:12:42.740 |
based on data, what we need is to be able to compute this gradient of the parameters 00:12:49.860 |
so that we can then iteratively update the weights of the model and efficiently train 00:12:55.460 |
a model that has good weights, i.e. that has high accuracy. And so how can we do that? 00:13:03.860 |
Well, what I'm going to talk about today is first of all, how you can do it by hand. And 00:13:11.820 |
so for doing it by hand, this is basically a review of matrix calculus. And that'll take 00:13:19.860 |
quite a bit of the lecture. And then after we've talked about that for a while, I'll 00:13:26.940 |
then shift gears and introduce the back propagation algorithm, which is the central technology 00:13:33.540 |
for neural networks. And that technology is essentially the efficient application of calculus 00:13:41.220 |
on a large scale, as we'll come to talking about soon. So for computing gradients by 00:13:47.300 |
hand, what we're doing is matrix calculus. So we're working with vectors and matrices 00:13:55.900 |
and working out gradients. And this can seem like pretty scary stuff. And well, to the 00:14:05.620 |
extent that you're kind of scared and don't know what's going on, one choice is to work 00:14:13.260 |
out a non-vectorized gradient by just working out what the partial derivative is for one 00:14:20.620 |
parameter at a time. And I showed a little example of that in the first lecture. But 00:14:27.100 |
it's much, much faster and more useful to actually be able to work with vectorized gradients. 00:14:37.580 |
And in some sense, if you're not very confident, this is kind of almost a leap of faith. But 00:14:43.740 |
it really is the case that multivariable calculus is just like single variable calculus, except 00:14:50.620 |
you're using vectors and matrices. So providing you remember some basics of single variable 00:14:56.740 |
calculus, you really should be able to do this stuff and get it to work out. Lots of 00:15:02.900 |
other sources. I've mentioned the notes. You can also look at the textbook for Math 51, 00:15:10.940 |
which also has quite a lot of material on this. I know some of you have bad memories 00:15:15.860 |
of Math 51. Okay. So let's go through this and see how it works, ramping up from the 00:15:21.940 |
beginning. So the beginning of calculus is, you know, we have a function with one input 00:15:27.900 |
and one output, f of x equals x cubed. And so then its gradient is its slope, right? 00:15:34.940 |
So that's its derivative. So its derivative is 3x squared. And the way to think about 00:15:41.700 |
this is how much will the output change if we change the input a little bit, right? So 00:15:48.620 |
what we're wanting to do in our neural net models is change what they output so that 00:15:54.140 |
they do a better job of predicting the correct answers when we're doing supervised learning. 00:16:00.620 |
And so what we want to know is if we fiddle different parameters of the model, how much 00:16:05.500 |
of an effect will that have on the output? Because then we can choose how to fiddle them 00:16:09.900 |
in the right way to move things down, right? So, you know, when we're saying that the derivative 00:16:15.780 |
here is 3x squared, well, what we're saying is that if you're at x equals 1, if you fiddle 00:16:24.740 |
the input a little bit, the output will change three times as much, 3 times 1 squared. And 00:16:31.180 |
it does. So if I say what's the value at 1.01, it's about 1.03. It's changed three times 00:16:37.700 |
as much, and that's its slope. But at x equals 4, the derivative is 16 times 3, 48. So if 00:16:47.740 |
we fiddle the input a little, it'll change 48 times as much. And that's roughly what 00:16:52.580 |
happens. 4.01 cubed is 64.48. Now, of course, you know, this is just sort of showing it 00:17:00.020 |
for a small fiddle, but, you know, that's an approximation to the actual truth. Okay. 00:17:06.820 |
So then we sort of ramp up to the more complex cases, which are more reflective of what we 00:17:12.460 |
do with neural networks. So if we have a function with one output and n inputs, then we have 00:17:19.740 |
a gradient. So a gradient is a vector of partial derivatives with respect to each input. So 00:17:26.180 |
we've got n inputs, x1 to xn, and we're working out the partial derivative f with respect 00:17:31.660 |
to x1, the partial derivative f with respect to x2, et cetera. And we then get a vector 00:17:39.460 |
of partial derivatives, where each element of this vector is just like a simple derivative 00:17:46.380 |
with respect to one variable. Okay. So from that point, we just keep on ramping up for 00:17:53.380 |
what we do with neural networks. So commonly when we have something like a layer in a neural 00:17:59.860 |
network, we'll have a function with n inputs that will be like our word vectors. Then we 00:18:07.140 |
do something like multiply by a matrix, and then we'll have m outputs. So we have a function 00:18:13.700 |
now which is taking n inputs and is producing m outputs. So at this point, what we're calculating 00:18:23.100 |
for the gradient is what's called a Jacobian matrix. So for m inputs and n outputs, the 00:18:30.540 |
Jacobian is an m by n matrix of every combination of partial derivatives. So function f splits 00:18:40.280 |
up into these different sub functions, f1 through m, fm, which generate each of the 00:18:47.660 |
m outputs. And so then we're taking the partial derivative f1 with respect to x1 through the 00:18:54.380 |
partial derivative of f1 with respect to xn. Then heading down, you know, we make it up 00:18:59.700 |
to the partial derivative of fm with respect to x1, et cetera. So we have every possible 00:19:05.580 |
partial derivative of an output variable with respect to one of the input variables. Okay. 00:19:14.940 |
So in simple calculus, when you have a composition of one variable functions, so that if you 00:19:23.100 |
have y equals x squared and then z equals 3y, that's then z is a composition of two 00:19:33.260 |
functions of - or you're composing two functions to get z as a function of x. Then you can 00:19:40.020 |
work out the derivative of z with respect to x. And the way you do that is with the 00:19:45.220 |
chain rule. And so in the chain rule, you multiply derivatives. So dz dx equals dz dy 00:19:53.180 |
times dy dx. So dz dy is just 3 and dy dx is 2x. So we get 3 times 2x. So that overall 00:20:07.180 |
derivative here is 6x. And since if we multiply this together, we're really saying that z 00:20:14.940 |
equals 3x squared, you should trivially be able to see again, aha, its derivative is 00:20:21.940 |
6x. So that works. Okay. So once we move into vectors and matrices and Jacobians, it's actually 00:20:30.780 |
the same game. So when we're working with those, we can compose functions and work out 00:20:36.380 |
the derivatives by simply multiplying Jacobians. So if we have start with an input x and then 00:20:44.060 |
put it through the simplest form of neural network layer and say that z equals wx plus 00:20:51.100 |
b. So we multiply the x vector by matrix w and then add on a bias vector b. And then 00:20:57.620 |
typically we'd put things through a nonlinearity f. So f could be a sigmoid function. We'll 00:21:03.900 |
then say h equals f of z. So this is the composition of two functions in terms of vectors and matrices. 00:21:13.300 |
So we can use Jacobians and we can say the partial of h with respect to x is going to 00:21:18.980 |
be the product of the partial of h with respect to z and the partial of z with respect to 00:21:25.940 |
x. And this all does work out. So let's start going through some examples of how these things 00:21:35.220 |
work slightly more concretely. First, just particular Jacobians and then composing them 00:21:43.220 |
together. So one case we look at is the nonlinearities that we put a vector through. So this is something 00:21:51.620 |
like putting a vector through the sigmoid function f. And so if we have an intermediate 00:21:59.740 |
vector z and we're turning into a vector h by putting it through a logistic function, 00:22:05.820 |
we can say what is dh dz? Well, for this, formally this is a function that has n inputs 00:22:20.140 |
and n outputs. So at the end of the day, we're computing an n by n Jacobian. And so what 00:22:28.860 |
that's meaning is the elements of this n by n Jacobian are going to take the partial derivative 00:22:36.460 |
of each output with respect to each input. And well, what is that going to be in this 00:22:44.780 |
case? Well, in this case, because we're actually just computing element-wise a transformation 00:22:54.340 |
such as a logistic transform of each element zi, like the second equation here, if i equals 00:23:03.420 |
j, we've got something to compute. Whereas if i doesn't equal j, there's just the input 00:23:10.820 |
has no influence on the output. And so the derivative is zero. So if i doesn't equal 00:23:16.460 |
j, we're going to get a zero. And if i does equal j, then we're going to get the regular 00:23:22.820 |
one variable derivative of the logistic function, which if I remember correctly, you were asked 00:23:31.660 |
to compute. Now I can't remember what's assignment one or assignment two, but one of the two 00:23:36.660 |
asks you to compute it. So our Jacobian for this case looks like this. We have a diagonal 00:23:43.380 |
matrix with the derivatives of each element along the diagonal and everything else is 00:23:52.540 |
zero. Okay, so let's look at a couple of other Jacobians. So if we are asking, if we've got 00:24:02.180 |
this WX plus B basic neural network layer, and we're asking for the gradient with respect 00:24:08.900 |
to X, then what we're going to have coming out is that that's actually going to be the 00:24:15.420 |
matrix W. So this is where, what I hope you can do is look at the notes at home and work 00:24:24.740 |
through this exactly and see that this is actually the right answer. But this is the 00:24:32.820 |
way in which if you just have faith and think this is just like single variable calculus, 00:24:40.900 |
except I've now got vectors and matrices, the answer you get is actually what you expected 00:24:45.540 |
to get, because this is just like the derivative of AX plus B with respect to X where it's 00:24:52.780 |
A. So similarly, if we take the partial derivative with respect to B of WX plus B, we get out 00:25:01.740 |
the identity matrix. Okay, then one other Jacobian that we mentioned while in the first 00:25:10.260 |
lecture while working through Word2Vec is if you have the dot product of two vectors, 00:25:18.340 |
i.e. that's a number, that what you get coming out of that, so the partial derivative of 00:25:28.900 |
UTH with respect to U is H transpose. And at this point, there's some fine print that 00:25:36.660 |
I'm going to come back to in a minute. So this is the correct Jacobian, right? Because 00:25:43.580 |
in this case, we have the dimension of H inputs and we have one output. And so we want to 00:25:53.660 |
have a row vector. But there's a little bit more to say on that that I'll come back to 00:26:00.740 |
in about 20 slides. But this is the correct Jacobian. Okay, so if you are not familiar 00:26:08.900 |
with these kind of Jacobians, do please look at some of the notes that are available and 00:26:16.540 |
try and compute these in more detail element-wise and convince yourself that they really are 00:26:21.740 |
right. But I'm going to assume these now and show you what happens when we actually then 00:26:27.740 |
work out gradients for at least a mini little neural net. Okay. So here is most of this 00:26:41.660 |
neural net. I mean, as I commented, that, you know, really we'd be working out the partial 00:26:49.140 |
derivative of the loss J with respect to these variables. But for the example I'm doing here, 00:26:56.140 |
I've locked that off to keep it a little simpler and more manageable for the lecture. And so 00:27:00.980 |
we're going to just work out the partial derivative of the score S, which is a real number with 00:27:07.100 |
respect to the different parameters of this model, where the parameters of this model 00:27:13.060 |
are going to be the W and the B and the U and also the input, because we can update 00:27:22.500 |
the weight vectors of the word vectors of different words based on tuning them to better 00:27:30.900 |
predict the classification outputs that we desire. So let's start off with a fairly easy 00:27:36.980 |
one where we want to update the bias vector B to have our system classify better. So to 00:27:45.980 |
be able to do that, what we want to work out is the partial derivatives of S with respect 00:27:51.380 |
to B. So we know how to put that into our stochastic gradient update for the B parameters. 00:28:00.020 |
Okay. So how do we go about doing these things? So the first step is we want to sort of break 00:28:07.100 |
things up into different functions of minimal complexity that compose together. So in particular, 00:28:15.900 |
this neural net layer, H equals F of WX plus B, it's still a little bit complex. So let's 00:28:21.780 |
decompose that one further step. So we have the input X, we then calculate the linear 00:28:30.980 |
transformation Z equals WX plus B. And then we put things through the sort of element 00:28:40.780 |
wise nonlinearity, H equals F of Z, and then we do the dot product with U. And, you know, 00:28:50.380 |
it's useful for working these things out to, you know, split into pieces like this, have 00:28:57.660 |
straight what your different variables are, and to know what the dimensionality of each 00:29:02.740 |
of these variables is. It's well worth just writing out the dimensionality of every variable 00:29:08.500 |
and making sure that the answers that you're computing are of the right dimensionality. 00:29:14.140 |
So at this point, though, what we can see is that calculating S is the product of three 00:29:22.020 |
- sorry, is the composition of three functions around X. So for working out the partials 00:29:31.020 |
of S with respect to B, it's the composition of the three functions shown on the left. 00:29:39.500 |
And so therefore, the gradient of S with respect to B, we're going to take the product of these 00:29:49.500 |
three partial derivatives. Okay, so how do - what do we - so, you know, we've got the 00:29:59.860 |
S equals UTH, so that's the sort of the top corresponding partial derivative, partial 00:30:06.740 |
derivative of H with respect to Z, partial derivative of Z with respect to B, which is 00:30:13.380 |
the first one that we're working out. Okay, so we want to work this out, and if we're 00:30:19.340 |
lucky, we remember those Jacobians I showed previously about the Jacobian for a vector 00:30:26.420 |
dot product, the Jacobian for the non-linearity, and the Jacobian for the simple linear transformation. 00:30:36.460 |
And so we can use those. So for the partials of S with respect to H, well, that's going 00:30:45.940 |
to be UT using the first one. The partials of H with respect to Z, okay, so that's the 00:30:53.540 |
non-linearity, and so that's going to be the matrix, it's the diagonal matrix with the 00:31:00.580 |
element wise derivative F prime of Z and zero elsewhere. And then for the WX plus B, when 00:31:09.780 |
we're taking the partials with respect to B, that's just the identity matrix. So we 00:31:16.140 |
can simplify that down a little, the identity matrix disappears, and since UT is a vector 00:31:27.820 |
and this is a diagonal matrix, we can rewrite this as UT Hadamard product of F prime of 00:31:35.260 |
Z. I think this is the first time I've used this little circle for a Hadamard product, 00:31:41.380 |
but it's something that you'll see quite a bit in your network work, since it's often 00:31:47.740 |
used. So when we have two vectors, UT and this vector here, sometimes you want to do 00:31:56.860 |
an element wise product. So the output of this will be a vector where you've taken the 00:32:02.140 |
first element of each and multiplied them, the second element of each and multiplied 00:32:06.180 |
them, et cetera, downwards. And so that's called the Hadamard product, and it's what 00:32:11.100 |
we're calculating as to calculate a vector, which is the gradient of S with respect to 00:32:20.140 |
B. Okay. So that's good. So we now have a gradient of S with respect to B, and we could 00:32:30.260 |
use that in our stochastic gradient, but we don't stop there. We also want to work out 00:32:37.940 |
the gradient with respect to others of our parameters. So we might want to next go on 00:32:45.540 |
and work out the gradient of S with respect to W. Well, we can use the chain rule just 00:32:54.660 |
like we did before, right? So we've got the same product of functions and everything is 00:33:01.040 |
going to be the same apart from now taking the derivatives with respect to W rather than 00:33:08.220 |
B. So it's now going to be the partial of S with respect to H, H with respect to Z, 00:33:16.620 |
and Z with respect to W. And the important thing to notice here, and this leads into 00:33:24.220 |
what we do with the backpropagation algorithm, is wait a minute, this is very similar to 00:33:32.420 |
what we've already done. So when we were working out the gradients of S with respect to B, 00:33:38.620 |
the first two terms were exactly the same. It's only the last one that differs. So to 00:33:46.620 |
be able to build or to train neural networks efficiently, this is what happens all the 00:33:54.460 |
time. And it's absolutely essential that we use an algorithm that avoids repeated computation. 00:34:02.980 |
And so the idea we're going to develop is when we have this equation stack, that there's 00:34:09.220 |
sort of stuff that's above where we compute Z, and we're going to be sort of, that'll 00:34:15.420 |
be the same each time. And we want to compute something from that, that we can then sort 00:34:21.700 |
of feed downwards when working out the gradients with respect to W, X, or B. And so we do that 00:34:32.540 |
by defining delta, which is delta is the partials composed that are above the linear transform. 00:34:42.660 |
And that's referred to as the local error signal. It's what's being passed in from above 00:34:48.620 |
the linear transform. And we've already computed the gradient of that in the preceding slides. 00:34:57.200 |
And so the final form of the partial of S with respect to B will be delta times the 00:35:07.700 |
remaining part. And well, we'd seen that, you know, for partial S with respect to B, 00:35:15.300 |
the partial of Z with respect to B is just the identity. So the end result was delta. 00:35:20.820 |
But in this time, we then going to have to work out the partial of Z with respect to 00:35:25.540 |
W and multiply that by delta. So that's the part that we still haven't yet done. So, and 00:35:35.260 |
this is where things get, in some sense, a little bit hairier. And so there's something 00:35:44.060 |
that's important to explain. So, you know, what should we have for the Jacobian of DSDW? 00:35:53.580 |
Well, that's a function that has one output. The output is just a score, a real number. 00:36:04.240 |
And then it has N by M inputs. So the Jacobian is a 1 by N by M matrix, i.e. a very long 00:36:16.620 |
row vector. But that's correct math. But it turns out that that's kind of bad for our 00:36:24.940 |
neural networks. Because remember, what we want to do with our neural networks is do 00:36:30.420 |
stochastic gradient descent. And we want to say theta new equals theta old minus a small 00:36:38.220 |
multiplier times the gradient. And well, actually, the W matrix is an N by M matrix. And so we 00:36:53.620 |
couldn't actually do this subtraction if this gradient we calculate is just a huge row vector. 00:37:01.060 |
We'd like to have it as the same shape as the W matrix. In neural network land, when 00:37:07.900 |
we do this, we depart from pure math at this point. And we use what we call the shape convention. 00:37:15.900 |
So what we're going to say is, and you're meant to use this for answers in the assignment, 00:37:22.400 |
that the shape of the gradient we're always going to make to be the shape of the parameters. 00:37:28.900 |
And so therefore, DSDW, we are also going to represent as an N by M matrix just like 00:37:37.640 |
W. And we're going to reshape the Jacobian to place it into this matrix shape. 00:37:46.100 |
Okay, so if we want to place it into this matrix shape, what do we, what are we going 00:37:55.700 |
to want to get for DSDW? Well, we know that it's going to involve delta, our local error 00:38:07.100 |
signal. And then we have to work out something for DZDW. Well, since C equals WX plus B, 00:38:23.420 |
you'd kind of expect that the answer should be X. And that's right. So the answer to DSDW 00:38:33.940 |
is going to be delta transpose times X transpose. And so the form that we're getting for this 00:38:42.420 |
derivative is going to be the product of the local error signal that comes from above versus 00:38:52.900 |
what we calculate from the local input X. So that shouldn't yet be obvious why that 00:39:00.300 |
is true. So let me just go through in a bit more detail why that's true. So when we want 00:39:07.140 |
to work out DSDW, right, it's sort of delta times DZDW, where what that's computing for 00:39:17.780 |
Z is WX plus B. So let's just consider for a moment what the derivative is with respect 00:39:25.860 |
to a single weight WIJ. So WIJ might be W2 3 that's shown in my little neural network 00:39:35.460 |
here. And so the first thing to notice is that WIJ only contributes to ZI. So it's going 00:39:46.380 |
into Z2, which then computes H2. And it has no effect whatsoever on H1. OK, so when we're 00:39:59.020 |
working out DZI, DWIJ, it's going to be DWIX, that sort of row of-- that row of the matrix 00:40:11.820 |
plus BI, which means that for-- we've got a kind of a sum of WIK times XK. And then 00:40:22.140 |
for this sum, this is like one variable calculus that when we're taking the derivative of this 00:40:28.220 |
with respect to WIJ, every term in this sum is going to be 0. The derivative is going 00:40:36.220 |
to be 0 except for the one that involves WIJ. And then the derivative of that is just like 00:40:42.740 |
AX with respect to A. It's going to be X. So you get XJ out as the answer. And so the 00:40:52.180 |
end result of that is that when we're working out, what we want as the answer is that we're 00:40:59.620 |
going to get that these columns where X1 is all that's left, X2 is all that's left, through 00:41:10.180 |
XM is all that's left. And then that's multiplied by the vectors of the local error signal from 00:41:18.620 |
above. And what we want to compute is this outer product matrix where we're getting the 00:41:24.740 |
different combinations of the delta and the X. And so we can get the N by M matrix that 00:41:33.620 |
we'd like to have via our shape convention by taking delta transpose, which is N by 1 00:41:41.620 |
times X transpose, which is N1 by M. And then we get this outer product matrix. So that's 00:41:49.500 |
a kind of a hacky argument that I've made. It's certainly a way of doing things that 00:41:54.540 |
the dimensions work out and sort of make sense. There's a more detailed run through this that 00:42:01.300 |
appears in the lecture notes. And I encourage you to sort of also look at the more mathy 00:42:08.180 |
version of that. Here's a little bit more information about the shape convention. So 00:42:14.820 |
well, first of all, one more example of this. So when you're working out the SDB, that comes 00:42:29.580 |
out as it's Jacobian is a row vector. But similarly, you know, according to the shape 00:42:37.820 |
convention, we want our gradient to be the same shape as B and B is a column vector. 00:42:46.380 |
So that's sort of, again, they're different shapes and you have to transpose one to get 00:42:51.340 |
the other. And so effectively what we have is a disagreement between the Jacobian form. 00:42:58.180 |
So the Jacobian form makes sense for, you know, calculus and math, because if you want 00:43:03.620 |
to have it like I claimed that matrix calculus is just like single variable calculus apart 00:43:11.020 |
from using vectors and matrices, you can just multiply together the partials. That only 00:43:16.180 |
works out if you're using Jacobians. But on the other hand, if you want to do stochastic 00:43:23.860 |
gradient descent and be able to sort of subtract off a piece of the gradient, that only works 00:43:32.820 |
if you have the same shape matrix for the gradient as you do for the original matrix. 00:43:41.400 |
And so this is a bit confusing, but that's just the reality. There are both of these 00:43:47.340 |
two things. So the Jacobian form is useful in doing the calculus. But for the answers 00:43:56.860 |
in the assignment, we want the answers to be presented using the shape convention so 00:44:03.700 |
that the gradient is shown in the same shape as the parameters. And therefore, you'll be 00:44:12.380 |
able to it's the right shape for doing a gradient update by just subtracting a small amount 00:44:19.100 |
of the gradient. So for working through things, there are then basically two choices. One 00:44:28.500 |
choice is to work through all the math using Jacobians and then right at the end to reshape 00:44:37.780 |
following the shape convention to give the answer. So that's what I did when I worked 00:44:42.740 |
out DSDB. We worked through it using Jacobians. We got an answer, but it turned out to be 00:44:53.620 |
a row vector. And so, well, then we have to transpose it at the end to get it into the 00:44:59.100 |
right shape for the shape convention. The alternative is to always follow the shape 00:45:08.740 |
convention. And that's kind of what I did when I was then working out DSDW. I didn't 00:45:16.540 |
fully use Jacobians. I said, oh, well, when we work out whatever it was, DZDW, let's work 00:45:26.260 |
out what shape we want it to be and what to fill in the cells with. And if you're sort 00:45:32.940 |
of trying to do it immediately with the shape convention, it's a little bit more hacky in 00:45:41.460 |
a way since you have to look at the dimensions for what you want and figure out when to transpose 00:45:47.740 |
or to reshape the matrix to be at the right shape. But the kind of informal reasoning 00:45:54.100 |
that I gave is what you do and what works. And one way of - and there are sort of hints 00:46:01.940 |
that you can use, right, that you know that your gradient should always be the same shape 00:46:07.300 |
as your parameters, and you know that the error message coming in will always have the 00:46:13.420 |
same dimensionality as that hidden layer, and you can sort of work it out always following 00:46:19.060 |
the shape convention. Okay. So that is, hey, doing this is all matrix calculus. So after 00:46:35.180 |
pausing for breath for a second, the rest of the lecture is then, okay, let's look at 00:46:46.580 |
how our software trains neural networks using what's referred to as the back propagation 00:46:55.380 |
algorithm. So the short answer is, you know, basically we've already done it. The rest 00:47:11.580 |
of the lecture is easy. So, you know, essentially I've just shown you what the back propagation 00:47:18.660 |
algorithm does. So the back propagation algorithm is judiciously taking and propagating derivatives 00:47:34.060 |
using the matrix chain rule. The rest of the back propagation algorithm is to say, okay, 00:47:43.620 |
when we have these neural networks, we have a lot of shared structure and shared derivatives. 00:47:52.360 |
So what we want to do is maximally efficiently reuse derivatives of higher layers when we're 00:48:02.180 |
computing derivatives for lower layers so that we minimize computation. And I already 00:48:08.380 |
pointed that out in the first half, but we want to systematically exploit that. And so 00:48:15.300 |
the way we do that in our computational systems is they construct computation graphs. So this 00:48:24.420 |
maybe looks a little bit like what you saw in a compiler's class if you did one, right, 00:48:31.060 |
that you're creating, I call it here a computation graph, but it's really a tree, right? So you're 00:48:37.300 |
creating here this tree of computations in this case, but in more general case, it's 00:48:44.380 |
some kind of directed graph of computations, which has source nodes, which are inputs, 00:48:54.340 |
either inputs like X or input parameters like W and B, and it's interior nodes operations. 00:49:03.920 |
And so then once we've constructed a graph, and so this graph corresponds to exactly the 00:49:09.380 |
example I did before, right? This was our little neural net that's in the top right. 00:49:14.100 |
And here's the corresponding computation graph of computing WX plus B, put it through the 00:49:21.460 |
sigmoid non-linearity F, multiply the resulting dot product of the resulting vector with U 00:49:28.620 |
gives us our output score S. Okay, so what we do to compute this is we pass along the 00:49:38.180 |
edges the results of operations. So this is WX, then Z, then H, and then our output is 00:49:44.420 |
S. And so the first thing we want to be able to do to compute with neural networks is to 00:49:50.980 |
be able to compute for different inputs what the output is. And so that's referred to as 00:49:57.780 |
forward propagation. And so we simply run this expression much like you'd standardly 00:50:07.740 |
do in a compiler to compute the value of S, and that's the forward propagation phase. 00:50:14.460 |
But the essential additional element of neural networks is that we then also want to be able 00:50:21.460 |
to send back gradients, which will tell us how to update the parameters of the model. 00:50:28.140 |
And so it's this ability to send back gradients, which gives us the ability for these models 00:50:35.180 |
to learn. Once we have a loss function at the end, we can work out how to change the 00:50:41.140 |
parameters of the model so that they more accurately produce the desired output, i.e. 00:50:48.580 |
they minimize the loss. And so it's doing that part that then is called back propagation. 00:50:56.740 |
So we then, once we forward propagated a value with our current parameters, we then head 00:51:05.540 |
backwards reversing the direction of the arrows and pass along gradients down to the different 00:51:13.460 |
parameters like B and W and U that we can use to change using stochastic gradient descent 00:51:21.260 |
what the value of B is and what the value of W is. So we start off with DSDS, which 00:51:27.620 |
is just one, and then we run our back propagation, and we're using the sort of same kind of composition, 00:51:35.500 |
of Jacobian. So we have DSDH here and DSDZ, and we progressively pass back those gradients. 00:51:46.260 |
So we just need to work out how to efficiently and cleanly do this in a computational system. 00:51:53.940 |
And so let's sort of work through again a few of these cases. So the general situation 00:52:00.060 |
is we have a particular node. So a node is where some kind of operation like multiplication 00:52:10.340 |
or a non-linearity happens. And so the simplest case is that we've got one output and one 00:52:19.100 |
input. So we'll do that first. So that's like H equals F of Z. So what we have is an upstream 00:52:27.540 |
gradient, DSDH, and what we want to do is compute the downstream gradient of DSDZ. And 00:52:39.220 |
the way we're going to do that is say, well, for this function F, it's a function, it's 00:52:46.780 |
got a derivative, a gradient. So what we want to do is work out that local gradient, DHDZ, 00:52:56.020 |
and then that gives us everything that we need to work out DSDZ, because that's precisely 00:53:03.340 |
we're going to use the chain rule. We're going to say that DSDZ equals the product of DSDH 00:53:09.700 |
times DHDZ, where this is again using Jacobians. Okay. So the general principle that we're 00:53:17.140 |
going to use is the downstream gradient equals the upstream gradient times the local gradient. 00:53:23.740 |
Okay. Sometimes it gets a little bit more complicated. So we might have multiple inputs 00:53:29.980 |
to a function. So this is the matrix vector multiply. So Z equals WX. Okay. When there 00:53:37.820 |
are multiple inputs, we still have an upstream gradient, DSDZ. But what we're going to do 00:53:47.220 |
is work out a local gradient with respect to each input. So we have DZDW and DZDX. And 00:53:56.260 |
so then at that point, it's exactly the same for each piece of it. We're going to work 00:54:01.980 |
out the downstream gradients, DSDW and DSDX by using the chain rule with respect to the 00:54:11.180 |
local gradient. So let's go through an example of this. I mean, this is kind of a silly example. 00:54:20.900 |
It's not really an example that looks like a typical neural net, but it's sort of a simple 00:54:25.740 |
example where we can show some of the components of what we do. So what we're going to do is 00:54:31.620 |
want to calculate F of XYZ, which is being calculated as X plus Y times the max of Y 00:54:40.740 |
and Z. And we've got, you know, particular values that we're starting off with. X equals 00:54:47.900 |
1, Y equals 2, and Z equals 0. So these are the current values of our parameters. And 00:54:54.780 |
so we can say, okay, well, we want to build an expression tree for that. Here's our expression 00:55:01.860 |
tree. We're taking X plus Y, we're taking the max of Y and Z, and then we're multiplying 00:55:08.740 |
them. And so our forward propagation phase is just to run this. So we take the values 00:55:16.620 |
of our parameters and we simply start to compute with them, right? So we have 1, 2, 2, 0, and 00:55:24.580 |
we add them as 3, the max is 2, we multiply them, and that gives us 6. Okay. So then at 00:55:34.180 |
that point, we then want to go and work out how to do things for back propagation and 00:55:43.860 |
how these back propagation steps work. And so the first part of that is sort of working 00:55:50.460 |
out what our local gradients are going to be. So this is A here, and this is X and Y. 00:56:00.260 |
So dA/dX, since A equals X plus Y is just going to be 1, and dA/dy is also going to 00:56:08.140 |
be 1. Then for B equals the max of YZ, so this is this max node. So the local gradients 00:56:20.420 |
for that is, it's going to depend on whether Y is greater than Z. So dB/dy is going to 00:56:29.500 |
be 1 if and only if Y is greater than Z, which it is at our particular point here, so that's 00:56:37.740 |
1. And dB/dz is going to be 1 only if Z is greater than Y. So for our particular values 00:56:47.460 |
here, that one is going to be 0. And then finally, here, we're calculating the product 00:56:56.700 |
F equals AB. So for that, we're going to - sorry, that slide is a little imperfect. 00:57:10.100 |
So for the product, the derivative F with respect to A is equal to B, which is 2, and 00:57:16.740 |
the derivative F with respect to B is A equals 3. So that gives us all of the local gradients 00:57:23.900 |
at each node. And so then to run backpropagation, we start with dF/df, which is just 1, and 00:57:32.420 |
then we're going to work out the downstream equals the upstream times the local. Okay, 00:57:41.380 |
so the local - so when you have a product like this, note that sort of the gradients 00:57:48.260 |
flip. So we take upstream times the local, which is 2. So the downstream is 2. On this 00:58:03.820 |
side, dF/dB is 3. So we're taking upstream times local, that gives us 3. And so that 00:58:14.740 |
gives us backpropagates values to the plus and max nodes. And so then we continue along. 00:58:22.700 |
So for the max node, the local gradient dB/dy equals 1. So we're going to take upstream 00:58:32.180 |
is 3. So we're going to take 3 times 1, and that gives us 3. dB/dz is 0, because of the 00:58:42.220 |
fact that z's value is not the max. So we're taking 3 times 0, and saying the gradient 00:58:48.260 |
there is 0. So finally, doing the plus node, the local gradients for both x and y there 00:58:56.700 |
are 1. So we're just getting 2 times 1 in both cases, and we're saying the gradients 00:59:02.540 |
there are 2. Okay, and so again, at the end of the day, the interpretation here is that 00:59:11.180 |
this is giving us information as to if we wiggle the values of x, y, and z, how much 00:59:18.700 |
of a difference does it make to the output? What is the slope, the gradient, with respect 00:59:25.300 |
to the variable? So what we've seen is that since z isn't the max of y and z, if I change 00:59:35.860 |
the value of z a little, like if I make z 0.1 or minus 0.1, it makes no difference at 00:59:42.980 |
all to what I compute as the output. So therefore, the gradient there is 0. If I change the value 00:59:52.100 |
of x a little, then that is going to have an effect, and it's going to affect the output 01:00:02.340 |
by twice as much as the amount I change it. Right, so, and that's because the df/dz equals 01:00:17.540 |
2. So interestingly, so I mean, we can basically work that out. So if we imagine making sort 01:00:29.580 |
of x 2.1, well, then what we'd calculate the max is 2. Oh, sorry, sorry, if we make x 1.1, 01:00:41.740 |
we then get the max here is 2, and we get 1.1 plus 2 is 3.1. So we get 3.1 times 2. 01:00:51.780 |
So that'd be about 6.2. So changing x by 0.1 has added 0.2 to the value of f. Conversely, 01:01:02.100 |
for the value of y, we find the df/dy equals 5. So what we do when we've got two things 01:01:09.640 |
coming out here, as I'll go through again in a moment, is we're summing the gradient. 01:01:14.840 |
So again, 3 plus 2 equals 5. And empirically, that's what happens. So if we consider fiddling 01:01:21.400 |
the value of y a little, let's say we make it a value of 2.1, then the prediction is 01:01:28.840 |
they'll have five times as big an effect on the output value we compute. And well, what 01:01:35.060 |
do we compute? So we compute 1 plus 2.1. So that's 3.1. And we compute the max of 2.1 01:01:45.640 |
and 0 is 2.1. So we'll take the product of 2.1 and 3.1. And I calculate that in advance 01:01:53.480 |
as I can't really do this arithmetic in my head. And the product of those two is 6.51. 01:01:59.400 |
So it has gone up about by 0.5. So we've multiplied my fiddling it by 0.1 by five times to work 01:02:08.560 |
out the magnitude of the effect on the output. OK. So for this, before I did the case of 01:02:20.160 |
when we had one in and one out here and multiple ins and one out here, the case that I hadn't 01:02:32.480 |
actually dealt with is the case of when you have multiple outward branches. But that then 01:02:40.240 |
turned up in the computation of y. So once you have multiple outward branches, what you're 01:02:47.560 |
doing is you're summing. So that when you want to work out the df/dy, you've got a local 01:03:00.000 |
gradient. You've got two upstream gradients. And you're working it out with respect to 01:03:09.080 |
each of them as in the chain rule. And then you're summing them together to work out the 01:03:15.240 |
impact at the end. Right. So we also saw some of the other node intuitions, which is useful 01:03:25.840 |
to have doing this. So when you have an addition, that distributes the upstream gradient to 01:03:35.480 |
each of the things below it. When you have max, it's like a routing node. So when you 01:03:43.000 |
have max, you have the upstream gradient and it goes to one of the branches below it and 01:03:48.960 |
the rest of them get no gradient. When you then have a multiplication, it has this effect 01:03:58.640 |
of switching the gradient. So if you're taking 3 by 2, the gradient on the 2 side is 3 and 01:04:07.800 |
on the 3 side is 2. And if you think about in terms of how much effect you get from when 01:04:14.160 |
you're doing this sort of wiggling, that totally makes sense. Right. Because if you're multiplying 01:04:19.400 |
another number by 3, then any change here is going to be multiplied by 3 and vice versa. 01:04:28.920 |
Okay. So this is the kind of computation graph that we want to use to work out derivatives 01:04:39.880 |
in an automated computational fashion, which is the basis of the backpropagation algorithm. 01:04:47.360 |
But at that point, this is what we're doing, but there's still one mistake that we can 01:04:53.640 |
make. It would be wrong for us to sort of say, okay, well, first of all, we want to 01:04:58.360 |
work out the SDB. So look, we can start up here. We can propagate our upstream errors, 01:05:07.760 |
work out local gradients, upstream error, local gradient, and keep all the way down 01:05:13.880 |
and get the DSDB down here. Okay. Next we want to do it for DSDW. Let's just run it 01:05:25.020 |
all over again. Because if we did that, we'd be doing repeated computation, as I showed 01:05:31.360 |
in the first half, that this term is the same both times, this term is the same both times, 01:05:38.680 |
this term is the same both times, that only the bits at the end differ. So what we want 01:05:44.680 |
to do is avoid duplicated computation and compute all the gradients that we're going 01:05:53.680 |
to need successively so that we only do them once. And so that was analogous to when I 01:06:00.500 |
introduced this delta variable when we computed gradients by hand. So starting off here from 01:06:08.160 |
D, we're starting off here with DSDS is one. We then want to one time compute gradients 01:06:21.080 |
in the green here. One time compute the gradient in green here. That's all common work. Then 01:06:28.360 |
we're going to take the local gradient for DZDB and multiply that by the upstream gradient 01:06:38.480 |
to have worked out DSDB. And then we're going to take the same upstream gradient and then 01:06:47.080 |
work out the local gradient here. And then sort of propagate that down to give us DSDW. 01:06:56.000 |
So the end result is we want to sort of systematically work to forward computation forward in the 01:07:03.480 |
graph and backward computation back propagation backward in the graph in a way that we do 01:07:10.380 |
things efficiently. So this is the general form of the algorithm which works for an arbitrary 01:07:22.680 |
computation graph. So at the end of the day, we've got a single scalar output Z. And then 01:07:31.720 |
we have inputs and parameters which compute Z. And so once we have this computation graph 01:07:41.680 |
and I added in this funky extra arrow here to make it a more general computation graph, 01:07:48.320 |
well, we can always say that we can work out a starting point, something that doesn't depend 01:07:55.320 |
on anything. So in this case, both of these bottom two nodes don't depend on anything 01:08:01.040 |
else. So we can start with them and we can start to compute forward. We can compute values 01:08:07.920 |
for all of these sort of second row from the bottom nodes. And then we're able to compute 01:08:15.200 |
the third ones up. So we can have a topological sort of the nodes based on the dependencies 01:08:22.840 |
in this directed graph. And we can compute the value of each node given some subset of 01:08:30.240 |
its predecessors, which it depends on. And so doing that is referred to as the forward 01:08:35.560 |
propagation phase and gives us a computation of the scalar output Z using our current parameters 01:08:43.880 |
and our current inputs. And so then after that, we run back propagation. So for back 01:08:50.920 |
propagation, we initialize the output gradient, dz dz as one. And then we visit nodes in the 01:09:00.760 |
reverse order of the topological sort and we compute the gradients downward. And so 01:09:08.280 |
our recipe is that for each node as we head down, we're going to compute the gradient 01:09:16.360 |
of the node with respect to its successors, the things that it feeds into and how we compute 01:09:25.000 |
that gradient is using this chain rule that we've looked at. So this is sort of the generalized 01:09:31.400 |
form of the chain rule where we have multiple outputs. And so we're summing over the different 01:09:37.120 |
outputs and then for each output, we're computing the product of the upstream gradient and the 01:09:43.360 |
local gradient with respect to that node. And so we head downwards and we continue down 01:09:51.240 |
the reverse topological sort order and we work out the gradient with respect to each 01:09:57.880 |
variable in this graph. And so it hopefully looks kind of intuitive looking at this picture 01:10:09.960 |
that if you think of it like this, the big O complexity of forward propagation and backward 01:10:18.140 |
propagation is the same, right? In both cases, you're doing a linear pass through all of 01:10:24.920 |
these nodes and calculating values given predecessors and then values given successors. I mean, 01:10:33.080 |
you have to do a little bit more work for working out the gradients sort of as shown 01:10:38.960 |
by this chain rule, but it's the same big O complexity. So if somehow you're implementing 01:10:44.320 |
stuff for yourself rather than relying on the software and you're calculating the gradients 01:10:49.720 |
as sort of a different order of complexity of forward propagation, it means that you're 01:10:54.840 |
doing something wrong. You're doing repeated work that you shouldn't have to do. Okay. 01:11:00.120 |
So this algorithm works for a completely arbitrary computation graph, any directed acyclic graph, 01:11:10.040 |
you can apply this algorithm. In general, what we find is that we build neural networks 01:11:17.120 |
that have a regular layer structure. So we have things like a vector of inputs, and then 01:11:22.280 |
that's multiplied by a matrix. It's transformed into another vector, which might be multiplied 01:11:28.680 |
by another matrix or summed with another matrix or something, right? So once we're using that 01:11:33.760 |
kind of regular layer structure, we can then parallelize the computation by working out 01:11:40.360 |
the gradients in terms of Jacobians of vectors and matrices and do things in parallel much 01:11:49.360 |
more efficiently. Okay. So doing this is then referred to as automatic differentiation. 01:11:58.140 |
And so essentially, if you know the computation graph, you should be able to have your clever 01:12:06.680 |
computer system work out what the derivatives of everything is, and then apply back propagation 01:12:19.000 |
to work out how to update the parameters and learn. And there's actually a sort of an interesting 01:12:26.840 |
sort of thing of how history has gone backwards here, which I'll just note. So some of you 01:12:35.960 |
might be familiar with symbolic computation packages. So those are things like Mathematica. 01:12:45.160 |
So Mathematica, you can give it a symbolic form of a computation, and then it can work 01:12:52.480 |
out derivatives for you. So it should be the case that if you give a complete symbolic 01:12:58.160 |
form of a computation graph, then it should be able to work out all the derivatives for 01:13:04.520 |
you and you never have to work out a derivative by hand whatsoever. And that was actually 01:13:10.680 |
attempted in a famous deep learning library called Fiano, which came out of Yoshua Bengio's 01:13:17.400 |
group at the University of Montreal, that it had a compiler that did that kind of symbolic 01:13:24.240 |
manipulation. But somehow that sort of proved a little bit too hard a road to follow. I 01:13:35.640 |
imagine it actually might come back again in the future. And so for modern deep learning 01:13:41.880 |
frameworks, which includes both TensorFlow or PyTorch, they do 90% of this computation 01:13:51.680 |
of automatic differentiation for you, but they don't actually symbolically compute derivatives. 01:13:59.240 |
So for each particular node or layer of your deep learning system, somebody, either you 01:14:08.400 |
or the person who wrote that layer, has handwritten the local derivatives. But then everything 01:14:18.020 |
from that point on, the sort of the taking, doing the chain rule of combining upstream 01:14:24.820 |
gradients with local gradients to work out downstream gradients, that's then all being 01:14:30.240 |
done automatically for back propagation on the computation graph. And so what that means 01:14:37.680 |
is for a whole neural network, you have a computation graph and it's going to have a 01:14:43.680 |
forward pass and a backward pass. And so for the forward pass, you're topologically sorting 01:14:52.120 |
the nodes based on their dependencies in the computation graph. And then for each node, 01:15:00.200 |
you're running forward, the forward computation on that node. And then for backward propagation, 01:15:06.740 |
you're reversing the topological sort of the graph. And then for each node in the graph, 01:15:12.520 |
you're running the backward propagation, which is the little bit of backdrop, the chain rule 01:15:17.840 |
at that node. And then the result of doing that is you have gradients for your inputs 01:15:25.620 |
and parameters. And so this is the overall software runs this for you. And so what you 01:15:36.120 |
want to do is then actually have stuff for particular nodes or layers in the graph. So 01:15:43.880 |
if I have a multiply gate, it's going to have a forward algorithm, which just computes that 01:15:50.680 |
the output is X times Y in terms of the two inputs. And then I'm going to want to compute 01:15:57.680 |
to tell it also how to calculate the local derivative. So I want to say, what is the 01:16:03.440 |
local derivative? So DLDX and DLDY in terms of the upstream gradient, DLDZ. And so I will 01:16:15.400 |
then manually work out how to calculate that. And normally what I have to do is I assume 01:16:23.240 |
the forward pass is being run first. And I'm going to shove into some local variables for 01:16:30.720 |
my class, the values that were used in the forward computation. So as well as computing 01:16:36.720 |
Z equals X times Y, I'm going to sort of remember what X and Y were. So that then when I'm asked 01:16:44.720 |
to compute the backward pass, I'm then going to have implemented here what we saw earlier 01:16:52.760 |
of that when it's X, Y, you're going to sort of swap the Y and the X to work out the local 01:17:01.240 |
gradients. And so then I'm going to multiply those by the upstream gradient. And I'm going 01:17:06.640 |
to return-- I've just written it here as a sort of a little list, but really it's going 01:17:17.040 |
OK, so that's 98% of what I wanted to cover today. Just a couple of quick comments left. 01:17:28.280 |
So that can and should all be automated. Sometimes you want to just check if you're computing 01:17:36.120 |
the right gradients. And so the standard way of checking that you're computing the right 01:17:41.560 |
gradients is to manually work out the gradient by doing a numeric calculation of the gradient. 01:17:49.120 |
And so you can do that. So you can work out what the derivative of X-- of f with respect 01:17:56.720 |
to X should be by choosing some sort of small number like 10 to the minus 4, adding it to 01:18:04.760 |
X, subtracting it from X. And then so the difference between these numbers is 2h, dividing 01:18:10.520 |
it through by 2h. And you're simply working out the rise over the run, which is the slope 01:18:16.120 |
of that point with respect to X. And that's an approximation of the gradient of f with 01:18:27.000 |
So this is so simple you can't make a mistake implementing it. And so therefore you can 01:18:32.200 |
use this to check whether your gradient values are correct or not. This isn't something that 01:18:39.820 |
you'd want to use much because not only is it approximate, but it's extremely slow. Because 01:18:45.800 |
to work this out, you have to run the forward computation for every parameter of the model. 01:18:52.120 |
So if you have a model with a million parameters, you're now doing a million times as much work 01:18:56.720 |
to run backprop as you would do if you were actually using calculus. So calculus is a 01:19:02.760 |
good thing to know. But it can be really useful to check that the right values are being calculated. 01:19:10.340 |
In the old days when we hand wrote everything, this was kind of the key unit test that people 01:19:15.580 |
used everywhere. These days, most of the time you're reusing layers that are built into 01:19:21.340 |
PyTorch or some other deep learning framework. So it's much less needed. But sometimes you're 01:19:26.740 |
implementing your own layer and you really do want to check that things are implemented 01:19:31.700 |
There's a fine point in the way this is written. If you saw this in sort of high school calculus 01:19:39.180 |
class, you will have seen rise over run of f of x plus h minus f of x divided by h. It 01:19:50.460 |
turns out that doing this two-sided estimate like this is much, much more accurate than 01:19:56.600 |
doing a one-sided estimate. And so you're really much encouraged to use this approximation. 01:20:02.220 |
Okay. So at that point, we've mastered the core technology of neural nets. Backpropagation 01:20:10.060 |
is recursively and hence efficiently applying the chain rule along the computation graph 01:20:16.620 |
with this sort of key step that downstream gradient equals upstream gradient times local 01:20:24.580 |
gradient. And so for calculating with neural nets, we do the forward pass to work out values 01:20:31.300 |
with current parameters, then run backpropagation to work out the gradient of the loss, currently 01:20:39.860 |
computed loss with respect to those parameters. 01:20:45.180 |
Now to some extent, you know, with modern deep learning frameworks, you don't actually 01:20:50.620 |
have to know how to do any of this, right? It's the same as you don't have to know how 01:20:55.480 |
to implement a C compiler. You can just write C code and say GCC and it will compile it 01:21:03.740 |
and it will run the right stuff for you. And that's the kind of functionality you get from 01:21:10.220 |
the PyTorch framework. So do come along to the PyTorch tutorial this Friday and get a 01:21:16.100 |
sense about how easy it is to write neural networks using a framework like PyTorch or 01:21:22.020 |
TensorFlow. And you know, it's so easy. That's why, you know, high school students across 01:21:27.540 |
the nation are now doing their science projects, training deep learning systems, because you 01:21:33.340 |
don't actually have to understand very much, debunk a few neural network layers together 01:21:39.380 |
and set it computing on some data. But, you know, we hope in this class that you actually 01:21:45.220 |
are also learning how these things are implemented. So you have a deeper understanding than that. 01:21:53.420 |
And you know, it turns out that sometimes you need to have a deeper understanding. So 01:21:57.980 |
back propagation doesn't always work perfectly. And so understanding what it's really doing 01:22:03.780 |
can be crucial to debugging things. And so we'll actually see an example of that fairly 01:22:08.940 |
soon when we start looking at recurrent models and some of the problems that they have, which 01:22:14.580 |
will require us to think a bit more deeply about what's happening in our gradient computations.