Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 3

00:00:00.000 | Hi, everyone. I'll get started. Okay. So we're now back to the second week of CS224N on natural

00:00:14.600 | language processing with deep learning. Okay. So for today's lecture, what we're going to

00:00:20.040 | be looking at is all the math details of doing neural net learning. First of all, looking

00:00:27.480 | at how we can work out by hand gradients for training neural networks and then looking

00:00:34.320 | at how it's done more algorithmically, which is known as the back propagation algorithm.

00:00:40.600 | And correspondingly for you guys, well, I hope you've remembered that, you know, one

00:00:46.000 | minute ago was when assignment one was due and everyone has handed that in. If by some

00:00:51.640 | chance you haven't handed it in, really should hand it in as soon as possible. Best to preserve

00:00:57.400 | those late days for the harder assignments. So, I mean, I actually forgot to mention,

00:01:02.200 | we actually did make one change for this year to make it a bit easier when occasionally

00:01:07.940 | people join the class a week late. If you want to this year in the grading assignment,

00:01:13.960 | one can be discounted and we'll just use your other four assignments. But if you've been

00:01:19.120 | in the class so far for that 98% of people, well, since assignment one is the easiest

00:01:24.840 | assignment, again, it's silly not to do it and have it as part of your grade. Okay. So

00:01:32.000 | starting today, we've put out assignment two and assignment two is all about making sure

00:01:37.640 | you really understand the math of neural networks and then the software that we use to do that

00:01:44.040 | math. So, this is going to be a bit of a tough week for some. So, for some people who are

00:01:50.440 | great on all their math and backgrounds, they'll feel like this is stuff they know well, nothing

00:01:57.380 | very difficult. But I know there are quite a few of you who this lecture and week is

00:02:03.600 | the biggest struggle of the course. We really do want people to actually have an understanding

00:02:09.900 | of what goes on in neural network learning rather than viewing it as some kind of deep

00:02:15.440 | magic. And I hope that some of the material we give today and that you read up on and

00:02:21.040 | use in the assignment will really give you more of a sense of what these neural networks

00:02:26.360 | are doing and how it is just math that's applied in a systematic large scale that works out

00:02:33.160 | the answers and that this will be valuable and give you a deeper sense of what's going

00:02:38.080 | on. But if this material seems very scary and difficult, you can take some refuge in

00:02:46.120 | the fact that there's fast light at the end of the tunnel, since this is really the only

00:02:51.120 | lecture that's heavily going through the math details of neural networks. After that, we'll

00:02:57.240 | be kind of popping back up to a higher level. And by and large, after this week, we'll be

00:03:03.640 | making use of software to do a lot of the complicated math for us. But nevertheless,

00:03:10.440 | I hope this is valuable. I'll go through everything quickly today. But if this isn't stuff that

00:03:16.840 | you know backwards, I really do encourage you to, you know, work through it and get

00:03:22.640 | help as you need it. So do come along to our office hours. There are also a number of pieces

00:03:28.200 | of tutorial material given in the syllabus. So there's both the lecture notes, there's

00:03:34.080 | some materials from CS 231. In the list of readings, the very top reading is some material

00:03:41.800 | put together by Kevin Clark a couple of years ago. And actually, that one's my favorite.

00:03:47.880 | The presentation there fairly closely follows the presentation in this lecture of going

00:03:53.840 | through matrix calculus. So, you know, personally, I'd recommend starting with that one. But

00:03:58.560 | there are four different ones you can choose from if one of them seems more helpful to

00:04:03.000 | you. Two other things on what's coming up. Actually, for Thursday's lecture, we make

00:04:11.360 | a big change. And Thursday's lecture is probably the most linguistic lecture of the whole class,

00:04:16.640 | where we go through the details of dependency grammar and dependency parsing. Some people

00:04:21.000 | find that tough as well, but at least it'll be tough in a different way. And then one

00:04:25.720 | other really good opportunity is this Friday, we have our second tutorial at 10am, which

00:04:32.240 | is an introduction to PyTorch, which is the deep learning framework that we'll be using

00:04:37.360 | for the rest of the class, once we've gone through these first two assignments where

00:04:41.640 | you do things by yourself. So this is a great chance to get an intro to PyTorch. It'll be

00:04:47.320 | really useful for later in the class. Okay. Today's material is really all about sort

00:04:55.920 | of the math of neural networks. But just to sort of introduce a setting where we can work

00:05:02.040 | through this, I'm going to introduce a simple NLP task and a simple form of classifier that

00:05:08.880 | we can use for it. So the task of named entity recognition is a very common basic NLP task.

00:05:15.840 | And the goal of this is you're looking through pieces of text and you're wanting to label

00:05:21.300 | by labeling the words, which words belong to entity categories like persons, locations,

00:05:27.680 | products, dates, times, etc. So for this piece of text, last night, Paris Hilton wowed in

00:05:35.120 | the sequined gown. Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989.

00:05:42.320 | Some words are being labeled as named entities as shown. These two sentences don't actually

00:05:48.160 | belong together in the same article, but I chose those two sentences to illustrate the

00:05:54.080 | basic point that it's not that you can just do this task by using a dictionary. Yes, a

00:05:59.760 | dictionary is helpful to know that Paris can possibly be a location, but Paris can also

00:06:06.120 | be a person name. So you have to use context to get named entity recognition right. Okay,

00:06:14.840 | well how might we do that with a neural network? There are much more advanced ways of doing

00:06:21.960 | this, but a simple yet already pretty good way of doing named entity recognition with

00:06:29.600 | a simple neural net is to say, well, what we're going to do is use the word vectors

00:06:36.560 | that we've learned about, and we're going to build up a context window of word vectors.

00:06:43.920 | And then we're going to put those through a neural network layer and then feed it through

00:06:49.520 | a softmax classifier of the kind that we, sorry, I said that wrong. And then we're going

00:06:55.760 | to feed it through a logistic classifier of the kind that we saw when looking at negative

00:07:00.800 | sampling, which is going to say for a particular entity type, such as location, is it high

00:07:08.280 | probability location or is it not a high probability location? So for a sentence like the museums

00:07:15.740 | in Paris are amazing to see, what we're going to do is for each word, say we're doing the

00:07:20.880 | word Paris, we're going to form a window around it, say a plus or minus two word window. And

00:07:27.080 | so for those five words, we're going to get word vectors for them from the kind of word

00:07:33.720 | to vector or glove word vectors we've learned. And we're going to make a long vector out

00:07:38.880 | of the concatenation of those five word vectors. So the word of interest is in the middle.

00:07:44.060 | And then we're going to feed this vector to a classifier, which is at the end going to

00:07:50.900 | have a probability of the word being a location. And then we could have another classifier

00:07:57.300 | that says the probability of the word being a person name. And so once we've done that,

00:08:02.020 | we're then going to run it at the next position. So we then say, well, is the word are a location?

00:08:07.660 | And we'd feed a window of five words as then in Paris are amazing to and put it through

00:08:13.100 | the same kind of classifier. And so this is the classifier that we'll use. So it's input

00:08:20.440 | will be this word window. So if we have D dimensional word vectors, this will be a five

00:08:26.420 | D vector. And then we're going to put it through a layer of a neural network. So the layer

00:08:32.500 | of the neural network is going to multiply this vector by a matrix, add on a bias spectra,

00:08:41.540 | and then put that through a nonlinearity such as the softmax transformation that we've seen

00:08:49.980 | before. And that will give us a hidden vector, which might be of a smaller dimensionality

00:08:56.340 | such as this one here. And so then with that hidden vector, we're then going to take the

00:09:04.380 | dot product of it with an extra vector here. Here's a U. So we take U dot product H. And

00:09:13.100 | so when we do that, we're getting out a single number and that number can be any real number.

00:09:20.100 | And so then finally, we're going to put that number through a logistic transform of the

00:09:27.020 | same kind that we saw when doing negative sampling. The logistic transform will take

00:09:33.600 | any real number and it will transform it into a probability that that word is a location.

00:09:40.620 | So its output is the predicted probability of the word belonging to a particular class.

00:09:47.260 | And so this could be our location classifier, which could classify each word in a window

00:09:53.500 | as to what the probability is that it's a location word. And so this little neural network

00:09:59.900 | here is the neural network I'm going to use today when going through some of the math.

00:10:06.580 | But actually I'm going to make it even easier on myself. I'm going to throw away the logistic

00:10:13.620 | function at the top, and I'm really just going to work through the math of the bottom three

00:10:18.640 | quarters of this. If you look at Kevin Clark's handout that I just mentioned, he includes

00:10:25.620 | when he works through it, also working through the logistic function. And we also saw working

00:10:32.120 | through a softmax in the first lecture when I was working through some of the word to

00:10:36.720 | deck model. Okay. So the overall question we want to be able to answer is, so here's

00:10:45.600 | our stochastic gradient descent equation that we have existing parameters of our model,

00:10:53.940 | and we want to update them based on our current loss, which is at the J of theta. So for getting

00:11:03.700 | our loss here, that the true answer as to whether a word is a location or not will be

00:11:09.680 | either, you know, one if it is a location or zero if it isn't. Our logistic classifier

00:11:16.140 | will return some number like .9, and we'll use the distance away from what it should

00:11:21.620 | have been squared as our loss. So we work out a loss, and then we're moving a little

00:11:28.260 | distance in the negative of the gradient, which will be changing our parameter estimates

00:11:35.140 | in such a way that they reduce the loss. And so this is already being written in terms

00:11:41.640 | of a whole vector of parameters, which is being updated as to a new vector of parameters.

00:11:48.220 | But you can also think about it that for each individual parameter, theta J, that we're

00:11:54.160 | working out the partial derivative of the loss with respect to that parameter, and then

00:12:00.860 | we're moving a little bit in the negative direction of that. That's going to give us

00:12:07.420 | a new value for parameter theta J. And we're going to update all of the parameters of our

00:12:15.580 | model as we learn. I mean, in particular, in contrast to what commonly happens in statistics,

00:12:23.100 | we also we update not only the sort of parameters of our model that are sort of weights in the

00:12:30.260 | classifier, but we also will update our data representation. So we'll also be changing

00:12:36.180 | our word vectors as we learn. Okay. So to build neural nets, i.e. to train neural nets

00:12:42.740 | based on data, what we need is to be able to compute this gradient of the parameters

00:12:49.860 | so that we can then iteratively update the weights of the model and efficiently train

00:12:55.460 | a model that has good weights, i.e. that has high accuracy. And so how can we do that?

00:13:03.860 | Well, what I'm going to talk about today is first of all, how you can do it by hand. And

00:13:11.820 | so for doing it by hand, this is basically a review of matrix calculus. And that'll take

00:13:19.860 | quite a bit of the lecture. And then after we've talked about that for a while, I'll

00:13:26.940 | then shift gears and introduce the back propagation algorithm, which is the central technology

00:13:33.540 | for neural networks. And that technology is essentially the efficient application of calculus

00:13:41.220 | on a large scale, as we'll come to talking about soon. So for computing gradients by

00:13:47.300 | hand, what we're doing is matrix calculus. So we're working with vectors and matrices

00:13:55.900 | and working out gradients. And this can seem like pretty scary stuff. And well, to the

00:14:05.620 | extent that you're kind of scared and don't know what's going on, one choice is to work

00:14:13.260 | out a non-vectorized gradient by just working out what the partial derivative is for one

00:14:20.620 | parameter at a time. And I showed a little example of that in the first lecture. But

00:14:27.100 | it's much, much faster and more useful to actually be able to work with vectorized gradients.

00:14:37.580 | And in some sense, if you're not very confident, this is kind of almost a leap of faith. But

00:14:43.740 | it really is the case that multivariable calculus is just like single variable calculus, except

00:14:50.620 | you're using vectors and matrices. So providing you remember some basics of single variable

00:14:56.740 | calculus, you really should be able to do this stuff and get it to work out. Lots of

00:15:02.900 | other sources. I've mentioned the notes. You can also look at the textbook for Math 51,

00:15:10.940 | which also has quite a lot of material on this. I know some of you have bad memories

00:15:15.860 | of Math 51. Okay. So let's go through this and see how it works, ramping up from the

00:15:21.940 | beginning. So the beginning of calculus is, you know, we have a function with one input

00:15:27.900 | and one output, f of x equals x cubed. And so then its gradient is its slope, right?

00:15:34.940 | So that's its derivative. So its derivative is 3x squared. And the way to think about

00:15:41.700 | this is how much will the output change if we change the input a little bit, right? So

00:15:48.620 | what we're wanting to do in our neural net models is change what they output so that

00:15:54.140 | they do a better job of predicting the correct answers when we're doing supervised learning.

00:16:00.620 | And so what we want to know is if we fiddle different parameters of the model, how much

00:16:05.500 | of an effect will that have on the output? Because then we can choose how to fiddle them

00:16:09.900 | in the right way to move things down, right? So, you know, when we're saying that the derivative

00:16:15.780 | here is 3x squared, well, what we're saying is that if you're at x equals 1, if you fiddle

00:16:24.740 | the input a little bit, the output will change three times as much, 3 times 1 squared. And

00:16:31.180 | it does. So if I say what's the value at 1.01, it's about 1.03. It's changed three times

00:16:37.700 | as much, and that's its slope. But at x equals 4, the derivative is 16 times 3, 48. So if

00:16:47.740 | we fiddle the input a little, it'll change 48 times as much. And that's roughly what

00:16:52.580 | happens. 4.01 cubed is 64.48. Now, of course, you know, this is just sort of showing it

00:17:00.020 | for a small fiddle, but, you know, that's an approximation to the actual truth. Okay.

00:17:06.820 | So then we sort of ramp up to the more complex cases, which are more reflective of what we

00:17:12.460 | do with neural networks. So if we have a function with one output and n inputs, then we have

00:17:19.740 | a gradient. So a gradient is a vector of partial derivatives with respect to each input. So

00:17:26.180 | we've got n inputs, x1 to xn, and we're working out the partial derivative f with respect

00:17:31.660 | to x1, the partial derivative f with respect to x2, et cetera. And we then get a vector

00:17:39.460 | of partial derivatives, where each element of this vector is just like a simple derivative

00:17:46.380 | with respect to one variable. Okay. So from that point, we just keep on ramping up for

00:17:53.380 | what we do with neural networks. So commonly when we have something like a layer in a neural

00:17:59.860 | network, we'll have a function with n inputs that will be like our word vectors. Then we

00:18:07.140 | do something like multiply by a matrix, and then we'll have m outputs. So we have a function

00:18:13.700 | now which is taking n inputs and is producing m outputs. So at this point, what we're calculating

00:18:23.100 | for the gradient is what's called a Jacobian matrix. So for m inputs and n outputs, the

00:18:30.540 | Jacobian is an m by n matrix of every combination of partial derivatives. So function f splits

00:18:40.280 | up into these different sub functions, f1 through m, fm, which generate each of the

00:18:47.660 | m outputs. And so then we're taking the partial derivative f1 with respect to x1 through the

00:18:54.380 | partial derivative of f1 with respect to xn. Then heading down, you know, we make it up

00:18:59.700 | to the partial derivative of fm with respect to x1, et cetera. So we have every possible

00:19:05.580 | partial derivative of an output variable with respect to one of the input variables. Okay.

00:19:14.940 | So in simple calculus, when you have a composition of one variable functions, so that if you

00:19:23.100 | have y equals x squared and then z equals 3y, that's then z is a composition of two

00:19:33.260 | functions of - or you're composing two functions to get z as a function of x. Then you can

00:19:40.020 | work out the derivative of z with respect to x. And the way you do that is with the

00:19:45.220 | chain rule. And so in the chain rule, you multiply derivatives. So dz dx equals dz dy

00:19:53.180 | times dy dx. So dz dy is just 3 and dy dx is 2x. So we get 3 times 2x. So that overall

00:20:07.180 | derivative here is 6x. And since if we multiply this together, we're really saying that z

00:20:14.940 | equals 3x squared, you should trivially be able to see again, aha, its derivative is

00:20:21.940 | 6x. So that works. Okay. So once we move into vectors and matrices and Jacobians, it's actually

00:20:30.780 | the same game. So when we're working with those, we can compose functions and work out

00:20:36.380 | the derivatives by simply multiplying Jacobians. So if we have start with an input x and then

00:20:44.060 | put it through the simplest form of neural network layer and say that z equals wx plus

00:20:51.100 | b. So we multiply the x vector by matrix w and then add on a bias vector b. And then

00:20:57.620 | typically we'd put things through a nonlinearity f. So f could be a sigmoid function. We'll

00:21:03.900 | then say h equals f of z. So this is the composition of two functions in terms of vectors and matrices.

00:21:13.300 | So we can use Jacobians and we can say the partial of h with respect to x is going to

00:21:18.980 | be the product of the partial of h with respect to z and the partial of z with respect to

00:21:25.940 | x. And this all does work out. So let's start going through some examples of how these things

00:21:35.220 | work slightly more concretely. First, just particular Jacobians and then composing them

00:21:43.220 | together. So one case we look at is the nonlinearities that we put a vector through. So this is something

00:21:51.620 | like putting a vector through the sigmoid function f. And so if we have an intermediate

00:21:59.740 | vector z and we're turning into a vector h by putting it through a logistic function,

00:22:05.820 | we can say what is dh dz? Well, for this, formally this is a function that has n inputs

00:22:20.140 | and n outputs. So at the end of the day, we're computing an n by n Jacobian. And so what

00:22:28.860 | that's meaning is the elements of this n by n Jacobian are going to take the partial derivative

00:22:36.460 | of each output with respect to each input. And well, what is that going to be in this

00:22:44.780 | case? Well, in this case, because we're actually just computing element-wise a transformation

00:22:54.340 | such as a logistic transform of each element zi, like the second equation here, if i equals

00:23:03.420 | j, we've got something to compute. Whereas if i doesn't equal j, there's just the input

00:23:10.820 | has no influence on the output. And so the derivative is zero. So if i doesn't equal

00:23:16.460 | j, we're going to get a zero. And if i does equal j, then we're going to get the regular

00:23:22.820 | one variable derivative of the logistic function, which if I remember correctly, you were asked

00:23:31.660 | to compute. Now I can't remember what's assignment one or assignment two, but one of the two

00:23:36.660 | asks you to compute it. So our Jacobian for this case looks like this. We have a diagonal

00:23:43.380 | matrix with the derivatives of each element along the diagonal and everything else is

00:23:52.540 | zero. Okay, so let's look at a couple of other Jacobians. So if we are asking, if we've got

00:24:02.180 | this WX plus B basic neural network layer, and we're asking for the gradient with respect

00:24:08.900 | to X, then what we're going to have coming out is that that's actually going to be the

00:24:15.420 | matrix W. So this is where, what I hope you can do is look at the notes at home and work

00:24:24.740 | through this exactly and see that this is actually the right answer. But this is the

00:24:32.820 | way in which if you just have faith and think this is just like single variable calculus,

00:24:40.900 | except I've now got vectors and matrices, the answer you get is actually what you expected

00:24:45.540 | to get, because this is just like the derivative of AX plus B with respect to X where it's

00:24:52.780 | A. So similarly, if we take the partial derivative with respect to B of WX plus B, we get out

00:25:01.740 | the identity matrix. Okay, then one other Jacobian that we mentioned while in the first

00:25:10.260 | lecture while working through Word2Vec is if you have the dot product of two vectors,

00:25:18.340 | i.e. that's a number, that what you get coming out of that, so the partial derivative of

00:25:28.900 | UTH with respect to U is H transpose. And at this point, there's some fine print that

00:25:36.660 | I'm going to come back to in a minute. So this is the correct Jacobian, right? Because

00:25:43.580 | in this case, we have the dimension of H inputs and we have one output. And so we want to

00:25:53.660 | have a row vector. But there's a little bit more to say on that that I'll come back to

00:26:00.740 | in about 20 slides. But this is the correct Jacobian. Okay, so if you are not familiar

00:26:08.900 | with these kind of Jacobians, do please look at some of the notes that are available and

00:26:16.540 | try and compute these in more detail element-wise and convince yourself that they really are

00:26:21.740 | right. But I'm going to assume these now and show you what happens when we actually then

00:26:27.740 | work out gradients for at least a mini little neural net. Okay. So here is most of this

00:26:41.660 | neural net. I mean, as I commented, that, you know, really we'd be working out the partial

00:26:49.140 | derivative of the loss J with respect to these variables. But for the example I'm doing here,

00:26:56.140 | I've locked that off to keep it a little simpler and more manageable for the lecture. And so

00:27:00.980 | we're going to just work out the partial derivative of the score S, which is a real number with

00:27:07.100 | respect to the different parameters of this model, where the parameters of this model

00:27:13.060 | are going to be the W and the B and the U and also the input, because we can update

00:27:22.500 | the weight vectors of the word vectors of different words based on tuning them to better

00:27:30.900 | predict the classification outputs that we desire. So let's start off with a fairly easy

00:27:36.980 | one where we want to update the bias vector B to have our system classify better. So to

00:27:45.980 | be able to do that, what we want to work out is the partial derivatives of S with respect

00:27:51.380 | to B. So we know how to put that into our stochastic gradient update for the B parameters.

00:28:00.020 | Okay. So how do we go about doing these things? So the first step is we want to sort of break

00:28:07.100 | things up into different functions of minimal complexity that compose together. So in particular,

00:28:15.900 | this neural net layer, H equals F of WX plus B, it's still a little bit complex. So let's

00:28:21.780 | decompose that one further step. So we have the input X, we then calculate the linear

00:28:30.980 | transformation Z equals WX plus B. And then we put things through the sort of element

00:28:40.780 | wise nonlinearity, H equals F of Z, and then we do the dot product with U. And, you know,

00:28:50.380 | it's useful for working these things out to, you know, split into pieces like this, have

00:28:57.660 | straight what your different variables are, and to know what the dimensionality of each

00:29:02.740 | of these variables is. It's well worth just writing out the dimensionality of every variable

00:29:08.500 | and making sure that the answers that you're computing are of the right dimensionality.

00:29:14.140 | So at this point, though, what we can see is that calculating S is the product of three

00:29:22.020 | - sorry, is the composition of three functions around X. So for working out the partials

00:29:31.020 | of S with respect to B, it's the composition of the three functions shown on the left.

00:29:39.500 | And so therefore, the gradient of S with respect to B, we're going to take the product of these

00:29:49.500 | three partial derivatives. Okay, so how do - what do we - so, you know, we've got the

00:29:59.860 | S equals UTH, so that's the sort of the top corresponding partial derivative, partial

00:30:06.740 | derivative of H with respect to Z, partial derivative of Z with respect to B, which is

00:30:13.380 | the first one that we're working out. Okay, so we want to work this out, and if we're

00:30:19.340 | lucky, we remember those Jacobians I showed previously about the Jacobian for a vector

00:30:26.420 | dot product, the Jacobian for the non-linearity, and the Jacobian for the simple linear transformation.

00:30:36.460 | And so we can use those. So for the partials of S with respect to H, well, that's going

00:30:45.940 | to be UT using the first one. The partials of H with respect to Z, okay, so that's the

00:30:53.540 | non-linearity, and so that's going to be the matrix, it's the diagonal matrix with the

00:31:00.580 | element wise derivative F prime of Z and zero elsewhere. And then for the WX plus B, when

00:31:09.780 | we're taking the partials with respect to B, that's just the identity matrix. So we

00:31:16.140 | can simplify that down a little, the identity matrix disappears, and since UT is a vector

00:31:27.820 | and this is a diagonal matrix, we can rewrite this as UT Hadamard product of F prime of

00:31:35.260 | Z. I think this is the first time I've used this little circle for a Hadamard product,

00:31:41.380 | but it's something that you'll see quite a bit in your network work, since it's often

00:31:47.740 | used. So when we have two vectors, UT and this vector here, sometimes you want to do

00:31:56.860 | an element wise product. So the output of this will be a vector where you've taken the

00:32:02.140 | first element of each and multiplied them, the second element of each and multiplied

00:32:06.180 | them, et cetera, downwards. And so that's called the Hadamard product, and it's what

00:32:11.100 | we're calculating as to calculate a vector, which is the gradient of S with respect to

00:32:20.140 | B. Okay. So that's good. So we now have a gradient of S with respect to B, and we could

00:32:30.260 | use that in our stochastic gradient, but we don't stop there. We also want to work out

00:32:37.940 | the gradient with respect to others of our parameters. So we might want to next go on

00:32:45.540 | and work out the gradient of S with respect to W. Well, we can use the chain rule just

00:32:54.660 | like we did before, right? So we've got the same product of functions and everything is

00:33:01.040 | going to be the same apart from now taking the derivatives with respect to W rather than

00:33:08.220 | B. So it's now going to be the partial of S with respect to H, H with respect to Z,

00:33:16.620 | and Z with respect to W. And the important thing to notice here, and this leads into

00:33:24.220 | what we do with the backpropagation algorithm, is wait a minute, this is very similar to

00:33:32.420 | what we've already done. So when we were working out the gradients of S with respect to B,

00:33:38.620 | the first two terms were exactly the same. It's only the last one that differs. So to

00:33:46.620 | be able to build or to train neural networks efficiently, this is what happens all the

00:33:54.460 | time. And it's absolutely essential that we use an algorithm that avoids repeated computation.

00:34:02.980 | And so the idea we're going to develop is when we have this equation stack, that there's

00:34:09.220 | sort of stuff that's above where we compute Z, and we're going to be sort of, that'll

00:34:15.420 | be the same each time. And we want to compute something from that, that we can then sort

00:34:21.700 | of feed downwards when working out the gradients with respect to W, X, or B. And so we do that

00:34:32.540 | by defining delta, which is delta is the partials composed that are above the linear transform.

00:34:42.660 | And that's referred to as the local error signal. It's what's being passed in from above

00:34:48.620 | the linear transform. And we've already computed the gradient of that in the preceding slides.

00:34:57.200 | And so the final form of the partial of S with respect to B will be delta times the

00:35:07.700 | remaining part. And well, we'd seen that, you know, for partial S with respect to B,

00:35:15.300 | the partial of Z with respect to B is just the identity. So the end result was delta.

00:35:20.820 | But in this time, we then going to have to work out the partial of Z with respect to

00:35:25.540 | W and multiply that by delta. So that's the part that we still haven't yet done. So, and

00:35:35.260 | this is where things get, in some sense, a little bit hairier. And so there's something

00:35:44.060 | that's important to explain. So, you know, what should we have for the Jacobian of DSDW?

00:35:53.580 | Well, that's a function that has one output. The output is just a score, a real number.

00:36:04.240 | And then it has N by M inputs. So the Jacobian is a 1 by N by M matrix, i.e. a very long

00:36:16.620 | row vector. But that's correct math. But it turns out that that's kind of bad for our

00:36:24.940 | neural networks. Because remember, what we want to do with our neural networks is do

00:36:30.420 | stochastic gradient descent. And we want to say theta new equals theta old minus a small

00:36:38.220 | multiplier times the gradient. And well, actually, the W matrix is an N by M matrix. And so we

00:36:53.620 | couldn't actually do this subtraction if this gradient we calculate is just a huge row vector.

00:37:01.060 | We'd like to have it as the same shape as the W matrix. In neural network land, when

00:37:07.900 | we do this, we depart from pure math at this point. And we use what we call the shape convention.

00:37:15.900 | So what we're going to say is, and you're meant to use this for answers in the assignment,

00:37:22.400 | that the shape of the gradient we're always going to make to be the shape of the parameters.

00:37:28.900 | And so therefore, DSDW, we are also going to represent as an N by M matrix just like

00:37:37.640 | W. And we're going to reshape the Jacobian to place it into this matrix shape.

00:37:46.100 | Okay, so if we want to place it into this matrix shape, what do we, what are we going

00:37:55.700 | to want to get for DSDW? Well, we know that it's going to involve delta, our local error

00:38:07.100 | signal. And then we have to work out something for DZDW. Well, since C equals WX plus B,

00:38:23.420 | you'd kind of expect that the answer should be X. And that's right. So the answer to DSDW

00:38:33.940 | is going to be delta transpose times X transpose. And so the form that we're getting for this

00:38:42.420 | derivative is going to be the product of the local error signal that comes from above versus

00:38:52.900 | what we calculate from the local input X. So that shouldn't yet be obvious why that

00:39:00.300 | is true. So let me just go through in a bit more detail why that's true. So when we want

00:39:07.140 | to work out DSDW, right, it's sort of delta times DZDW, where what that's computing for

00:39:17.780 | Z is WX plus B. So let's just consider for a moment what the derivative is with respect

00:39:25.860 | to a single weight WIJ. So WIJ might be W2 3 that's shown in my little neural network

00:39:35.460 | here. And so the first thing to notice is that WIJ only contributes to ZI. So it's going

00:39:46.380 | into Z2, which then computes H2. And it has no effect whatsoever on H1. OK, so when we're

00:39:59.020 | working out DZI, DWIJ, it's going to be DWIX, that sort of row of-- that row of the matrix

00:40:11.820 | plus BI, which means that for-- we've got a kind of a sum of WIK times XK. And then

00:40:22.140 | for this sum, this is like one variable calculus that when we're taking the derivative of this

00:40:28.220 | with respect to WIJ, every term in this sum is going to be 0. The derivative is going

00:40:36.220 | to be 0 except for the one that involves WIJ. And then the derivative of that is just like

00:40:42.740 | AX with respect to A. It's going to be X. So you get XJ out as the answer. And so the

00:40:52.180 | end result of that is that when we're working out, what we want as the answer is that we're

00:40:59.620 | going to get that these columns where X1 is all that's left, X2 is all that's left, through

00:41:10.180 | XM is all that's left. And then that's multiplied by the vectors of the local error signal from

00:41:18.620 | above. And what we want to compute is this outer product matrix where we're getting the

00:41:24.740 | different combinations of the delta and the X. And so we can get the N by M matrix that

00:41:33.620 | we'd like to have via our shape convention by taking delta transpose, which is N by 1

00:41:41.620 | times X transpose, which is N1 by M. And then we get this outer product matrix. So that's

00:41:49.500 | a kind of a hacky argument that I've made. It's certainly a way of doing things that

00:41:54.540 | the dimensions work out and sort of make sense. There's a more detailed run through this that

00:42:01.300 | appears in the lecture notes. And I encourage you to sort of also look at the more mathy

00:42:08.180 | version of that. Here's a little bit more information about the shape convention. So

00:42:14.820 | well, first of all, one more example of this. So when you're working out the SDB, that comes

00:42:29.580 | out as it's Jacobian is a row vector. But similarly, you know, according to the shape

00:42:37.820 | convention, we want our gradient to be the same shape as B and B is a column vector.

00:42:46.380 | So that's sort of, again, they're different shapes and you have to transpose one to get

00:42:51.340 | the other. And so effectively what we have is a disagreement between the Jacobian form.

00:42:58.180 | So the Jacobian form makes sense for, you know, calculus and math, because if you want

00:43:03.620 | to have it like I claimed that matrix calculus is just like single variable calculus apart

00:43:11.020 | from using vectors and matrices, you can just multiply together the partials. That only

00:43:16.180 | works out if you're using Jacobians. But on the other hand, if you want to do stochastic

00:43:23.860 | gradient descent and be able to sort of subtract off a piece of the gradient, that only works

00:43:32.820 | if you have the same shape matrix for the gradient as you do for the original matrix.

00:43:41.400 | And so this is a bit confusing, but that's just the reality. There are both of these

00:43:47.340 | two things. So the Jacobian form is useful in doing the calculus. But for the answers

00:43:56.860 | in the assignment, we want the answers to be presented using the shape convention so

00:44:03.700 | that the gradient is shown in the same shape as the parameters. And therefore, you'll be

00:44:12.380 | able to it's the right shape for doing a gradient update by just subtracting a small amount

00:44:19.100 | of the gradient. So for working through things, there are then basically two choices. One

00:44:28.500 | choice is to work through all the math using Jacobians and then right at the end to reshape

00:44:37.780 | following the shape convention to give the answer. So that's what I did when I worked

00:44:42.740 | out DSDB. We worked through it using Jacobians. We got an answer, but it turned out to be

00:44:53.620 | a row vector. And so, well, then we have to transpose it at the end to get it into the

00:44:59.100 | right shape for the shape convention. The alternative is to always follow the shape

00:45:08.740 | convention. And that's kind of what I did when I was then working out DSDW. I didn't

00:45:16.540 | fully use Jacobians. I said, oh, well, when we work out whatever it was, DZDW, let's work

00:45:26.260 | out what shape we want it to be and what to fill in the cells with. And if you're sort

00:45:32.940 | of trying to do it immediately with the shape convention, it's a little bit more hacky in

00:45:41.460 | a way since you have to look at the dimensions for what you want and figure out when to transpose

00:45:47.740 | or to reshape the matrix to be at the right shape. But the kind of informal reasoning

00:45:54.100 | that I gave is what you do and what works. And one way of - and there are sort of hints

00:46:01.940 | that you can use, right, that you know that your gradient should always be the same shape

00:46:07.300 | as your parameters, and you know that the error message coming in will always have the

00:46:13.420 | same dimensionality as that hidden layer, and you can sort of work it out always following

00:46:19.060 | the shape convention. Okay. So that is, hey, doing this is all matrix calculus. So after

00:46:35.180 | pausing for breath for a second, the rest of the lecture is then, okay, let's look at

00:46:46.580 | how our software trains neural networks using what's referred to as the back propagation

00:46:55.380 | algorithm. So the short answer is, you know, basically we've already done it. The rest

00:47:11.580 | of the lecture is easy. So, you know, essentially I've just shown you what the back propagation

00:47:18.660 | algorithm does. So the back propagation algorithm is judiciously taking and propagating derivatives

00:47:34.060 | using the matrix chain rule. The rest of the back propagation algorithm is to say, okay,

00:47:43.620 | when we have these neural networks, we have a lot of shared structure and shared derivatives.

00:47:52.360 | So what we want to do is maximally efficiently reuse derivatives of higher layers when we're

00:48:02.180 | computing derivatives for lower layers so that we minimize computation. And I already

00:48:08.380 | pointed that out in the first half, but we want to systematically exploit that. And so

00:48:15.300 | the way we do that in our computational systems is they construct computation graphs. So this

00:48:24.420 | maybe looks a little bit like what you saw in a compiler's class if you did one, right,

00:48:31.060 | that you're creating, I call it here a computation graph, but it's really a tree, right? So you're

00:48:37.300 | creating here this tree of computations in this case, but in more general case, it's

00:48:44.380 | some kind of directed graph of computations, which has source nodes, which are inputs,

00:48:54.340 | either inputs like X or input parameters like W and B, and it's interior nodes operations.

00:49:03.920 | And so then once we've constructed a graph, and so this graph corresponds to exactly the

00:49:09.380 | example I did before, right? This was our little neural net that's in the top right.

00:49:14.100 | And here's the corresponding computation graph of computing WX plus B, put it through the

00:49:21.460 | sigmoid non-linearity F, multiply the resulting dot product of the resulting vector with U

00:49:28.620 | gives us our output score S. Okay, so what we do to compute this is we pass along the

00:49:38.180 | edges the results of operations. So this is WX, then Z, then H, and then our output is

00:49:44.420 | S. And so the first thing we want to be able to do to compute with neural networks is to

00:49:50.980 | be able to compute for different inputs what the output is. And so that's referred to as

00:49:57.780 | forward propagation. And so we simply run this expression much like you'd standardly

00:50:07.740 | do in a compiler to compute the value of S, and that's the forward propagation phase.

00:50:14.460 | But the essential additional element of neural networks is that we then also want to be able

00:50:21.460 | to send back gradients, which will tell us how to update the parameters of the model.

00:50:28.140 | And so it's this ability to send back gradients, which gives us the ability for these models

00:50:35.180 | to learn. Once we have a loss function at the end, we can work out how to change the

00:50:41.140 | parameters of the model so that they more accurately produce the desired output, i.e.

00:50:48.580 | they minimize the loss. And so it's doing that part that then is called back propagation.

00:50:56.740 | So we then, once we forward propagated a value with our current parameters, we then head

00:51:05.540 | backwards reversing the direction of the arrows and pass along gradients down to the different

00:51:13.460 | parameters like B and W and U that we can use to change using stochastic gradient descent

00:51:21.260 | what the value of B is and what the value of W is. So we start off with DSDS, which

00:51:27.620 | is just one, and then we run our back propagation, and we're using the sort of same kind of composition,

00:51:35.500 | of Jacobian. So we have DSDH here and DSDZ, and we progressively pass back those gradients.

00:51:46.260 | So we just need to work out how to efficiently and cleanly do this in a computational system.

00:51:53.940 | And so let's sort of work through again a few of these cases. So the general situation

00:52:00.060 | is we have a particular node. So a node is where some kind of operation like multiplication

00:52:10.340 | or a non-linearity happens. And so the simplest case is that we've got one output and one

00:52:19.100 | input. So we'll do that first. So that's like H equals F of Z. So what we have is an upstream

00:52:27.540 | gradient, DSDH, and what we want to do is compute the downstream gradient of DSDZ. And

00:52:39.220 | the way we're going to do that is say, well, for this function F, it's a function, it's

00:52:46.780 | got a derivative, a gradient. So what we want to do is work out that local gradient, DHDZ,

00:52:56.020 | and then that gives us everything that we need to work out DSDZ, because that's precisely

00:53:03.340 | we're going to use the chain rule. We're going to say that DSDZ equals the product of DSDH

00:53:09.700 | times DHDZ, where this is again using Jacobians. Okay. So the general principle that we're

00:53:17.140 | going to use is the downstream gradient equals the upstream gradient times the local gradient.

00:53:23.740 | Okay. Sometimes it gets a little bit more complicated. So we might have multiple inputs

00:53:29.980 | to a function. So this is the matrix vector multiply. So Z equals WX. Okay. When there

00:53:37.820 | are multiple inputs, we still have an upstream gradient, DSDZ. But what we're going to do

00:53:47.220 | is work out a local gradient with respect to each input. So we have DZDW and DZDX. And

00:53:56.260 | so then at that point, it's exactly the same for each piece of it. We're going to work

00:54:01.980 | out the downstream gradients, DSDW and DSDX by using the chain rule with respect to the

00:54:11.180 | local gradient. So let's go through an example of this. I mean, this is kind of a silly example.

00:54:20.900 | It's not really an example that looks like a typical neural net, but it's sort of a simple

00:54:25.740 | example where we can show some of the components of what we do. So what we're going to do is

00:54:31.620 | want to calculate F of XYZ, which is being calculated as X plus Y times the max of Y

00:54:40.740 | and Z. And we've got, you know, particular values that we're starting off with. X equals

00:54:47.900 | 1, Y equals 2, and Z equals 0. So these are the current values of our parameters. And

00:54:54.780 | so we can say, okay, well, we want to build an expression tree for that. Here's our expression

00:55:01.860 | tree. We're taking X plus Y, we're taking the max of Y and Z, and then we're multiplying

00:55:08.740 | them. And so our forward propagation phase is just to run this. So we take the values

00:55:16.620 | of our parameters and we simply start to compute with them, right? So we have 1, 2, 2, 0, and

00:55:24.580 | we add them as 3, the max is 2, we multiply them, and that gives us 6. Okay. So then at

00:55:34.180 | that point, we then want to go and work out how to do things for back propagation and

00:55:43.860 | how these back propagation steps work. And so the first part of that is sort of working

00:55:50.460 | out what our local gradients are going to be. So this is A here, and this is X and Y.

00:56:00.260 | So dA/dX, since A equals X plus Y is just going to be 1, and dA/dy is also going to

00:56:08.140 | be 1. Then for B equals the max of YZ, so this is this max node. So the local gradients

00:56:20.420 | for that is, it's going to depend on whether Y is greater than Z. So dB/dy is going to

00:56:29.500 | be 1 if and only if Y is greater than Z, which it is at our particular point here, so that's

00:56:37.740 | 1. And dB/dz is going to be 1 only if Z is greater than Y. So for our particular values

00:56:47.460 | here, that one is going to be 0. And then finally, here, we're calculating the product

00:56:56.700 | F equals AB. So for that, we're going to - sorry, that slide is a little imperfect.

00:57:10.100 | So for the product, the derivative F with respect to A is equal to B, which is 2, and

00:57:16.740 | the derivative F with respect to B is A equals 3. So that gives us all of the local gradients

00:57:23.900 | at each node. And so then to run backpropagation, we start with dF/df, which is just 1, and

00:57:32.420 | then we're going to work out the downstream equals the upstream times the local. Okay,

00:57:41.380 | so the local - so when you have a product like this, note that sort of the gradients

00:57:48.260 | flip. So we take upstream times the local, which is 2. So the downstream is 2. On this

00:58:03.820 | side, dF/dB is 3. So we're taking upstream times local, that gives us 3. And so that

00:58:14.740 | gives us backpropagates values to the plus and max nodes. And so then we continue along.

00:58:22.700 | So for the max node, the local gradient dB/dy equals 1. So we're going to take upstream

00:58:32.180 | is 3. So we're going to take 3 times 1, and that gives us 3. dB/dz is 0, because of the

00:58:42.220 | fact that z's value is not the max. So we're taking 3 times 0, and saying the gradient

00:58:48.260 | there is 0. So finally, doing the plus node, the local gradients for both x and y there

00:58:56.700 | are 1. So we're just getting 2 times 1 in both cases, and we're saying the gradients

00:59:02.540 | there are 2. Okay, and so again, at the end of the day, the interpretation here is that

00:59:11.180 | this is giving us information as to if we wiggle the values of x, y, and z, how much

00:59:18.700 | of a difference does it make to the output? What is the slope, the gradient, with respect

00:59:25.300 | to the variable? So what we've seen is that since z isn't the max of y and z, if I change

00:59:35.860 | the value of z a little, like if I make z 0.1 or minus 0.1, it makes no difference at

00:59:42.980 | all to what I compute as the output. So therefore, the gradient there is 0. If I change the value

00:59:52.100 | of x a little, then that is going to have an effect, and it's going to affect the output

01:00:02.340 | by twice as much as the amount I change it. Right, so, and that's because the df/dz equals

01:00:17.540 | 2. So interestingly, so I mean, we can basically work that out. So if we imagine making sort

01:00:29.580 | of x 2.1, well, then what we'd calculate the max is 2. Oh, sorry, sorry, if we make x 1.1,

01:00:41.740 | we then get the max here is 2, and we get 1.1 plus 2 is 3.1. So we get 3.1 times 2.

01:00:51.780 | So that'd be about 6.2. So changing x by 0.1 has added 0.2 to the value of f. Conversely,

01:01:02.100 | for the value of y, we find the df/dy equals 5. So what we do when we've got two things

01:01:09.640 | coming out here, as I'll go through again in a moment, is we're summing the gradient.

01:01:14.840 | So again, 3 plus 2 equals 5. And empirically, that's what happens. So if we consider fiddling

01:01:21.400 | the value of y a little, let's say we make it a value of 2.1, then the prediction is

01:01:28.840 | they'll have five times as big an effect on the output value we compute. And well, what

01:01:35.060 | do we compute? So we compute 1 plus 2.1. So that's 3.1. And we compute the max of 2.1

01:01:45.640 | and 0 is 2.1. So we'll take the product of 2.1 and 3.1. And I calculate that in advance

01:01:53.480 | as I can't really do this arithmetic in my head. And the product of those two is 6.51.

01:01:59.400 | So it has gone up about by 0.5. So we've multiplied my fiddling it by 0.1 by five times to work

01:02:08.560 | out the magnitude of the effect on the output. OK. So for this, before I did the case of

01:02:20.160 | when we had one in and one out here and multiple ins and one out here, the case that I hadn't

01:02:32.480 | actually dealt with is the case of when you have multiple outward branches. But that then

01:02:40.240 | turned up in the computation of y. So once you have multiple outward branches, what you're

01:02:47.560 | doing is you're summing. So that when you want to work out the df/dy, you've got a local

01:03:00.000 | gradient. You've got two upstream gradients. And you're working it out with respect to

01:03:09.080 | each of them as in the chain rule. And then you're summing them together to work out the

01:03:15.240 | impact at the end. Right. So we also saw some of the other node intuitions, which is useful

01:03:25.840 | to have doing this. So when you have an addition, that distributes the upstream gradient to

01:03:35.480 | each of the things below it. When you have max, it's like a routing node. So when you

01:03:43.000 | have max, you have the upstream gradient and it goes to one of the branches below it and

01:03:48.960 | the rest of them get no gradient. When you then have a multiplication, it has this effect

01:03:58.640 | of switching the gradient. So if you're taking 3 by 2, the gradient on the 2 side is 3 and

01:04:07.800 | on the 3 side is 2. And if you think about in terms of how much effect you get from when

01:04:14.160 | you're doing this sort of wiggling, that totally makes sense. Right. Because if you're multiplying

01:04:19.400 | another number by 3, then any change here is going to be multiplied by 3 and vice versa.

01:04:28.920 | Okay. So this is the kind of computation graph that we want to use to work out derivatives

01:04:39.880 | in an automated computational fashion, which is the basis of the backpropagation algorithm.

01:04:47.360 | But at that point, this is what we're doing, but there's still one mistake that we can

01:04:53.640 | make. It would be wrong for us to sort of say, okay, well, first of all, we want to

01:04:58.360 | work out the SDB. So look, we can start up here. We can propagate our upstream errors,

01:05:07.760 | work out local gradients, upstream error, local gradient, and keep all the way down

01:05:13.880 | and get the DSDB down here. Okay. Next we want to do it for DSDW. Let's just run it

01:05:25.020 | all over again. Because if we did that, we'd be doing repeated computation, as I showed

01:05:31.360 | in the first half, that this term is the same both times, this term is the same both times,

01:05:38.680 | this term is the same both times, that only the bits at the end differ. So what we want

01:05:44.680 | to do is avoid duplicated computation and compute all the gradients that we're going

01:05:53.680 | to need successively so that we only do them once. And so that was analogous to when I

01:06:00.500 | introduced this delta variable when we computed gradients by hand. So starting off here from

01:06:08.160 | D, we're starting off here with DSDS is one. We then want to one time compute gradients

01:06:21.080 | in the green here. One time compute the gradient in green here. That's all common work. Then

01:06:28.360 | we're going to take the local gradient for DZDB and multiply that by the upstream gradient

01:06:38.480 | to have worked out DSDB. And then we're going to take the same upstream gradient and then

01:06:47.080 | work out the local gradient here. And then sort of propagate that down to give us DSDW.

01:06:56.000 | So the end result is we want to sort of systematically work to forward computation forward in the

01:07:03.480 | graph and backward computation back propagation backward in the graph in a way that we do

01:07:10.380 | things efficiently. So this is the general form of the algorithm which works for an arbitrary

01:07:22.680 | computation graph. So at the end of the day, we've got a single scalar output Z. And then

01:07:31.720 | we have inputs and parameters which compute Z. And so once we have this computation graph

01:07:41.680 | and I added in this funky extra arrow here to make it a more general computation graph,

01:07:48.320 | well, we can always say that we can work out a starting point, something that doesn't depend

01:07:55.320 | on anything. So in this case, both of these bottom two nodes don't depend on anything

01:08:01.040 | else. So we can start with them and we can start to compute forward. We can compute values

01:08:07.920 | for all of these sort of second row from the bottom nodes. And then we're able to compute

01:08:15.200 | the third ones up. So we can have a topological sort of the nodes based on the dependencies

01:08:22.840 | in this directed graph. And we can compute the value of each node given some subset of

01:08:30.240 | its predecessors, which it depends on. And so doing that is referred to as the forward

01:08:35.560 | propagation phase and gives us a computation of the scalar output Z using our current parameters

01:08:43.880 | and our current inputs. And so then after that, we run back propagation. So for back

01:08:50.920 | propagation, we initialize the output gradient, dz dz as one. And then we visit nodes in the

01:09:00.760 | reverse order of the topological sort and we compute the gradients downward. And so

01:09:08.280 | our recipe is that for each node as we head down, we're going to compute the gradient

01:09:16.360 | of the node with respect to its successors, the things that it feeds into and how we compute

01:09:25.000 | that gradient is using this chain rule that we've looked at. So this is sort of the generalized

01:09:31.400 | form of the chain rule where we have multiple outputs. And so we're summing over the different

01:09:37.120 | outputs and then for each output, we're computing the product of the upstream gradient and the

01:09:43.360 | local gradient with respect to that node. And so we head downwards and we continue down

01:09:51.240 | the reverse topological sort order and we work out the gradient with respect to each

01:09:57.880 | variable in this graph. And so it hopefully looks kind of intuitive looking at this picture

01:10:09.960 | that if you think of it like this, the big O complexity of forward propagation and backward

01:10:18.140 | propagation is the same, right? In both cases, you're doing a linear pass through all of

01:10:24.920 | these nodes and calculating values given predecessors and then values given successors. I mean,

01:10:33.080 | you have to do a little bit more work for working out the gradients sort of as shown

01:10:38.960 | by this chain rule, but it's the same big O complexity. So if somehow you're implementing

01:10:44.320 | stuff for yourself rather than relying on the software and you're calculating the gradients

01:10:49.720 | as sort of a different order of complexity of forward propagation, it means that you're

01:10:54.840 | doing something wrong. You're doing repeated work that you shouldn't have to do. Okay.

01:11:00.120 | So this algorithm works for a completely arbitrary computation graph, any directed acyclic graph,

01:11:10.040 | you can apply this algorithm. In general, what we find is that we build neural networks

01:11:17.120 | that have a regular layer structure. So we have things like a vector of inputs, and then

01:11:22.280 | that's multiplied by a matrix. It's transformed into another vector, which might be multiplied

01:11:28.680 | by another matrix or summed with another matrix or something, right? So once we're using that

01:11:33.760 | kind of regular layer structure, we can then parallelize the computation by working out

01:11:40.360 | the gradients in terms of Jacobians of vectors and matrices and do things in parallel much

01:11:49.360 | more efficiently. Okay. So doing this is then referred to as automatic differentiation.

01:11:58.140 | And so essentially, if you know the computation graph, you should be able to have your clever

01:12:06.680 | computer system work out what the derivatives of everything is, and then apply back propagation

01:12:19.000 | to work out how to update the parameters and learn. And there's actually a sort of an interesting

01:12:26.840 | sort of thing of how history has gone backwards here, which I'll just note. So some of you

01:12:35.960 | might be familiar with symbolic computation packages. So those are things like Mathematica.

01:12:45.160 | So Mathematica, you can give it a symbolic form of a computation, and then it can work

01:12:52.480 | out derivatives for you. So it should be the case that if you give a complete symbolic

01:12:58.160 | form of a computation graph, then it should be able to work out all the derivatives for

01:13:04.520 | you and you never have to work out a derivative by hand whatsoever. And that was actually

01:13:10.680 | attempted in a famous deep learning library called Fiano, which came out of Yoshua Bengio's

01:13:17.400 | group at the University of Montreal, that it had a compiler that did that kind of symbolic

01:13:24.240 | manipulation. But somehow that sort of proved a little bit too hard a road to follow. I

01:13:35.640 | imagine it actually might come back again in the future. And so for modern deep learning

01:13:41.880 | frameworks, which includes both TensorFlow or PyTorch, they do 90% of this computation

01:13:51.680 | of automatic differentiation for you, but they don't actually symbolically compute derivatives.

01:13:59.240 | So for each particular node or layer of your deep learning system, somebody, either you

01:14:08.400 | or the person who wrote that layer, has handwritten the local derivatives. But then everything

01:14:18.020 | from that point on, the sort of the taking, doing the chain rule of combining upstream

01:14:24.820 | gradients with local gradients to work out downstream gradients, that's then all being

01:14:30.240 | done automatically for back propagation on the computation graph. And so what that means

01:14:37.680 | is for a whole neural network, you have a computation graph and it's going to have a

01:14:43.680 | forward pass and a backward pass. And so for the forward pass, you're topologically sorting

01:14:52.120 | the nodes based on their dependencies in the computation graph. And then for each node,

01:15:00.200 | you're running forward, the forward computation on that node. And then for backward propagation,

01:15:06.740 | you're reversing the topological sort of the graph. And then for each node in the graph,

01:15:12.520 | you're running the backward propagation, which is the little bit of backdrop, the chain rule

01:15:17.840 | at that node. And then the result of doing that is you have gradients for your inputs

01:15:25.620 | and parameters. And so this is the overall software runs this for you. And so what you

01:15:36.120 | want to do is then actually have stuff for particular nodes or layers in the graph. So

01:15:43.880 | if I have a multiply gate, it's going to have a forward algorithm, which just computes that

01:15:50.680 | the output is X times Y in terms of the two inputs. And then I'm going to want to compute

01:15:57.680 | to tell it also how to calculate the local derivative. So I want to say, what is the

01:16:03.440 | local derivative? So DLDX and DLDY in terms of the upstream gradient, DLDZ. And so I will

01:16:15.400 | then manually work out how to calculate that. And normally what I have to do is I assume

01:16:23.240 | the forward pass is being run first. And I'm going to shove into some local variables for

01:16:30.720 | my class, the values that were used in the forward computation. So as well as computing

01:16:36.720 | Z equals X times Y, I'm going to sort of remember what X and Y were. So that then when I'm asked

01:16:44.720 | to compute the backward pass, I'm then going to have implemented here what we saw earlier

01:16:52.760 | of that when it's X, Y, you're going to sort of swap the Y and the X to work out the local

01:17:01.240 | gradients. And so then I'm going to multiply those by the upstream gradient. And I'm going

01:17:06.640 | to return-- I've just written it here as a sort of a little list, but really it's going

01:17:11.800 | to be a NumPy vector of the gradients.

01:17:17.040 | OK, so that's 98% of what I wanted to cover today. Just a couple of quick comments left.

01:17:28.280 | So that can and should all be automated. Sometimes you want to just check if you're computing

01:17:36.120 | the right gradients. And so the standard way of checking that you're computing the right

01:17:41.560 | gradients is to manually work out the gradient by doing a numeric calculation of the gradient.

01:17:49.120 | And so you can do that. So you can work out what the derivative of X-- of f with respect

01:17:56.720 | to X should be by choosing some sort of small number like 10 to the minus 4, adding it to

01:18:04.760 | X, subtracting it from X. And then so the difference between these numbers is 2h, dividing

01:18:10.520 | it through by 2h. And you're simply working out the rise over the run, which is the slope

01:18:16.120 | of that point with respect to X. And that's an approximation of the gradient of f with

01:18:22.600 | respect to X at that value of X.

01:18:27.000 | So this is so simple you can't make a mistake implementing it. And so therefore you can

01:18:32.200 | use this to check whether your gradient values are correct or not. This isn't something that

01:18:39.820 | you'd want to use much because not only is it approximate, but it's extremely slow. Because

01:18:45.800 | to work this out, you have to run the forward computation for every parameter of the model.

01:18:52.120 | So if you have a model with a million parameters, you're now doing a million times as much work

01:18:56.720 | to run backprop as you would do if you were actually using calculus. So calculus is a

01:19:02.760 | good thing to know. But it can be really useful to check that the right values are being calculated.

01:19:10.340 | In the old days when we hand wrote everything, this was kind of the key unit test that people

01:19:15.580 | used everywhere. These days, most of the time you're reusing layers that are built into

01:19:21.340 | PyTorch or some other deep learning framework. So it's much less needed. But sometimes you're

01:19:26.740 | implementing your own layer and you really do want to check that things are implemented

01:19:30.700 | correctly.

01:19:31.700 | There's a fine point in the way this is written. If you saw this in sort of high school calculus

01:19:39.180 | class, you will have seen rise over run of f of x plus h minus f of x divided by h. It

01:19:50.460 | turns out that doing this two-sided estimate like this is much, much more accurate than

01:19:56.600 | doing a one-sided estimate. And so you're really much encouraged to use this approximation.

01:20:02.220 | Okay. So at that point, we've mastered the core technology of neural nets. Backpropagation

01:20:10.060 | is recursively and hence efficiently applying the chain rule along the computation graph

01:20:16.620 | with this sort of key step that downstream gradient equals upstream gradient times local

01:20:24.580 | gradient. And so for calculating with neural nets, we do the forward pass to work out values

01:20:31.300 | with current parameters, then run backpropagation to work out the gradient of the loss, currently

01:20:39.860 | computed loss with respect to those parameters.

01:20:45.180 | Now to some extent, you know, with modern deep learning frameworks, you don't actually

01:20:50.620 | have to know how to do any of this, right? It's the same as you don't have to know how

01:20:55.480 | to implement a C compiler. You can just write C code and say GCC and it will compile it

01:21:03.740 | and it will run the right stuff for you. And that's the kind of functionality you get from

01:21:10.220 | the PyTorch framework. So do come along to the PyTorch tutorial this Friday and get a

01:21:16.100 | sense about how easy it is to write neural networks using a framework like PyTorch or

01:21:22.020 | TensorFlow. And you know, it's so easy. That's why, you know, high school students across

01:21:27.540 | the nation are now doing their science projects, training deep learning systems, because you

01:21:33.340 | don't actually have to understand very much, debunk a few neural network layers together

01:21:39.380 | and set it computing on some data. But, you know, we hope in this class that you actually

01:21:45.220 | are also learning how these things are implemented. So you have a deeper understanding than that.

01:21:53.420 | And you know, it turns out that sometimes you need to have a deeper understanding. So

01:21:57.980 | back propagation doesn't always work perfectly. And so understanding what it's really doing

01:22:03.780 | can be crucial to debugging things. And so we'll actually see an example of that fairly

01:22:08.940 | soon when we start looking at recurrent models and some of the problems that they have, which

01:22:14.580 | will require us to think a bit more deeply about what's happening in our gradient computations.

01:22:20.460 | Okay. That's it for today.

01:22:23.260 | Thank you.

01:22:24.260 | [ Applause ]

01:22:24.260 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 3 - Backprop and Neural Networks

Chapters