Back to Index

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 3 - Backprop and Neural Networks


Chapters

0:0 Introduction
0:40 Assignments
4:51 Named Entity Recognition
8:16 Neural Network
13:45 Gradient Computing
21:29 Nonlinearities
23:54 Jacobians
26:34 Neural Net
35:32 Jacobian
37:6 Shape Convention
42:15 Jacobian vs Shape Convention
48:16 Computation Graph
54:12 Example

Transcript

Hi, everyone. I'll get started. Okay. So we're now back to the second week of CS224N on natural language processing with deep learning. Okay. So for today's lecture, what we're going to be looking at is all the math details of doing neural net learning. First of all, looking at how we can work out by hand gradients for training neural networks and then looking at how it's done more algorithmically, which is known as the back propagation algorithm.

And correspondingly for you guys, well, I hope you've remembered that, you know, one minute ago was when assignment one was due and everyone has handed that in. If by some chance you haven't handed it in, really should hand it in as soon as possible. Best to preserve those late days for the harder assignments.

So, I mean, I actually forgot to mention, we actually did make one change for this year to make it a bit easier when occasionally people join the class a week late. If you want to this year in the grading assignment, one can be discounted and we'll just use your other four assignments.

But if you've been in the class so far for that 98% of people, well, since assignment one is the easiest assignment, again, it's silly not to do it and have it as part of your grade. Okay. So starting today, we've put out assignment two and assignment two is all about making sure you really understand the math of neural networks and then the software that we use to do that math.

So, this is going to be a bit of a tough week for some. So, for some people who are great on all their math and backgrounds, they'll feel like this is stuff they know well, nothing very difficult. But I know there are quite a few of you who this lecture and week is the biggest struggle of the course.

We really do want people to actually have an understanding of what goes on in neural network learning rather than viewing it as some kind of deep magic. And I hope that some of the material we give today and that you read up on and use in the assignment will really give you more of a sense of what these neural networks are doing and how it is just math that's applied in a systematic large scale that works out the answers and that this will be valuable and give you a deeper sense of what's going on.

But if this material seems very scary and difficult, you can take some refuge in the fact that there's fast light at the end of the tunnel, since this is really the only lecture that's heavily going through the math details of neural networks. After that, we'll be kind of popping back up to a higher level.

And by and large, after this week, we'll be making use of software to do a lot of the complicated math for us. But nevertheless, I hope this is valuable. I'll go through everything quickly today. But if this isn't stuff that you know backwards, I really do encourage you to, you know, work through it and get help as you need it.

So do come along to our office hours. There are also a number of pieces of tutorial material given in the syllabus. So there's both the lecture notes, there's some materials from CS 231. In the list of readings, the very top reading is some material put together by Kevin Clark a couple of years ago.

And actually, that one's my favorite. The presentation there fairly closely follows the presentation in this lecture of going through matrix calculus. So, you know, personally, I'd recommend starting with that one. But there are four different ones you can choose from if one of them seems more helpful to you.

Two other things on what's coming up. Actually, for Thursday's lecture, we make a big change. And Thursday's lecture is probably the most linguistic lecture of the whole class, where we go through the details of dependency grammar and dependency parsing. Some people find that tough as well, but at least it'll be tough in a different way.

And then one other really good opportunity is this Friday, we have our second tutorial at 10am, which is an introduction to PyTorch, which is the deep learning framework that we'll be using for the rest of the class, once we've gone through these first two assignments where you do things by yourself.

So this is a great chance to get an intro to PyTorch. It'll be really useful for later in the class. Okay. Today's material is really all about sort of the math of neural networks. But just to sort of introduce a setting where we can work through this, I'm going to introduce a simple NLP task and a simple form of classifier that we can use for it.

So the task of named entity recognition is a very common basic NLP task. And the goal of this is you're looking through pieces of text and you're wanting to label by labeling the words, which words belong to entity categories like persons, locations, products, dates, times, etc. So for this piece of text, last night, Paris Hilton wowed in the sequined gown.

Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989. Some words are being labeled as named entities as shown. These two sentences don't actually belong together in the same article, but I chose those two sentences to illustrate the basic point that it's not that you can just do this task by using a dictionary.

Yes, a dictionary is helpful to know that Paris can possibly be a location, but Paris can also be a person name. So you have to use context to get named entity recognition right. Okay, well how might we do that with a neural network? There are much more advanced ways of doing this, but a simple yet already pretty good way of doing named entity recognition with a simple neural net is to say, well, what we're going to do is use the word vectors that we've learned about, and we're going to build up a context window of word vectors.

And then we're going to put those through a neural network layer and then feed it through a softmax classifier of the kind that we, sorry, I said that wrong. And then we're going to feed it through a logistic classifier of the kind that we saw when looking at negative sampling, which is going to say for a particular entity type, such as location, is it high probability location or is it not a high probability location?

So for a sentence like the museums in Paris are amazing to see, what we're going to do is for each word, say we're doing the word Paris, we're going to form a window around it, say a plus or minus two word window. And so for those five words, we're going to get word vectors for them from the kind of word to vector or glove word vectors we've learned.

And we're going to make a long vector out of the concatenation of those five word vectors. So the word of interest is in the middle. And then we're going to feed this vector to a classifier, which is at the end going to have a probability of the word being a location.

And then we could have another classifier that says the probability of the word being a person name. And so once we've done that, we're then going to run it at the next position. So we then say, well, is the word are a location? And we'd feed a window of five words as then in Paris are amazing to and put it through the same kind of classifier.

And so this is the classifier that we'll use. So it's input will be this word window. So if we have D dimensional word vectors, this will be a five D vector. And then we're going to put it through a layer of a neural network. So the layer of the neural network is going to multiply this vector by a matrix, add on a bias spectra, and then put that through a nonlinearity such as the softmax transformation that we've seen before.

And that will give us a hidden vector, which might be of a smaller dimensionality such as this one here. And so then with that hidden vector, we're then going to take the dot product of it with an extra vector here. Here's a U. So we take U dot product H.

And so when we do that, we're getting out a single number and that number can be any real number. And so then finally, we're going to put that number through a logistic transform of the same kind that we saw when doing negative sampling. The logistic transform will take any real number and it will transform it into a probability that that word is a location.

So its output is the predicted probability of the word belonging to a particular class. And so this could be our location classifier, which could classify each word in a window as to what the probability is that it's a location word. And so this little neural network here is the neural network I'm going to use today when going through some of the math.

But actually I'm going to make it even easier on myself. I'm going to throw away the logistic function at the top, and I'm really just going to work through the math of the bottom three quarters of this. If you look at Kevin Clark's handout that I just mentioned, he includes when he works through it, also working through the logistic function.

And we also saw working through a softmax in the first lecture when I was working through some of the word to deck model. Okay. So the overall question we want to be able to answer is, so here's our stochastic gradient descent equation that we have existing parameters of our model, and we want to update them based on our current loss, which is at the J of theta.

So for getting our loss here, that the true answer as to whether a word is a location or not will be either, you know, one if it is a location or zero if it isn't. Our logistic classifier will return some number like .9, and we'll use the distance away from what it should have been squared as our loss.

So we work out a loss, and then we're moving a little distance in the negative of the gradient, which will be changing our parameter estimates in such a way that they reduce the loss. And so this is already being written in terms of a whole vector of parameters, which is being updated as to a new vector of parameters.

But you can also think about it that for each individual parameter, theta J, that we're working out the partial derivative of the loss with respect to that parameter, and then we're moving a little bit in the negative direction of that. That's going to give us a new value for parameter theta J.

And we're going to update all of the parameters of our model as we learn. I mean, in particular, in contrast to what commonly happens in statistics, we also we update not only the sort of parameters of our model that are sort of weights in the classifier, but we also will update our data representation.

So we'll also be changing our word vectors as we learn. Okay. So to build neural nets, i.e. to train neural nets based on data, what we need is to be able to compute this gradient of the parameters so that we can then iteratively update the weights of the model and efficiently train a model that has good weights, i.e.

that has high accuracy. And so how can we do that? Well, what I'm going to talk about today is first of all, how you can do it by hand. And so for doing it by hand, this is basically a review of matrix calculus. And that'll take quite a bit of the lecture.

And then after we've talked about that for a while, I'll then shift gears and introduce the back propagation algorithm, which is the central technology for neural networks. And that technology is essentially the efficient application of calculus on a large scale, as we'll come to talking about soon. So for computing gradients by hand, what we're doing is matrix calculus.

So we're working with vectors and matrices and working out gradients. And this can seem like pretty scary stuff. And well, to the extent that you're kind of scared and don't know what's going on, one choice is to work out a non-vectorized gradient by just working out what the partial derivative is for one parameter at a time.

And I showed a little example of that in the first lecture. But it's much, much faster and more useful to actually be able to work with vectorized gradients. And in some sense, if you're not very confident, this is kind of almost a leap of faith. But it really is the case that multivariable calculus is just like single variable calculus, except you're using vectors and matrices.

So providing you remember some basics of single variable calculus, you really should be able to do this stuff and get it to work out. Lots of other sources. I've mentioned the notes. You can also look at the textbook for Math 51, which also has quite a lot of material on this.

I know some of you have bad memories of Math 51. Okay. So let's go through this and see how it works, ramping up from the beginning. So the beginning of calculus is, you know, we have a function with one input and one output, f of x equals x cubed.

And so then its gradient is its slope, right? So that's its derivative. So its derivative is 3x squared. And the way to think about this is how much will the output change if we change the input a little bit, right? So what we're wanting to do in our neural net models is change what they output so that they do a better job of predicting the correct answers when we're doing supervised learning.

And so what we want to know is if we fiddle different parameters of the model, how much of an effect will that have on the output? Because then we can choose how to fiddle them in the right way to move things down, right? So, you know, when we're saying that the derivative here is 3x squared, well, what we're saying is that if you're at x equals 1, if you fiddle the input a little bit, the output will change three times as much, 3 times 1 squared.

And it does. So if I say what's the value at 1.01, it's about 1.03. It's changed three times as much, and that's its slope. But at x equals 4, the derivative is 16 times 3, 48. So if we fiddle the input a little, it'll change 48 times as much.

And that's roughly what happens. 4.01 cubed is 64.48. Now, of course, you know, this is just sort of showing it for a small fiddle, but, you know, that's an approximation to the actual truth. Okay. So then we sort of ramp up to the more complex cases, which are more reflective of what we do with neural networks.

So if we have a function with one output and n inputs, then we have a gradient. So a gradient is a vector of partial derivatives with respect to each input. So we've got n inputs, x1 to xn, and we're working out the partial derivative f with respect to x1, the partial derivative f with respect to x2, et cetera.

And we then get a vector of partial derivatives, where each element of this vector is just like a simple derivative with respect to one variable. Okay. So from that point, we just keep on ramping up for what we do with neural networks. So commonly when we have something like a layer in a neural network, we'll have a function with n inputs that will be like our word vectors.

Then we do something like multiply by a matrix, and then we'll have m outputs. So we have a function now which is taking n inputs and is producing m outputs. So at this point, what we're calculating for the gradient is what's called a Jacobian matrix. So for m inputs and n outputs, the Jacobian is an m by n matrix of every combination of partial derivatives.

So function f splits up into these different sub functions, f1 through m, fm, which generate each of the m outputs. And so then we're taking the partial derivative f1 with respect to x1 through the partial derivative of f1 with respect to xn. Then heading down, you know, we make it up to the partial derivative of fm with respect to x1, et cetera.

So we have every possible partial derivative of an output variable with respect to one of the input variables. Okay. So in simple calculus, when you have a composition of one variable functions, so that if you have y equals x squared and then z equals 3y, that's then z is a composition of two functions of - or you're composing two functions to get z as a function of x.

Then you can work out the derivative of z with respect to x. And the way you do that is with the chain rule. And so in the chain rule, you multiply derivatives. So dz dx equals dz dy times dy dx. So dz dy is just 3 and dy dx is 2x.

So we get 3 times 2x. So that overall derivative here is 6x. And since if we multiply this together, we're really saying that z equals 3x squared, you should trivially be able to see again, aha, its derivative is 6x. So that works. Okay. So once we move into vectors and matrices and Jacobians, it's actually the same game.

So when we're working with those, we can compose functions and work out the derivatives by simply multiplying Jacobians. So if we have start with an input x and then put it through the simplest form of neural network layer and say that z equals wx plus b. So we multiply the x vector by matrix w and then add on a bias vector b.

And then typically we'd put things through a nonlinearity f. So f could be a sigmoid function. We'll then say h equals f of z. So this is the composition of two functions in terms of vectors and matrices. So we can use Jacobians and we can say the partial of h with respect to x is going to be the product of the partial of h with respect to z and the partial of z with respect to x.

And this all does work out. So let's start going through some examples of how these things work slightly more concretely. First, just particular Jacobians and then composing them together. So one case we look at is the nonlinearities that we put a vector through. So this is something like putting a vector through the sigmoid function f.

And so if we have an intermediate vector z and we're turning into a vector h by putting it through a logistic function, we can say what is dh dz? Well, for this, formally this is a function that has n inputs and n outputs. So at the end of the day, we're computing an n by n Jacobian.

And so what that's meaning is the elements of this n by n Jacobian are going to take the partial derivative of each output with respect to each input. And well, what is that going to be in this case? Well, in this case, because we're actually just computing element-wise a transformation such as a logistic transform of each element zi, like the second equation here, if i equals j, we've got something to compute.

Whereas if i doesn't equal j, there's just the input has no influence on the output. And so the derivative is zero. So if i doesn't equal j, we're going to get a zero. And if i does equal j, then we're going to get the regular one variable derivative of the logistic function, which if I remember correctly, you were asked to compute.

Now I can't remember what's assignment one or assignment two, but one of the two asks you to compute it. So our Jacobian for this case looks like this. We have a diagonal matrix with the derivatives of each element along the diagonal and everything else is zero. Okay, so let's look at a couple of other Jacobians.

So if we are asking, if we've got this WX plus B basic neural network layer, and we're asking for the gradient with respect to X, then what we're going to have coming out is that that's actually going to be the matrix W. So this is where, what I hope you can do is look at the notes at home and work through this exactly and see that this is actually the right answer.

But this is the way in which if you just have faith and think this is just like single variable calculus, except I've now got vectors and matrices, the answer you get is actually what you expected to get, because this is just like the derivative of AX plus B with respect to X where it's A.

So similarly, if we take the partial derivative with respect to B of WX plus B, we get out the identity matrix. Okay, then one other Jacobian that we mentioned while in the first lecture while working through Word2Vec is if you have the dot product of two vectors, i.e. that's a number, that what you get coming out of that, so the partial derivative of UTH with respect to U is H transpose.

And at this point, there's some fine print that I'm going to come back to in a minute. So this is the correct Jacobian, right? Because in this case, we have the dimension of H inputs and we have one output. And so we want to have a row vector. But there's a little bit more to say on that that I'll come back to in about 20 slides.

But this is the correct Jacobian. Okay, so if you are not familiar with these kind of Jacobians, do please look at some of the notes that are available and try and compute these in more detail element-wise and convince yourself that they really are right. But I'm going to assume these now and show you what happens when we actually then work out gradients for at least a mini little neural net.

Okay. So here is most of this neural net. I mean, as I commented, that, you know, really we'd be working out the partial derivative of the loss J with respect to these variables. But for the example I'm doing here, I've locked that off to keep it a little simpler and more manageable for the lecture.

And so we're going to just work out the partial derivative of the score S, which is a real number with respect to the different parameters of this model, where the parameters of this model are going to be the W and the B and the U and also the input, because we can update the weight vectors of the word vectors of different words based on tuning them to better predict the classification outputs that we desire.

So let's start off with a fairly easy one where we want to update the bias vector B to have our system classify better. So to be able to do that, what we want to work out is the partial derivatives of S with respect to B. So we know how to put that into our stochastic gradient update for the B parameters.

Okay. So how do we go about doing these things? So the first step is we want to sort of break things up into different functions of minimal complexity that compose together. So in particular, this neural net layer, H equals F of WX plus B, it's still a little bit complex.

So let's decompose that one further step. So we have the input X, we then calculate the linear transformation Z equals WX plus B. And then we put things through the sort of element wise nonlinearity, H equals F of Z, and then we do the dot product with U. And, you know, it's useful for working these things out to, you know, split into pieces like this, have straight what your different variables are, and to know what the dimensionality of each of these variables is.

It's well worth just writing out the dimensionality of every variable and making sure that the answers that you're computing are of the right dimensionality. So at this point, though, what we can see is that calculating S is the product of three - sorry, is the composition of three functions around X.

So for working out the partials of S with respect to B, it's the composition of the three functions shown on the left. And so therefore, the gradient of S with respect to B, we're going to take the product of these three partial derivatives. Okay, so how do - what do we - so, you know, we've got the S equals UTH, so that's the sort of the top corresponding partial derivative, partial derivative of H with respect to Z, partial derivative of Z with respect to B, which is the first one that we're working out.

Okay, so we want to work this out, and if we're lucky, we remember those Jacobians I showed previously about the Jacobian for a vector dot product, the Jacobian for the non-linearity, and the Jacobian for the simple linear transformation. And so we can use those. So for the partials of S with respect to H, well, that's going to be UT using the first one.

The partials of H with respect to Z, okay, so that's the non-linearity, and so that's going to be the matrix, it's the diagonal matrix with the element wise derivative F prime of Z and zero elsewhere. And then for the WX plus B, when we're taking the partials with respect to B, that's just the identity matrix.

So we can simplify that down a little, the identity matrix disappears, and since UT is a vector and this is a diagonal matrix, we can rewrite this as UT Hadamard product of F prime of Z. I think this is the first time I've used this little circle for a Hadamard product, but it's something that you'll see quite a bit in your network work, since it's often used.

So when we have two vectors, UT and this vector here, sometimes you want to do an element wise product. So the output of this will be a vector where you've taken the first element of each and multiplied them, the second element of each and multiplied them, et cetera, downwards.

And so that's called the Hadamard product, and it's what we're calculating as to calculate a vector, which is the gradient of S with respect to B. Okay. So that's good. So we now have a gradient of S with respect to B, and we could use that in our stochastic gradient, but we don't stop there.

We also want to work out the gradient with respect to others of our parameters. So we might want to next go on and work out the gradient of S with respect to W. Well, we can use the chain rule just like we did before, right? So we've got the same product of functions and everything is going to be the same apart from now taking the derivatives with respect to W rather than B.

So it's now going to be the partial of S with respect to H, H with respect to Z, and Z with respect to W. And the important thing to notice here, and this leads into what we do with the backpropagation algorithm, is wait a minute, this is very similar to what we've already done.

So when we were working out the gradients of S with respect to B, the first two terms were exactly the same. It's only the last one that differs. So to be able to build or to train neural networks efficiently, this is what happens all the time. And it's absolutely essential that we use an algorithm that avoids repeated computation.

And so the idea we're going to develop is when we have this equation stack, that there's sort of stuff that's above where we compute Z, and we're going to be sort of, that'll be the same each time. And we want to compute something from that, that we can then sort of feed downwards when working out the gradients with respect to W, X, or B.

And so we do that by defining delta, which is delta is the partials composed that are above the linear transform. And that's referred to as the local error signal. It's what's being passed in from above the linear transform. And we've already computed the gradient of that in the preceding slides.

And so the final form of the partial of S with respect to B will be delta times the remaining part. And well, we'd seen that, you know, for partial S with respect to B, the partial of Z with respect to B is just the identity. So the end result was delta.

But in this time, we then going to have to work out the partial of Z with respect to W and multiply that by delta. So that's the part that we still haven't yet done. So, and this is where things get, in some sense, a little bit hairier. And so there's something that's important to explain.

So, you know, what should we have for the Jacobian of DSDW? Well, that's a function that has one output. The output is just a score, a real number. And then it has N by M inputs. So the Jacobian is a 1 by N by M matrix, i.e. a very long row vector.

But that's correct math. But it turns out that that's kind of bad for our neural networks. Because remember, what we want to do with our neural networks is do stochastic gradient descent. And we want to say theta new equals theta old minus a small multiplier times the gradient. And well, actually, the W matrix is an N by M matrix.

And so we couldn't actually do this subtraction if this gradient we calculate is just a huge row vector. We'd like to have it as the same shape as the W matrix. In neural network land, when we do this, we depart from pure math at this point. And we use what we call the shape convention.

So what we're going to say is, and you're meant to use this for answers in the assignment, that the shape of the gradient we're always going to make to be the shape of the parameters. And so therefore, DSDW, we are also going to represent as an N by M matrix just like W.

And we're going to reshape the Jacobian to place it into this matrix shape. Okay, so if we want to place it into this matrix shape, what do we, what are we going to want to get for DSDW? Well, we know that it's going to involve delta, our local error signal.

And then we have to work out something for DZDW. Well, since C equals WX plus B, you'd kind of expect that the answer should be X. And that's right. So the answer to DSDW is going to be delta transpose times X transpose. And so the form that we're getting for this derivative is going to be the product of the local error signal that comes from above versus what we calculate from the local input X.

So that shouldn't yet be obvious why that is true. So let me just go through in a bit more detail why that's true. So when we want to work out DSDW, right, it's sort of delta times DZDW, where what that's computing for Z is WX plus B. So let's just consider for a moment what the derivative is with respect to a single weight WIJ.

So WIJ might be W2 3 that's shown in my little neural network here. And so the first thing to notice is that WIJ only contributes to ZI. So it's going into Z2, which then computes H2. And it has no effect whatsoever on H1. OK, so when we're working out DZI, DWIJ, it's going to be DWIX, that sort of row of-- that row of the matrix plus BI, which means that for-- we've got a kind of a sum of WIK times XK.

And then for this sum, this is like one variable calculus that when we're taking the derivative of this with respect to WIJ, every term in this sum is going to be 0. The derivative is going to be 0 except for the one that involves WIJ. And then the derivative of that is just like AX with respect to A.

It's going to be X. So you get XJ out as the answer. And so the end result of that is that when we're working out, what we want as the answer is that we're going to get that these columns where X1 is all that's left, X2 is all that's left, through XM is all that's left.

And then that's multiplied by the vectors of the local error signal from above. And what we want to compute is this outer product matrix where we're getting the different combinations of the delta and the X. And so we can get the N by M matrix that we'd like to have via our shape convention by taking delta transpose, which is N by 1 times X transpose, which is N1 by M.

And then we get this outer product matrix. So that's a kind of a hacky argument that I've made. It's certainly a way of doing things that the dimensions work out and sort of make sense. There's a more detailed run through this that appears in the lecture notes. And I encourage you to sort of also look at the more mathy version of that.

Here's a little bit more information about the shape convention. So well, first of all, one more example of this. So when you're working out the SDB, that comes out as it's Jacobian is a row vector. But similarly, you know, according to the shape convention, we want our gradient to be the same shape as B and B is a column vector.

So that's sort of, again, they're different shapes and you have to transpose one to get the other. And so effectively what we have is a disagreement between the Jacobian form. So the Jacobian form makes sense for, you know, calculus and math, because if you want to have it like I claimed that matrix calculus is just like single variable calculus apart from using vectors and matrices, you can just multiply together the partials.

That only works out if you're using Jacobians. But on the other hand, if you want to do stochastic gradient descent and be able to sort of subtract off a piece of the gradient, that only works if you have the same shape matrix for the gradient as you do for the original matrix.

And so this is a bit confusing, but that's just the reality. There are both of these two things. So the Jacobian form is useful in doing the calculus. But for the answers in the assignment, we want the answers to be presented using the shape convention so that the gradient is shown in the same shape as the parameters.

And therefore, you'll be able to it's the right shape for doing a gradient update by just subtracting a small amount of the gradient. So for working through things, there are then basically two choices. One choice is to work through all the math using Jacobians and then right at the end to reshape following the shape convention to give the answer.

So that's what I did when I worked out DSDB. We worked through it using Jacobians. We got an answer, but it turned out to be a row vector. And so, well, then we have to transpose it at the end to get it into the right shape for the shape convention.

The alternative is to always follow the shape convention. And that's kind of what I did when I was then working out DSDW. I didn't fully use Jacobians. I said, oh, well, when we work out whatever it was, DZDW, let's work out what shape we want it to be and what to fill in the cells with.

And if you're sort of trying to do it immediately with the shape convention, it's a little bit more hacky in a way since you have to look at the dimensions for what you want and figure out when to transpose or to reshape the matrix to be at the right shape.

But the kind of informal reasoning that I gave is what you do and what works. And one way of - and there are sort of hints that you can use, right, that you know that your gradient should always be the same shape as your parameters, and you know that the error message coming in will always have the same dimensionality as that hidden layer, and you can sort of work it out always following the shape convention.

Okay. So that is, hey, doing this is all matrix calculus. So after pausing for breath for a second, the rest of the lecture is then, okay, let's look at how our software trains neural networks using what's referred to as the back propagation algorithm. So the short answer is, you know, basically we've already done it.

The rest of the lecture is easy. So, you know, essentially I've just shown you what the back propagation algorithm does. So the back propagation algorithm is judiciously taking and propagating derivatives using the matrix chain rule. The rest of the back propagation algorithm is to say, okay, when we have these neural networks, we have a lot of shared structure and shared derivatives.

So what we want to do is maximally efficiently reuse derivatives of higher layers when we're computing derivatives for lower layers so that we minimize computation. And I already pointed that out in the first half, but we want to systematically exploit that. And so the way we do that in our computational systems is they construct computation graphs.

So this maybe looks a little bit like what you saw in a compiler's class if you did one, right, that you're creating, I call it here a computation graph, but it's really a tree, right? So you're creating here this tree of computations in this case, but in more general case, it's some kind of directed graph of computations, which has source nodes, which are inputs, either inputs like X or input parameters like W and B, and it's interior nodes operations.

And so then once we've constructed a graph, and so this graph corresponds to exactly the example I did before, right? This was our little neural net that's in the top right. And here's the corresponding computation graph of computing WX plus B, put it through the sigmoid non-linearity F, multiply the resulting dot product of the resulting vector with U gives us our output score S.

Okay, so what we do to compute this is we pass along the edges the results of operations. So this is WX, then Z, then H, and then our output is S. And so the first thing we want to be able to do to compute with neural networks is to be able to compute for different inputs what the output is.

And so that's referred to as forward propagation. And so we simply run this expression much like you'd standardly do in a compiler to compute the value of S, and that's the forward propagation phase. But the essential additional element of neural networks is that we then also want to be able to send back gradients, which will tell us how to update the parameters of the model.

And so it's this ability to send back gradients, which gives us the ability for these models to learn. Once we have a loss function at the end, we can work out how to change the parameters of the model so that they more accurately produce the desired output, i.e. they minimize the loss.

And so it's doing that part that then is called back propagation. So we then, once we forward propagated a value with our current parameters, we then head backwards reversing the direction of the arrows and pass along gradients down to the different parameters like B and W and U that we can use to change using stochastic gradient descent what the value of B is and what the value of W is.

So we start off with DSDS, which is just one, and then we run our back propagation, and we're using the sort of same kind of composition, of Jacobian. So we have DSDH here and DSDZ, and we progressively pass back those gradients. So we just need to work out how to efficiently and cleanly do this in a computational system.

And so let's sort of work through again a few of these cases. So the general situation is we have a particular node. So a node is where some kind of operation like multiplication or a non-linearity happens. And so the simplest case is that we've got one output and one input.

So we'll do that first. So that's like H equals F of Z. So what we have is an upstream gradient, DSDH, and what we want to do is compute the downstream gradient of DSDZ. And the way we're going to do that is say, well, for this function F, it's a function, it's got a derivative, a gradient.

So what we want to do is work out that local gradient, DHDZ, and then that gives us everything that we need to work out DSDZ, because that's precisely we're going to use the chain rule. We're going to say that DSDZ equals the product of DSDH times DHDZ, where this is again using Jacobians.

Okay. So the general principle that we're going to use is the downstream gradient equals the upstream gradient times the local gradient. Okay. Sometimes it gets a little bit more complicated. So we might have multiple inputs to a function. So this is the matrix vector multiply. So Z equals WX.

Okay. When there are multiple inputs, we still have an upstream gradient, DSDZ. But what we're going to do is work out a local gradient with respect to each input. So we have DZDW and DZDX. And so then at that point, it's exactly the same for each piece of it.

We're going to work out the downstream gradients, DSDW and DSDX by using the chain rule with respect to the local gradient. So let's go through an example of this. I mean, this is kind of a silly example. It's not really an example that looks like a typical neural net, but it's sort of a simple example where we can show some of the components of what we do.

So what we're going to do is want to calculate F of XYZ, which is being calculated as X plus Y times the max of Y and Z. And we've got, you know, particular values that we're starting off with. X equals 1, Y equals 2, and Z equals 0. So these are the current values of our parameters.

And so we can say, okay, well, we want to build an expression tree for that. Here's our expression tree. We're taking X plus Y, we're taking the max of Y and Z, and then we're multiplying them. And so our forward propagation phase is just to run this. So we take the values of our parameters and we simply start to compute with them, right?

So we have 1, 2, 2, 0, and we add them as 3, the max is 2, we multiply them, and that gives us 6. Okay. So then at that point, we then want to go and work out how to do things for back propagation and how these back propagation steps work.

And so the first part of that is sort of working out what our local gradients are going to be. So this is A here, and this is X and Y. So dA/dX, since A equals X plus Y is just going to be 1, and dA/dy is also going to be 1.

Then for B equals the max of YZ, so this is this max node. So the local gradients for that is, it's going to depend on whether Y is greater than Z. So dB/dy is going to be 1 if and only if Y is greater than Z, which it is at our particular point here, so that's 1.

And dB/dz is going to be 1 only if Z is greater than Y. So for our particular values here, that one is going to be 0. And then finally, here, we're calculating the product F equals AB. So for that, we're going to - sorry, that slide is a little imperfect.

So for the product, the derivative F with respect to A is equal to B, which is 2, and the derivative F with respect to B is A equals 3. So that gives us all of the local gradients at each node. And so then to run backpropagation, we start with dF/df, which is just 1, and then we're going to work out the downstream equals the upstream times the local.

Okay, so the local - so when you have a product like this, note that sort of the gradients flip. So we take upstream times the local, which is 2. So the downstream is 2. On this side, dF/dB is 3. So we're taking upstream times local, that gives us 3.

And so that gives us backpropagates values to the plus and max nodes. And so then we continue along. So for the max node, the local gradient dB/dy equals 1. So we're going to take upstream is 3. So we're going to take 3 times 1, and that gives us 3.

dB/dz is 0, because of the fact that z's value is not the max. So we're taking 3 times 0, and saying the gradient there is 0. So finally, doing the plus node, the local gradients for both x and y there are 1. So we're just getting 2 times 1 in both cases, and we're saying the gradients there are 2.

Okay, and so again, at the end of the day, the interpretation here is that this is giving us information as to if we wiggle the values of x, y, and z, how much of a difference does it make to the output? What is the slope, the gradient, with respect to the variable?

So what we've seen is that since z isn't the max of y and z, if I change the value of z a little, like if I make z 0.1 or minus 0.1, it makes no difference at all to what I compute as the output. So therefore, the gradient there is 0.

If I change the value of x a little, then that is going to have an effect, and it's going to affect the output by twice as much as the amount I change it. Right, so, and that's because the df/dz equals 2. So interestingly, so I mean, we can basically work that out.

So if we imagine making sort of x 2.1, well, then what we'd calculate the max is 2. Oh, sorry, sorry, if we make x 1.1, we then get the max here is 2, and we get 1.1 plus 2 is 3.1. So we get 3.1 times 2. So that'd be about 6.2.

So changing x by 0.1 has added 0.2 to the value of f. Conversely, for the value of y, we find the df/dy equals 5. So what we do when we've got two things coming out here, as I'll go through again in a moment, is we're summing the gradient. So again, 3 plus 2 equals 5.

And empirically, that's what happens. So if we consider fiddling the value of y a little, let's say we make it a value of 2.1, then the prediction is they'll have five times as big an effect on the output value we compute. And well, what do we compute? So we compute 1 plus 2.1.

So that's 3.1. And we compute the max of 2.1 and 0 is 2.1. So we'll take the product of 2.1 and 3.1. And I calculate that in advance as I can't really do this arithmetic in my head. And the product of those two is 6.51. So it has gone up about by 0.5.

So we've multiplied my fiddling it by 0.1 by five times to work out the magnitude of the effect on the output. OK. So for this, before I did the case of when we had one in and one out here and multiple ins and one out here, the case that I hadn't actually dealt with is the case of when you have multiple outward branches.

But that then turned up in the computation of y. So once you have multiple outward branches, what you're doing is you're summing. So that when you want to work out the df/dy, you've got a local gradient. You've got two upstream gradients. And you're working it out with respect to each of them as in the chain rule.

And then you're summing them together to work out the impact at the end. Right. So we also saw some of the other node intuitions, which is useful to have doing this. So when you have an addition, that distributes the upstream gradient to each of the things below it. When you have max, it's like a routing node.

So when you have max, you have the upstream gradient and it goes to one of the branches below it and the rest of them get no gradient. When you then have a multiplication, it has this effect of switching the gradient. So if you're taking 3 by 2, the gradient on the 2 side is 3 and on the 3 side is 2.

And if you think about in terms of how much effect you get from when you're doing this sort of wiggling, that totally makes sense. Right. Because if you're multiplying another number by 3, then any change here is going to be multiplied by 3 and vice versa. Okay. So this is the kind of computation graph that we want to use to work out derivatives in an automated computational fashion, which is the basis of the backpropagation algorithm.

But at that point, this is what we're doing, but there's still one mistake that we can make. It would be wrong for us to sort of say, okay, well, first of all, we want to work out the SDB. So look, we can start up here. We can propagate our upstream errors, work out local gradients, upstream error, local gradient, and keep all the way down and get the DSDB down here.

Okay. Next we want to do it for DSDW. Let's just run it all over again. Because if we did that, we'd be doing repeated computation, as I showed in the first half, that this term is the same both times, this term is the same both times, this term is the same both times, that only the bits at the end differ.

So what we want to do is avoid duplicated computation and compute all the gradients that we're going to need successively so that we only do them once. And so that was analogous to when I introduced this delta variable when we computed gradients by hand. So starting off here from D, we're starting off here with DSDS is one.

We then want to one time compute gradients in the green here. One time compute the gradient in green here. That's all common work. Then we're going to take the local gradient for DZDB and multiply that by the upstream gradient to have worked out DSDB. And then we're going to take the same upstream gradient and then work out the local gradient here.

And then sort of propagate that down to give us DSDW. So the end result is we want to sort of systematically work to forward computation forward in the graph and backward computation back propagation backward in the graph in a way that we do things efficiently. So this is the general form of the algorithm which works for an arbitrary computation graph.

So at the end of the day, we've got a single scalar output Z. And then we have inputs and parameters which compute Z. And so once we have this computation graph and I added in this funky extra arrow here to make it a more general computation graph, well, we can always say that we can work out a starting point, something that doesn't depend on anything.

So in this case, both of these bottom two nodes don't depend on anything else. So we can start with them and we can start to compute forward. We can compute values for all of these sort of second row from the bottom nodes. And then we're able to compute the third ones up.

So we can have a topological sort of the nodes based on the dependencies in this directed graph. And we can compute the value of each node given some subset of its predecessors, which it depends on. And so doing that is referred to as the forward propagation phase and gives us a computation of the scalar output Z using our current parameters and our current inputs.

And so then after that, we run back propagation. So for back propagation, we initialize the output gradient, dz dz as one. And then we visit nodes in the reverse order of the topological sort and we compute the gradients downward. And so our recipe is that for each node as we head down, we're going to compute the gradient of the node with respect to its successors, the things that it feeds into and how we compute that gradient is using this chain rule that we've looked at.

So this is sort of the generalized form of the chain rule where we have multiple outputs. And so we're summing over the different outputs and then for each output, we're computing the product of the upstream gradient and the local gradient with respect to that node. And so we head downwards and we continue down the reverse topological sort order and we work out the gradient with respect to each variable in this graph.

And so it hopefully looks kind of intuitive looking at this picture that if you think of it like this, the big O complexity of forward propagation and backward propagation is the same, right? In both cases, you're doing a linear pass through all of these nodes and calculating values given predecessors and then values given successors.

I mean, you have to do a little bit more work for working out the gradients sort of as shown by this chain rule, but it's the same big O complexity. So if somehow you're implementing stuff for yourself rather than relying on the software and you're calculating the gradients as sort of a different order of complexity of forward propagation, it means that you're doing something wrong.

You're doing repeated work that you shouldn't have to do. Okay. So this algorithm works for a completely arbitrary computation graph, any directed acyclic graph, you can apply this algorithm. In general, what we find is that we build neural networks that have a regular layer structure. So we have things like a vector of inputs, and then that's multiplied by a matrix.

It's transformed into another vector, which might be multiplied by another matrix or summed with another matrix or something, right? So once we're using that kind of regular layer structure, we can then parallelize the computation by working out the gradients in terms of Jacobians of vectors and matrices and do things in parallel much more efficiently.

Okay. So doing this is then referred to as automatic differentiation. And so essentially, if you know the computation graph, you should be able to have your clever computer system work out what the derivatives of everything is, and then apply back propagation to work out how to update the parameters and learn.

And there's actually a sort of an interesting sort of thing of how history has gone backwards here, which I'll just note. So some of you might be familiar with symbolic computation packages. So those are things like Mathematica. So Mathematica, you can give it a symbolic form of a computation, and then it can work out derivatives for you.

So it should be the case that if you give a complete symbolic form of a computation graph, then it should be able to work out all the derivatives for you and you never have to work out a derivative by hand whatsoever. And that was actually attempted in a famous deep learning library called Fiano, which came out of Yoshua Bengio's group at the University of Montreal, that it had a compiler that did that kind of symbolic manipulation.

But somehow that sort of proved a little bit too hard a road to follow. I imagine it actually might come back again in the future. And so for modern deep learning frameworks, which includes both TensorFlow or PyTorch, they do 90% of this computation of automatic differentiation for you, but they don't actually symbolically compute derivatives.

So for each particular node or layer of your deep learning system, somebody, either you or the person who wrote that layer, has handwritten the local derivatives. But then everything from that point on, the sort of the taking, doing the chain rule of combining upstream gradients with local gradients to work out downstream gradients, that's then all being done automatically for back propagation on the computation graph.

And so what that means is for a whole neural network, you have a computation graph and it's going to have a forward pass and a backward pass. And so for the forward pass, you're topologically sorting the nodes based on their dependencies in the computation graph. And then for each node, you're running forward, the forward computation on that node.

And then for backward propagation, you're reversing the topological sort of the graph. And then for each node in the graph, you're running the backward propagation, which is the little bit of backdrop, the chain rule at that node. And then the result of doing that is you have gradients for your inputs and parameters.

And so this is the overall software runs this for you. And so what you want to do is then actually have stuff for particular nodes or layers in the graph. So if I have a multiply gate, it's going to have a forward algorithm, which just computes that the output is X times Y in terms of the two inputs.

And then I'm going to want to compute to tell it also how to calculate the local derivative. So I want to say, what is the local derivative? So DLDX and DLDY in terms of the upstream gradient, DLDZ. And so I will then manually work out how to calculate that.

And normally what I have to do is I assume the forward pass is being run first. And I'm going to shove into some local variables for my class, the values that were used in the forward computation. So as well as computing Z equals X times Y, I'm going to sort of remember what X and Y were.

So that then when I'm asked to compute the backward pass, I'm then going to have implemented here what we saw earlier of that when it's X, Y, you're going to sort of swap the Y and the X to work out the local gradients. And so then I'm going to multiply those by the upstream gradient.

And I'm going to return-- I've just written it here as a sort of a little list, but really it's going to be a NumPy vector of the gradients. OK, so that's 98% of what I wanted to cover today. Just a couple of quick comments left. So that can and should all be automated.

Sometimes you want to just check if you're computing the right gradients. And so the standard way of checking that you're computing the right gradients is to manually work out the gradient by doing a numeric calculation of the gradient. And so you can do that. So you can work out what the derivative of X-- of f with respect to X should be by choosing some sort of small number like 10 to the minus 4, adding it to X, subtracting it from X.

And then so the difference between these numbers is 2h, dividing it through by 2h. And you're simply working out the rise over the run, which is the slope of that point with respect to X. And that's an approximation of the gradient of f with respect to X at that value of X.

So this is so simple you can't make a mistake implementing it. And so therefore you can use this to check whether your gradient values are correct or not. This isn't something that you'd want to use much because not only is it approximate, but it's extremely slow. Because to work this out, you have to run the forward computation for every parameter of the model.

So if you have a model with a million parameters, you're now doing a million times as much work to run backprop as you would do if you were actually using calculus. So calculus is a good thing to know. But it can be really useful to check that the right values are being calculated.

In the old days when we hand wrote everything, this was kind of the key unit test that people used everywhere. These days, most of the time you're reusing layers that are built into PyTorch or some other deep learning framework. So it's much less needed. But sometimes you're implementing your own layer and you really do want to check that things are implemented correctly.

There's a fine point in the way this is written. If you saw this in sort of high school calculus class, you will have seen rise over run of f of x plus h minus f of x divided by h. It turns out that doing this two-sided estimate like this is much, much more accurate than doing a one-sided estimate.

And so you're really much encouraged to use this approximation. Okay. So at that point, we've mastered the core technology of neural nets. Backpropagation is recursively and hence efficiently applying the chain rule along the computation graph with this sort of key step that downstream gradient equals upstream gradient times local gradient.

And so for calculating with neural nets, we do the forward pass to work out values with current parameters, then run backpropagation to work out the gradient of the loss, currently computed loss with respect to those parameters. Now to some extent, you know, with modern deep learning frameworks, you don't actually have to know how to do any of this, right?

It's the same as you don't have to know how to implement a C compiler. You can just write C code and say GCC and it will compile it and it will run the right stuff for you. And that's the kind of functionality you get from the PyTorch framework. So do come along to the PyTorch tutorial this Friday and get a sense about how easy it is to write neural networks using a framework like PyTorch or TensorFlow.

And you know, it's so easy. That's why, you know, high school students across the nation are now doing their science projects, training deep learning systems, because you don't actually have to understand very much, debunk a few neural network layers together and set it computing on some data. But, you know, we hope in this class that you actually are also learning how these things are implemented.

So you have a deeper understanding than that. And you know, it turns out that sometimes you need to have a deeper understanding. So back propagation doesn't always work perfectly. And so understanding what it's really doing can be crucial to debugging things. And so we'll actually see an example of that fairly soon when we start looking at recurrent models and some of the problems that they have, which will require us to think a bit more deeply about what's happening in our gradient computations.

Okay. That's it for today. Thank you.