The spelled-out intro to neural networks and backpropagation: building micrograd

00:00:00.000 | Hello, my name is Andrej and I've been training deep neural networks for a bit more than a decade

00:00:04.800 | and in this lecture I'd like to show you what neural network training looks like under the hood.

00:00:09.600 | So in particular we are going to start with a blank Jupyter notebook and by the end of this

00:00:14.000 | lecture we will define and train a neural net and you'll get to see everything that goes on

00:00:18.560 | under the hood and exactly sort of how that works on an intuitive level. Now specifically what I

00:00:23.520 | would like to do is I would like to take you through building of micrograd. Now micrograd

00:00:29.200 | is this library that I released on github about two years ago but at the time I only uploaded

00:00:33.840 | the source code and you'd have to go in by yourself and really figure out how it works.

00:00:38.480 | So in this lecture I will take you through it step by step and kind of comment on all the

00:00:42.960 | pieces of it. So what is micrograd and why is it interesting? Thank you.

00:00:47.600 | Micrograd is basically an autograd engine. Autograd is short for automatic gradient and

00:00:54.720 | really what it does is it implements back propagation. Now back propagation is this

00:00:58.640 | algorithm that allows you to efficiently evaluate the gradient of some kind of a loss function

00:01:04.960 | with respect to the weights of a neural network and what that allows us to do then is we can

00:01:10.000 | iteratively tune the weights of that neural network to minimize the loss function and

00:01:13.920 | therefore improve the accuracy of the network. So back propagation would be at the mathematical

00:01:18.960 | core of any modern deep neural network library like say pytorch or jax. So the functionality

00:01:24.800 | of micrograd is I think best illustrated by an example. So if we just scroll down here

00:01:28.720 | you'll see that micrograd basically allows you to build out mathematical expressions

00:01:33.520 | and here what we are doing is we have an expression that we're building out where

00:01:38.000 | you have two inputs a and b and you'll see that a and b are negative four and two but we are

00:01:44.880 | wrapping those values into this value object that we are going to build out as part of micrograd.

00:01:50.480 | So this value object will wrap the numbers themselves and then we are going to build

00:01:56.080 | out a mathematical expression here where a and b are transformed into c d and eventually e f and g

00:02:03.120 | and I'm showing some of the functionality of micrograd and the operations that it supports.

00:02:08.800 | So you can add two value objects, you can multiply them, you can raise them to a constant power,

00:02:14.480 | you can offset by one, negate, squash at zero, square, divide by a constant, divide by it,

00:02:21.520 | etc. And so we're building out an expression graph with these two inputs a and b and we're

00:02:27.760 | creating an output value of g and micrograd will in the background build out this entire

00:02:33.600 | mathematical expression. So it will for example know that c is also a value, c was a result of

00:02:39.680 | an addition operation and the child nodes of c are a and b because the and it will maintain

00:02:47.360 | pointers to a and b value objects so we'll basically know exactly how all of this is laid out.

00:02:52.480 | And then not only can we do what we call the forward pass where we actually look at the value

00:02:57.520 | of g of course that's pretty straightforward we will access that using the dot data attribute

00:03:03.120 | and so the output of the forward pass the value of g is 24.7 it turns out but the big deal is

00:03:10.160 | that we can also take this g value object and we can call dot backward and this will basically

00:03:16.240 | initialize backpropagation at the node g. And what backpropagation is going to do is it's going to

00:03:21.920 | start at g and it's going to go backwards through that expression graph and it's going to recursively

00:03:27.440 | apply the chain rule from calculus and what that allows us to do then is we're going to evaluate

00:03:33.440 | basically the derivative of g with respect to all the internal nodes like e d and c but also

00:03:40.400 | with respect to the inputs a and b and then we can actually query this derivative of g with respect

00:03:47.200 | to a for example that's a dot grad in this case it happens to be 138 and the derivative of g with

00:03:52.880 | respect to b which also happens to be here 645 and this derivative we'll see soon is very important

00:04:00.000 | information because it's telling us how a and b are affecting g through this mathematical expression

00:04:06.720 | so in particular a dot grad is 138 so if we slightly nudge a and make it slightly larger

00:04:13.920 | 138 is telling us that g will grow and the slope of that growth is going to be 138

00:04:20.720 | and the slope of growth of b is going to be 645 so that's going to tell us about how g will respond

00:04:26.800 | if a and b get tweaked a tiny amount in a positive direction okay um now you might be confused about

00:04:34.720 | what this expression is that we built out here and this expression by the way is completely meaningless

00:04:39.680 | i just made it up i'm just flexing about the kinds of operations that are supported by micrograd

00:04:44.800 | and what we actually really care about are neural networks but it turns out that neural networks

00:04:48.720 | are just mathematical expressions just like this one but actually slightly a bit less crazy even

00:04:53.680 | neural networks are just a mathematical expression they take the input data as an input and they take

00:05:00.080 | the weights of a neural network as an input and it's a mathematical expression and the output

00:05:04.560 | are your predictions of your neural net or the loss function we'll see this in a bit but basically

00:05:09.600 | neural networks just happen to be a certain class of mathematical expressions but back propagation

00:05:14.800 | is actually significantly more general it doesn't actually care about neural networks at all it only

00:05:19.440 | tells us about arbitrary mathematical expressions and then we happen to use that machinery for

00:05:24.480 | training of neural networks now one more note i would like to make at this stage is that as you

00:05:28.960 | see here micrograd is a scalar valued autograd engine so it's working on the you know level of

00:05:34.320 | individual scalars like negative four and two and we're taking neural nets and we're breaking them

00:05:38.320 | down all the way to these atoms of individual scalars and all the little pluses and times and

00:05:42.960 | it's just excessive and so obviously you would never be doing any of this in production it's

00:05:48.000 | really just for done for pedagogical reasons because it allows us to not have to deal with

00:05:52.000 | these n-dimensional tensors that you would use in modern deep neural network library so this is

00:05:57.280 | really done so that you understand and refactor out back propagation and chain rule and understanding

00:06:03.200 | of neural training and then if you actually want to train bigger networks you have to be using these

00:06:08.320 | tensors but none of the math changes this is done purely for efficiency we are basically taking

00:06:13.120 | scalar value all the scalar values we're packaging them up into tensors which are just arrays of

00:06:18.240 | these scalars and then because we have these large arrays we're making operations on those large

00:06:23.440 | arrays that allows us to take advantage of the parallelism in a computer and all those operations

00:06:29.040 | can be done in parallel and then the whole thing runs faster but really none of the math changes

00:06:33.520 | and they've done purely for efficiency so i don't think that it's pedagogically useful to be dealing

00:06:37.760 | with tensors from scratch and i think and that's why i fundamentally wrote micrograd because you

00:06:42.640 | can understand how things work at the fundamental level and then you can speed it up later okay so

00:06:48.320 | here's the fun part my claim is that micrograd is what you need to train your own networks and

00:06:52.800 | everything else is just efficiency so you'd think that micrograd would be a very complex piece of

00:06:57.280 | code and that turns out to not be the case so if we just go to micrograd and you'll see that there's

00:07:04.480 | only two files here in micrograd this is the actual engine it doesn't know anything about

00:07:09.280 | neural nets and this is the entire neural nets library on top of micrograd so engine and nn.py

00:07:16.560 | so the actual back propagation autograd engine that gives you the power of neural networks

00:07:23.280 | is literally 100 lines of code of like very simple python which we'll understand by the end of this

00:07:31.120 | lecture and then nn.py this neural network library built on top of the autograd engine

00:07:37.040 | is like a joke it's like we have to define what is a neuron and then we have to define what is a layer

00:07:43.920 | of neurons and then we define what is a multilayer perceptron which is just a sequence of layers of

00:07:48.800 | neurons and so it's just a total joke so basically there's a lot of power that comes from only 150

00:07:56.400 | lines of code and that's all you need to understand to understand neural network training and everything

00:08:01.200 | else is just efficiency and of course there's a lot to efficiency but fundamentally that's all

00:08:06.880 | that's happening okay so now let's dive right in and implement micrograd step by step the first

00:08:11.600 | thing i'd like to do is i'd like to make sure that you have a very good understanding intuitively of

00:08:15.600 | what a derivative is and exactly what information it gives you so let's start with some basic

00:08:21.280 | imports that i copy paste in every jupyter notebook always and let's define a function

00:08:26.880 | a scalar valid function f of x as follows so i just made this up randomly i just wanted a

00:08:33.520 | scalar valid function that takes a single scalar x and returns a single scalar y and we can call

00:08:39.280 | this function of course so we can pass and say 3.0 and get 20 back now we can also plot this function

00:08:45.520 | to get a sense of its shape you can tell from the mathematical expression that this is probably a

00:08:50.000 | parabola it's a quadratic and so if we just create a set of um um scalar values that we can feed in

00:08:59.280 | using for example a range from negative 5 to 5 in steps of 0.25 so this is so x is is just from

00:09:06.320 | negative 5 to 5 not including 5 in steps of 0.25 and we can actually call this function on this

00:09:13.120 | numpy array as well so we get a set of y's if we call f on x's and these y's are basically also

00:09:20.800 | applying the function on every one of these elements independently and we can plot this

00:09:26.400 | using mathplotlib so plt.plot x's and y's and we get a nice parabola so previously here we fed in

00:09:33.760 | 3.0 somewhere here and we received 20 back which is here the y coordinate so now i'd like to think

00:09:39.840 | through what is the derivative of this function at any single input point x right so what is the

00:09:46.160 | derivative at different points x of this function now if you remember back to your calculus class

00:09:51.360 | you've probably derived derivatives so we take this mathematical expression 3x squared minus 4x

00:09:56.400 | plus 5 and you would write out on a piece of paper and you would you know apply the product rule and

00:10:00.560 | all the other rules and derive the mathematical expression of the great derivative of the

00:10:05.040 | original function and then you could plug in different taxes and see what the derivative is

00:10:08.720 | we're not going to actually do that because no one in neural networks actually writes out the

00:10:14.800 | expression for the neural net it would be a massive expression it would be you know thousands tens of

00:10:19.600 | thousands of terms no one actually derives the derivative of course and so we're not going to

00:10:24.240 | take this kind of like symbolic approach instead what i'd like to do is i'd like to look at the

00:10:28.160 | definition of derivative and just make sure that we really understand what the derivative is

00:10:31.920 | measuring what it's telling you about the function and so if we just look up derivative

00:10:36.720 | we see that um okay so this is not a very good definition of derivative this is a definition

00:10:46.560 | of what it means to be differentiable but if you remember from your calculus it is the limit as h

00:10:51.120 | goes to zero of f of x plus h minus f of x over h so basically what it's saying is if you slightly

00:10:58.960 | bump up you're at some point x that you're interested in or a and if you slightly bump up

00:11:04.000 | you know you slightly increase it by a small number h how does the function respond with what

00:11:09.760 | sensitivity does it respond where is the slope at that point does the function go up or does it go

00:11:14.480 | down and by how much and that's the slope of that function the the slope of that response at that

00:11:20.880 | point and so we can basically evaluate the derivative here numerically by taking a very

00:11:26.880 | small h of course the definition would ask us to take h to zero we're just going to pick a very

00:11:31.760 | small h 0.001 and let's say we're interested in 0.3.0 so we can look at f of x of course as 20

00:11:38.240 | and now f of x plus h so if we slightly nudge x in a positive direction how is the function going to

00:11:44.560 | respond and just looking at this do you expand do you expect f of x plus h to be slightly greater

00:11:50.320 | than 20 or do you expect to be slightly lower than 20 and since this 3 is here and this is 20

00:11:57.040 | if we slightly go positively the function will respond positively so you'd expect this to be

00:12:02.160 | slightly greater than 20 and now by how much is telling you the sort of the the strength of that

00:12:08.640 | slope right the the size of the slope so f of x plus h minus f of x this is how much the function

00:12:14.720 | responded in the positive direction and we have to normalize by the run so we have the rise over

00:12:21.360 | run to get the slope so this of course is just a numerical approximation of the slope because we

00:12:27.440 | have to make h very very small to converge to the exact amount now if i'm doing too many zeros

00:12:34.560 | at some point i'm going to get an incorrect answer because we're using floating point arithmetic

00:12:40.400 | and the representations of all these numbers in computer memory is finite and at some point we're

00:12:45.280 | getting into trouble so we can converge towards the right answer with this approach but basically

00:12:50.960 | at 3 the slope is 14 and you can see that by taking 3x squared minus 4x plus 5 and differentiating it

00:12:59.200 | in our head so 3x squared would be 6x minus 4 and then we plug in x equals 3 so that's 18 minus 4

00:13:08.000 | is 14 so this is correct so that's at 3 now how about the slope at say negative 3 would you expect

00:13:18.480 | what would you expect for the slope now telling the exact value is really hard but what is the

00:13:23.200 | sign of that slope so at negative 3 if we slightly go in the positive direction at x the function

00:13:30.400 | would actually go down and so that tells you that the slope would be negative so we'll get a slight

00:13:34.560 | number below 20 and so if we take the slope we expect something negative negative 22 okay

00:13:42.160 | and at some point here of course the slope would be zero now for this specific function

00:13:48.320 | i looked it up previously and it's at point 2 over 3 so at roughly 2 over 3 that's somewhere here

00:13:57.040 | this derivative would be zero so basically at that precise point

00:14:01.120 | yeah at that precise point if we nudge in a positive direction the function doesn't respond

00:14:08.160 | this stays the same almost and so that's why the slope is zero okay now let's look at a bit more

00:14:12.560 | complex case so we're going to start you know complexifying a bit so now we have a function

00:14:19.680 | here with output variable b that is a function of three scalar inputs a b and c so a b and c are

00:14:27.280 | some specific values three inputs into our expression graph and a single output d and so

00:14:33.440 | if we just print d we get four and now what i'd like to do is i'd like to again look at the

00:14:38.720 | derivatives of d with respect to a b and c and think through again just the intuition of what

00:14:45.600 | this derivative is telling us so in order to evaluate this derivative we're going to get a

00:14:51.040 | bit hacky here we're going to again have a very small value of h and then we're going to fix the

00:14:56.960 | inputs at some values that we're interested in so these are the this is the point a b c at which

00:15:03.360 | we're going to be evaluating the derivative of d with respect to all a b and c at that point

00:15:08.800 | so there are the inputs and now we have d1 is that expression and then we're going to for example

00:15:14.880 | look at the derivative of d with respect to a so we'll take a and we'll bump it by h and then we'll

00:15:20.400 | get d2 to be the exact same function and now we're going to print um you know f1 d1 is d1

00:15:30.000 | d2 is d2 and print slope so the derivative or slope here will be um of course d2 minus d1

00:15:42.800 | divide h so d2 minus d1 is how much the function increased when we bumped the uh the specific

00:15:52.560 | input that we're interested in by a tiny amount and this is the normalized by h to get the slope

00:16:02.720 | so um yeah so this so i just run this we're going to print d1 which we know is four

00:16:14.240 | now d2 will be bumped a will be bumped by h so let's just think through a little bit

00:16:23.760 | what d2 will be uh printed out here in particular d1 will be four will d2 be a number slightly

00:16:33.200 | greater than four or slightly lower than four and that's going to tell us the the the sign

00:16:38.320 | of the derivative so we're bumping a by h b is minus three c is 10 so you can just intuitively

00:16:49.760 | think through this derivative and what it's doing a will be slightly more positive and but b is a

00:16:55.840 | negative number so if a is slightly more positive because b is negative three we're actually going

00:17:03.840 | to be adding less to d so you'd actually expect that the value of the function will go down

00:17:13.760 | so let's just see this yeah and so we went from four to 3.9996 and that tells you that the slope

00:17:22.000 | will be negative and then um will be a negative number because we went down and then the exact

00:17:29.440 | number of slope will be exact amount of slope is negative three and you can also convince yourself

00:17:34.880 | that negative three is the right answer uh mathematically and analytically because if you

00:17:39.280 | have a times b plus c and you are you know you have calculus then uh differentiating a times b

00:17:45.440 | plus c with respect to a gives you just b and indeed the value of b is negative three which

00:17:51.040 | is the derivative that we have so you can tell that that's correct so now if we do this with b

00:17:56.320 | so if we bump b by a little bit in a positive direction we'd get different slopes so what is

00:18:03.040 | the influence of b on the output d so if we bump b by a tiny amount in a positive direction then

00:18:09.520 | because a is positive we'll be adding more to d right so um and now what is the what is the

00:18:16.960 | sensitivity what is the slope of that addition and it might not surprise you that this should be

00:18:21.840 | two and why is it two because d of d by db differentiating with respect to b would be would

00:18:30.720 | give us a and the value of a is two so that's also working well and then if c gets bumped a tiny

00:18:37.040 | amount in h by h then of course a times b is unaffected and now c becomes slightly bit higher

00:18:44.240 | what does that do to the function it makes it slightly bit higher because we're simply adding c

00:18:48.240 | and it makes it slightly bit higher by the exact same amount that we added to c and so that tells

00:18:53.840 | you that the slope is one that will be the the rate at which d will increase as we scale c okay

00:19:05.360 | so we now have some intuitive sense of what this derivative is telling you about the function

00:19:09.280 | and we'd like to move to neural networks now as i mentioned neural networks will be pretty massive

00:19:13.120 | expressions mathematical expressions so we need some data structures that maintain these expressions

00:19:17.680 | and that's what we're going to start to build out now so we're going to build out this value object

00:19:23.760 | that i showed you in the readme page of micrograd so let me copy paste a skeleton of the first very

00:19:31.440 | simple value object so class value takes a single scalar value that it wraps and keeps track of

00:19:38.640 | and that's it so we can for example do value of 2.0 and then we can get we can look at its content

00:19:47.760 | and python will internally use the wrapper function to return this string oops like that

00:19:57.280 | so this is a value object with data equals two that we're creating here now we'd like to do is

00:20:04.080 | like we'd like to be able to have not just like two values but we'd like to do a plus b right we'd

00:20:12.000 | like to add them so currently you would get an error because python doesn't know how to add

00:20:17.760 | two value objects so we have to tell it so here's addition

00:20:22.880 | so you have to basically use these special double underscore methods in python to define these

00:20:30.880 | operators for these objects so if we call um the uh if we use this plus operator python will

00:20:39.760 | internally call a dot add of b that's what will happen internally and so b will be the other

00:20:47.280 | and self will be a and so we see that what we're going to return is a new value object

00:20:54.000 | and it's just uh it's going to be wrapping the plus of their data but remember now because

00:21:01.360 | data is the actual like numbered python number so this operator here is just the typical floating

00:21:07.360 | point plus addition now it's not an addition of value objects and we'll return a new value

00:21:13.680 | so now a plus b should work and it should print value of negative one because that's two plus

00:21:19.200 | minus three there we go okay let's now implement multiply just so we can recreate this expression

00:21:25.920 | here so multiply i think it won't surprise you will be fairly similar so instead of add we're

00:21:32.880 | going to be using mul and then here of course we want to do times and so now we can create a c value

00:21:38.640 | object which will be 10.0 and now we should be able to do a times b well let's just do a times b

00:21:45.280 | first um that's value of negative six now and by the way i skipped over this a little bit uh suppose

00:21:52.960 | that i didn't have the wrapper function here uh then it's just that you'll get some kind of an

00:21:57.680 | ugly expression so what wrapper is doing is it's providing us a way to print out like a nicer

00:22:03.120 | looking expression in python uh so we don't just have something cryptic we actually are you know

00:22:09.040 | it's value of negative six so this gives us a times and then this we should now be able to add

00:22:16.560 | c to it because we've defined and told the python how to do mul and add and so this will call this

00:22:22.240 | will basically be equivalent to a dot mul of b and then this new value object will be dot add of c

00:22:31.760 | and so let's see if that worked yep so that worked well that gave us four which is what we expect

00:22:37.680 | from before and i believe you can just call them manually as well there we go so yeah okay so now

00:22:45.600 | what we are missing is the connected tissue of this expression as i mentioned we want to keep

00:22:50.000 | these expression graphs so we need to know and keep pointers about what values produce what other

00:22:55.440 | values so here for example we are going to introduce a new variable which we'll call

00:23:00.320 | children and by default it will be an empty tuple and then we're actually going to keep a slightly

00:23:05.120 | different variable in the class which we'll call underscore prev which will be the set of children

00:23:10.320 | this is how i done i did it in the original micrograd looking at my code here i can't remember

00:23:16.320 | exactly the reason i believe it was efficiency but this underscore children will be a tuple

00:23:20.800 | for convenience but then when we actually maintain it in the class it will be just this set

00:23:24.400 | i believe for efficiency so now when we are creating a value like this with a constructor

00:23:32.800 | children will be empty and prep will be the empty set but when we are creating a value through

00:23:37.360 | addition or multiplication we're going to feed in the children of this value which in this case

00:23:43.520 | is self and other so those are the children here so now we can do d dot prev and we'll see that

00:23:54.160 | the children of the we now know are this a value of negative six and value of 10 and this of course

00:24:00.880 | is the value resulting from a times b and the c value which is 10 now the last piece of information

00:24:08.400 | we don't know so we know now the children of every single value but we don't know what operation

00:24:12.960 | created this value so we need one more element here let's call it underscore up and by default

00:24:19.920 | this is the empty set for leaves and then we'll just maintain it here and now the operation will

00:24:26.880 | be just a simple string and in the case of addition it's plus in the case of multiplication

00:24:32.000 | it's times so now we not just have d dot prep we also have a d dot up and we know that d was

00:24:39.600 | produced by an addition of those two values and so now we have the full mathematical expression

00:24:45.040 | and we're building out this data structure and we know exactly how each value came to be by what

00:24:50.320 | expression and from what other values now because these expressions are about to get quite a bit

00:24:56.880 | larger we'd like a way to nicely visualize these expressions that we're building out so for that

00:25:02.560 | i'm going to copy paste a bunch of slightly scary code that's going to visualize this

00:25:07.840 | these expression graphs for us so here's the code and i'll explain it in a bit but first let me just

00:25:13.040 | show you what this code does basically what it does is it creates a new function draw dot that

00:25:18.400 | we can call on some root node and then it's going to visualize it so if we call draw dot on d which

00:25:25.040 | is this final value here that is a times b plus c it creates something like this so this is d

00:25:32.240 | and you see that this is a times b creating an integer value plus c gives us this output node d

00:25:38.880 | so that's draw dot of b and i'm not going to go through this in complete detail you can take a

00:25:45.840 | look at graphvis and its api graphvis is a open source graph visualization software

00:25:50.800 | and what we're doing here is we're building out this graph in graphvis api and you can basically

00:25:57.600 | see that trace is this helper function that enumerates all the nodes and edges in the graph

00:26:02.160 | so that just builds a set of all the nodes and edges and then we iterate through all the nodes

00:26:06.720 | and we create special node objects for them in using dot node and then we also create edges

00:26:15.120 | using dot dot edge and the only thing that's like slightly tricky here is you'll notice that i

00:26:20.400 | basically add these fake nodes which are these operation nodes so for example this node here

00:26:25.680 | is just like a plus node and i create these special op nodes here and i connect them accordingly

00:26:36.240 | so these nodes of course are not actual nodes in the original graph they're not actually a value

00:26:42.960 | object the only value objects here are the things in squares those are actual value objects or

00:26:48.960 | representations thereof and these op nodes are just created in this draw dot routine so that it

00:26:54.160 | looks nice let's also add labels to these graphs just so we know what variables are where so let's

00:27:00.320 | create a special underscore label or let's just do label equals empty by default and save it in each

00:27:08.560 | node and then here we're going to do label as a label is b label is c and then let's create a

00:27:25.680 | special e equals a times b and e dot label will be e it's kind of naughty and e will be e plus c

00:27:37.200 | and a d dot label will be b okay so nothing really changes i just added this new e function

00:27:45.840 | that new e variable and then here when we are printing this i'm going to print the label here

00:27:54.080 | so this will be a percent s bar and this will be n dot label

00:27:58.240 | and so now we have the label on the left here so it says a b creating e and then e plus c

00:28:07.680 | creates d just like we have it here and finally let's make this expression just one layer deeper

00:28:13.520 | so d will not be the final output node instead after d we are going to create a new value object

00:28:21.760 | called f we're going to start running out of variables soon f will be negative 2.0 and its

00:28:27.920 | label will of course just be f and then L capital L will be the output of our graph and L will be

00:28:36.160 | d times f okay so L will be negative 8 is the output so now we don't just draw a d we draw L

00:28:50.000 | okay and somehow the label of L is undefined oops all that label has to be explicitly

00:28:57.680 | sort of given to it there we go so L is the output so let's quickly recap what we've done so far

00:29:03.600 | we are able to build out mathematical expressions using only plus and times so far

00:29:08.160 | they are scalar valued along the way and we can do this forward pass and build out a mathematical

00:29:15.440 | expression so we have multiple inputs here a b c and f going into a mathematical expression

00:29:21.520 | that produces a single output L and this here is visualizing the forward pass so the output of the

00:29:28.000 | forward pass is negative 8 that's the value now what we'd like to do next is we'd like to run

00:29:33.920 | back propagation and in back propagation we are going to start here at the end and we're going to

00:29:39.600 | reverse and calculate the gradient along along all these intermediate values and really what

00:29:45.920 | we're computing for every single value here we're going to compute the derivative of that node with

00:29:52.800 | respect to L so the derivative of L with respect to L is just one and then we're going to derive

00:30:01.520 | what is the derivative of L with respect to f with respect to d with respect to c with respect to e

00:30:07.600 | with respect to b and with respect to a and in a neural network setting you'd be very interested

00:30:13.040 | in the derivative of basically this loss function L with respect to the weights of a neural network

00:30:18.800 | and here of course we have just these variables a b c and f but some of these will eventually

00:30:23.840 | represent the weights of a neural net and so we'll need to know how those weights are impacting the

00:30:29.200 | loss function so we'll be interested basically in the derivative of the output with respect to some

00:30:33.760 | of its leaf nodes and those leaf nodes will be the weights of the neural net and the other leaf

00:30:38.800 | nodes of course will be the data itself but usually we will not want or use the derivative of

00:30:43.920 | the loss function with respect to data because the data is fixed but the weights will be iterated on

00:30:49.680 | using the gradient information so next we are going to create a variable inside the value class

00:30:55.440 | that maintains the derivative of L with respect to that value and we will call this variable grad

00:31:03.840 | so there's a dot data and there's a self dot grad and initially it will be zero and remember that

00:31:10.320 | zero is basically means no effect so at initialization we're assuming that every value

00:31:15.520 | does not impact does not affect the output right because if the gradient is zero that means that

00:31:21.920 | changing this variable is not changing the loss function so by default we assume that the gradient

00:31:27.680 | is zero and then now that we have grad and it's you know 0.0 we are going to be able to visualize

00:31:38.080 | it here after data so here grad is 0.4f and this will be in dot grad and now we are going to be

00:31:47.040 | showing both the data and the grad initialized at zero and we are just about getting ready to

00:31:55.600 | calculate the back propagation and of course this grad again as I mentioned is representing

00:32:00.400 | the derivative of the output in this case L with respect to this value so with respect to so this

00:32:06.320 | is the derivative of L with respect to f with respect to d and so on so let's now fill in those

00:32:11.680 | gradients and actually do back propagation manually so let's start filling in these gradients

00:32:15.840 | and start all the way at the end as I mentioned here first we are interested to fill in this

00:32:20.400 | gradient here so what is the derivative of L with respect to L in other words if I change L

00:32:26.800 | by a tiny amount h how much does L change it changes by h so it's proportional and therefore

00:32:35.440 | the derivative will be one we can of course measure these or estimate these numerical

00:32:40.320 | gradients numerically just like we've seen before so if I take this expression and I create a def

00:32:46.800 | lol function here and put this here now the reason I'm creating a gating function lol here is because

00:32:53.840 | I don't want to pollute or mess up the global scope here this is just kind of like a little

00:32:57.840 | staging area and as you know in Python all of these will be local variables to this function

00:33:02.560 | so I'm not changing any of the global scope here so here L1 will be L

00:33:07.520 | and then copy pasting this expression we're going to add a small amount h

00:33:15.280 | and for example a right and this would be measuring the derivative of L with respect to a

00:33:24.480 | so here this will be L2 and then we want to print that derivative so print L2 minus L1 which is how

00:33:33.280 | much L changed and then normalize it by h so this is the rise over run and we have to be careful

00:33:40.400 | because L is a value node so we actually want its data so that these are floats dividing by h

00:33:48.880 | and this should print the derivative of L with respect to a because a is the one that we bumped

00:33:53.920 | a little bit by h so what is the derivative of L with respect to a it's six okay and obviously

00:34:03.520 | if we change L by h then that would be here effectively this looks really awkward but

00:34:14.000 | changing L by h you see the derivative here is one that's kind of like the base case of

00:34:22.560 | what we are doing here so basically we can come up here and we can manually set L.grad to one

00:34:29.600 | this is our manual back propagation L.grad is one and let's redraw and we'll see that we filled in

00:34:37.280 | grad is one for L we're now going to continue the back propagation so let's here look at the

00:34:42.160 | derivatives of L with respect to D and F let's do a D first so what we are interested in if I

00:34:49.120 | create a markdown on here is we'd like to know basically we have that L is D times F and we'd

00:34:54.560 | like to know what is D L by D D what is that and if you know your calculus L is D times F so what

00:35:04.400 | is D L by D D it would be F and if you don't believe me we can also just derive it because

00:35:11.200 | the proof would be fairly straightforward we go to the definition of the derivative which is F of x

00:35:18.560 | plus h minus F of x divide h as a limit limit of h goes to zero of this kind of expression so when

00:35:26.160 | we have L is D times F then increasing D by h would give us the output of D plus h times F

00:35:34.480 | that's basically F of x plus h right minus D times F and then divide h and symbolically

00:35:44.880 | expanding out here we would have basically D times F plus h times F minus D times F divide h

00:35:51.600 | and then you see how the D F minus D F cancels so you're left with h times F

00:35:56.160 | divide h which is F so in the limit as h goes to zero of you know derivative

00:36:06.720 | definition we just get F in the case of D times F so symmetrically D L by D F will just be D

00:36:17.120 | so what we have is that F dot grad we see now is just the value of D which is four

00:36:25.120 | and we see that D dot grad is just the value of F

00:36:34.560 | and so the value of F is negative two so we'll set those manually

00:36:41.600 | let me erase this markdown node and then let's redraw what we have

00:36:46.640 | okay and let's just make sure that these were correct so we seem to think that D L by D D is

00:36:55.040 | negative two so let's double check um let me erase this plus h from before and now we want the

00:37:00.400 | derivative with respect to F so let's just come here when I create F and let's do a plus h here

00:37:05.840 | and this should print a derivative of L with respect to F so we expect to see four yeah and

00:37:11.360 | this is four up to floating point funkiness and then D L by D D should be F which is negative two

00:37:22.320 | grad is negative two so if we again come here and we change D

00:37:26.080 | D dot data plus equals h right here so we expect so we've added a little h and then we see how L

00:37:35.040 | changed and we expect to print negative two there we go so we've numerically verified what we're

00:37:45.760 | doing here is kind of like an inline gradient check gradient check is when we are deriving

00:37:50.720 | this like back propagation and getting the derivative with respect to all the intermediate

00:37:54.720 | results and then numerical gradient is just you know um estimating it using small step size now

00:38:01.760 | we're getting to the crux of back propagation so this will be the most important node to understand

00:38:08.320 | because if you understand the gradient for this node you understand the gradient for this node

00:38:12.720 | because if you understand the gradient for this node you understand all of back propagation and

00:38:17.040 | all of training of neural nets basically so we need to derive D L by D C in other words the

00:38:23.920 | derivative of L with respect to C because we've computed all these other gradients already now

00:38:29.840 | we're coming here and we're continuing the back propagation manually so we want D L by D C and

00:38:35.840 | we'll also derive D L by D E now here's the problem how do we derive D L by D C we actually

00:38:44.640 | know the derivative L with respect to D so we know how L is sensitive to D but how is L sensitive to

00:38:52.240 | C so if we wiggle C how does that impact L through D so we know D L by D C

00:39:01.920 | and we also here know how C impacts D and so just very intuitively if you know the

00:39:06.800 | impact that C is having on D and the impact that D is having on L then you should be able to somehow

00:39:12.720 | put that information together to figure out how C impacts L and indeed this is what we can actually

00:39:18.480 | do so in particular we know just concentrating on D first let's look at how what is the derivative

00:39:24.400 | basically of D with respect to C so in other words what is D D by D C so here we know that D

00:39:33.040 | is C times C plus E that's what we know and now we're interested in D D by D C if you just know

00:39:40.400 | your calculus again and you remember then differentiating C plus E with respect to C

00:39:44.880 | you know that that gives you 1.0 and we can also go back to the basics and derive this

00:39:50.880 | because again we can go to our f of x plus h minus f of x divide by h that's the definition

00:39:57.280 | of a derivative as h goes to zero and so here focusing on C and its effect on D we can basically

00:40:05.120 | do the f of x plus h will be C is incremented by h plus E that's the first evaluation of our

00:40:12.160 | function minus C plus E and then divide h and so what is this just expanding this out this will be

00:40:21.840 | C plus h plus C minus C minus E divide h and then you see here how C minus C cancels E minus E

00:40:29.680 | cancels we're left with h over h which is 1.0 and so by symmetry also D D by D E will be 1.0 as well

00:40:42.000 | so basically the derivative of a sum expression is very simple and this is the local derivative

00:40:48.240 | so I call this the local derivative because we have the final output value all the way at the

00:40:52.480 | end of this graph and we're now like a small node here and this is a little plus node and

00:40:58.320 | it the little plus node doesn't know anything about the rest of the graph that it's embedded

00:41:02.880 | in all it knows is that it did plus it took a C and an E added them and created D and this plus

00:41:09.760 | node also knows the local influence of C on D or rather the derivative of D with respect to C

00:41:16.000 | and it also knows the derivative of D with respect to E but that's not what we want that's just the

00:41:21.520 | local derivative what we actually want is D L by D C and L is here just one step away

00:41:29.120 | but in the general case this little plus node could be embedded in like a massive graph

00:41:34.560 | so again we know how L impacts D and now we know how C and E impact D how do we put that

00:41:41.360 | information together to write D L by D C and the answer of course is the chain rule in calculus

00:41:46.880 | and so I pulled up chain rule here from Wikipedia and I'm going to go through this very briefly

00:41:55.520 | so chain rule Wikipedia sometimes can be very confusing and calculus can

00:42:00.480 | be very confusing like this is the way I learned chain rule and it was very confusing like what

00:42:06.960 | is happening it's just complicated so I like this expression much better

00:42:11.200 | if a variable Z depends on a variable Y which itself depends on a variable X

00:42:16.960 | then Z depends on X as well obviously through the intermediate variable Y

00:42:21.680 | and in this case the chain rule is expressed as if you want D Z by D X then you take the D Z by D Y

00:42:30.400 | and you multiply it by D Y by D X so the chain rule fundamentally is telling you how we chain

00:42:38.240 | these derivatives together correctly so to differentiate through a function composition

00:42:45.440 | we have to apply a multiplication of those derivatives so that's really what chain rule

00:42:53.360 | is telling us and there's a nice little intuitive explanation here which I also think is kind of cute

00:42:59.120 | the chain rule states that knowing the instantaneous rate of change of Z with respect to Y

00:43:02.880 | and Y relative to X allows one to calculate the instantaneous rate of change of Z relative to X

00:43:07.120 | as a product of those two rates of change simply the product of those two so here's a good one if

00:43:14.480 | a car travels twice as fast as a bicycle and the bicycle is four times as fast as walking men

00:43:19.200 | then the car travels two times four eight times as fast as the man and so this makes it very clear

00:43:26.800 | that the correct thing to do sort of is to multiply so car is twice as fast as bicycle

00:43:33.680 | and bicycle is four times as fast as man so the car will be eight times as fast as the man

00:43:39.440 | and so we can take these intermediate rates of change if you will and multiply them together

00:43:45.760 | and that justifies the chain rule intuitively so have a look at chain rule but here really what

00:43:52.560 | it means for us is there's a very simple recipe for deriving what we want which is dL by dc

00:43:58.000 | and what we have so far is we know want and we know what is the impact of d on L so we know dL

00:44:11.760 | by dd the derivative of L with respect to dd we know that that's negative two and now because of

00:44:18.240 | this local reasoning that we've done here we know dd by dc so how does c impact d and in particular

00:44:28.000 | this is a plus node so the local derivative is simply 1.0 it's very simple and so the chain

00:44:34.720 | rule tells us that dL by dc going through this intermediate variable will just be simply dL by

00:44:44.000 | dd times dd by dc that's chain rule so this is identical to what's happening here except

00:44:57.120 | z is our L y is our d and x is our c so we literally just have to multiply these and because

00:45:10.320 | these local derivatives like dd by dc are just one we basically just copy over dL by dd because

00:45:17.680 | this is just times one so what does it do so because dL by dd is negative two what is dL by dc

00:45:24.640 | well it's the local gradient 1.0 times dL by dd which is negative two so literally what a plus

00:45:32.880 | node does you can look at it that way is it literally just routes the gradient because

00:45:37.760 | the plus nodes local derivatives are just one and so in the chain rule one times dL by dd

00:45:44.400 | is just dL by dd and so that derivative just gets routed to both c and to e in this case

00:45:54.560 | so basically we have that e dot grad or let's start with c since that's the one we've looked at

00:46:02.480 | is negative two times one negative two and in the same way by symmetry e dot grad will be negative

00:46:12.080 | two that's the claim so we can set those we can redraw and you see how we just assign negative

00:46:20.880 | two negative two so this back propagating signal which is carrying the information of like what is

00:46:25.680 | the derivative of L with respect to all the intermediate nodes we can imagine it almost

00:46:30.080 | like flowing backwards through the graph and a plus node will simply distribute the derivative

00:46:34.800 | to all the leaf nodes sorry to all the children nodes of it so this is the claim and now let's

00:46:40.480 | verify it so let me remove the plus h here from before and now instead what we're going to do is

00:46:46.400 | we want to increment c so c dot data will be incremented by h and when I run this we expect

00:46:52.080 | to see negative two negative two and then of course for e so e dot data plus equals h and

00:47:00.960 | we expect to see negative two simple so those are the derivatives of these internal nodes

00:47:10.160 | and now we're going to recurse our way backwards again and we're again going to apply the chain

00:47:17.200 | rule so here we go our second application of chain rule and we will apply it all the way

00:47:21.920 | through the graph we just happen to only have one more node remaining we have that dl by de

00:47:27.840 | as we have just calculated is negative two so we know that so we know the derivative of l with

00:47:33.920 | respect to e and now we want dl by da right and the chain rule is telling us that that's just dl by de

00:47:46.800 | times the local gradient so what is the local gradient basically de by da we have to look at

00:47:58.000 | that so i'm a little times node inside a massive graph and i only know that i did a times b and i

00:48:06.960 | produced an e so now what is de by da and de by db that's the only thing that i sort of know about

00:48:14.400 | that's my local gradient so because we have that e is a times b we're asking what is de by da

00:48:22.480 | and of course we just did that here we had a times so i'm not going to re-derive it but if you want

00:48:31.040 | to differentiate this with respect to a you'll just get b right the value of b which in this case

00:48:37.680 | is negative 3.0 so basically we have that dl by da well let me just do it right here we have that

00:48:47.520 | a dot grad and we are applying chain rule here is dl by de which we see here is negative two

00:48:54.800 | times what is de by da it's the value of b which is negative three that's it

00:49:05.680 | and then we have b dot grad is again dl by de which is negative two just the same way times

00:49:13.440 | what is de by d db is the value of a which is 2.0 that's the value of a so these are our claimed

00:49:26.080 | derivatives let's redraw and we see here that a dot grad turns out to be six because that is

00:49:33.440 | negative two times negative three and b dot grad is negative four times sorry is negative two times

00:49:40.080 | two which is negative four so those are our claims let's delete this and let's verify them

00:49:47.680 | we have a here a dot data plus equals h so the claim is that a dot grad is six let's verify

00:49:59.360 | six and we have b dot data plus equals h so nudging b by h and looking at what happens

00:50:10.320 | we claim it's negative four and indeed it's negative four plus minus again float oddness

00:50:16.080 | and that's it this that was the manual back propagation all the way from here to all the

00:50:26.880 | leaf nodes and we've done it piece by piece and really all we've done is as you saw we iterated

00:50:32.400 | through all the nodes one by one and locally applied the chain rule we also applied the

00:50:37.280 | one and locally applied the chain rule we always know what is the derivative of l with respect to

00:50:42.880 | this little output and then we look at how this output was produced this output was produced

00:50:47.600 | through some operation and we have the pointers to the children nodes of this operation and so

00:50:53.200 | in this little operation we know what the local derivatives are and we just multiply them onto

00:50:57.920 | the derivative always so we just go through and recursively multiply on the local derivatives

00:51:04.080 | and that's what back propagation is it's just a recursive application of chain rule

00:51:08.160 | backwards through the computation graph let's see this power in action just very briefly what

00:51:13.680 | we're going to do is we're going to nudge our inputs to try to make l go up so in particular

00:51:20.560 | what we're doing is we want a dot data we're going to change it and if we want l to go up

00:51:26.000 | that means we just have to go in the direction of the gradient so a should increase in the

00:51:31.840 | direction of gradient by like some small step amount this is the step size and we don't just

00:51:37.040 | want this for b but also for b also for c also for f those are leaf nodes which we usually have

00:51:49.280 | control over and if we nudge in direction of the gradient we expect a positive influence on l

00:51:55.680 | so we expect l to go up positively so it should become less negative it should go up to say

00:52:02.640 | negative you know six or something like that it's hard to tell exactly and we'd have to

00:52:08.240 | rerun the forward pass so let me just do that here this would be the forward pass f would be

00:52:18.960 | unchanged this is effectively the forward pass and now if we print l dot data we expect because

00:52:26.080 | we nudged all the values all the inputs in the direction of gradient we expected less negative

00:52:30.640 | l we expect it to go up so maybe it's negative six or so let's see what happens okay negative seven

00:52:37.520 | and this is basically one step of an optimization that will end up running and really this gradient

00:52:45.120 | just give us some power because we know how to influence the final outcome and this will be

00:52:49.760 | extremely useful for training neural nets as well as cnc so now i would like to do one more example

00:52:56.080 | of manual back propagation using a bit more complex and useful example we are going to

00:53:02.960 | back propagate through a neuron so we want to eventually build up neural networks and in the

00:53:10.160 | simplest case these are multi-layer perceptrons as they're called so this is a two-layer neural

00:53:14.800 | net and it's got these hidden layers made up of neurons and these neurons are fully connected to

00:53:19.040 | each other now biologically neurons are very complicated devices but we have very simple

00:53:23.920 | mathematical models of them and so this is a very simple mathematical model of a neuron

00:53:29.360 | you have some inputs x's and then you have these synapses that have weights on them so

00:53:35.600 | the w's are weights and then the synapse interacts with the input to this neuron

00:53:43.360 | multiplicatively so what flows to the cell body of this neuron is w times x but there's multiple

00:53:50.720 | inputs so there's many w times x's flowing to the cell body the cell body then has also like some

00:53:56.960 | bias so this is kind of like the in innate sort of trigger happiness of this neuron so this bias

00:54:04.240 | can make it a bit more trigger happy or a bit less trigger happy regardless of the input but basically

00:54:08.960 | we're taking all the w times x of all the inputs adding the bias and then we take it through an

00:54:15.040 | activation function and this activation function is usually some kind of a squashing function

00:54:19.760 | like a sigmoid or 10h or something like that so as an example we're going to use the 10h in this

00:54:26.320 | example numpy has a np.10h so we can call it on a range and we can plot it this is the 10h function

00:54:38.080 | and you see that the inputs as they come in get squashed on the y coordinate here so

00:54:44.000 | right at zero we're going to get exactly zero and then as you go more positive in the input

00:54:50.080 | then you'll see that the function will only go up to one and then plateau out and so if you pass in

00:54:57.200 | very positive inputs we're going to cap it smoothly at one and on the negative side we're going to cap

00:55:02.320 | it smoothly to negative one so that's 10h and that's the squashing function or an activation

00:55:09.040 | function and what comes out of this neuron is just the activation function applied to the

00:55:14.000 | dot product of the weights and the inputs so let's write one out i'm going to copy paste because

00:55:27.360 | i don't want to type too much but okay so here we have the inputs x1 x2 so this is a two-dimensional

00:55:33.360 | neuron so two inputs are going to come in these are thought of as the weights of this neuron

00:55:38.160 | weights w1 w2 and these weights again are the synaptic strengths for each input and this is

00:55:45.600 | the bias of the neuron b and now what we want to do is according to this model we need to multiply

00:55:53.440 | x1 times w1 and x2 times w2 and then we need to add bias on top of it and it gets a little messy

00:56:02.720 | here but all we are trying to do is x1 w1 plus x2 w2 plus b and these are multiplied here except

00:56:10.320 | i'm doing it in small steps so that we actually have pointers to all these intermediate nodes

00:56:15.040 | so we have x1 w1 variable x times x2 w2 variable and i'm also labeling them so n is now

00:56:23.760 | the cell body raw activation without the activation function for now

00:56:29.680 | and this should be enough to basically plot it so draw dot of n

00:56:34.560 | gives us x1 times w1 x2 times w2 being added then the bias gets added on top of this and this n

00:56:47.280 | is this sum so we're now going to take it through an activation function and let's say we use the

00:56:53.440 | tan h so that we produce the output so what we'd like to do here is we'd like to do the output

00:56:59.600 | and i'll call it o is n dot tan h okay but we haven't yet written the tan h now the reason

00:57:08.880 | that we need to implement another tan h function here is that tan h is a hyperbolic function and

00:57:16.000 | we've only so far implemented plus and the times and you can't make a tan h out of just pluses and

00:57:20.960 | times you also need exponentiation so tan h is this kind of a formula here you can use either

00:57:27.840 | one of these and you see that there's exponentiation involved which we have not implemented yet for

00:57:32.560 | our low value node here so we're not going to be able to produce tan h yet and we have to go back

00:57:36.800 | up and implement something like it now one option here is we could actually implement

00:57:44.880 | exponentiation right and we could return the exp of a value instead of a tan h of a value

00:57:51.760 | because if we had exp then we have everything else that we need so because we know how to add

00:57:58.560 | and we know how to we know how to add and we know how to multiply so we'd be able to create tan h

00:58:04.960 | if we knew how to exp but for the purposes of this example i specifically wanted to

00:58:10.000 | show you that we don't necessarily need to have the most atomic pieces in this value object

00:58:18.160 | we can actually like create functions at arbitrary points of abstraction they can be complicated

00:58:25.200 | functions but they can be also very very simple functions like a plus and it's totally up to us

00:58:30.000 | the only thing that matters is that we know how to differentiate through any one function

00:58:34.160 | so we take some inputs and we make an output the only thing that matters it can be arbitrarily

00:58:38.560 | complex function as long as you know how to create the local derivative if you know the

00:58:43.360 | local derivative of how the inputs impact the output then that's all you need so we're going

00:58:47.760 | to cluster up all of this expression and we're not going to break it down to its atomic pieces

00:58:53.200 | we're just going to directly implement tan h so let's do that dev tan h and then out will be a

00:59:00.320 | value of and we need this expression here so um let me actually copy paste

00:59:11.040 | let's grab n which is a sub theta and then this i believe is the tan h math.exp of

00:59:24.560 | two no n minus one over two n plus one maybe i can call this x just so that it matches exactly

00:59:34.320 | okay and now this will be t and uh children of this node there's just one child and i'm

00:59:44.800 | wrapping it in a tuple so this is a tuple of one object just self and here the name of this

00:59:50.240 | operation will be tan h and we're going to return that okay so now value should be

01:00:00.640 | implementing tan h and now we can scroll all the way down here and we can actually do n dot tan h

01:00:06.560 | and that's going to return the tan h output of n and now we should be able to draw dot of o

01:00:13.360 | not of n so let's see how that worked there we go n went through tan h to produce this output

01:00:23.200 | so now tan h is a sort of uh our little micro grad supported node here as an operation

01:00:31.120 | and as long as we know the derivative of tan h then we'll be able to back propagate through it

01:00:38.240 | now let's see this tan h in action currently it's not squashing too much because the input to it is

01:00:43.600 | pretty low so the bias was increased to say eight then we'll see that what's flowing into the tan h

01:00:51.680 | now is two and tan h is squashing it to 0.96 so we're already hitting the tail of this tan h and

01:00:59.920 | it will sort of smoothly go up to one and then plateau out over there okay so now i'm going to

01:01:04.240 | do something slightly strange i'm going to change this bias from eight to this number 6.88 etc

01:01:11.840 | and i'm going to do this for specific reasons because we're about to start back propagation

01:01:16.400 | and i want to make sure that our numbers come out nice they're not like very crazy numbers

01:01:21.600 | they're nice numbers that we can sort of understand in our head let me also add those

01:01:25.840 | label o is short for output here so that's the r okay so 0.88 flows into tan h comes out 0.7

01:01:35.120 | so on so now we're going to do back propagation and we're going to fill in all the gradients

01:01:40.240 | so what is the derivative o with respect to all the inputs here and of course in a typical neural

01:01:47.280 | network setting what we really care about the most is the derivative of these neurons on the

01:01:52.720 | weights specifically the w2 and w1 because those are the weights that we're going to be changing

01:01:57.760 | part of the optimization and the other thing that we have to remember is here we have only a single

01:02:02.240 | neuron but in the neural net you typically have many neurons and they're connected

01:02:07.120 | so this is only like a one small neuron a piece of a much bigger puzzle and eventually there's a

01:02:11.280 | loss function that sort of measures the accuracy of the neural net and we're back propagating with

01:02:15.440 | respect to that accuracy and trying to increase it so let's start off back propagation here and

01:02:21.280 | what is the derivative of o with respect to o the base case sort of we know always is that

01:02:27.920 | the gradient is just 1.0 so let me fill it in and then let me split out the drawing function here

01:02:40.320 | and then here cell clear this output here okay so now when we draw o we'll see that

01:02:52.320 | o that grad is 1 so now we're going to back propagate through the tan h so to back propagate

01:02:57.840 | through tan h we need to know the local derivative of tan h so if we have that o is tan h of n

01:03:07.200 | then what is do by dn now what you could do is you could come here and you could take this

01:03:14.640 | expression and you could do your calculus derivative taking and that would work but

01:03:20.880 | we can also just scroll down wikipedia here into a section that hopefully tells us that derivative

01:03:27.520 | d by dx of tan h of x is any of these i like this one 1 minus tan h square of x

01:03:34.480 | so this is 1 minus tan h of x squared so basically what this is saying is that do by dn

01:03:43.760 | is 1 minus tan h of n squared and we already have tan h of n it's just o so it's 1 minus o squared

01:03:55.760 | so o is the output here so the output is this number o dot data is this number and then

01:04:08.080 | what this is saying is that do by dn is 1 minus this squared so 1 minus o dot data squared

01:04:14.880 | is 0.5 conveniently so the local derivative of this tan h operation here is 0.5

01:04:23.120 | and so that would be do by dn so we can fill in that n dot grad

01:04:30.640 | is 0.5 we'll just fill it in

01:04:35.440 | so this is exactly 0.5 one half so now we're going to continue the back propagation

01:04:47.680 | this is 0.5 and this is a plus node so how is backprop going to what is backprop going to do

01:04:55.680 | here and if you remember our previous example a plus is just a distributor of gradient so this

01:05:02.160 | gradient will simply flow to both of these equally and that's because the local derivative of this

01:05:06.960 | operation is one for every one of its nodes so one times 0.5 is 0.5 so therefore we know that

01:05:14.880 | this node here which we called this it's grad it's just 0.5 and we know that b dot grad is also 0.5

01:05:24.800 | so let's set those and let's draw so those are 0.5 continuing we have another plus 0.5 again

01:05:33.280 | we'll just distribute so 0.5 will flow to both of these so we can set theirs

01:05:39.520 | x2 w2 as well dot grad is 0.5 and let's redraw pluses are my favorite uh operations to back

01:05:51.280 | propagate through because it's very simple so now what's flowing into these expressions is 0.5

01:05:57.760 | and so really again keep in mind what the derivative is telling us at every point in time

01:06:01.360 | along here this is saying that if we want the output of this neuron to increase

01:06:06.480 | then the influence on these expressions is positive on the output both of them are positive

01:06:13.760 | contribution to the output

01:06:20.480 | so now back propagating to x1 w2 first this is a times node so we know that the local derivative

01:06:27.120 | is you know the other term so if we want to calculate x2 dot grad

01:06:31.440 | then can you think through what it's going to be

01:06:35.120 | so x2 dot grad will be w2 dot data times this x2 w2 dot grad right

01:06:49.840 | and w2 dot grad will be x2 dot data times x2 w2 dot grad

01:06:57.840 | right so that's the little local piece of chain rule

01:07:03.040 | let's set them and let's redraw so here we see that the gradient on our weight

01:07:10.560 | 2 is 0 because x2's data was 0 right but x2 will have the gradient 0.5 because data here was 1

01:07:18.560 | and so what's interesting here right is because the input x2 was 0 then because of the way the

01:07:25.440 | times works of course this gradient will be 0 and think about intuitively why that is

01:07:30.640 | derivative always tells us the influence of this on the final output if i wiggle w2 how is the

01:07:39.200 | output changing it's not changing because we're multiplying by 0 so because it's not changing

01:07:44.400 | there is no derivative and 0 is the correct answer because we're squashing that 0

01:07:49.600 | and let's do it here 0.5 should come here and flow through this times and so we'll have that

01:07:58.240 | x1 dot grad is can you think through a little bit what what this should be

01:08:06.160 | the local derivative of times with respect to x1 is going to be w1 so w1's data times

01:08:14.240 | x1 w1 dot grad and w1 dot grad will be x1 dot data times x1 w2 w1 dot grad

01:08:25.760 | let's see what those came out to be so this is 0.5 so this would be negative 1.5 and this would be

01:08:33.360 | 1 and we've back propagated through this expression these are the actual final derivatives

01:08:39.040 | so if we want this neuron's output to increase we know that what's necessary is that

01:08:45.440 | w2 we have no gradient w2 doesn't actually matter to this neuron right now but this neuron this

01:08:52.640 | weight should go up so if this weight goes up then this neuron's output would have gone up

01:08:59.360 | and proportionally because the gradient is 1 okay so doing the back propagation manually is

01:09:04.080 | obviously ridiculous so we are now going to put an end to this suffering and we're going to see

01:09:08.560 | how we can implement the backward pass a bit more automatically we're not going to be doing all of

01:09:13.280 | it manually out here it's now pretty obvious to us by example how these pluses and times are back

01:09:18.560 | propagating gradients so let's go up to the value object and we're going to start codifying what

01:09:25.520 | we've seen in the examples below so we're going to do this by storing a special self.backward

01:09:32.720 | and underscore backward and this will be a function which is going to do that little piece

01:09:39.920 | of chain rule at each little node that compute that took inputs and produced output we're going

01:09:45.520 | to store how we are going to chain the outputs gradient into the inputs gradients so by default

01:09:54.000 | this will be a function that doesn't do anything so and you can also see that here in the value in

01:10:01.600 | micro grad so we have this backward function by default doesn't do anything this is an empty

01:10:09.040 | function and that would be sort of the case for example for a leaf node for leaf node there's

01:10:13.360 | nothing to do but now if when we're creating these out values these out values are an addition of

01:10:22.160 | self and other and so we will want to sell set outs backward to be the function that

01:10:30.560 | propagates the gradient so let's define what should happen

01:10:37.200 | and we're going to store it in a closure let's define what should happen when we call outs grad

01:10:47.760 | for addition our job is to take outs grad and propagate it into selves grad and other dot grad

01:10:56.960 | so basically we want to solve self dot grad to something and we want to set others dot grad

01:11:02.720 | to something okay and the way we saw below how chain rule works we want to take the local derivative

01:11:10.720 | times the um sort of global derivative i should call it which is the derivative of the final

01:11:16.240 | output of the expression with respect to outs data with respect to out so the local derivative

01:11:25.680 | of self in an addition is 1.0 so it's just 1.0 times outs grad that's the chain rule

01:11:35.120 | and others dot grad will be 1.0 times out grad and what you basically what you're seeing here

01:11:40.960 | is that outs grad will simply be copied onto selves grad and others grad as we saw happens

01:11:47.760 | for an addition operation so we're going to later call this function to propagate the gradient

01:11:53.440 | having done an addition let's now do multiplication we're going to also define backward

01:11:59.280 | and we're going to set its backward to be backward

01:12:07.760 | and we want to chain out grad into self dot grad

01:12:11.840 | and others dot grad and this will be a little piece of chain rule for multiplication

01:12:19.600 | so we'll have so what should this be can you think through

01:12:23.920 | so what is the local derivative here the local derivative was others dot data

01:12:35.360 | and then oops others dot data and then times out dot grad that's channel

01:12:40.960 | and here we have self dot data times out dot grad that's what we've been doing

01:12:46.400 | and finally here for tan h that's backward

01:12:52.320 | and then we want to set outs backwards to be just backward

01:13:00.480 | and here we need to back propagate we have out dot grad and we want to chain it into self dot grad

01:13:07.440 | and self dot grad will be the local derivative of this operation that we've done here which is

01:13:14.880 | tan h and so we saw that the local gradient is 1 minus the tan h of x squared which here is t

01:13:22.000 | that's the local derivative because that's t is the output of this tan h

01:13:27.360 | so 1 minus t squared is the local derivative and then gradient has to be multiplied because of the

01:13:33.840 | chain rule so outs grad is chained through the local gradient into self dot grad and that should

01:13:40.000 | be basically it so we're going to redefine our value node we're going to swing all the way down

01:13:46.160 | here and we're going to redefine our expression make sure that all the grads are zero okay

01:13:56.240 | but now we don't have to do this manually anymore we are going to basically be calling

01:14:01.120 | the dot backward in the right order so first we want to call outs dot backward

01:14:09.040 | so o was the outcome of tan h right so calling those those backward

01:14:22.160 | will be this function this is what it will do now we have to be careful because

01:14:28.480 | there's a times out dot grad and out dot grad remember is initialized to zero

01:14:34.640 | so here we see grad zero so as a base case we need to set oath dot grad to 1.0

01:14:46.640 | to initialize this with one and then once this is one we can call o dot backward

01:14:56.560 | and what that should do is it should propagate this grad through tan h so the local derivative

01:15:03.360 | times the global derivative which is initialized at one so this should

01:15:08.640 | um

01:15:16.000 | dope so i thought about redoing it but i figured i should just leave the error in here because it's

01:15:21.120 | pretty funny why is not an object not callable it's because i screwed up we're trying to save

01:15:28.880 | these functions so this is correct this here you don't want to call the function because that

01:15:35.040 | returns none these functions return none we just want to store the function so let me redefine the

01:15:40.640 | value object and then we're going to come back in redefine the expression draw dot everything is

01:15:47.120 | great o dot grad is one o dot grad is one and now now this should work of course okay so all that

01:15:56.960 | backward should this grad should now be 0.5 if we redraw and if everything went correctly 0.5 yay

01:16:05.200 | okay so now we need to call ns dot grad

01:16:07.760 | ns dot backward sorry ns backward so that seems to have worked

01:16:15.600 | so ns dot backward routed the gradient to both of these so this is looking great

01:16:23.040 | now we could of course called uh called b dot grad b dot backward sorry what's going to happen

01:16:32.000 | well b doesn't have a backward b's backward because b is a leaf node b's backward is by

01:16:38.960 | initialization the empty function so nothing would happen but we can call call it on it

01:16:44.960 | but when we call this one it's backward

01:16:50.720 | then we expect this 0.5 to get further routed right so there we go 0.5 0.5

01:17:00.960 | and then finally we want to call it here on x2 w2

01:17:07.120 | and on x1 w1

01:17:11.840 | let's do both of those and there we go so we get 0.5 negative 1.5 and 1 exactly as we did before

01:17:25.360 | but now we've done it through calling that backward sort of manually so we have the one

01:17:33.520 | last piece to get rid of which is us calling underscore backward manually so let's think

01:17:38.640 | through what we are actually doing we've laid out a mathematical expression and now we're trying to

01:17:43.920 | go backwards through that expression so going backwards through the expression just means that

01:17:49.840 | we never want to call a dot backward for any node before we've done sort of everything after it

01:17:58.800 | so we have to do everything after it before we're ever going to call dot backward on any one node

01:18:03.840 | we have to get all of its full dependencies everything that it depends on has to propagate

01:18:08.960 | to it before we can continue back propagation so this ordering of graphs can be achieved using

01:18:15.680 | something called topological sort so topological sort is basically a laying out of a graph such

01:18:23.280 | that all the edges go only from left to right basically so here we have a graph it's a

01:18:28.720 | direct acyclic graph a dag and this is two different topological orders of it i believe

01:18:35.520 | where basically you'll see that it's a laying out of the nodes such that all the edges go only one

01:18:39.520 | way from left to right and implementing topological sort you can look in wikipedia and so on i'm not

01:18:46.240 | going to go through it in detail but basically this is what builds a topological graph we maintain a

01:18:54.960 | set of visited nodes and then we are going through starting at some root node which for us is o

01:19:03.040 | that's where we want to start the topological sort and starting at o we go through all of its children

01:19:08.720 | and we need to lay them out from left to right and basically this starts at o if it's not visited

01:19:16.080 | then it marks it as visited and then it iterates through all of its children and calls build

01:19:22.080 | topological on them and then after it's gone through all the children it adds itself so basically

01:19:29.040 | this node that we're going to call it on like say o is only going to add itself to the topo list

01:19:36.480 | after all of the children have been processed and that's how this function is guaranteeing that

01:19:42.160 | you're only going to be in the list once all your children are in the list and that's the invariant

01:19:46.480 | that is being maintained so if we built up on o and then inspect this list we're going to see that

01:19:53.120 | it ordered our value objects and the last one is the value of 0.707 which is the output

01:20:01.760 | so this is o and then this is n and then all the other nodes get laid out before it

01:20:08.000 | so that builds the topological graph and really what we're doing now is we're just calling

01:20:14.720 | dot underscore backward on all of the nodes in a topological order so if we just reset the

01:20:21.360 | gradients they're all zero what did we do we started by setting o dot grad to be one

01:20:30.400 | that's the base case then we built a topological order

01:20:35.920 | and then we went for node in reversed of topo now in in the reverse order because this list goes from

01:20:50.800 | you know we need to go through it in reversed order so starting at o node backward and this

01:20:59.040 | should be it there we go those are the correct derivatives finally we are going to hide this

01:21:08.880 | functionality so i'm going to copy this and we're going to hide it inside the value class because

01:21:15.200 | we don't want to have all that code lying around so instead of an underscore backward we're now

01:21:20.160 | going to define an actual backward so that backward without the underscore and that's

01:21:26.640 | going to do all the stuff that we just derived so let me just clean this up a little bit so

01:21:31.200 | we're first going to

01:21:33.360 | build a topological graph starting at self so build topo of self will populate the topological

01:21:46.000 | order into the topo list which is a local variable then we set self dot grads to be one

01:21:51.440 | and then for each node in the reversed list so starting at us and going to all the children

01:21:58.160 | underscore backward and that should be it so save come down here we define

01:22:11.120 | okay all the grads are zero and now what we can do is o dot backward without the underscore and

01:22:17.360 | there we go and that's uh that's back propagation

01:22:24.960 | place for one neuron now we shouldn't be too happy with ourselves actually because we have a

01:22:32.000 | bad bug and we have not surfaced the bug because of some specific conditions that we are we have

01:22:37.840 | to think about right now so here's the simplest case that shows the bug say i create a single node a

01:22:45.760 | and then i create a b that is a plus a and then i call backward

01:22:52.480 | so what's going to happen is a is three and then a b is a plus a so there's two arrows on top of

01:23:00.960 | each other here then we can see that b is of course the forward pass works b is just a plus a

01:23:08.640 | which is six but the gradient here is not actually correct that we calculate it automatically

01:23:14.240 | and that's because um of course uh just doing calculus in your head the derivative of b with

01:23:23.200 | respect to a should be uh two one plus one it's not one intuitively what's happening here right

01:23:32.240 | so b is the result of a plus a and then we call backward on it so let's go up and see what that

01:23:38.800 | does um b is the result of addition so out as b and then when we call backward what happened is

01:23:50.880 | self dot grad was set to one and then other that grad was set to one but because we're doing a plus

01:23:59.120 | a self and other are actually the exact same object so we are overriding the gradient we are

01:24:06.000 | setting it to one and then we are setting it again to one and that's why it stays at one

01:24:12.160 | so that's a problem there's another way to see this in a little bit more complicated expression

01:24:18.320 | so here we have a and b and then d will be the multiplication of the two and e will be the

01:24:30.320 | addition of the two and then we multiply e times d to get f and then we call f dot backward

01:24:36.560 | and these gradients if you check will be incorrect so fundamentally what's happening here again is

01:24:45.040 | basically we're going to see an issue any time we use a variable more than once

01:24:48.560 | until now in these expressions above every variable is used exactly once so we didn't

01:24:53.520 | see the issue but here if a variable is used more than once what's going to happen during

01:24:57.920 | backward pass we're back propagating from f to e to d so far so good but now e calls a backward

01:25:04.720 | and it deposits its gradients to a and b but then we come back to d and call backward and it overwrites

01:25:11.520 | those gradients at a and b so that's obviously a problem and the solution here if you look at

01:25:19.280 | the multivariate case of the chain rule and its generalization there the solution there is

01:25:25.040 | basically that we have to accumulate these gradients these gradients add and so instead of

01:25:31.600 | setting those gradients we can simply do plus equals we need to accumulate those gradients

01:25:39.120 | plus equals plus equals plus equals plus equals and this will be okay remember because we are

01:25:48.400 | initializing them at zero so they start at zero and then any contribution that flows backwards

01:25:55.920 | will simply add so now if we redefine this one because the plus equals this now works

01:26:06.000 | because a dot grad started at zero and we called b dot backward we deposit one and then we deposit

01:26:11.840 | one again and now this is two which is correct and here this will also work and we'll get correct

01:26:17.440 | gradients because when we call e dot backward we will deposit the gradients from this branch

01:26:22.320 | and then we get to back to d dot backward it will deposit its own gradients and then those

01:26:27.440 | gradients simply add on top of each other and so we just accumulate those gradients and that

01:26:32.080 | fixes the issue okay now before we move on let me actually do a bit of cleanup here and delete

01:26:37.680 | some of these some of this intermediate work so i'm not going to need any of this now that

01:26:42.880 | we've derived all of it um we are going to keep this because i want to come back to it

01:26:48.720 | delete the 10h delete our modeling example delete the step delete this keep the code that draws

01:26:59.680 | and then delete this example and leave behind only the definition of value

01:27:04.160 | and now let's come back to this non-linearity here that we implemented the 10h now i told you

01:27:09.920 | that we could have broken down 10h into its explicit atoms in terms of other expressions

01:27:15.840 | if we had the exp function so if you remember 10h is defined like this and we chose to develop 10h

01:27:21.760 | as a single function and we can do that because we know it's derivative and we can back propagate

01:27:26.000 | through it but we can also break down 10h into and express it as a function of exp and i would

01:27:31.600 | like to do that now because i want to prove to you that you get all the same results and all the same

01:27:35.280 | gradients um but also because it forces us to implement a few more expressions it forces us to

01:27:40.880 | do exponentiation addition subtraction division and things like that and i think it's a good

01:27:45.760 | exercise to go through a few more of these okay so let's scroll up to the definition of value

01:27:52.080 | and here one thing that we currently can't do is we can't do like a value of say 2.0 but we can't

01:27:59.040 | do you know here for example we want to add a constant one and we can't do something like this

01:28:03.920 | and we can't do it because it says int object has no attribute data that's because a plus one

01:28:10.160 | comes right here to add and then other is the integer one and then here python is trying to

01:28:16.160 | access one dot data and that's not a thing and that's because basically one is not a value object

01:28:21.280 | and we only have addition for value objects so as a matter of convenience so that we can create

01:28:26.800 | expressions like this and make them make sense we can simply do something like this

01:28:30.640 | basically we let other alone if other is an instance of value but if it's not an instance

01:28:38.000 | of value we're going to assume that it's a number like an integer or float and we're going to simply

01:28:41.840 | wrap it in in value and then other will just become value of other and then other will have

01:28:46.880 | a data attribute and this should work so if i just say this redefine value then this should work

01:28:52.560 | there we go okay and now let's do the exact same thing for multiply because we can't do something

01:28:57.680 | like this again for the exact same reason so we just have to go to mul and if other is not a value

01:29:05.600 | then let's wrap it in value let's redefine value and now this works now here's a kind of unfortunate

01:29:12.080 | and not obvious part a times two works we saw that but two times a is that going to work

01:29:18.320 | you'd expect it to right but actually it will not and the reason it won't is because python doesn't

01:29:25.120 | know like when when you do a times two basically um so a times two python will go and it will

01:29:32.000 | basically do something like a dot mul of two that's basically what it will call but to it

01:29:38.000 | two times a is the same as two dot mul of a and it doesn't two can't multiply value and so it's

01:29:45.840 | really confused about that so instead what happens is in python the way this works is you are free to

01:29:50.800 | define something called the armol and armol is kind of like a fallback so if a python can't do

01:29:58.800 | two times a it will check if um if by any chance a knows how to multiply two and that will be called

01:30:06.720 | into armol so because python can't do two times a it will check is there an armol in value and

01:30:13.680 | because there is it will now call that and what we'll do here is we will swap the order of the

01:30:19.520 | operands so basically two times a will redirect to armol and armol will basically call a times two

01:30:26.000 | and that's how that will work so redefining that with armol two times a becomes four okay now

01:30:32.880 | looking at the other elements that we still need we need to know how to exponentiate and how to

01:30:36.320 | divide so let's first do the explanation to the exponentiation part we're going to introduce a

01:30:41.920 | single function exp here and exp is going to mirror tanh in the sense that it's a simple

01:30:48.880 | single function that transforms a single scalar value and outputs a single scalar value so we pop

01:30:54.000 | out the python number we use math.exp to exponentiate it create a new value object everything

01:30:59.520 | that we've seen before the tricky part of course is how do you back propagate through e to the x

01:31:04.800 | and so here you can potentially pause the video and think about what should go here

01:31:10.000 | okay so basically we need to know what is the local derivative of e to the x so d by dx of e

01:31:19.520 | to the x is famously just e to the x and we've already just calculated e to the x and it's inside

01:31:25.520 | out that data so we can do out that data times and out that grad that's the chain rule so we're just

01:31:32.640 | chaining on to the current running grad and this is what the expression looks like it looks a little

01:31:37.680 | confusing but this is what it is and that's the exponentiation so redefining we should now be able

01:31:43.600 | to call a.exp and hopefully the backward pass works as well okay and the last thing we'd like

01:31:49.600 | to do of course is we'd like to be able to divide now i actually will implement something slightly

01:31:54.880 | more powerful than division because division is just a special case of something a bit more

01:31:58.960 | powerful so in particular just by rearranging if we have some kind of a b equals value of 4.0 here

01:32:06.880 | we'd like to basically be able to do a divide b and we'd like this to be able to give us 0.5

01:32:10.720 | now division actually can be reshuffled as follows if we have a divide b that's actually

01:32:17.280 | the same as a multiplying 1 over b and that's the same as a multiplying b to the power of negative

01:32:23.040 | 1 and so what i'd like to do instead is i basically like to implement the operation of x to the k for

01:32:29.440 | some constant k so it's an integer or a float and we would like to be able to differentiate this and

01:32:36.160 | then as a special case negative 1 will be division and so i'm doing that just because it's more

01:32:42.960 | general and yeah you might as well do it that way so basically what i'm saying is we can redefine

01:32:49.760 | division which we will put here somewhere yeah we can put it here somewhere what i'm saying is that

01:32:56.800 | we can redefine division so self divide other this can actually be rewritten as self times

01:33:03.040 | other to the power of negative one and now value raised to the power of negative one we have now

01:33:10.000 | defined that so here's so we need to implement the pow function where am i going to put the pow

01:33:17.280 | function maybe here somewhere there's this coliform for it so this function will be called

01:33:23.840 | when we try to raise a value to some power and other will be that power now i'd like to make

01:33:29.520 | sure that other is only an int or a float usually other is some kind of a different value object

01:33:35.360 | but here other will be forced to be an int or a float otherwise the math won't work for for

01:33:42.320 | what we're trying to achieve in the specific case that would be a different derivative expression

01:33:46.480 | if we wanted other to be a value so here we create the up the value which is just uh you know this

01:33:52.880 | data raised to the power of other and other here could be for example negative one that's what we

01:33:57.280 | are hoping to achieve and then uh this is the backward stub and this is the fun part which is

01:34:03.600 | what is the uh chain rule expression here for back for um back propagating through the power function

01:34:12.320 | where the power is to the power of some kind of a constant so this is the exercise and maybe

01:34:17.040 | pause the video here and see if you can figure it out yourself as to what we should put here

01:34:20.720 | okay so um you can actually go here and look at derivative rules as an example and we see lots of

01:34:33.360 | derivative rules that you can hopefully know from calculus in particular what we're looking for is

01:34:37.600 | the power rule because that's telling us that if we're trying to take d by dx of x to the n which

01:34:42.880 | is what we're doing here then that is just n times x to the n minus one right okay so that's telling

01:34:52.160 | us about the local derivative of this power operation so all we want here basically n is now

01:34:59.760 | other and self.data is x and so this now becomes other which is n times self.data which is now a

01:35:10.800 | python int or a float it's not a value object we're accessing the data attribute raised to the

01:35:17.600 | power of other minus one or n minus one i can put brackets around this but this doesn't matter

01:35:23.440 | because power takes precedence over multiply and python so that would have been okay and that's

01:35:29.840 | the local derivative only but now we have to chain it and we change it just simply by multiplying by

01:35:34.560 | our top grand that's chain rule and this should technically work and we're going to find out soon

01:35:42.240 | but now if we do this this should now work and we get 0.5 so the forward pass works but does the

01:35:49.760 | backward pass work and i realized that we actually also have to know how to subtract so right now a

01:35:55.520 | minus b will not work to make it work we need one more piece of code here and basically this is the

01:36:04.080 | subtraction and the way we're going to implement subtraction is we're going to implement it by

01:36:09.440 | addition of a negation and then to implement negation we're going to multiply by negative one

01:36:13.760 | so just again using the stuff we've already built and just expressing it in terms of what we have

01:36:18.960 | and a minus b is not working okay so now let's scroll again to this expression here for this

01:36:23.840 | neuron and let's just compute the backward pass here once we've defined o and let's draw it so

01:36:32.160 | here's the gradients for all these leaf nodes for this two-dimensional neuron that has a 10h that

01:36:36.880 | we've seen before so now what i'd like to do is i'd like to break up this 10h into this expression

01:36:43.520 | here so let me copy paste this here and now instead of we'll preserve the label and we will

01:36:51.280 | change how we define o so in particular we're going to implement this formula here so we need

01:36:57.200 | e to the 2x minus 1 over e to the x plus 1 so e to the 2x we need to take 2 times n and we need to

01:37:04.960 | exponentiate it that's e to the 2x and then because we're using it twice let's create an intermediate

01:37:10.800 | variable e and then define o as e plus 1 over e minus 1 over e plus 1 e minus 1 over e plus 1

01:37:21.840 | and that should be it and then we should be able to draw dot of o so now before i run this what do

01:37:29.120 | we expect to see number one we're expecting to see a much longer graph here because we've broken up

01:37:35.280 | 10h into a bunch of other operations but those operations are mathematically equivalent and so

01:37:40.480 | what we're expecting to see is number one the same result here so the forward pass works and

01:37:46.000 | number two because of that mathematical equivalence we expect to see the same backward pass and the

01:37:50.800 | same gradients on these leaf nodes so these gradients should be identical so let's run this

01:37:56.400 | so number one let's verify that instead of a single 10h node we have now exp and we have plus

01:38:04.880 | we have times negative one this is the division and we end up with the same forward pass here

01:38:11.120 | and then the gradients we have to be careful because they're in slightly different order

01:38:14.800 | potentially the gradients for w2 x2 should be 0 and 0.5 w2 and x2 are 0 and 0.5 and w1 x1 are 1

01:38:23.440 | and negative 1.5 1 and negative 1.5 so that means that both our forward passes and backward passes

01:38:30.160 | were correct because this turned out to be equivalent to 10h before and so the reason

01:38:36.640 | I wanted to go through this exercise is number one we got to practice a few more operations

01:38:40.960 | and writing more backwards passes and number two I wanted to illustrate the point that the

01:38:46.320 | level at which you implement your operations is totally up to you you can implement backward

01:38:52.400 | passes for tiny expressions like a single individual plus or a single times or you can

01:38:57.040 | implement them for say 10h which is a kind of a potentially you can see it as a composite operation

01:39:02.960 | because it's made up of all these more atomic operations but really all of this is kind of

01:39:07.280 | like a fake concept all that matters is we have some kind of inputs and some kind of an output

01:39:11.360 | and this output is a function of the inputs in some way and as long as you can do forward pass

01:39:15.360 | and the backward pass of that little operation it doesn't matter what that operation is and how

01:39:21.440 | composite it is if you can write the local gradients you can chain the gradient and you

01:39:25.840 | can continue back propagation so the design of what those functions are is completely up to you

01:39:30.800 | so now I would like to show you how you can do the exact same thing but using a modern deep neural

01:39:36.000 | network library like for example PyTorch which I've roughly modeled micro grad by and so PyTorch

01:39:44.080 | is something you would use in production and I'll show you how you can do the exact same thing but

01:39:48.080 | in PyTorch API so I'm just going to copy paste it in and walk you through it a little bit this is

01:39:52.800 | what it looks like so we're going to import PyTorch and then we need to define these value

01:40:00.000 | objects like we have here now micro grad is a scalar valued engine so we only have scalar values

01:40:07.440 | like 2.0 but in PyTorch everything is based around tensors and like I mentioned tensors are just

01:40:13.120 | n-dimensional arrays of scalars so that's why things get a little bit more complicated here

01:40:18.640 | I just need a scalar valued tensor a tensor with just a single element but by default when you work

01:40:24.720 | with PyTorch you would use more complicated tensors like this so if I import PyTorch

01:40:31.600 | then I can create tensors like this and this tensor for example is a two by three array of

01:40:40.080 | scalars in a single compact representation so you can check its shape we see that it's a two by three

01:40:47.120 | array and so on so this is usually what you would work with in the actual libraries so here I'm

01:40:54.160 | creating a tensor that has only a single element 2.0 and then I'm casting it to be double because

01:41:03.920 | Python is by default using double precision for its floating point numbers so I'd like everything

01:41:08.560 | to be identical by default the data type of these tensors will be float 32 so it's only using a

01:41:15.280 | single precision float so I'm casting it to double so that we have float 64 just like in Python

01:41:21.600 | so I'm casting to double and then we get something similar to value of 2.0 the next thing I have to

01:41:28.640 | do is because these are leaf nodes by default PyTorch assumes that they do not require gradients

01:41:33.760 | so I need to explicitly say that all of these nodes require gradients okay so this is going

01:41:38.960 | to construct scalar valued one element tensors make sure that PyTorch knows that they require

01:41:44.880 | gradients now by default these are set to false by the way because of efficiency reasons because

01:41:50.240 | usually you would not want gradients for leaf nodes like the inputs to the network and this

01:41:55.760 | is just trying to be efficient in the most common cases so once we've defined all of our values

01:42:01.440 | in PyTorch land we can perform arithmetic just like we can here in micrograd land so this would

01:42:06.240 | just work and then there's a torch.tanh also and when we get back as a tensor again and we can

01:42:13.520 | just like in micrograd it's got a data attribute and it's got grad attributes so these tensor

01:42:19.200 | objects just like in micrograd have a dot data and a dot grad and the only difference here is that

01:42:24.800 | we need to call a dot item because otherwise PyTorch dot item basically takes a single tensor

01:42:33.120 | of one element and it just returns that element stripping out the tensor so let me just run this

01:42:38.800 | and hopefully we are going to get this is going to print the forward pass which is 0.707 and this

01:42:45.360 | will be the gradients which hopefully are 0.50 negative 1.5 and 1 so if we just run this

01:42:52.240 | there we go 0.7 so the forward pass agrees and then 0.50 negative 1.5 and 1

01:42:59.680 | so PyTorch agrees with us and just to show you here basically oh here's a tensor with a single

01:43:06.400 | element and it's a double and we can call dot item on it to just get the single number out

01:43:13.440 | so that's what item does and o is a tensor object like i mentioned and it's got a backward function

01:43:19.920 | just like we've implemented and then all of these also have a dot grad so like x2 for example has

01:43:25.440 | a grad and it's a tensor and we can pop out the individual number with dot item so basically

01:43:32.880 | Torches Torch can do what we did in micro grad as a special case when your tensors are all single

01:43:38.960 | element tensors but the big deal with PyTorch is that everything is significantly more efficient

01:43:44.080 | because we are working with these tensor objects and we can do lots of operations in parallel

01:43:48.800 | on all of these tensors but otherwise what we've built very much agrees with the API of PyTorch

01:43:55.040 | okay so now that we have some machinery to build out pretty complicated mathematical expressions

01:43:59.760 | we can also start building up neural nets and as I mentioned neural nets are just a specific class

01:44:04.480 | of mathematical expressions so we're going to start building out a neural net piece by piece

01:44:09.280 | and eventually we'll build out a two-layer multi-layer layer perceptron as it's called

01:44:13.920 | and I'll show you exactly what that means let's start with a single individual neuron we've

01:44:18.080 | implemented one here but here I'm going to implement one that also subscribes to the

01:44:22.720 | PyTorch API in how it designs its neural network modules so just like we saw that we can like

01:44:28.800 | match the API of PyTorch on the autograd side we're going to try to do that on the neural

01:44:34.400 | network modules so here's class neuron and just for the sake of efficiency I'm going to copy paste

01:44:41.760 | some sections that are relatively straightforward so the constructor will take number of inputs to

01:44:48.640 | this neuron which is how many inputs come to a neuron so this one for example has three inputs

01:44:55.120 | and then it's going to create a weight that is some random number between negative one and one

01:44:59.440 | for every one of those inputs and a bias that controls the overall trigger happiness of this

01:45:04.640 | neuron and then we're going to implement a def underscore underscore call of self and x

01:45:12.720 | sum input x and really what we don't want to do here is w times x plus b where w times x here is

01:45:18.800 | a dot product specifically now if you haven't seen call let me just return 0.0 here for now

01:45:26.000 | the way this works now is we can have an x which is say like 2.0 3.0 then we can initialize a neuron

01:45:32.160 | that is two-dimensional because these are two numbers and then we can feed those two numbers

01:45:36.720 | into that neuron to get an output and so when you use this notation n of x python will use call

01:45:44.960 | so currently call just returns 0.0 now we'd like to actually do the forward pass of this neuron

01:45:52.960 | instead so we're going to do here first is we need to basically multiply all of the elements of

01:45:59.920 | w with all of the elements of x pairwise we need to multiply them so the first thing we're going

01:46:05.200 | to do is we're going to zip up salta w and x and in python zip takes two iterators and it creates

01:46:13.280 | a new iterator that iterates over the tuples of their corresponding entries so for example just

01:46:18.960 | to show you we can print this list and still return 0.0 here sorry i'm in my so we see that

01:46:34.880 | these w's are paired up with the x's w with x and now what we want to do is

01:46:42.640 | for wi xi in we want to multiply w times wi times xi and then we want to sum all of that together

01:46:56.960 | to come up with an activation and add also salta b on top so that's the raw activation and then

01:47:04.160 | of course we need to pass that through a non-linearity so what we're going to be returning

01:47:07.760 | is act dot 10h and here's out so now we see that we are getting some outputs and we get a different

01:47:16.080 | output from a neuron each time because we are initializing different weights and biases and

01:47:21.280 | then to be a bit more efficient here actually sum by the way takes a second optional parameter

01:47:27.200 | which is the start and by default the start is zero so these elements of this sum will be added

01:47:34.080 | on top of zero to begin with but actually we can just start with cell.b and then we just have an

01:47:39.280 | expression like this and then the generator expression here must be parenthesized in python

01:47:49.360 | there we go yep so now we can forward a single neuron next up we're going to define a layer of

01:47:58.400 | neurons so here we have a schematic for a MLP so we see that these MLPs each layer this is one layer

01:48:06.320 | has actually a number of neurons and they're not connected to each other but all of them are fully

01:48:09.760 | connected to the input so what is a layer of neurons it's just it's just a set of neurons

01:48:14.560 | evaluated independently so in the interest of time i'm going to do something fairly straightforward

01:48:21.360 | here it's literally a layer is just a list of neurons and then how many neurons do we have

01:48:30.640 | we take that as an input argument here how many neurons do you want in your layer number of outputs

01:48:35.120 | in this layer and so we just initialize completely independent neurons with this given dimensionality

01:48:41.280 | and when we call on it we just independently evaluate them so now instead of a neuron we can

01:48:48.000 | make a layer of neurons they are two-dimensional neurons and let's have three of them and now we

01:48:52.720 | see that we have three independent evaluations of three different neurons right okay and finally

01:48:59.680 | let's complete this picture and define an entire multi-layer perceptron or MLP and as we can see

01:49:05.200 | here in an MLP these layers just feed into each other sequentially so let's come here and i'm

01:49:10.480 | just going to copy the code here in interest of time so an MLP is very similar we're taking the

01:49:17.200 | number of inputs as before but now instead of saying taking a single n out which is number of

01:49:22.240 | neurons in a single layer we're going to take a list of n outs and this list defines the sizes

01:49:27.440 | of all the layers that we want in our MLP so here we just put them all together and then iterate

01:49:32.720 | over consecutive pairs of these sizes and create layer objects for them and then in the call

01:49:38.560 | function we are just calling them sequentially so that's an MLP really and let's actually

01:49:43.520 | re-implement this picture so we want three input neurons and then two layers of four and an output

01:49:48.320 | unit so we want a three-dimensional input say this is an example input we want three inputs into two

01:49:57.680 | layers of four and one output and this of course is an MLP and there we go that's a forward pass

01:50:05.360 | of an MLP to make this a little bit nicer you see how we have just a single element but it's wrapped

01:50:10.320 | in a list because layer always returns lists so for convenience return outs at zero if len outs

01:50:18.320 | is exactly a single element else return fullest and this will allow us to just get a single value

01:50:24.560 | out at the last layer that only has a single neuron and finally we should be able to draw dot

01:50:30.000 | of n of x and as you might imagine these expressions are now getting

01:50:35.440 | relatively involved so this is an entire MLP that we're defining now

01:50:40.880 | all the way until a single output okay and so obviously you would never differentiate on pen

01:50:51.920 | and paper these expressions but with micrograd we will be able to back propagate all the way

01:50:56.720 | through this and back propagate into these weights of all these neurons so let's see how that works

01:51:04.080 | okay so let's create ourselves a very simple example data set here so this data set has four

01:51:09.920 | examples and so we have four possible inputs into the neural net and we have four desired targets

01:51:17.440 | so we'd like the neural net to assign or output 1.0 when it's fed this example negative one when

01:51:24.880 | it's fed these examples and one when it's fed this example so it's a very simple binary classifier

01:51:29.600 | neural net basically that we would like here now let's think what the neural net currently thinks

01:51:34.320 | about these four examples we can just get their predictions basically we can just call n of x

01:51:40.160 | for x and x's and then we can print so these are the outputs of the neural net on those four

01:51:47.680 | examples so the first one is 0.91 but we'd like it to be 1 so we should push this one higher this

01:51:55.920 | one we want to be higher this one says 0.88 and we want this to be negative 1 this is 0.88 we

01:52:03.600 | want it to be negative 1 and this one is 0.88 we want it to be 1 so how do we make the neural net

01:52:10.000 | and how do we tune the weights to better predict the desired targets and the trick used in deep

01:52:18.080 | learning to achieve this is to calculate a single number that somehow measures the total performance

01:52:24.000 | of your neural net and we call this single number the loss so the loss first is a single number that

01:52:32.320 | we're going to define that basically measures how well the neural net is performing right now we

01:52:36.480 | have the intuitive sense that it's not performing very well because we're not very much close to

01:52:40.080 | this so the loss will be high and we'll want to minimize the loss so in particular in this case

01:52:45.760 | what we're going to do is we're going to implement the mean squared error loss so what this is doing

01:52:50.480 | is we're going to basically iterate for y ground truth and y output in zip of y's and y_red

01:53:01.120 | so we're going to pair up the ground truths with the predictions and the zip iterates over

01:53:07.360 | tuples of them and for each y ground truth and y output we're going to subtract them

01:53:14.240 | and square them so let's first see what these losses are these are individual loss components

01:53:21.680 | and so basically for each one of the four we are taking the prediction and the ground truth

01:53:29.440 | we are subtracting them and squaring them so because this one is so close to its target

01:53:36.240 | 0.91 is almost one subtracting them gives a very small number so here we would get like a negative

01:53:43.360 | 0.1 and then squaring it just makes sure that regardless of whether we are more negative or

01:53:50.160 | more positive we always get a positive number instead of squaring we should we could also take

01:53:56.000 | for example the absolute value we need to discard the sign and so you see that the expression is

01:54:00.880 | ranged so that you only get zero exactly when y out is equal to y ground truth when those two are

01:54:07.120 | equal so your prediction is exactly the target you are going to get zero and if your prediction is

01:54:11.920 | not the target you are going to get some other number so here for example we are way off and so

01:54:17.200 | that's why the loss is quite high and the more off we are the greater the loss will be so we don't

01:54:25.120 | want high loss we want low loss and so the final loss here will be just the sum of all of these

01:54:32.720 | numbers so you see that this should be zero roughly plus zero roughly but plus seven so

01:54:41.200 | loss should be about seven here and now we want to minimize the loss we want the loss to be low

01:54:49.280 | because if loss is low then every one of the predictions is equal to its target so the loss

01:54:57.120 | the lowest it can be is zero and the greater it is the worse off the neural net is predicting

01:55:02.560 | so now of course if we do loss.backward something magical happened when i hit enter

01:55:09.760 | and the magical thing of course that happened is that we can look at n.layers.neuron

01:55:15.280 | and that layers at say like the first layer that neurons at zero

01:55:20.000 | because remember that mlp has the layers which is a list and each layer has neurons which is a list

01:55:28.720 | and that gives us an individual neuron and then it's got some weights

01:55:32.000 | and so we can for example look at the weights at zero

01:55:40.160 | oops it's not called weights it's called w and that's a value but now this value also has a grad

01:55:48.000 | because of the backward pass and so we see that because this gradient here on this particular

01:55:54.240 | weight of this particular neuron of this particular layer is negative we see that

01:55:58.480 | its influence on the loss is also negative so slightly increasing this particular weight

01:56:03.760 | of this neuron of this layer would make the loss go down and we actually have this information for

01:56:10.400 | every single one of our neurons and all their parameters actually it's worth looking at

01:56:14.640 | also the draw dot of loss by the way so previously we looked at the draw dot of a single neuron

01:56:20.720 | neuron forward pass and that was already a large expression but what is this expression we actually

01:56:25.920 | forwarded every one of those four examples and then we have the loss on top of them with the

01:56:31.360 | mean squared error and so this is a really massive graph because this graph that we've built up now

01:56:38.400 | oh my gosh this graph that we've built up now which is kind of excessive it's excessive because

01:56:45.440 | it has four forward passes of a neural net for every one of the examples and then it has the

01:56:50.080 | loss on top and it ends with the value of the loss which was 7.12 and this loss will now back

01:56:56.240 | propagate through all the four forward passes all the way through just every single intermediate

01:57:01.520 | value of the neural net all the way back to of course the parameters of the weights which are

01:57:06.640 | the input so these weight parameters here are inputs to this neural net and these numbers here

01:57:14.320 | these scalars are inputs to the neural net so if we went around here we will probably find

01:57:20.640 | some of these examples this 1.0 potentially maybe this 1.0 or you know some of the others and you'll

01:57:26.480 | see that they all have gradients as well the thing is these gradients on the input data are not that

01:57:32.160 | useful to us and that's because the input data seems to be not changeable it's it's a given to

01:57:39.040 | the problem and so it's a fixed input we're not going to be changing it or messing with it even

01:57:43.120 | though we do have gradients for it but some of these gradients here will be for the neural network

01:57:50.320 | parameters the w's and the b's and those we of course we want to change okay so now we're going

01:57:56.640 | to want some convenience code to gather up all of the parameters of the neural net so that we can

01:58:01.920 | operate on all of them simultaneously and every one of them we will nudge a tiny amount based on

01:58:09.040 | the gradient information so let's collect the parameters of the neural net all in one array

01:58:14.800 | so let's create a parameters of self that just returns self.w which is a list

01:58:21.360 | concatenated with a list of self.b so this will just return a list list plus list just

01:58:30.640 | you know gives you a list so that's parameters of neuron and i'm calling it this way because

01:58:36.080 | also pytorch has a parameters on every single nn module and it does exactly what we're doing here

01:58:42.240 | it just returns the parameter tensors for us it's the parameter scalars now layer is also a module

01:58:50.320 | so it will have parameters self and basically what we want to do here is something like this like

01:58:57.680 | params is here and then for neuron in self.neurons we want to get neuron.parameters

01:59:10.560 | and we want to params.extend right so these are the parameters of this neuron and then we want

01:59:17.280 | to put them on top of params so params.extend of piece and then we want to return params

01:59:23.920 | so this there's way too much code so actually there's a way to simplify this which is return

01:59:33.920 | p for neuron in self.neurons for p in neuron.parameters

01:59:44.000 | so it's a single list comprehension in python you can sort of nest them like this and you can

01:59:50.400 | then create the desired array so this is these are identical we can take this out

02:00:00.000 | and then let's do the same here

02:00:01.440 | def parameters self and return a parameter for layer in self.layers for p in layer.parameters

02:00:17.440 | and that should be good now let me pop out this so we don't re-initialize our network

02:00:28.960 | because we need to re-initialize our

02:00:31.680 | okay so unfortunately we will have to probably re-initialize the network because

02:00:39.120 | we just had functionality because this class of course we i want to get

02:00:43.840 | all the end up parameters but that's not going to work because this is the old class

02:00:48.240 | okay so unfortunately we do have to re-initialize the network which will change some of the numbers

02:00:55.680 | but let me do that so that we pick up the new api we can now do end up parameters

02:00:59.280 | and these are all the weights and biases inside the entire neural net so in total this mlp has 41

02:01:08.720 | parameters and now we'll be able to change them if we recalculate the loss here we see that

02:01:18.560 | unfortunately we have slightly different predictions and slightly different loss

02:01:24.880 | um but that's okay okay so we see that this neuron's gradient is slightly negative we can

02:01:33.200 | also look at its data right now which is 0.85 so this is the current value of this neuron and this

02:01:39.840 | is its gradient on the loss so what we want to do now is we want to iterate for every p in

02:01:47.040 | end up parameters so for all the 41 parameters of this neural net we actually want to change

02:01:53.360 | p.data slightly according to the gradient information okay so dot dot dot to do here

02:02:01.360 | but this will be basically a tiny update in this gradient descent scheme and gradient descent

02:02:09.520 | we are thinking of the gradient as a vector pointing in the direction of increased loss

02:02:19.120 | and so in gradient descent we are modifying p.data by a small step size in the direction

02:02:26.480 | of the gradient so the step size as an example could be like a very small number like 0.01 is

02:02:30.960 | the step size times p.grad right but we have to think through some of the signs here so

02:02:39.120 | in particular working with this specific example here we see that if we just left it like this

02:02:47.040 | then this neuron's value would be currently increased by a tiny amount of the gradient

02:02:52.560 | the gradient is negative so this value of this neuron would go slightly down it would become

02:02:58.480 | like 0.84 or something like that but if this neuron's value goes lower that would actually

02:03:07.280 | increase the loss that's because the derivative of this neuron is negative so increasing

02:03:16.560 | this makes the loss go down so increasing it is what we want to do instead of decreasing it

02:03:22.880 | so basically what we're missing here is we're actually missing a negative sign

02:03:25.440 | and again this other interpretation and that's because we want to minimize the loss we don't

02:03:30.880 | want to maximize the loss we want to decrease it and the other interpretation as i mentioned

02:03:35.600 | is you can think of the gradient vector so basically just the vector of all the gradients

02:03:40.400 | as pointing in the direction of increasing the loss but then we want to decrease it so we actually

02:03:46.800 | want to go in the opposite direction and so you can convince yourself that this sort of like does

02:03:51.360 | the right thing here with the negative because we want to minimize the loss so if we nudge all the

02:03:56.480 | parameters by a tiny amount then we'll see that this data will have changed a little bit so now

02:04:04.640 | this neuron is a tiny amount greater value so 0.854 went to 0.857 and that's a good thing because

02:04:15.600 | slightly increasing this neuron data makes the loss go down according to the gradient and so

02:04:23.120 | the correcting has happened sine wise and so now what we would expect of course is that because

02:04:29.360 | we've changed all these parameters we expect that the loss should have gone down a bit so we want to

02:04:35.680 | re-evaluate the loss let me basically this is just a data definition that hasn't changed but the

02:04:42.960 | forward pass here of the network we can recalculate and actually let me do it outside here so that we

02:04:51.920 | can compare the two loss values so here if i recalculate the loss we'd expect the new loss now

02:04:59.280 | to be slightly lower than this number so hopefully what we're getting now is a tiny bit lower than

02:05:04.240 | 4.36 okay and remember the way we've arranged this is that low loss means that our predictions

02:05:13.440 | are matching the targets so our predictions now are probably slightly closer to the targets

02:05:20.240 | and now all we have to do is we have to iterate this process

02:05:23.440 | so again we've done the forward pass and this is the loss now we can loss that backward

02:05:29.280 | let me take these out and we can do a step size and now we should have a slightly lower loss

02:05:36.320 | 4.36 goes to 3.9 and okay so we've done the forward pass here's the backward pass

02:05:44.320 | nudge and now the loss is 3.66

02:05:47.520 | 3.47 and you get the idea we just continue doing this and this is gradient descent we're just

02:05:57.520 | iteratively doing forward pass backward pass update forward pass backward pass update and

02:06:02.880 | the neural net is improving its predictions so here if we look at y pred now y pred

02:06:12.720 | we see that this value should be getting closer to one so this value should be getting more positive

02:06:18.400 | these should be getting more negative and this one should be also getting more positive

02:06:21.840 | so if we just iterate this a few more times

02:06:24.640 | actually we may be able to afford to go a bit faster let's try a slightly higher learning rate

02:06:31.040 | oops okay there we go so now we're at 0.31

02:06:39.200 | if you go too fast by the way if you try to make it too big of a step you may actually overstep

02:06:44.400 | overconfidence because again remember we don't actually know exactly about the loss function

02:06:51.280 | the loss function has all kinds of structure and we only know about the very local dependence of

02:06:56.320 | all these parameters on the loss but if we step too far we may step into you know a part of the

02:07:01.520 | loss that is completely different and that can destabilize training and make your loss actually

02:07:05.840 | blow up even so the loss is now 0.04 so actually the predictions should be really quite close let's

02:07:13.600 | take a look so you see how this is almost one almost negative one almost one we can continue

02:07:20.080 | going so yep backward update oops there we go so we went way too fast and we actually overstepped

02:07:31.440 | so we got too eager where are we now oops okay 7e negative 9 so this is very very low loss

02:07:40.080 | and the predictions are basically perfect so somehow we basically we were doing way too big

02:07:48.880 | updates and we briefly exploded but then somehow we ended up getting into a really good spot

02:07:52.880 | so usually this learning rate and the tuning of it is a subtle art you want to set your learning

02:07:58.640 | rate if it's too low you're going to take way too long to converge but if it's too high the

02:08:03.280 | whole thing gets unstable and you might actually even explode the loss depending on your loss

02:08:07.760 | function so finding the step size to be just right it's it's a pretty subtle art sometimes

02:08:13.040 | when you're using sort of vanilla gradient descent but we happen to get into a good spot

02:08:17.680 | we can look at n dot parameters so this is the setting of weights and biases that makes our

02:08:27.520 | network predict the desired targets very very close and basically we've successfully trained

02:08:37.120 | a neural net okay let's make this a tiny bit more respectable and implement an actual training loop

02:08:42.240 | and what that looks like so this is the data definition that stays this is the forward pass

02:08:48.320 | so for k in range you know we're going to take a bunch of steps

02:08:55.040 | first you do the forward pass we validate the loss

02:09:01.040 | let's reinitialize the neural net from scratch and here's the data

02:09:06.960 | and we first do forward pass then we do the backward pass

02:09:12.480 | and then we do an update that's gradient descent

02:09:21.680 | and then we should be able to iterate this and we should be able to print the current step

02:09:29.600 | the current loss let's just print the sort of number of the loss and that should be it

02:09:40.480 | and then the learning rate 0.01 is a little too small 0.1 we saw is like a little bit dangerous

02:09:44.960 | with ui let's go somewhere in between and we'll optimize this for not 10 steps but let's go for

02:09:51.680 | say 20 steps let me erase all of this junk and let's run the optimization

02:10:00.800 | and you see how we've actually converged slower in a more controlled manner and got to a loss that

02:10:09.280 | is very low so i expect y_pred to be quite good there we go

02:10:16.240 | and that's it okay so this is kind of embarrassing but we actually have a really terrible bug

02:10:27.600 | in here and it's a subtle bug and it's a very common bug and i can't believe i've done it for

02:10:33.920 | the 20th time in my life especially on camera and i could have re-shot the whole thing but i think

02:10:39.600 | it's pretty funny and you know you get to appreciate a bit what working with neural

02:10:44.720 | nuts maybe is like sometimes we are guilty of a common bug i've actually tweeted the most common

02:10:53.520 | neural mistakes a long time ago now and i'm not really gonna explain any of these except for

02:11:01.520 | we are guilty of number three you forgot to zero grad before dot backward what is that

02:11:06.800 | basically what's happening and it's a subtle bug and i'm not sure if you saw it

02:11:12.160 | is that all of these weights here have a dot data and a dot grad and the dot grad starts at zero

02:11:21.120 | and then we do backward and we fill in the gradients and then we do an update on the data

02:11:27.520 | but we don't flush the grad it stays there so when we do the second forward pass and we do

02:11:34.400 | backward again remember that all the backward operations do a plus equals on the grad and so

02:11:39.760 | these gradients just add up and they never get reset to zero so basically we didn't zero grad

02:11:47.280 | so here's how we zero grad before backward we need to iterate over all the parameters

02:11:54.160 | and we need to make sure that p dot grad is set to zero we need to reset it to zero just like it is

02:12:01.040 | in the constructor so remember all the way here for all these value nodes grad is reset to zero

02:12:07.360 | and then all these backward passes do a plus equals from that grad but we need to make sure

02:12:12.240 | that we reset these graphs to zero so that when we do backward all of them start at zero and the

02:12:18.640 | actual backward pass accumulates the loss derivatives into the grads so this is zero grad

02:12:26.960 | in pytorch and we will slightly we'll get a slightly different optimization let's reset the

02:12:33.760 | neural net the data is the same this is now i think correct and we get a much more you know

02:12:40.240 | we get a much more slower descent we still end up with pretty good results and we can continue this

02:12:47.280 | a bit more to get down lower and lower and lower yeah so the only reason that the previous thing

02:12:57.600 | worked it's extremely buggy the only reason that worked is that this is a very very simple problem

02:13:04.880 | and it's very easy for this neural net to fit this data and so the grads ended up accumulating

02:13:12.000 | and it effectively gave us a massive step size and it made us converge extremely fast

02:13:16.880 | but basically now we have to do more steps to get to very low values of loss

02:13:23.600 | and get y pred to be really good we can try to step a bit greater

02:13:29.040 | yeah we're gonna get closer and closer to one minus one and one so working with neural nets

02:13:40.640 | is sometimes tricky because you may have lots of bugs in the code and your network might actually

02:13:49.280 | work just like ours worked but chances are is that if we had a more complex problem then actually

02:13:55.200 | this bug would have made us not optimize the loss very well and we were only able to get away with

02:13:59.520 | it because the problem is very simple so let's now bring everything together and summarize what

02:14:05.680 | we learned what are neural nets neural nets are these mathematical expressions fairly simple

02:14:12.000 | mathematical expressions in the case of multilear perceptron that take input as the data and they

02:14:18.320 | take input the weights and the parameters of the neural net mathematical expression for the forward

02:14:23.360 | pass followed by a loss function and the loss function tries to measure the accuracy of the

02:14:28.320 | predictions and usually the loss will be low when your predictions are matching your targets or

02:14:33.440 | where the network is basically behaving well so we we manipulate the loss function so that when

02:14:38.960 | the loss is low the network is doing what you want it to do on your problem and then we backward the

02:14:45.440 | loss use back propagation to get the gradient and then we know how to tune all the parameters to

02:14:50.800 | decrease the loss locally but then we have to iterate that process many times in what's called

02:14:55.200 | the gradient descent so we simply follow the gradient information and that minimizes the loss

02:15:00.960 | and the loss is arranged so that when the loss is minimized the network is doing what you want it to

02:15:05.120 | do and yeah so we just have a blob of neural stuff and we can make it do arbitrary things

02:15:12.640 | and that's what gives neural nets their power it's you know this is a very tiny network with

02:15:17.120 | 41 parameters but you can build significantly more complicated neural nets with billions at

02:15:24.160 | this point almost trillions of parameters and it's a massive blob of neural tissue simulated neural

02:15:29.840 | tissue roughly speaking and you can make it do extremely complex problems and these neural nets

02:15:36.560 | then have all kinds of very fascinating emergent properties in when you try to make them do

02:15:42.800 | significantly hard problems as in the case of gpt for example we have massive amounts of text from

02:15:49.680 | the internet and we're trying to get a neural net to predict to take like a few words and try to

02:15:54.400 | predict the next word in a sequence that's the learning problem and it turns out that when you

02:15:58.480 | train this on all of internet the neural net actually has like really remarkable emergent

02:16:02.880 | properties but that neural net would have hundreds of billions of parameters but it works on

02:16:08.400 | fundamentally the exact same principles the neural net of course will be a bit more complex but

02:16:13.360 | otherwise the evaluating the gradient is there and will be identical and the gradient descent would

02:16:20.080 | be there and would be basically identical but people usually use slightly different updates

02:16:24.720 | this is a very simple stochastic gradient descent update and the loss function would not be a mean

02:16:30.640 | squared error they would be using something called the cross entropy loss for predicting the next

02:16:34.880 | token so there's a few more details but fundamentally the neural network setup and

02:16:38.560 | neural network training is identical and pervasive and now you understand intuitively how that works

02:16:44.800 | under the hood in the beginning of this video i told you that by the end of it you would understand

02:16:48.800 | everything in micrograd and then would slowly build it up let me briefly prove that to you

02:16:54.000 | so i'm going to step through all the code that is in micrograd as of today actually potentially some

02:16:58.640 | of the code will change by the time you watch this video because i intend to continue developing

02:17:02.160 | micrograd but let's look at what we have so far at least init.py is empty when you go to

02:17:07.680 | engine.py that has the value everything here you should mostly recognize so we have the data.data.grad

02:17:13.600 | attributes we have the backward function we have the previous set of children and the operation

02:17:18.320 | that produced this value we have addition multiplication and raising to a scalar power

02:17:24.320 | we have the relu non-linearity which is slightly different type of non-linearity than tanh that

02:17:29.200 | we used in this video both of them are non-linearities and notably tanh is not actually

02:17:34.160 | present in micrograd as of right now but i intend to add it later we have the backward which is

02:17:39.680 | identical and then all of these other operations which are built up on top of operations here

02:17:45.440 | so value should be very recognizable except for the non-linearity used in this video

02:17:49.040 | there's no massive difference between relu and tanh and sigmoid and these other non-linearities

02:17:55.200 | they're all roughly equivalent and can be used in MLPs so i use tanh because it's a bit smoother

02:18:00.240 | and because it's a little bit more complicated than relu and therefore it's stressed a little

02:18:04.480 | bit more the local gradients and working with those derivatives which i thought would be useful

02:18:09.440 | and then .py is the neural networks library as i mentioned so you should recognize

02:18:15.120 | identical implementation of neuron layer and MLP notably or not so much we have a class module here

02:18:21.840 | there's a parent class of all these modules i did that because there's an nn.module class

02:18:27.120 | in pytorch and so this exactly matches that api and nn.module in pytorch has also a zero grad

02:18:32.720 | which i refactored out here so that's the end of micrograd really then there's a test which you'll

02:18:40.720 | see basically creates two chunks of code one in micrograd and one in pytorch and we'll make sure

02:18:47.520 | that the forward and the backward paths agree identically for a slightly less complicated

02:18:51.920 | expression a slightly more complicated expression everything agrees so we agree with pytorch and all

02:18:57.360 | of these operations and finally there's a demo.pypy.ymb here and it's a bit more complicated

02:19:02.880 | binary classification demo than the one i covered in this lecture so we only had a tiny data set

02:19:07.680 | of four examples here we have a bit more complicated example with lots of blue points and lots of red

02:19:13.680 | points and we're trying to again build a binary classifier to distinguish two-dimensional points

02:19:18.800 | as red or blue it's a bit more complicated MLP here with it's a bigger MLP the loss is a bit

02:19:25.440 | more complicated because it supports batches so because our data set was so tiny we always did a

02:19:31.680 | forward pass on the entire data set of four examples but when your data set is like a million

02:19:36.400 | examples what we usually do in practice is we basically pick out some random subset we call

02:19:42.000 | that a batch and then we only process the batch forward backward and update so we don't have to

02:19:47.120 | forward the entire training set so this supports batching because there's a lot more examples here

02:19:52.640 | we do a forward pass the loss is slightly more different this is a max margin loss that i

02:19:58.720 | implement here the one that we used was the mean squared error loss because it's the simplest one

02:20:04.560 | there's also the binary cross-entropy loss all of them can be used for binary classification

02:20:09.040 | and don't make too much of a difference in the simple examples that we looked at so far

02:20:12.720 | there's something called L2 regularization used here this has to do with generalization of the

02:20:18.640 | neural net and controls the overfitting in machine learning setting but i did not cover these concepts

02:20:23.920 | in concepts in this video potentially later and the training loop you should recognize so forward

02:20:29.520 | backward with zero grad and update and so on you'll notice that in the update here the learning rate

02:20:36.720 | is scaled as a function of number of iterations and it shrinks and this is something called

02:20:42.720 | learning rate decay so in the beginning you have a high learning rate and as the network

02:20:47.040 | sort of stabilizes near the end you bring down the learning rate to get some of the fine details in

02:20:52.000 | the end and in the end we see the decision surface of the neural net and we see that it learns to

02:20:57.440 | separate out the red and the blue area based on the data points so that's the slightly more

02:21:02.880 | complicated example in the demo demo.hypeiymb that you're free to go over but yeah as of today

02:21:08.720 | that is micrograd i also wanted to show you a little bit of real stuff so that you get to see

02:21:13.040 | how this is actually implemented in a production grade library like pytorch so in particular i

02:21:17.760 | wanted to show i wanted to find and show you the backward pass for 10h in pytorch so here in

02:21:23.520 | micrograd we see that the backward pass for 10h is 1 minus t square where t is the output of the 10h

02:21:30.560 | of x times of that grad which is the chain rule so we're looking for something that looks like this

02:21:37.360 | now i went to pytorch which has an open source github codebase and i looked through a lot of

02:21:45.760 | its code and honestly i i spent about 15 minutes and i couldn't find 10h and that's because these

02:21:52.800 | libraries unfortunately they grow in size and entropy and if you just search for 10h you get

02:21:58.320 | apparently 2800 results and 400 and 406 files so i don't know what these files are doing honestly

02:22:05.760 | and why there are so many mentions of 10h but unfortunately these libraries are quite complex

02:22:11.920 | they're meant to be used not really inspected eventually i did stumble on someone who tries to

02:22:19.840 | change the 10h backward code for some reason and someone here pointed to the cpu kernel and the

02:22:24.880 | cuda kernel for 10h backward so this so basically depends on if you're using pytorch on a cpu device

02:22:31.440 | or on a gpu which these are different devices and i haven't covered this but this is the 10h

02:22:36.080 | backward kernel for cpu and the reason it's so large is that number one this is like if you're

02:22:45.200 | using a complex type which we haven't even talked about if you're using a specific data type of

02:22:49.280 | bfloat16 which we haven't talked about and then if you're not then this is the kernel and deep here

02:22:56.560 | we see something that resembles our backward pass so they have a times 1 minus b square so this b

02:23:04.720 | b here must be the output of the 10h and this is the out.grad so here we found it deep inside

02:23:14.080 | pytorch on this location for some reason inside binary ops kernel when 10h is not actually binary

02:23:19.600 | op and then this is the gpu kernel we're not complex we're here and here we go with one

02:23:29.280 | line of code so we did find it but basically unfortunately these code bases are very large and

02:23:35.440 | micrograd is very very simple but if you actually want to use real stuff

02:23:40.400 | finding the code for it you'll actually find that difficult i also wanted to show you a little

02:23:45.840 | example here where pytorch is showing you how you can register a new type of function that you want

02:23:50.560 | to add to pytorch as a lego building block so here if you want to for example add a legendre polynomial

02:23:57.280 | 3 here's how you could do it you will register it as a class that subclasses torch.rgrad.function

02:24:06.320 | and then you have to tell pytorch how to forward your new function and how to backward through it

02:24:12.080 | so as long as you can do the forward pass of this little function piece that you want to add

02:24:16.240 | and as long as you know the the local derivative the local gradients which are implemented in the

02:24:20.480 | backward pytorch will be able to back propagate through your function and then you can use this

02:24:24.960 | as a lego block in a larger lego castle of all the different lego blocks that pytorch already has

02:24:30.320 | and so that's the only thing you have to tell pytorch and everything would just work

02:24:34.240 | and you can register new types of functions in this way following this example and that is

02:24:39.360 | everything that i wanted to cover in this lecture so i hope you enjoyed building out micrograd with

02:24:43.440 | me i hope you find it interesting insightful and yeah i will post a lot of the links that are

02:24:50.560 | related to this video in the video description below i will also probably post a link to a

02:24:55.280 | discussion forum or discussion group where you can ask questions related to this video and then i can

02:25:00.800 | answer or someone else can answer your questions and i may also do a follow-up video that answers

02:25:05.760 | some of the most common questions but for now that's it i hope you enjoyed it if you did then

02:25:11.360 | please like and subscribe so that youtube knows to feature this video to more people

02:25:14.880 | and that's it for now i'll see you later

02:25:22.480 | now here's the problem we know dl by wait what is the problem

02:25:29.520 | and that's everything i wanted to cover in this lecture so i hope

02:25:34.960 | you enjoyed us building up micro grabbed micro grab

02:25:39.120 | okay now let's do the exact same thing for multiply because we can't do something like

02:25:45.520 | a times two whoops i know what happened there

The spelled-out intro to neural networks and backpropagation: building micrograd

Chapters