back to indexThe spelled-out intro to neural networks and backpropagation: building micrograd
Chapters
0:0 intro
0:25 micrograd overview
8:8 derivative of a simple function with one input
14:12 derivative of a function with multiple inputs
19:9 starting the core Value object of micrograd and its visualization
32:10 manual backpropagation example #1: simple expression
51:10 preview of a single optimization step
52:52 manual backpropagation example #2: a neuron
69:2 implementing the backward function for each operation
77:32 implementing the backward function for a whole expression graph
82:28 fixing a backprop bug when one node is used multiple times
87:5 breaking up a tanh, exercising with more operations
99:31 doing the same thing but in PyTorch: comparison
103:55 building out a neural net library (multi-layer perceptron) in micrograd
111:4 creating a tiny dataset, writing the loss function
117:56 collecting all of the parameters of the neural net
121:12 doing gradient descent optimization manually, training the network
134:3 summary of what we learned, how to go towards modern neural nets
136:46 walkthrough of the full code of micrograd on github
141:10 real stuff: diving into PyTorch, finding their backward pass for tanh
144:39 conclusion
145:20 outtakes :)
00:00:00.000 |
Hello, my name is Andrej and I've been training deep neural networks for a bit more than a decade 00:00:04.800 |
and in this lecture I'd like to show you what neural network training looks like under the hood. 00:00:09.600 |
So in particular we are going to start with a blank Jupyter notebook and by the end of this 00:00:14.000 |
lecture we will define and train a neural net and you'll get to see everything that goes on 00:00:18.560 |
under the hood and exactly sort of how that works on an intuitive level. Now specifically what I 00:00:23.520 |
would like to do is I would like to take you through building of micrograd. Now micrograd 00:00:29.200 |
is this library that I released on github about two years ago but at the time I only uploaded 00:00:33.840 |
the source code and you'd have to go in by yourself and really figure out how it works. 00:00:38.480 |
So in this lecture I will take you through it step by step and kind of comment on all the 00:00:42.960 |
pieces of it. So what is micrograd and why is it interesting? Thank you. 00:00:47.600 |
Micrograd is basically an autograd engine. Autograd is short for automatic gradient and 00:00:54.720 |
really what it does is it implements back propagation. Now back propagation is this 00:00:58.640 |
algorithm that allows you to efficiently evaluate the gradient of some kind of a loss function 00:01:04.960 |
with respect to the weights of a neural network and what that allows us to do then is we can 00:01:10.000 |
iteratively tune the weights of that neural network to minimize the loss function and 00:01:13.920 |
therefore improve the accuracy of the network. So back propagation would be at the mathematical 00:01:18.960 |
core of any modern deep neural network library like say pytorch or jax. So the functionality 00:01:24.800 |
of micrograd is I think best illustrated by an example. So if we just scroll down here 00:01:28.720 |
you'll see that micrograd basically allows you to build out mathematical expressions 00:01:33.520 |
and here what we are doing is we have an expression that we're building out where 00:01:38.000 |
you have two inputs a and b and you'll see that a and b are negative four and two but we are 00:01:44.880 |
wrapping those values into this value object that we are going to build out as part of micrograd. 00:01:50.480 |
So this value object will wrap the numbers themselves and then we are going to build 00:01:56.080 |
out a mathematical expression here where a and b are transformed into c d and eventually e f and g 00:02:03.120 |
and I'm showing some of the functionality of micrograd and the operations that it supports. 00:02:08.800 |
So you can add two value objects, you can multiply them, you can raise them to a constant power, 00:02:14.480 |
you can offset by one, negate, squash at zero, square, divide by a constant, divide by it, 00:02:21.520 |
etc. And so we're building out an expression graph with these two inputs a and b and we're 00:02:27.760 |
creating an output value of g and micrograd will in the background build out this entire 00:02:33.600 |
mathematical expression. So it will for example know that c is also a value, c was a result of 00:02:39.680 |
an addition operation and the child nodes of c are a and b because the and it will maintain 00:02:47.360 |
pointers to a and b value objects so we'll basically know exactly how all of this is laid out. 00:02:52.480 |
And then not only can we do what we call the forward pass where we actually look at the value 00:02:57.520 |
of g of course that's pretty straightforward we will access that using the dot data attribute 00:03:03.120 |
and so the output of the forward pass the value of g is 24.7 it turns out but the big deal is 00:03:10.160 |
that we can also take this g value object and we can call dot backward and this will basically 00:03:16.240 |
initialize backpropagation at the node g. And what backpropagation is going to do is it's going to 00:03:21.920 |
start at g and it's going to go backwards through that expression graph and it's going to recursively 00:03:27.440 |
apply the chain rule from calculus and what that allows us to do then is we're going to evaluate 00:03:33.440 |
basically the derivative of g with respect to all the internal nodes like e d and c but also 00:03:40.400 |
with respect to the inputs a and b and then we can actually query this derivative of g with respect 00:03:47.200 |
to a for example that's a dot grad in this case it happens to be 138 and the derivative of g with 00:03:52.880 |
respect to b which also happens to be here 645 and this derivative we'll see soon is very important 00:04:00.000 |
information because it's telling us how a and b are affecting g through this mathematical expression 00:04:06.720 |
so in particular a dot grad is 138 so if we slightly nudge a and make it slightly larger 00:04:13.920 |
138 is telling us that g will grow and the slope of that growth is going to be 138 00:04:20.720 |
and the slope of growth of b is going to be 645 so that's going to tell us about how g will respond 00:04:26.800 |
if a and b get tweaked a tiny amount in a positive direction okay um now you might be confused about 00:04:34.720 |
what this expression is that we built out here and this expression by the way is completely meaningless 00:04:39.680 |
i just made it up i'm just flexing about the kinds of operations that are supported by micrograd 00:04:44.800 |
and what we actually really care about are neural networks but it turns out that neural networks 00:04:48.720 |
are just mathematical expressions just like this one but actually slightly a bit less crazy even 00:04:53.680 |
neural networks are just a mathematical expression they take the input data as an input and they take 00:05:00.080 |
the weights of a neural network as an input and it's a mathematical expression and the output 00:05:04.560 |
are your predictions of your neural net or the loss function we'll see this in a bit but basically 00:05:09.600 |
neural networks just happen to be a certain class of mathematical expressions but back propagation 00:05:14.800 |
is actually significantly more general it doesn't actually care about neural networks at all it only 00:05:19.440 |
tells us about arbitrary mathematical expressions and then we happen to use that machinery for 00:05:24.480 |
training of neural networks now one more note i would like to make at this stage is that as you 00:05:28.960 |
see here micrograd is a scalar valued autograd engine so it's working on the you know level of 00:05:34.320 |
individual scalars like negative four and two and we're taking neural nets and we're breaking them 00:05:38.320 |
down all the way to these atoms of individual scalars and all the little pluses and times and 00:05:42.960 |
it's just excessive and so obviously you would never be doing any of this in production it's 00:05:48.000 |
really just for done for pedagogical reasons because it allows us to not have to deal with 00:05:52.000 |
these n-dimensional tensors that you would use in modern deep neural network library so this is 00:05:57.280 |
really done so that you understand and refactor out back propagation and chain rule and understanding 00:06:03.200 |
of neural training and then if you actually want to train bigger networks you have to be using these 00:06:08.320 |
tensors but none of the math changes this is done purely for efficiency we are basically taking 00:06:13.120 |
scalar value all the scalar values we're packaging them up into tensors which are just arrays of 00:06:18.240 |
these scalars and then because we have these large arrays we're making operations on those large 00:06:23.440 |
arrays that allows us to take advantage of the parallelism in a computer and all those operations 00:06:29.040 |
can be done in parallel and then the whole thing runs faster but really none of the math changes 00:06:33.520 |
and they've done purely for efficiency so i don't think that it's pedagogically useful to be dealing 00:06:37.760 |
with tensors from scratch and i think and that's why i fundamentally wrote micrograd because you 00:06:42.640 |
can understand how things work at the fundamental level and then you can speed it up later okay so 00:06:48.320 |
here's the fun part my claim is that micrograd is what you need to train your own networks and 00:06:52.800 |
everything else is just efficiency so you'd think that micrograd would be a very complex piece of 00:06:57.280 |
code and that turns out to not be the case so if we just go to micrograd and you'll see that there's 00:07:04.480 |
only two files here in micrograd this is the actual engine it doesn't know anything about 00:07:09.280 |
neural nets and this is the entire neural nets library on top of micrograd so engine and nn.py 00:07:16.560 |
so the actual back propagation autograd engine that gives you the power of neural networks 00:07:23.280 |
is literally 100 lines of code of like very simple python which we'll understand by the end of this 00:07:31.120 |
lecture and then nn.py this neural network library built on top of the autograd engine 00:07:37.040 |
is like a joke it's like we have to define what is a neuron and then we have to define what is a layer 00:07:43.920 |
of neurons and then we define what is a multilayer perceptron which is just a sequence of layers of 00:07:48.800 |
neurons and so it's just a total joke so basically there's a lot of power that comes from only 150 00:07:56.400 |
lines of code and that's all you need to understand to understand neural network training and everything 00:08:01.200 |
else is just efficiency and of course there's a lot to efficiency but fundamentally that's all 00:08:06.880 |
that's happening okay so now let's dive right in and implement micrograd step by step the first 00:08:11.600 |
thing i'd like to do is i'd like to make sure that you have a very good understanding intuitively of 00:08:15.600 |
what a derivative is and exactly what information it gives you so let's start with some basic 00:08:21.280 |
imports that i copy paste in every jupyter notebook always and let's define a function 00:08:26.880 |
a scalar valid function f of x as follows so i just made this up randomly i just wanted a 00:08:33.520 |
scalar valid function that takes a single scalar x and returns a single scalar y and we can call 00:08:39.280 |
this function of course so we can pass and say 3.0 and get 20 back now we can also plot this function 00:08:45.520 |
to get a sense of its shape you can tell from the mathematical expression that this is probably a 00:08:50.000 |
parabola it's a quadratic and so if we just create a set of um um scalar values that we can feed in 00:08:59.280 |
using for example a range from negative 5 to 5 in steps of 0.25 so this is so x is is just from 00:09:06.320 |
negative 5 to 5 not including 5 in steps of 0.25 and we can actually call this function on this 00:09:13.120 |
numpy array as well so we get a set of y's if we call f on x's and these y's are basically also 00:09:20.800 |
applying the function on every one of these elements independently and we can plot this 00:09:26.400 |
using mathplotlib so plt.plot x's and y's and we get a nice parabola so previously here we fed in 00:09:33.760 |
3.0 somewhere here and we received 20 back which is here the y coordinate so now i'd like to think 00:09:39.840 |
through what is the derivative of this function at any single input point x right so what is the 00:09:46.160 |
derivative at different points x of this function now if you remember back to your calculus class 00:09:51.360 |
you've probably derived derivatives so we take this mathematical expression 3x squared minus 4x 00:09:56.400 |
plus 5 and you would write out on a piece of paper and you would you know apply the product rule and 00:10:00.560 |
all the other rules and derive the mathematical expression of the great derivative of the 00:10:05.040 |
original function and then you could plug in different taxes and see what the derivative is 00:10:08.720 |
we're not going to actually do that because no one in neural networks actually writes out the 00:10:14.800 |
expression for the neural net it would be a massive expression it would be you know thousands tens of 00:10:19.600 |
thousands of terms no one actually derives the derivative of course and so we're not going to 00:10:24.240 |
take this kind of like symbolic approach instead what i'd like to do is i'd like to look at the 00:10:28.160 |
definition of derivative and just make sure that we really understand what the derivative is 00:10:31.920 |
measuring what it's telling you about the function and so if we just look up derivative 00:10:36.720 |
we see that um okay so this is not a very good definition of derivative this is a definition 00:10:46.560 |
of what it means to be differentiable but if you remember from your calculus it is the limit as h 00:10:51.120 |
goes to zero of f of x plus h minus f of x over h so basically what it's saying is if you slightly 00:10:58.960 |
bump up you're at some point x that you're interested in or a and if you slightly bump up 00:11:04.000 |
you know you slightly increase it by a small number h how does the function respond with what 00:11:09.760 |
sensitivity does it respond where is the slope at that point does the function go up or does it go 00:11:14.480 |
down and by how much and that's the slope of that function the the slope of that response at that 00:11:20.880 |
point and so we can basically evaluate the derivative here numerically by taking a very 00:11:26.880 |
small h of course the definition would ask us to take h to zero we're just going to pick a very 00:11:31.760 |
small h 0.001 and let's say we're interested in 0.3.0 so we can look at f of x of course as 20 00:11:38.240 |
and now f of x plus h so if we slightly nudge x in a positive direction how is the function going to 00:11:44.560 |
respond and just looking at this do you expand do you expect f of x plus h to be slightly greater 00:11:50.320 |
than 20 or do you expect to be slightly lower than 20 and since this 3 is here and this is 20 00:11:57.040 |
if we slightly go positively the function will respond positively so you'd expect this to be 00:12:02.160 |
slightly greater than 20 and now by how much is telling you the sort of the the strength of that 00:12:08.640 |
slope right the the size of the slope so f of x plus h minus f of x this is how much the function 00:12:14.720 |
responded in the positive direction and we have to normalize by the run so we have the rise over 00:12:21.360 |
run to get the slope so this of course is just a numerical approximation of the slope because we 00:12:27.440 |
have to make h very very small to converge to the exact amount now if i'm doing too many zeros 00:12:34.560 |
at some point i'm going to get an incorrect answer because we're using floating point arithmetic 00:12:40.400 |
and the representations of all these numbers in computer memory is finite and at some point we're 00:12:45.280 |
getting into trouble so we can converge towards the right answer with this approach but basically 00:12:50.960 |
at 3 the slope is 14 and you can see that by taking 3x squared minus 4x plus 5 and differentiating it 00:12:59.200 |
in our head so 3x squared would be 6x minus 4 and then we plug in x equals 3 so that's 18 minus 4 00:13:08.000 |
is 14 so this is correct so that's at 3 now how about the slope at say negative 3 would you expect 00:13:18.480 |
what would you expect for the slope now telling the exact value is really hard but what is the 00:13:23.200 |
sign of that slope so at negative 3 if we slightly go in the positive direction at x the function 00:13:30.400 |
would actually go down and so that tells you that the slope would be negative so we'll get a slight 00:13:34.560 |
number below 20 and so if we take the slope we expect something negative negative 22 okay 00:13:42.160 |
and at some point here of course the slope would be zero now for this specific function 00:13:48.320 |
i looked it up previously and it's at point 2 over 3 so at roughly 2 over 3 that's somewhere here 00:13:57.040 |
this derivative would be zero so basically at that precise point 00:14:01.120 |
yeah at that precise point if we nudge in a positive direction the function doesn't respond 00:14:08.160 |
this stays the same almost and so that's why the slope is zero okay now let's look at a bit more 00:14:12.560 |
complex case so we're going to start you know complexifying a bit so now we have a function 00:14:19.680 |
here with output variable b that is a function of three scalar inputs a b and c so a b and c are 00:14:27.280 |
some specific values three inputs into our expression graph and a single output d and so 00:14:33.440 |
if we just print d we get four and now what i'd like to do is i'd like to again look at the 00:14:38.720 |
derivatives of d with respect to a b and c and think through again just the intuition of what 00:14:45.600 |
this derivative is telling us so in order to evaluate this derivative we're going to get a 00:14:51.040 |
bit hacky here we're going to again have a very small value of h and then we're going to fix the 00:14:56.960 |
inputs at some values that we're interested in so these are the this is the point a b c at which 00:15:03.360 |
we're going to be evaluating the derivative of d with respect to all a b and c at that point 00:15:08.800 |
so there are the inputs and now we have d1 is that expression and then we're going to for example 00:15:14.880 |
look at the derivative of d with respect to a so we'll take a and we'll bump it by h and then we'll 00:15:20.400 |
get d2 to be the exact same function and now we're going to print um you know f1 d1 is d1 00:15:30.000 |
d2 is d2 and print slope so the derivative or slope here will be um of course d2 minus d1 00:15:42.800 |
divide h so d2 minus d1 is how much the function increased when we bumped the uh the specific 00:15:52.560 |
input that we're interested in by a tiny amount and this is the normalized by h to get the slope 00:16:02.720 |
so um yeah so this so i just run this we're going to print d1 which we know is four 00:16:14.240 |
now d2 will be bumped a will be bumped by h so let's just think through a little bit 00:16:23.760 |
what d2 will be uh printed out here in particular d1 will be four will d2 be a number slightly 00:16:33.200 |
greater than four or slightly lower than four and that's going to tell us the the the sign 00:16:38.320 |
of the derivative so we're bumping a by h b is minus three c is 10 so you can just intuitively 00:16:49.760 |
think through this derivative and what it's doing a will be slightly more positive and but b is a 00:16:55.840 |
negative number so if a is slightly more positive because b is negative three we're actually going 00:17:03.840 |
to be adding less to d so you'd actually expect that the value of the function will go down 00:17:13.760 |
so let's just see this yeah and so we went from four to 3.9996 and that tells you that the slope 00:17:22.000 |
will be negative and then um will be a negative number because we went down and then the exact 00:17:29.440 |
number of slope will be exact amount of slope is negative three and you can also convince yourself 00:17:34.880 |
that negative three is the right answer uh mathematically and analytically because if you 00:17:39.280 |
have a times b plus c and you are you know you have calculus then uh differentiating a times b 00:17:45.440 |
plus c with respect to a gives you just b and indeed the value of b is negative three which 00:17:51.040 |
is the derivative that we have so you can tell that that's correct so now if we do this with b 00:17:56.320 |
so if we bump b by a little bit in a positive direction we'd get different slopes so what is 00:18:03.040 |
the influence of b on the output d so if we bump b by a tiny amount in a positive direction then 00:18:09.520 |
because a is positive we'll be adding more to d right so um and now what is the what is the 00:18:16.960 |
sensitivity what is the slope of that addition and it might not surprise you that this should be 00:18:21.840 |
two and why is it two because d of d by db differentiating with respect to b would be would 00:18:30.720 |
give us a and the value of a is two so that's also working well and then if c gets bumped a tiny 00:18:37.040 |
amount in h by h then of course a times b is unaffected and now c becomes slightly bit higher 00:18:44.240 |
what does that do to the function it makes it slightly bit higher because we're simply adding c 00:18:48.240 |
and it makes it slightly bit higher by the exact same amount that we added to c and so that tells 00:18:53.840 |
you that the slope is one that will be the the rate at which d will increase as we scale c okay 00:19:05.360 |
so we now have some intuitive sense of what this derivative is telling you about the function 00:19:09.280 |
and we'd like to move to neural networks now as i mentioned neural networks will be pretty massive 00:19:13.120 |
expressions mathematical expressions so we need some data structures that maintain these expressions 00:19:17.680 |
and that's what we're going to start to build out now so we're going to build out this value object 00:19:23.760 |
that i showed you in the readme page of micrograd so let me copy paste a skeleton of the first very 00:19:31.440 |
simple value object so class value takes a single scalar value that it wraps and keeps track of 00:19:38.640 |
and that's it so we can for example do value of 2.0 and then we can get we can look at its content 00:19:47.760 |
and python will internally use the wrapper function to return this string oops like that 00:19:57.280 |
so this is a value object with data equals two that we're creating here now we'd like to do is 00:20:04.080 |
like we'd like to be able to have not just like two values but we'd like to do a plus b right we'd 00:20:12.000 |
like to add them so currently you would get an error because python doesn't know how to add 00:20:17.760 |
two value objects so we have to tell it so here's addition 00:20:22.880 |
so you have to basically use these special double underscore methods in python to define these 00:20:30.880 |
operators for these objects so if we call um the uh if we use this plus operator python will 00:20:39.760 |
internally call a dot add of b that's what will happen internally and so b will be the other 00:20:47.280 |
and self will be a and so we see that what we're going to return is a new value object 00:20:54.000 |
and it's just uh it's going to be wrapping the plus of their data but remember now because 00:21:01.360 |
data is the actual like numbered python number so this operator here is just the typical floating 00:21:07.360 |
point plus addition now it's not an addition of value objects and we'll return a new value 00:21:13.680 |
so now a plus b should work and it should print value of negative one because that's two plus 00:21:19.200 |
minus three there we go okay let's now implement multiply just so we can recreate this expression 00:21:25.920 |
here so multiply i think it won't surprise you will be fairly similar so instead of add we're 00:21:32.880 |
going to be using mul and then here of course we want to do times and so now we can create a c value 00:21:38.640 |
object which will be 10.0 and now we should be able to do a times b well let's just do a times b 00:21:45.280 |
first um that's value of negative six now and by the way i skipped over this a little bit uh suppose 00:21:52.960 |
that i didn't have the wrapper function here uh then it's just that you'll get some kind of an 00:21:57.680 |
ugly expression so what wrapper is doing is it's providing us a way to print out like a nicer 00:22:03.120 |
looking expression in python uh so we don't just have something cryptic we actually are you know 00:22:09.040 |
it's value of negative six so this gives us a times and then this we should now be able to add 00:22:16.560 |
c to it because we've defined and told the python how to do mul and add and so this will call this 00:22:22.240 |
will basically be equivalent to a dot mul of b and then this new value object will be dot add of c 00:22:31.760 |
and so let's see if that worked yep so that worked well that gave us four which is what we expect 00:22:37.680 |
from before and i believe you can just call them manually as well there we go so yeah okay so now 00:22:45.600 |
what we are missing is the connected tissue of this expression as i mentioned we want to keep 00:22:50.000 |
these expression graphs so we need to know and keep pointers about what values produce what other 00:22:55.440 |
values so here for example we are going to introduce a new variable which we'll call 00:23:00.320 |
children and by default it will be an empty tuple and then we're actually going to keep a slightly 00:23:05.120 |
different variable in the class which we'll call underscore prev which will be the set of children 00:23:10.320 |
this is how i done i did it in the original micrograd looking at my code here i can't remember 00:23:16.320 |
exactly the reason i believe it was efficiency but this underscore children will be a tuple 00:23:20.800 |
for convenience but then when we actually maintain it in the class it will be just this set 00:23:24.400 |
i believe for efficiency so now when we are creating a value like this with a constructor 00:23:32.800 |
children will be empty and prep will be the empty set but when we are creating a value through 00:23:37.360 |
addition or multiplication we're going to feed in the children of this value which in this case 00:23:43.520 |
is self and other so those are the children here so now we can do d dot prev and we'll see that 00:23:54.160 |
the children of the we now know are this a value of negative six and value of 10 and this of course 00:24:00.880 |
is the value resulting from a times b and the c value which is 10 now the last piece of information 00:24:08.400 |
we don't know so we know now the children of every single value but we don't know what operation 00:24:12.960 |
created this value so we need one more element here let's call it underscore up and by default 00:24:19.920 |
this is the empty set for leaves and then we'll just maintain it here and now the operation will 00:24:26.880 |
be just a simple string and in the case of addition it's plus in the case of multiplication 00:24:32.000 |
it's times so now we not just have d dot prep we also have a d dot up and we know that d was 00:24:39.600 |
produced by an addition of those two values and so now we have the full mathematical expression 00:24:45.040 |
and we're building out this data structure and we know exactly how each value came to be by what 00:24:50.320 |
expression and from what other values now because these expressions are about to get quite a bit 00:24:56.880 |
larger we'd like a way to nicely visualize these expressions that we're building out so for that 00:25:02.560 |
i'm going to copy paste a bunch of slightly scary code that's going to visualize this 00:25:07.840 |
these expression graphs for us so here's the code and i'll explain it in a bit but first let me just 00:25:13.040 |
show you what this code does basically what it does is it creates a new function draw dot that 00:25:18.400 |
we can call on some root node and then it's going to visualize it so if we call draw dot on d which 00:25:25.040 |
is this final value here that is a times b plus c it creates something like this so this is d 00:25:32.240 |
and you see that this is a times b creating an integer value plus c gives us this output node d 00:25:38.880 |
so that's draw dot of b and i'm not going to go through this in complete detail you can take a 00:25:45.840 |
look at graphvis and its api graphvis is a open source graph visualization software 00:25:50.800 |
and what we're doing here is we're building out this graph in graphvis api and you can basically 00:25:57.600 |
see that trace is this helper function that enumerates all the nodes and edges in the graph 00:26:02.160 |
so that just builds a set of all the nodes and edges and then we iterate through all the nodes 00:26:06.720 |
and we create special node objects for them in using dot node and then we also create edges 00:26:15.120 |
using dot dot edge and the only thing that's like slightly tricky here is you'll notice that i 00:26:20.400 |
basically add these fake nodes which are these operation nodes so for example this node here 00:26:25.680 |
is just like a plus node and i create these special op nodes here and i connect them accordingly 00:26:36.240 |
so these nodes of course are not actual nodes in the original graph they're not actually a value 00:26:42.960 |
object the only value objects here are the things in squares those are actual value objects or 00:26:48.960 |
representations thereof and these op nodes are just created in this draw dot routine so that it 00:26:54.160 |
looks nice let's also add labels to these graphs just so we know what variables are where so let's 00:27:00.320 |
create a special underscore label or let's just do label equals empty by default and save it in each 00:27:08.560 |
node and then here we're going to do label as a label is b label is c and then let's create a 00:27:25.680 |
special e equals a times b and e dot label will be e it's kind of naughty and e will be e plus c 00:27:37.200 |
and a d dot label will be b okay so nothing really changes i just added this new e function 00:27:45.840 |
that new e variable and then here when we are printing this i'm going to print the label here 00:27:54.080 |
so this will be a percent s bar and this will be n dot label 00:27:58.240 |
and so now we have the label on the left here so it says a b creating e and then e plus c 00:28:07.680 |
creates d just like we have it here and finally let's make this expression just one layer deeper 00:28:13.520 |
so d will not be the final output node instead after d we are going to create a new value object 00:28:21.760 |
called f we're going to start running out of variables soon f will be negative 2.0 and its 00:28:27.920 |
label will of course just be f and then L capital L will be the output of our graph and L will be 00:28:36.160 |
d times f okay so L will be negative 8 is the output so now we don't just draw a d we draw L 00:28:50.000 |
okay and somehow the label of L is undefined oops all that label has to be explicitly 00:28:57.680 |
sort of given to it there we go so L is the output so let's quickly recap what we've done so far 00:29:03.600 |
we are able to build out mathematical expressions using only plus and times so far 00:29:08.160 |
they are scalar valued along the way and we can do this forward pass and build out a mathematical 00:29:15.440 |
expression so we have multiple inputs here a b c and f going into a mathematical expression 00:29:21.520 |
that produces a single output L and this here is visualizing the forward pass so the output of the 00:29:28.000 |
forward pass is negative 8 that's the value now what we'd like to do next is we'd like to run 00:29:33.920 |
back propagation and in back propagation we are going to start here at the end and we're going to 00:29:39.600 |
reverse and calculate the gradient along along all these intermediate values and really what 00:29:45.920 |
we're computing for every single value here we're going to compute the derivative of that node with 00:29:52.800 |
respect to L so the derivative of L with respect to L is just one and then we're going to derive 00:30:01.520 |
what is the derivative of L with respect to f with respect to d with respect to c with respect to e 00:30:07.600 |
with respect to b and with respect to a and in a neural network setting you'd be very interested 00:30:13.040 |
in the derivative of basically this loss function L with respect to the weights of a neural network 00:30:18.800 |
and here of course we have just these variables a b c and f but some of these will eventually 00:30:23.840 |
represent the weights of a neural net and so we'll need to know how those weights are impacting the 00:30:29.200 |
loss function so we'll be interested basically in the derivative of the output with respect to some 00:30:33.760 |
of its leaf nodes and those leaf nodes will be the weights of the neural net and the other leaf 00:30:38.800 |
nodes of course will be the data itself but usually we will not want or use the derivative of 00:30:43.920 |
the loss function with respect to data because the data is fixed but the weights will be iterated on 00:30:49.680 |
using the gradient information so next we are going to create a variable inside the value class 00:30:55.440 |
that maintains the derivative of L with respect to that value and we will call this variable grad 00:31:03.840 |
so there's a dot data and there's a self dot grad and initially it will be zero and remember that 00:31:10.320 |
zero is basically means no effect so at initialization we're assuming that every value 00:31:15.520 |
does not impact does not affect the output right because if the gradient is zero that means that 00:31:21.920 |
changing this variable is not changing the loss function so by default we assume that the gradient 00:31:27.680 |
is zero and then now that we have grad and it's you know 0.0 we are going to be able to visualize 00:31:38.080 |
it here after data so here grad is 0.4f and this will be in dot grad and now we are going to be 00:31:47.040 |
showing both the data and the grad initialized at zero and we are just about getting ready to 00:31:55.600 |
calculate the back propagation and of course this grad again as I mentioned is representing 00:32:00.400 |
the derivative of the output in this case L with respect to this value so with respect to so this 00:32:06.320 |
is the derivative of L with respect to f with respect to d and so on so let's now fill in those 00:32:11.680 |
gradients and actually do back propagation manually so let's start filling in these gradients 00:32:15.840 |
and start all the way at the end as I mentioned here first we are interested to fill in this 00:32:20.400 |
gradient here so what is the derivative of L with respect to L in other words if I change L 00:32:26.800 |
by a tiny amount h how much does L change it changes by h so it's proportional and therefore 00:32:35.440 |
the derivative will be one we can of course measure these or estimate these numerical 00:32:40.320 |
gradients numerically just like we've seen before so if I take this expression and I create a def 00:32:46.800 |
lol function here and put this here now the reason I'm creating a gating function lol here is because 00:32:53.840 |
I don't want to pollute or mess up the global scope here this is just kind of like a little 00:32:57.840 |
staging area and as you know in Python all of these will be local variables to this function 00:33:02.560 |
so I'm not changing any of the global scope here so here L1 will be L 00:33:07.520 |
and then copy pasting this expression we're going to add a small amount h 00:33:15.280 |
and for example a right and this would be measuring the derivative of L with respect to a 00:33:24.480 |
so here this will be L2 and then we want to print that derivative so print L2 minus L1 which is how 00:33:33.280 |
much L changed and then normalize it by h so this is the rise over run and we have to be careful 00:33:40.400 |
because L is a value node so we actually want its data so that these are floats dividing by h 00:33:48.880 |
and this should print the derivative of L with respect to a because a is the one that we bumped 00:33:53.920 |
a little bit by h so what is the derivative of L with respect to a it's six okay and obviously 00:34:03.520 |
if we change L by h then that would be here effectively this looks really awkward but 00:34:14.000 |
changing L by h you see the derivative here is one that's kind of like the base case of 00:34:22.560 |
what we are doing here so basically we can come up here and we can manually set L.grad to one 00:34:29.600 |
this is our manual back propagation L.grad is one and let's redraw and we'll see that we filled in 00:34:37.280 |
grad is one for L we're now going to continue the back propagation so let's here look at the 00:34:42.160 |
derivatives of L with respect to D and F let's do a D first so what we are interested in if I 00:34:49.120 |
create a markdown on here is we'd like to know basically we have that L is D times F and we'd 00:34:54.560 |
like to know what is D L by D D what is that and if you know your calculus L is D times F so what 00:35:04.400 |
is D L by D D it would be F and if you don't believe me we can also just derive it because 00:35:11.200 |
the proof would be fairly straightforward we go to the definition of the derivative which is F of x 00:35:18.560 |
plus h minus F of x divide h as a limit limit of h goes to zero of this kind of expression so when 00:35:26.160 |
we have L is D times F then increasing D by h would give us the output of D plus h times F 00:35:34.480 |
that's basically F of x plus h right minus D times F and then divide h and symbolically 00:35:44.880 |
expanding out here we would have basically D times F plus h times F minus D times F divide h 00:35:51.600 |
and then you see how the D F minus D F cancels so you're left with h times F 00:35:56.160 |
divide h which is F so in the limit as h goes to zero of you know derivative 00:36:06.720 |
definition we just get F in the case of D times F so symmetrically D L by D F will just be D 00:36:17.120 |
so what we have is that F dot grad we see now is just the value of D which is four 00:36:25.120 |
and we see that D dot grad is just the value of F 00:36:34.560 |
and so the value of F is negative two so we'll set those manually 00:36:41.600 |
let me erase this markdown node and then let's redraw what we have 00:36:46.640 |
okay and let's just make sure that these were correct so we seem to think that D L by D D is 00:36:55.040 |
negative two so let's double check um let me erase this plus h from before and now we want the 00:37:00.400 |
derivative with respect to F so let's just come here when I create F and let's do a plus h here 00:37:05.840 |
and this should print a derivative of L with respect to F so we expect to see four yeah and 00:37:11.360 |
this is four up to floating point funkiness and then D L by D D should be F which is negative two 00:37:22.320 |
grad is negative two so if we again come here and we change D 00:37:26.080 |
D dot data plus equals h right here so we expect so we've added a little h and then we see how L 00:37:35.040 |
changed and we expect to print negative two there we go so we've numerically verified what we're 00:37:45.760 |
doing here is kind of like an inline gradient check gradient check is when we are deriving 00:37:50.720 |
this like back propagation and getting the derivative with respect to all the intermediate 00:37:54.720 |
results and then numerical gradient is just you know um estimating it using small step size now 00:38:01.760 |
we're getting to the crux of back propagation so this will be the most important node to understand 00:38:08.320 |
because if you understand the gradient for this node you understand the gradient for this node 00:38:12.720 |
because if you understand the gradient for this node you understand all of back propagation and 00:38:17.040 |
all of training of neural nets basically so we need to derive D L by D C in other words the 00:38:23.920 |
derivative of L with respect to C because we've computed all these other gradients already now 00:38:29.840 |
we're coming here and we're continuing the back propagation manually so we want D L by D C and 00:38:35.840 |
we'll also derive D L by D E now here's the problem how do we derive D L by D C we actually 00:38:44.640 |
know the derivative L with respect to D so we know how L is sensitive to D but how is L sensitive to 00:38:52.240 |
C so if we wiggle C how does that impact L through D so we know D L by D C 00:39:01.920 |
and we also here know how C impacts D and so just very intuitively if you know the 00:39:06.800 |
impact that C is having on D and the impact that D is having on L then you should be able to somehow 00:39:12.720 |
put that information together to figure out how C impacts L and indeed this is what we can actually 00:39:18.480 |
do so in particular we know just concentrating on D first let's look at how what is the derivative 00:39:24.400 |
basically of D with respect to C so in other words what is D D by D C so here we know that D 00:39:33.040 |
is C times C plus E that's what we know and now we're interested in D D by D C if you just know 00:39:40.400 |
your calculus again and you remember then differentiating C plus E with respect to C 00:39:44.880 |
you know that that gives you 1.0 and we can also go back to the basics and derive this 00:39:50.880 |
because again we can go to our f of x plus h minus f of x divide by h that's the definition 00:39:57.280 |
of a derivative as h goes to zero and so here focusing on C and its effect on D we can basically 00:40:05.120 |
do the f of x plus h will be C is incremented by h plus E that's the first evaluation of our 00:40:12.160 |
function minus C plus E and then divide h and so what is this just expanding this out this will be 00:40:21.840 |
C plus h plus C minus C minus E divide h and then you see here how C minus C cancels E minus E 00:40:29.680 |
cancels we're left with h over h which is 1.0 and so by symmetry also D D by D E will be 1.0 as well 00:40:42.000 |
so basically the derivative of a sum expression is very simple and this is the local derivative 00:40:48.240 |
so I call this the local derivative because we have the final output value all the way at the 00:40:52.480 |
end of this graph and we're now like a small node here and this is a little plus node and 00:40:58.320 |
it the little plus node doesn't know anything about the rest of the graph that it's embedded 00:41:02.880 |
in all it knows is that it did plus it took a C and an E added them and created D and this plus 00:41:09.760 |
node also knows the local influence of C on D or rather the derivative of D with respect to C 00:41:16.000 |
and it also knows the derivative of D with respect to E but that's not what we want that's just the 00:41:21.520 |
local derivative what we actually want is D L by D C and L is here just one step away 00:41:29.120 |
but in the general case this little plus node could be embedded in like a massive graph 00:41:34.560 |
so again we know how L impacts D and now we know how C and E impact D how do we put that 00:41:41.360 |
information together to write D L by D C and the answer of course is the chain rule in calculus 00:41:46.880 |
and so I pulled up chain rule here from Wikipedia and I'm going to go through this very briefly 00:41:55.520 |
so chain rule Wikipedia sometimes can be very confusing and calculus can 00:42:00.480 |
be very confusing like this is the way I learned chain rule and it was very confusing like what 00:42:06.960 |
is happening it's just complicated so I like this expression much better 00:42:11.200 |
if a variable Z depends on a variable Y which itself depends on a variable X 00:42:16.960 |
then Z depends on X as well obviously through the intermediate variable Y 00:42:21.680 |
and in this case the chain rule is expressed as if you want D Z by D X then you take the D Z by D Y 00:42:30.400 |
and you multiply it by D Y by D X so the chain rule fundamentally is telling you how we chain 00:42:38.240 |
these derivatives together correctly so to differentiate through a function composition 00:42:45.440 |
we have to apply a multiplication of those derivatives so that's really what chain rule 00:42:53.360 |
is telling us and there's a nice little intuitive explanation here which I also think is kind of cute 00:42:59.120 |
the chain rule states that knowing the instantaneous rate of change of Z with respect to Y 00:43:02.880 |
and Y relative to X allows one to calculate the instantaneous rate of change of Z relative to X 00:43:07.120 |
as a product of those two rates of change simply the product of those two so here's a good one if 00:43:14.480 |
a car travels twice as fast as a bicycle and the bicycle is four times as fast as walking men 00:43:19.200 |
then the car travels two times four eight times as fast as the man and so this makes it very clear 00:43:26.800 |
that the correct thing to do sort of is to multiply so car is twice as fast as bicycle 00:43:33.680 |
and bicycle is four times as fast as man so the car will be eight times as fast as the man 00:43:39.440 |
and so we can take these intermediate rates of change if you will and multiply them together 00:43:45.760 |
and that justifies the chain rule intuitively so have a look at chain rule but here really what 00:43:52.560 |
it means for us is there's a very simple recipe for deriving what we want which is dL by dc 00:43:58.000 |
and what we have so far is we know want and we know what is the impact of d on L so we know dL 00:44:11.760 |
by dd the derivative of L with respect to dd we know that that's negative two and now because of 00:44:18.240 |
this local reasoning that we've done here we know dd by dc so how does c impact d and in particular 00:44:28.000 |
this is a plus node so the local derivative is simply 1.0 it's very simple and so the chain 00:44:34.720 |
rule tells us that dL by dc going through this intermediate variable will just be simply dL by 00:44:44.000 |
dd times dd by dc that's chain rule so this is identical to what's happening here except 00:44:57.120 |
z is our L y is our d and x is our c so we literally just have to multiply these and because 00:45:10.320 |
these local derivatives like dd by dc are just one we basically just copy over dL by dd because 00:45:17.680 |
this is just times one so what does it do so because dL by dd is negative two what is dL by dc 00:45:24.640 |
well it's the local gradient 1.0 times dL by dd which is negative two so literally what a plus 00:45:32.880 |
node does you can look at it that way is it literally just routes the gradient because 00:45:37.760 |
the plus nodes local derivatives are just one and so in the chain rule one times dL by dd 00:45:44.400 |
is just dL by dd and so that derivative just gets routed to both c and to e in this case 00:45:54.560 |
so basically we have that e dot grad or let's start with c since that's the one we've looked at 00:46:02.480 |
is negative two times one negative two and in the same way by symmetry e dot grad will be negative 00:46:12.080 |
two that's the claim so we can set those we can redraw and you see how we just assign negative 00:46:20.880 |
two negative two so this back propagating signal which is carrying the information of like what is 00:46:25.680 |
the derivative of L with respect to all the intermediate nodes we can imagine it almost 00:46:30.080 |
like flowing backwards through the graph and a plus node will simply distribute the derivative 00:46:34.800 |
to all the leaf nodes sorry to all the children nodes of it so this is the claim and now let's 00:46:40.480 |
verify it so let me remove the plus h here from before and now instead what we're going to do is 00:46:46.400 |
we want to increment c so c dot data will be incremented by h and when I run this we expect 00:46:52.080 |
to see negative two negative two and then of course for e so e dot data plus equals h and 00:47:00.960 |
we expect to see negative two simple so those are the derivatives of these internal nodes 00:47:10.160 |
and now we're going to recurse our way backwards again and we're again going to apply the chain 00:47:17.200 |
rule so here we go our second application of chain rule and we will apply it all the way 00:47:21.920 |
through the graph we just happen to only have one more node remaining we have that dl by de 00:47:27.840 |
as we have just calculated is negative two so we know that so we know the derivative of l with 00:47:33.920 |
respect to e and now we want dl by da right and the chain rule is telling us that that's just dl by de 00:47:46.800 |
times the local gradient so what is the local gradient basically de by da we have to look at 00:47:58.000 |
that so i'm a little times node inside a massive graph and i only know that i did a times b and i 00:48:06.960 |
produced an e so now what is de by da and de by db that's the only thing that i sort of know about 00:48:14.400 |
that's my local gradient so because we have that e is a times b we're asking what is de by da 00:48:22.480 |
and of course we just did that here we had a times so i'm not going to re-derive it but if you want 00:48:31.040 |
to differentiate this with respect to a you'll just get b right the value of b which in this case 00:48:37.680 |
is negative 3.0 so basically we have that dl by da well let me just do it right here we have that 00:48:47.520 |
a dot grad and we are applying chain rule here is dl by de which we see here is negative two 00:48:54.800 |
times what is de by da it's the value of b which is negative three that's it 00:49:05.680 |
and then we have b dot grad is again dl by de which is negative two just the same way times 00:49:13.440 |
what is de by d db is the value of a which is 2.0 that's the value of a so these are our claimed 00:49:26.080 |
derivatives let's redraw and we see here that a dot grad turns out to be six because that is 00:49:33.440 |
negative two times negative three and b dot grad is negative four times sorry is negative two times 00:49:40.080 |
two which is negative four so those are our claims let's delete this and let's verify them 00:49:47.680 |
we have a here a dot data plus equals h so the claim is that a dot grad is six let's verify 00:49:59.360 |
six and we have b dot data plus equals h so nudging b by h and looking at what happens 00:50:10.320 |
we claim it's negative four and indeed it's negative four plus minus again float oddness 00:50:16.080 |
and that's it this that was the manual back propagation all the way from here to all the 00:50:26.880 |
leaf nodes and we've done it piece by piece and really all we've done is as you saw we iterated 00:50:32.400 |
through all the nodes one by one and locally applied the chain rule we also applied the 00:50:37.280 |
one and locally applied the chain rule we always know what is the derivative of l with respect to 00:50:42.880 |
this little output and then we look at how this output was produced this output was produced 00:50:47.600 |
through some operation and we have the pointers to the children nodes of this operation and so 00:50:53.200 |
in this little operation we know what the local derivatives are and we just multiply them onto 00:50:57.920 |
the derivative always so we just go through and recursively multiply on the local derivatives 00:51:04.080 |
and that's what back propagation is it's just a recursive application of chain rule 00:51:08.160 |
backwards through the computation graph let's see this power in action just very briefly what 00:51:13.680 |
we're going to do is we're going to nudge our inputs to try to make l go up so in particular 00:51:20.560 |
what we're doing is we want a dot data we're going to change it and if we want l to go up 00:51:26.000 |
that means we just have to go in the direction of the gradient so a should increase in the 00:51:31.840 |
direction of gradient by like some small step amount this is the step size and we don't just 00:51:37.040 |
want this for b but also for b also for c also for f those are leaf nodes which we usually have 00:51:49.280 |
control over and if we nudge in direction of the gradient we expect a positive influence on l 00:51:55.680 |
so we expect l to go up positively so it should become less negative it should go up to say 00:52:02.640 |
negative you know six or something like that it's hard to tell exactly and we'd have to 00:52:08.240 |
rerun the forward pass so let me just do that here this would be the forward pass f would be 00:52:18.960 |
unchanged this is effectively the forward pass and now if we print l dot data we expect because 00:52:26.080 |
we nudged all the values all the inputs in the direction of gradient we expected less negative 00:52:30.640 |
l we expect it to go up so maybe it's negative six or so let's see what happens okay negative seven 00:52:37.520 |
and this is basically one step of an optimization that will end up running and really this gradient 00:52:45.120 |
just give us some power because we know how to influence the final outcome and this will be 00:52:49.760 |
extremely useful for training neural nets as well as cnc so now i would like to do one more example 00:52:56.080 |
of manual back propagation using a bit more complex and useful example we are going to 00:53:02.960 |
back propagate through a neuron so we want to eventually build up neural networks and in the 00:53:10.160 |
simplest case these are multi-layer perceptrons as they're called so this is a two-layer neural 00:53:14.800 |
net and it's got these hidden layers made up of neurons and these neurons are fully connected to 00:53:19.040 |
each other now biologically neurons are very complicated devices but we have very simple 00:53:23.920 |
mathematical models of them and so this is a very simple mathematical model of a neuron 00:53:29.360 |
you have some inputs x's and then you have these synapses that have weights on them so 00:53:35.600 |
the w's are weights and then the synapse interacts with the input to this neuron 00:53:43.360 |
multiplicatively so what flows to the cell body of this neuron is w times x but there's multiple 00:53:50.720 |
inputs so there's many w times x's flowing to the cell body the cell body then has also like some 00:53:56.960 |
bias so this is kind of like the in innate sort of trigger happiness of this neuron so this bias 00:54:04.240 |
can make it a bit more trigger happy or a bit less trigger happy regardless of the input but basically 00:54:08.960 |
we're taking all the w times x of all the inputs adding the bias and then we take it through an 00:54:15.040 |
activation function and this activation function is usually some kind of a squashing function 00:54:19.760 |
like a sigmoid or 10h or something like that so as an example we're going to use the 10h in this 00:54:26.320 |
example numpy has a np.10h so we can call it on a range and we can plot it this is the 10h function 00:54:38.080 |
and you see that the inputs as they come in get squashed on the y coordinate here so 00:54:44.000 |
right at zero we're going to get exactly zero and then as you go more positive in the input 00:54:50.080 |
then you'll see that the function will only go up to one and then plateau out and so if you pass in 00:54:57.200 |
very positive inputs we're going to cap it smoothly at one and on the negative side we're going to cap 00:55:02.320 |
it smoothly to negative one so that's 10h and that's the squashing function or an activation 00:55:09.040 |
function and what comes out of this neuron is just the activation function applied to the 00:55:14.000 |
dot product of the weights and the inputs so let's write one out i'm going to copy paste because 00:55:27.360 |
i don't want to type too much but okay so here we have the inputs x1 x2 so this is a two-dimensional 00:55:33.360 |
neuron so two inputs are going to come in these are thought of as the weights of this neuron 00:55:38.160 |
weights w1 w2 and these weights again are the synaptic strengths for each input and this is 00:55:45.600 |
the bias of the neuron b and now what we want to do is according to this model we need to multiply 00:55:53.440 |
x1 times w1 and x2 times w2 and then we need to add bias on top of it and it gets a little messy 00:56:02.720 |
here but all we are trying to do is x1 w1 plus x2 w2 plus b and these are multiplied here except 00:56:10.320 |
i'm doing it in small steps so that we actually have pointers to all these intermediate nodes 00:56:15.040 |
so we have x1 w1 variable x times x2 w2 variable and i'm also labeling them so n is now 00:56:23.760 |
the cell body raw activation without the activation function for now 00:56:29.680 |
and this should be enough to basically plot it so draw dot of n 00:56:34.560 |
gives us x1 times w1 x2 times w2 being added then the bias gets added on top of this and this n 00:56:47.280 |
is this sum so we're now going to take it through an activation function and let's say we use the 00:56:53.440 |
tan h so that we produce the output so what we'd like to do here is we'd like to do the output 00:56:59.600 |
and i'll call it o is n dot tan h okay but we haven't yet written the tan h now the reason 00:57:08.880 |
that we need to implement another tan h function here is that tan h is a hyperbolic function and 00:57:16.000 |
we've only so far implemented plus and the times and you can't make a tan h out of just pluses and 00:57:20.960 |
times you also need exponentiation so tan h is this kind of a formula here you can use either 00:57:27.840 |
one of these and you see that there's exponentiation involved which we have not implemented yet for 00:57:32.560 |
our low value node here so we're not going to be able to produce tan h yet and we have to go back 00:57:36.800 |
up and implement something like it now one option here is we could actually implement 00:57:44.880 |
exponentiation right and we could return the exp of a value instead of a tan h of a value 00:57:51.760 |
because if we had exp then we have everything else that we need so because we know how to add 00:57:58.560 |
and we know how to we know how to add and we know how to multiply so we'd be able to create tan h 00:58:04.960 |
if we knew how to exp but for the purposes of this example i specifically wanted to 00:58:10.000 |
show you that we don't necessarily need to have the most atomic pieces in this value object 00:58:18.160 |
we can actually like create functions at arbitrary points of abstraction they can be complicated 00:58:25.200 |
functions but they can be also very very simple functions like a plus and it's totally up to us 00:58:30.000 |
the only thing that matters is that we know how to differentiate through any one function 00:58:34.160 |
so we take some inputs and we make an output the only thing that matters it can be arbitrarily 00:58:38.560 |
complex function as long as you know how to create the local derivative if you know the 00:58:43.360 |
local derivative of how the inputs impact the output then that's all you need so we're going 00:58:47.760 |
to cluster up all of this expression and we're not going to break it down to its atomic pieces 00:58:53.200 |
we're just going to directly implement tan h so let's do that dev tan h and then out will be a 00:59:00.320 |
value of and we need this expression here so um let me actually copy paste 00:59:11.040 |
let's grab n which is a sub theta and then this i believe is the tan h math.exp of 00:59:24.560 |
two no n minus one over two n plus one maybe i can call this x just so that it matches exactly 00:59:34.320 |
okay and now this will be t and uh children of this node there's just one child and i'm 00:59:44.800 |
wrapping it in a tuple so this is a tuple of one object just self and here the name of this 00:59:50.240 |
operation will be tan h and we're going to return that okay so now value should be 01:00:00.640 |
implementing tan h and now we can scroll all the way down here and we can actually do n dot tan h 01:00:06.560 |
and that's going to return the tan h output of n and now we should be able to draw dot of o 01:00:13.360 |
not of n so let's see how that worked there we go n went through tan h to produce this output 01:00:23.200 |
so now tan h is a sort of uh our little micro grad supported node here as an operation 01:00:31.120 |
and as long as we know the derivative of tan h then we'll be able to back propagate through it 01:00:38.240 |
now let's see this tan h in action currently it's not squashing too much because the input to it is 01:00:43.600 |
pretty low so the bias was increased to say eight then we'll see that what's flowing into the tan h 01:00:51.680 |
now is two and tan h is squashing it to 0.96 so we're already hitting the tail of this tan h and 01:00:59.920 |
it will sort of smoothly go up to one and then plateau out over there okay so now i'm going to 01:01:04.240 |
do something slightly strange i'm going to change this bias from eight to this number 6.88 etc 01:01:11.840 |
and i'm going to do this for specific reasons because we're about to start back propagation 01:01:16.400 |
and i want to make sure that our numbers come out nice they're not like very crazy numbers 01:01:21.600 |
they're nice numbers that we can sort of understand in our head let me also add those 01:01:25.840 |
label o is short for output here so that's the r okay so 0.88 flows into tan h comes out 0.7 01:01:35.120 |
so on so now we're going to do back propagation and we're going to fill in all the gradients 01:01:40.240 |
so what is the derivative o with respect to all the inputs here and of course in a typical neural 01:01:47.280 |
network setting what we really care about the most is the derivative of these neurons on the 01:01:52.720 |
weights specifically the w2 and w1 because those are the weights that we're going to be changing 01:01:57.760 |
part of the optimization and the other thing that we have to remember is here we have only a single 01:02:02.240 |
neuron but in the neural net you typically have many neurons and they're connected 01:02:07.120 |
so this is only like a one small neuron a piece of a much bigger puzzle and eventually there's a 01:02:11.280 |
loss function that sort of measures the accuracy of the neural net and we're back propagating with 01:02:15.440 |
respect to that accuracy and trying to increase it so let's start off back propagation here and 01:02:21.280 |
what is the derivative of o with respect to o the base case sort of we know always is that 01:02:27.920 |
the gradient is just 1.0 so let me fill it in and then let me split out the drawing function here 01:02:40.320 |
and then here cell clear this output here okay so now when we draw o we'll see that 01:02:52.320 |
o that grad is 1 so now we're going to back propagate through the tan h so to back propagate 01:02:57.840 |
through tan h we need to know the local derivative of tan h so if we have that o is tan h of n 01:03:07.200 |
then what is do by dn now what you could do is you could come here and you could take this 01:03:14.640 |
expression and you could do your calculus derivative taking and that would work but 01:03:20.880 |
we can also just scroll down wikipedia here into a section that hopefully tells us that derivative 01:03:27.520 |
d by dx of tan h of x is any of these i like this one 1 minus tan h square of x 01:03:34.480 |
so this is 1 minus tan h of x squared so basically what this is saying is that do by dn 01:03:43.760 |
is 1 minus tan h of n squared and we already have tan h of n it's just o so it's 1 minus o squared 01:03:55.760 |
so o is the output here so the output is this number o dot data is this number and then 01:04:08.080 |
what this is saying is that do by dn is 1 minus this squared so 1 minus o dot data squared 01:04:14.880 |
is 0.5 conveniently so the local derivative of this tan h operation here is 0.5 01:04:23.120 |
and so that would be do by dn so we can fill in that n dot grad 01:04:35.440 |
so this is exactly 0.5 one half so now we're going to continue the back propagation 01:04:47.680 |
this is 0.5 and this is a plus node so how is backprop going to what is backprop going to do 01:04:55.680 |
here and if you remember our previous example a plus is just a distributor of gradient so this 01:05:02.160 |
gradient will simply flow to both of these equally and that's because the local derivative of this 01:05:06.960 |
operation is one for every one of its nodes so one times 0.5 is 0.5 so therefore we know that 01:05:14.880 |
this node here which we called this it's grad it's just 0.5 and we know that b dot grad is also 0.5 01:05:24.800 |
so let's set those and let's draw so those are 0.5 continuing we have another plus 0.5 again 01:05:33.280 |
we'll just distribute so 0.5 will flow to both of these so we can set theirs 01:05:39.520 |
x2 w2 as well dot grad is 0.5 and let's redraw pluses are my favorite uh operations to back 01:05:51.280 |
propagate through because it's very simple so now what's flowing into these expressions is 0.5 01:05:57.760 |
and so really again keep in mind what the derivative is telling us at every point in time 01:06:01.360 |
along here this is saying that if we want the output of this neuron to increase 01:06:06.480 |
then the influence on these expressions is positive on the output both of them are positive 01:06:20.480 |
so now back propagating to x1 w2 first this is a times node so we know that the local derivative 01:06:27.120 |
is you know the other term so if we want to calculate x2 dot grad 01:06:31.440 |
then can you think through what it's going to be 01:06:35.120 |
so x2 dot grad will be w2 dot data times this x2 w2 dot grad right 01:06:49.840 |
and w2 dot grad will be x2 dot data times x2 w2 dot grad 01:06:57.840 |
right so that's the little local piece of chain rule 01:07:03.040 |
let's set them and let's redraw so here we see that the gradient on our weight 01:07:10.560 |
2 is 0 because x2's data was 0 right but x2 will have the gradient 0.5 because data here was 1 01:07:18.560 |
and so what's interesting here right is because the input x2 was 0 then because of the way the 01:07:25.440 |
times works of course this gradient will be 0 and think about intuitively why that is 01:07:30.640 |
derivative always tells us the influence of this on the final output if i wiggle w2 how is the 01:07:39.200 |
output changing it's not changing because we're multiplying by 0 so because it's not changing 01:07:44.400 |
there is no derivative and 0 is the correct answer because we're squashing that 0 01:07:49.600 |
and let's do it here 0.5 should come here and flow through this times and so we'll have that 01:07:58.240 |
x1 dot grad is can you think through a little bit what what this should be 01:08:06.160 |
the local derivative of times with respect to x1 is going to be w1 so w1's data times 01:08:14.240 |
x1 w1 dot grad and w1 dot grad will be x1 dot data times x1 w2 w1 dot grad 01:08:25.760 |
let's see what those came out to be so this is 0.5 so this would be negative 1.5 and this would be 01:08:33.360 |
1 and we've back propagated through this expression these are the actual final derivatives 01:08:39.040 |
so if we want this neuron's output to increase we know that what's necessary is that 01:08:45.440 |
w2 we have no gradient w2 doesn't actually matter to this neuron right now but this neuron this 01:08:52.640 |
weight should go up so if this weight goes up then this neuron's output would have gone up 01:08:59.360 |
and proportionally because the gradient is 1 okay so doing the back propagation manually is 01:09:04.080 |
obviously ridiculous so we are now going to put an end to this suffering and we're going to see 01:09:08.560 |
how we can implement the backward pass a bit more automatically we're not going to be doing all of 01:09:13.280 |
it manually out here it's now pretty obvious to us by example how these pluses and times are back 01:09:18.560 |
propagating gradients so let's go up to the value object and we're going to start codifying what 01:09:25.520 |
we've seen in the examples below so we're going to do this by storing a special self.backward 01:09:32.720 |
and underscore backward and this will be a function which is going to do that little piece 01:09:39.920 |
of chain rule at each little node that compute that took inputs and produced output we're going 01:09:45.520 |
to store how we are going to chain the outputs gradient into the inputs gradients so by default 01:09:54.000 |
this will be a function that doesn't do anything so and you can also see that here in the value in 01:10:01.600 |
micro grad so we have this backward function by default doesn't do anything this is an empty 01:10:09.040 |
function and that would be sort of the case for example for a leaf node for leaf node there's 01:10:13.360 |
nothing to do but now if when we're creating these out values these out values are an addition of 01:10:22.160 |
self and other and so we will want to sell set outs backward to be the function that 01:10:30.560 |
propagates the gradient so let's define what should happen 01:10:37.200 |
and we're going to store it in a closure let's define what should happen when we call outs grad 01:10:47.760 |
for addition our job is to take outs grad and propagate it into selves grad and other dot grad 01:10:56.960 |
so basically we want to solve self dot grad to something and we want to set others dot grad 01:11:02.720 |
to something okay and the way we saw below how chain rule works we want to take the local derivative 01:11:10.720 |
times the um sort of global derivative i should call it which is the derivative of the final 01:11:16.240 |
output of the expression with respect to outs data with respect to out so the local derivative 01:11:25.680 |
of self in an addition is 1.0 so it's just 1.0 times outs grad that's the chain rule 01:11:35.120 |
and others dot grad will be 1.0 times out grad and what you basically what you're seeing here 01:11:40.960 |
is that outs grad will simply be copied onto selves grad and others grad as we saw happens 01:11:47.760 |
for an addition operation so we're going to later call this function to propagate the gradient 01:11:53.440 |
having done an addition let's now do multiplication we're going to also define backward 01:11:59.280 |
and we're going to set its backward to be backward 01:12:07.760 |
and we want to chain out grad into self dot grad 01:12:11.840 |
and others dot grad and this will be a little piece of chain rule for multiplication 01:12:19.600 |
so we'll have so what should this be can you think through 01:12:23.920 |
so what is the local derivative here the local derivative was others dot data 01:12:35.360 |
and then oops others dot data and then times out dot grad that's channel 01:12:40.960 |
and here we have self dot data times out dot grad that's what we've been doing 01:12:52.320 |
and then we want to set outs backwards to be just backward 01:13:00.480 |
and here we need to back propagate we have out dot grad and we want to chain it into self dot grad 01:13:07.440 |
and self dot grad will be the local derivative of this operation that we've done here which is 01:13:14.880 |
tan h and so we saw that the local gradient is 1 minus the tan h of x squared which here is t 01:13:22.000 |
that's the local derivative because that's t is the output of this tan h 01:13:27.360 |
so 1 minus t squared is the local derivative and then gradient has to be multiplied because of the 01:13:33.840 |
chain rule so outs grad is chained through the local gradient into self dot grad and that should 01:13:40.000 |
be basically it so we're going to redefine our value node we're going to swing all the way down 01:13:46.160 |
here and we're going to redefine our expression make sure that all the grads are zero okay 01:13:56.240 |
but now we don't have to do this manually anymore we are going to basically be calling 01:14:01.120 |
the dot backward in the right order so first we want to call outs dot backward 01:14:09.040 |
so o was the outcome of tan h right so calling those those backward 01:14:22.160 |
will be this function this is what it will do now we have to be careful because 01:14:28.480 |
there's a times out dot grad and out dot grad remember is initialized to zero 01:14:34.640 |
so here we see grad zero so as a base case we need to set oath dot grad to 1.0 01:14:46.640 |
to initialize this with one and then once this is one we can call o dot backward 01:14:56.560 |
and what that should do is it should propagate this grad through tan h so the local derivative 01:15:03.360 |
times the global derivative which is initialized at one so this should 01:15:16.000 |
dope so i thought about redoing it but i figured i should just leave the error in here because it's 01:15:21.120 |
pretty funny why is not an object not callable it's because i screwed up we're trying to save 01:15:28.880 |
these functions so this is correct this here you don't want to call the function because that 01:15:35.040 |
returns none these functions return none we just want to store the function so let me redefine the 01:15:40.640 |
value object and then we're going to come back in redefine the expression draw dot everything is 01:15:47.120 |
great o dot grad is one o dot grad is one and now now this should work of course okay so all that 01:15:56.960 |
backward should this grad should now be 0.5 if we redraw and if everything went correctly 0.5 yay 01:16:07.760 |
ns dot backward sorry ns backward so that seems to have worked 01:16:15.600 |
so ns dot backward routed the gradient to both of these so this is looking great 01:16:23.040 |
now we could of course called uh called b dot grad b dot backward sorry what's going to happen 01:16:32.000 |
well b doesn't have a backward b's backward because b is a leaf node b's backward is by 01:16:38.960 |
initialization the empty function so nothing would happen but we can call call it on it 01:16:50.720 |
then we expect this 0.5 to get further routed right so there we go 0.5 0.5 01:17:00.960 |
and then finally we want to call it here on x2 w2 01:17:11.840 |
let's do both of those and there we go so we get 0.5 negative 1.5 and 1 exactly as we did before 01:17:25.360 |
but now we've done it through calling that backward sort of manually so we have the one 01:17:33.520 |
last piece to get rid of which is us calling underscore backward manually so let's think 01:17:38.640 |
through what we are actually doing we've laid out a mathematical expression and now we're trying to 01:17:43.920 |
go backwards through that expression so going backwards through the expression just means that 01:17:49.840 |
we never want to call a dot backward for any node before we've done sort of everything after it 01:17:58.800 |
so we have to do everything after it before we're ever going to call dot backward on any one node 01:18:03.840 |
we have to get all of its full dependencies everything that it depends on has to propagate 01:18:08.960 |
to it before we can continue back propagation so this ordering of graphs can be achieved using 01:18:15.680 |
something called topological sort so topological sort is basically a laying out of a graph such 01:18:23.280 |
that all the edges go only from left to right basically so here we have a graph it's a 01:18:28.720 |
direct acyclic graph a dag and this is two different topological orders of it i believe 01:18:35.520 |
where basically you'll see that it's a laying out of the nodes such that all the edges go only one 01:18:39.520 |
way from left to right and implementing topological sort you can look in wikipedia and so on i'm not 01:18:46.240 |
going to go through it in detail but basically this is what builds a topological graph we maintain a 01:18:54.960 |
set of visited nodes and then we are going through starting at some root node which for us is o 01:19:03.040 |
that's where we want to start the topological sort and starting at o we go through all of its children 01:19:08.720 |
and we need to lay them out from left to right and basically this starts at o if it's not visited 01:19:16.080 |
then it marks it as visited and then it iterates through all of its children and calls build 01:19:22.080 |
topological on them and then after it's gone through all the children it adds itself so basically 01:19:29.040 |
this node that we're going to call it on like say o is only going to add itself to the topo list 01:19:36.480 |
after all of the children have been processed and that's how this function is guaranteeing that 01:19:42.160 |
you're only going to be in the list once all your children are in the list and that's the invariant 01:19:46.480 |
that is being maintained so if we built up on o and then inspect this list we're going to see that 01:19:53.120 |
it ordered our value objects and the last one is the value of 0.707 which is the output 01:20:01.760 |
so this is o and then this is n and then all the other nodes get laid out before it 01:20:08.000 |
so that builds the topological graph and really what we're doing now is we're just calling 01:20:14.720 |
dot underscore backward on all of the nodes in a topological order so if we just reset the 01:20:21.360 |
gradients they're all zero what did we do we started by setting o dot grad to be one 01:20:30.400 |
that's the base case then we built a topological order 01:20:35.920 |
and then we went for node in reversed of topo now in in the reverse order because this list goes from 01:20:50.800 |
you know we need to go through it in reversed order so starting at o node backward and this 01:20:59.040 |
should be it there we go those are the correct derivatives finally we are going to hide this 01:21:08.880 |
functionality so i'm going to copy this and we're going to hide it inside the value class because 01:21:15.200 |
we don't want to have all that code lying around so instead of an underscore backward we're now 01:21:20.160 |
going to define an actual backward so that backward without the underscore and that's 01:21:26.640 |
going to do all the stuff that we just derived so let me just clean this up a little bit so 01:21:33.360 |
build a topological graph starting at self so build topo of self will populate the topological 01:21:46.000 |
order into the topo list which is a local variable then we set self dot grads to be one 01:21:51.440 |
and then for each node in the reversed list so starting at us and going to all the children 01:21:58.160 |
underscore backward and that should be it so save come down here we define 01:22:11.120 |
okay all the grads are zero and now what we can do is o dot backward without the underscore and 01:22:17.360 |
there we go and that's uh that's back propagation 01:22:24.960 |
place for one neuron now we shouldn't be too happy with ourselves actually because we have a 01:22:32.000 |
bad bug and we have not surfaced the bug because of some specific conditions that we are we have 01:22:37.840 |
to think about right now so here's the simplest case that shows the bug say i create a single node a 01:22:45.760 |
and then i create a b that is a plus a and then i call backward 01:22:52.480 |
so what's going to happen is a is three and then a b is a plus a so there's two arrows on top of 01:23:00.960 |
each other here then we can see that b is of course the forward pass works b is just a plus a 01:23:08.640 |
which is six but the gradient here is not actually correct that we calculate it automatically 01:23:14.240 |
and that's because um of course uh just doing calculus in your head the derivative of b with 01:23:23.200 |
respect to a should be uh two one plus one it's not one intuitively what's happening here right 01:23:32.240 |
so b is the result of a plus a and then we call backward on it so let's go up and see what that 01:23:38.800 |
does um b is the result of addition so out as b and then when we call backward what happened is 01:23:50.880 |
self dot grad was set to one and then other that grad was set to one but because we're doing a plus 01:23:59.120 |
a self and other are actually the exact same object so we are overriding the gradient we are 01:24:06.000 |
setting it to one and then we are setting it again to one and that's why it stays at one 01:24:12.160 |
so that's a problem there's another way to see this in a little bit more complicated expression 01:24:18.320 |
so here we have a and b and then d will be the multiplication of the two and e will be the 01:24:30.320 |
addition of the two and then we multiply e times d to get f and then we call f dot backward 01:24:36.560 |
and these gradients if you check will be incorrect so fundamentally what's happening here again is 01:24:45.040 |
basically we're going to see an issue any time we use a variable more than once 01:24:48.560 |
until now in these expressions above every variable is used exactly once so we didn't 01:24:53.520 |
see the issue but here if a variable is used more than once what's going to happen during 01:24:57.920 |
backward pass we're back propagating from f to e to d so far so good but now e calls a backward 01:25:04.720 |
and it deposits its gradients to a and b but then we come back to d and call backward and it overwrites 01:25:11.520 |
those gradients at a and b so that's obviously a problem and the solution here if you look at 01:25:19.280 |
the multivariate case of the chain rule and its generalization there the solution there is 01:25:25.040 |
basically that we have to accumulate these gradients these gradients add and so instead of 01:25:31.600 |
setting those gradients we can simply do plus equals we need to accumulate those gradients 01:25:39.120 |
plus equals plus equals plus equals plus equals and this will be okay remember because we are 01:25:48.400 |
initializing them at zero so they start at zero and then any contribution that flows backwards 01:25:55.920 |
will simply add so now if we redefine this one because the plus equals this now works 01:26:06.000 |
because a dot grad started at zero and we called b dot backward we deposit one and then we deposit 01:26:11.840 |
one again and now this is two which is correct and here this will also work and we'll get correct 01:26:17.440 |
gradients because when we call e dot backward we will deposit the gradients from this branch 01:26:22.320 |
and then we get to back to d dot backward it will deposit its own gradients and then those 01:26:27.440 |
gradients simply add on top of each other and so we just accumulate those gradients and that 01:26:32.080 |
fixes the issue okay now before we move on let me actually do a bit of cleanup here and delete 01:26:37.680 |
some of these some of this intermediate work so i'm not going to need any of this now that 01:26:42.880 |
we've derived all of it um we are going to keep this because i want to come back to it 01:26:48.720 |
delete the 10h delete our modeling example delete the step delete this keep the code that draws 01:26:59.680 |
and then delete this example and leave behind only the definition of value 01:27:04.160 |
and now let's come back to this non-linearity here that we implemented the 10h now i told you 01:27:09.920 |
that we could have broken down 10h into its explicit atoms in terms of other expressions 01:27:15.840 |
if we had the exp function so if you remember 10h is defined like this and we chose to develop 10h 01:27:21.760 |
as a single function and we can do that because we know it's derivative and we can back propagate 01:27:26.000 |
through it but we can also break down 10h into and express it as a function of exp and i would 01:27:31.600 |
like to do that now because i want to prove to you that you get all the same results and all the same 01:27:35.280 |
gradients um but also because it forces us to implement a few more expressions it forces us to 01:27:40.880 |
do exponentiation addition subtraction division and things like that and i think it's a good 01:27:45.760 |
exercise to go through a few more of these okay so let's scroll up to the definition of value 01:27:52.080 |
and here one thing that we currently can't do is we can't do like a value of say 2.0 but we can't 01:27:59.040 |
do you know here for example we want to add a constant one and we can't do something like this 01:28:03.920 |
and we can't do it because it says int object has no attribute data that's because a plus one 01:28:10.160 |
comes right here to add and then other is the integer one and then here python is trying to 01:28:16.160 |
access one dot data and that's not a thing and that's because basically one is not a value object 01:28:21.280 |
and we only have addition for value objects so as a matter of convenience so that we can create 01:28:26.800 |
expressions like this and make them make sense we can simply do something like this 01:28:30.640 |
basically we let other alone if other is an instance of value but if it's not an instance 01:28:38.000 |
of value we're going to assume that it's a number like an integer or float and we're going to simply 01:28:41.840 |
wrap it in in value and then other will just become value of other and then other will have 01:28:46.880 |
a data attribute and this should work so if i just say this redefine value then this should work 01:28:52.560 |
there we go okay and now let's do the exact same thing for multiply because we can't do something 01:28:57.680 |
like this again for the exact same reason so we just have to go to mul and if other is not a value 01:29:05.600 |
then let's wrap it in value let's redefine value and now this works now here's a kind of unfortunate 01:29:12.080 |
and not obvious part a times two works we saw that but two times a is that going to work 01:29:18.320 |
you'd expect it to right but actually it will not and the reason it won't is because python doesn't 01:29:25.120 |
know like when when you do a times two basically um so a times two python will go and it will 01:29:32.000 |
basically do something like a dot mul of two that's basically what it will call but to it 01:29:38.000 |
two times a is the same as two dot mul of a and it doesn't two can't multiply value and so it's 01:29:45.840 |
really confused about that so instead what happens is in python the way this works is you are free to 01:29:50.800 |
define something called the armol and armol is kind of like a fallback so if a python can't do 01:29:58.800 |
two times a it will check if um if by any chance a knows how to multiply two and that will be called 01:30:06.720 |
into armol so because python can't do two times a it will check is there an armol in value and 01:30:13.680 |
because there is it will now call that and what we'll do here is we will swap the order of the 01:30:19.520 |
operands so basically two times a will redirect to armol and armol will basically call a times two 01:30:26.000 |
and that's how that will work so redefining that with armol two times a becomes four okay now 01:30:32.880 |
looking at the other elements that we still need we need to know how to exponentiate and how to 01:30:36.320 |
divide so let's first do the explanation to the exponentiation part we're going to introduce a 01:30:41.920 |
single function exp here and exp is going to mirror tanh in the sense that it's a simple 01:30:48.880 |
single function that transforms a single scalar value and outputs a single scalar value so we pop 01:30:54.000 |
out the python number we use math.exp to exponentiate it create a new value object everything 01:30:59.520 |
that we've seen before the tricky part of course is how do you back propagate through e to the x 01:31:04.800 |
and so here you can potentially pause the video and think about what should go here 01:31:10.000 |
okay so basically we need to know what is the local derivative of e to the x so d by dx of e 01:31:19.520 |
to the x is famously just e to the x and we've already just calculated e to the x and it's inside 01:31:25.520 |
out that data so we can do out that data times and out that grad that's the chain rule so we're just 01:31:32.640 |
chaining on to the current running grad and this is what the expression looks like it looks a little 01:31:37.680 |
confusing but this is what it is and that's the exponentiation so redefining we should now be able 01:31:43.600 |
to call a.exp and hopefully the backward pass works as well okay and the last thing we'd like 01:31:49.600 |
to do of course is we'd like to be able to divide now i actually will implement something slightly 01:31:54.880 |
more powerful than division because division is just a special case of something a bit more 01:31:58.960 |
powerful so in particular just by rearranging if we have some kind of a b equals value of 4.0 here 01:32:06.880 |
we'd like to basically be able to do a divide b and we'd like this to be able to give us 0.5 01:32:10.720 |
now division actually can be reshuffled as follows if we have a divide b that's actually 01:32:17.280 |
the same as a multiplying 1 over b and that's the same as a multiplying b to the power of negative 01:32:23.040 |
1 and so what i'd like to do instead is i basically like to implement the operation of x to the k for 01:32:29.440 |
some constant k so it's an integer or a float and we would like to be able to differentiate this and 01:32:36.160 |
then as a special case negative 1 will be division and so i'm doing that just because it's more 01:32:42.960 |
general and yeah you might as well do it that way so basically what i'm saying is we can redefine 01:32:49.760 |
division which we will put here somewhere yeah we can put it here somewhere what i'm saying is that 01:32:56.800 |
we can redefine division so self divide other this can actually be rewritten as self times 01:33:03.040 |
other to the power of negative one and now value raised to the power of negative one we have now 01:33:10.000 |
defined that so here's so we need to implement the pow function where am i going to put the pow 01:33:17.280 |
function maybe here somewhere there's this coliform for it so this function will be called 01:33:23.840 |
when we try to raise a value to some power and other will be that power now i'd like to make 01:33:29.520 |
sure that other is only an int or a float usually other is some kind of a different value object 01:33:35.360 |
but here other will be forced to be an int or a float otherwise the math won't work for for 01:33:42.320 |
what we're trying to achieve in the specific case that would be a different derivative expression 01:33:46.480 |
if we wanted other to be a value so here we create the up the value which is just uh you know this 01:33:52.880 |
data raised to the power of other and other here could be for example negative one that's what we 01:33:57.280 |
are hoping to achieve and then uh this is the backward stub and this is the fun part which is 01:34:03.600 |
what is the uh chain rule expression here for back for um back propagating through the power function 01:34:12.320 |
where the power is to the power of some kind of a constant so this is the exercise and maybe 01:34:17.040 |
pause the video here and see if you can figure it out yourself as to what we should put here 01:34:20.720 |
okay so um you can actually go here and look at derivative rules as an example and we see lots of 01:34:33.360 |
derivative rules that you can hopefully know from calculus in particular what we're looking for is 01:34:37.600 |
the power rule because that's telling us that if we're trying to take d by dx of x to the n which 01:34:42.880 |
is what we're doing here then that is just n times x to the n minus one right okay so that's telling 01:34:52.160 |
us about the local derivative of this power operation so all we want here basically n is now 01:34:59.760 |
other and self.data is x and so this now becomes other which is n times self.data which is now a 01:35:10.800 |
python int or a float it's not a value object we're accessing the data attribute raised to the 01:35:17.600 |
power of other minus one or n minus one i can put brackets around this but this doesn't matter 01:35:23.440 |
because power takes precedence over multiply and python so that would have been okay and that's 01:35:29.840 |
the local derivative only but now we have to chain it and we change it just simply by multiplying by 01:35:34.560 |
our top grand that's chain rule and this should technically work and we're going to find out soon 01:35:42.240 |
but now if we do this this should now work and we get 0.5 so the forward pass works but does the 01:35:49.760 |
backward pass work and i realized that we actually also have to know how to subtract so right now a 01:35:55.520 |
minus b will not work to make it work we need one more piece of code here and basically this is the 01:36:04.080 |
subtraction and the way we're going to implement subtraction is we're going to implement it by 01:36:09.440 |
addition of a negation and then to implement negation we're going to multiply by negative one 01:36:13.760 |
so just again using the stuff we've already built and just expressing it in terms of what we have 01:36:18.960 |
and a minus b is not working okay so now let's scroll again to this expression here for this 01:36:23.840 |
neuron and let's just compute the backward pass here once we've defined o and let's draw it so 01:36:32.160 |
here's the gradients for all these leaf nodes for this two-dimensional neuron that has a 10h that 01:36:36.880 |
we've seen before so now what i'd like to do is i'd like to break up this 10h into this expression 01:36:43.520 |
here so let me copy paste this here and now instead of we'll preserve the label and we will 01:36:51.280 |
change how we define o so in particular we're going to implement this formula here so we need 01:36:57.200 |
e to the 2x minus 1 over e to the x plus 1 so e to the 2x we need to take 2 times n and we need to 01:37:04.960 |
exponentiate it that's e to the 2x and then because we're using it twice let's create an intermediate 01:37:10.800 |
variable e and then define o as e plus 1 over e minus 1 over e plus 1 e minus 1 over e plus 1 01:37:21.840 |
and that should be it and then we should be able to draw dot of o so now before i run this what do 01:37:29.120 |
we expect to see number one we're expecting to see a much longer graph here because we've broken up 01:37:35.280 |
10h into a bunch of other operations but those operations are mathematically equivalent and so 01:37:40.480 |
what we're expecting to see is number one the same result here so the forward pass works and 01:37:46.000 |
number two because of that mathematical equivalence we expect to see the same backward pass and the 01:37:50.800 |
same gradients on these leaf nodes so these gradients should be identical so let's run this 01:37:56.400 |
so number one let's verify that instead of a single 10h node we have now exp and we have plus 01:38:04.880 |
we have times negative one this is the division and we end up with the same forward pass here 01:38:11.120 |
and then the gradients we have to be careful because they're in slightly different order 01:38:14.800 |
potentially the gradients for w2 x2 should be 0 and 0.5 w2 and x2 are 0 and 0.5 and w1 x1 are 1 01:38:23.440 |
and negative 1.5 1 and negative 1.5 so that means that both our forward passes and backward passes 01:38:30.160 |
were correct because this turned out to be equivalent to 10h before and so the reason 01:38:36.640 |
I wanted to go through this exercise is number one we got to practice a few more operations 01:38:40.960 |
and writing more backwards passes and number two I wanted to illustrate the point that the 01:38:46.320 |
level at which you implement your operations is totally up to you you can implement backward 01:38:52.400 |
passes for tiny expressions like a single individual plus or a single times or you can 01:38:57.040 |
implement them for say 10h which is a kind of a potentially you can see it as a composite operation 01:39:02.960 |
because it's made up of all these more atomic operations but really all of this is kind of 01:39:07.280 |
like a fake concept all that matters is we have some kind of inputs and some kind of an output 01:39:11.360 |
and this output is a function of the inputs in some way and as long as you can do forward pass 01:39:15.360 |
and the backward pass of that little operation it doesn't matter what that operation is and how 01:39:21.440 |
composite it is if you can write the local gradients you can chain the gradient and you 01:39:25.840 |
can continue back propagation so the design of what those functions are is completely up to you 01:39:30.800 |
so now I would like to show you how you can do the exact same thing but using a modern deep neural 01:39:36.000 |
network library like for example PyTorch which I've roughly modeled micro grad by and so PyTorch 01:39:44.080 |
is something you would use in production and I'll show you how you can do the exact same thing but 01:39:48.080 |
in PyTorch API so I'm just going to copy paste it in and walk you through it a little bit this is 01:39:52.800 |
what it looks like so we're going to import PyTorch and then we need to define these value 01:40:00.000 |
objects like we have here now micro grad is a scalar valued engine so we only have scalar values 01:40:07.440 |
like 2.0 but in PyTorch everything is based around tensors and like I mentioned tensors are just 01:40:13.120 |
n-dimensional arrays of scalars so that's why things get a little bit more complicated here 01:40:18.640 |
I just need a scalar valued tensor a tensor with just a single element but by default when you work 01:40:24.720 |
with PyTorch you would use more complicated tensors like this so if I import PyTorch 01:40:31.600 |
then I can create tensors like this and this tensor for example is a two by three array of 01:40:40.080 |
scalars in a single compact representation so you can check its shape we see that it's a two by three 01:40:47.120 |
array and so on so this is usually what you would work with in the actual libraries so here I'm 01:40:54.160 |
creating a tensor that has only a single element 2.0 and then I'm casting it to be double because 01:41:03.920 |
Python is by default using double precision for its floating point numbers so I'd like everything 01:41:08.560 |
to be identical by default the data type of these tensors will be float 32 so it's only using a 01:41:15.280 |
single precision float so I'm casting it to double so that we have float 64 just like in Python 01:41:21.600 |
so I'm casting to double and then we get something similar to value of 2.0 the next thing I have to 01:41:28.640 |
do is because these are leaf nodes by default PyTorch assumes that they do not require gradients 01:41:33.760 |
so I need to explicitly say that all of these nodes require gradients okay so this is going 01:41:38.960 |
to construct scalar valued one element tensors make sure that PyTorch knows that they require 01:41:44.880 |
gradients now by default these are set to false by the way because of efficiency reasons because 01:41:50.240 |
usually you would not want gradients for leaf nodes like the inputs to the network and this 01:41:55.760 |
is just trying to be efficient in the most common cases so once we've defined all of our values 01:42:01.440 |
in PyTorch land we can perform arithmetic just like we can here in micrograd land so this would 01:42:06.240 |
just work and then there's a torch.tanh also and when we get back as a tensor again and we can 01:42:13.520 |
just like in micrograd it's got a data attribute and it's got grad attributes so these tensor 01:42:19.200 |
objects just like in micrograd have a dot data and a dot grad and the only difference here is that 01:42:24.800 |
we need to call a dot item because otherwise PyTorch dot item basically takes a single tensor 01:42:33.120 |
of one element and it just returns that element stripping out the tensor so let me just run this 01:42:38.800 |
and hopefully we are going to get this is going to print the forward pass which is 0.707 and this 01:42:45.360 |
will be the gradients which hopefully are 0.50 negative 1.5 and 1 so if we just run this 01:42:52.240 |
there we go 0.7 so the forward pass agrees and then 0.50 negative 1.5 and 1 01:42:59.680 |
so PyTorch agrees with us and just to show you here basically oh here's a tensor with a single 01:43:06.400 |
element and it's a double and we can call dot item on it to just get the single number out 01:43:13.440 |
so that's what item does and o is a tensor object like i mentioned and it's got a backward function 01:43:19.920 |
just like we've implemented and then all of these also have a dot grad so like x2 for example has 01:43:25.440 |
a grad and it's a tensor and we can pop out the individual number with dot item so basically 01:43:32.880 |
Torches Torch can do what we did in micro grad as a special case when your tensors are all single 01:43:38.960 |
element tensors but the big deal with PyTorch is that everything is significantly more efficient 01:43:44.080 |
because we are working with these tensor objects and we can do lots of operations in parallel 01:43:48.800 |
on all of these tensors but otherwise what we've built very much agrees with the API of PyTorch 01:43:55.040 |
okay so now that we have some machinery to build out pretty complicated mathematical expressions 01:43:59.760 |
we can also start building up neural nets and as I mentioned neural nets are just a specific class 01:44:04.480 |
of mathematical expressions so we're going to start building out a neural net piece by piece 01:44:09.280 |
and eventually we'll build out a two-layer multi-layer layer perceptron as it's called 01:44:13.920 |
and I'll show you exactly what that means let's start with a single individual neuron we've 01:44:18.080 |
implemented one here but here I'm going to implement one that also subscribes to the 01:44:22.720 |
PyTorch API in how it designs its neural network modules so just like we saw that we can like 01:44:28.800 |
match the API of PyTorch on the autograd side we're going to try to do that on the neural 01:44:34.400 |
network modules so here's class neuron and just for the sake of efficiency I'm going to copy paste 01:44:41.760 |
some sections that are relatively straightforward so the constructor will take number of inputs to 01:44:48.640 |
this neuron which is how many inputs come to a neuron so this one for example has three inputs 01:44:55.120 |
and then it's going to create a weight that is some random number between negative one and one 01:44:59.440 |
for every one of those inputs and a bias that controls the overall trigger happiness of this 01:45:04.640 |
neuron and then we're going to implement a def underscore underscore call of self and x 01:45:12.720 |
sum input x and really what we don't want to do here is w times x plus b where w times x here is 01:45:18.800 |
a dot product specifically now if you haven't seen call let me just return 0.0 here for now 01:45:26.000 |
the way this works now is we can have an x which is say like 2.0 3.0 then we can initialize a neuron 01:45:32.160 |
that is two-dimensional because these are two numbers and then we can feed those two numbers 01:45:36.720 |
into that neuron to get an output and so when you use this notation n of x python will use call 01:45:44.960 |
so currently call just returns 0.0 now we'd like to actually do the forward pass of this neuron 01:45:52.960 |
instead so we're going to do here first is we need to basically multiply all of the elements of 01:45:59.920 |
w with all of the elements of x pairwise we need to multiply them so the first thing we're going 01:46:05.200 |
to do is we're going to zip up salta w and x and in python zip takes two iterators and it creates 01:46:13.280 |
a new iterator that iterates over the tuples of their corresponding entries so for example just 01:46:18.960 |
to show you we can print this list and still return 0.0 here sorry i'm in my so we see that 01:46:34.880 |
these w's are paired up with the x's w with x and now what we want to do is 01:46:42.640 |
for wi xi in we want to multiply w times wi times xi and then we want to sum all of that together 01:46:56.960 |
to come up with an activation and add also salta b on top so that's the raw activation and then 01:47:04.160 |
of course we need to pass that through a non-linearity so what we're going to be returning 01:47:07.760 |
is act dot 10h and here's out so now we see that we are getting some outputs and we get a different 01:47:16.080 |
output from a neuron each time because we are initializing different weights and biases and 01:47:21.280 |
then to be a bit more efficient here actually sum by the way takes a second optional parameter 01:47:27.200 |
which is the start and by default the start is zero so these elements of this sum will be added 01:47:34.080 |
on top of zero to begin with but actually we can just start with cell.b and then we just have an 01:47:39.280 |
expression like this and then the generator expression here must be parenthesized in python 01:47:49.360 |
there we go yep so now we can forward a single neuron next up we're going to define a layer of 01:47:58.400 |
neurons so here we have a schematic for a MLP so we see that these MLPs each layer this is one layer 01:48:06.320 |
has actually a number of neurons and they're not connected to each other but all of them are fully 01:48:09.760 |
connected to the input so what is a layer of neurons it's just it's just a set of neurons 01:48:14.560 |
evaluated independently so in the interest of time i'm going to do something fairly straightforward 01:48:21.360 |
here it's literally a layer is just a list of neurons and then how many neurons do we have 01:48:30.640 |
we take that as an input argument here how many neurons do you want in your layer number of outputs 01:48:35.120 |
in this layer and so we just initialize completely independent neurons with this given dimensionality 01:48:41.280 |
and when we call on it we just independently evaluate them so now instead of a neuron we can 01:48:48.000 |
make a layer of neurons they are two-dimensional neurons and let's have three of them and now we 01:48:52.720 |
see that we have three independent evaluations of three different neurons right okay and finally 01:48:59.680 |
let's complete this picture and define an entire multi-layer perceptron or MLP and as we can see 01:49:05.200 |
here in an MLP these layers just feed into each other sequentially so let's come here and i'm 01:49:10.480 |
just going to copy the code here in interest of time so an MLP is very similar we're taking the 01:49:17.200 |
number of inputs as before but now instead of saying taking a single n out which is number of 01:49:22.240 |
neurons in a single layer we're going to take a list of n outs and this list defines the sizes 01:49:27.440 |
of all the layers that we want in our MLP so here we just put them all together and then iterate 01:49:32.720 |
over consecutive pairs of these sizes and create layer objects for them and then in the call 01:49:38.560 |
function we are just calling them sequentially so that's an MLP really and let's actually 01:49:43.520 |
re-implement this picture so we want three input neurons and then two layers of four and an output 01:49:48.320 |
unit so we want a three-dimensional input say this is an example input we want three inputs into two 01:49:57.680 |
layers of four and one output and this of course is an MLP and there we go that's a forward pass 01:50:05.360 |
of an MLP to make this a little bit nicer you see how we have just a single element but it's wrapped 01:50:10.320 |
in a list because layer always returns lists so for convenience return outs at zero if len outs 01:50:18.320 |
is exactly a single element else return fullest and this will allow us to just get a single value 01:50:24.560 |
out at the last layer that only has a single neuron and finally we should be able to draw dot 01:50:30.000 |
of n of x and as you might imagine these expressions are now getting 01:50:35.440 |
relatively involved so this is an entire MLP that we're defining now 01:50:40.880 |
all the way until a single output okay and so obviously you would never differentiate on pen 01:50:51.920 |
and paper these expressions but with micrograd we will be able to back propagate all the way 01:50:56.720 |
through this and back propagate into these weights of all these neurons so let's see how that works 01:51:04.080 |
okay so let's create ourselves a very simple example data set here so this data set has four 01:51:09.920 |
examples and so we have four possible inputs into the neural net and we have four desired targets 01:51:17.440 |
so we'd like the neural net to assign or output 1.0 when it's fed this example negative one when 01:51:24.880 |
it's fed these examples and one when it's fed this example so it's a very simple binary classifier 01:51:29.600 |
neural net basically that we would like here now let's think what the neural net currently thinks 01:51:34.320 |
about these four examples we can just get their predictions basically we can just call n of x 01:51:40.160 |
for x and x's and then we can print so these are the outputs of the neural net on those four 01:51:47.680 |
examples so the first one is 0.91 but we'd like it to be 1 so we should push this one higher this 01:51:55.920 |
one we want to be higher this one says 0.88 and we want this to be negative 1 this is 0.88 we 01:52:03.600 |
want it to be negative 1 and this one is 0.88 we want it to be 1 so how do we make the neural net 01:52:10.000 |
and how do we tune the weights to better predict the desired targets and the trick used in deep 01:52:18.080 |
learning to achieve this is to calculate a single number that somehow measures the total performance 01:52:24.000 |
of your neural net and we call this single number the loss so the loss first is a single number that 01:52:32.320 |
we're going to define that basically measures how well the neural net is performing right now we 01:52:36.480 |
have the intuitive sense that it's not performing very well because we're not very much close to 01:52:40.080 |
this so the loss will be high and we'll want to minimize the loss so in particular in this case 01:52:45.760 |
what we're going to do is we're going to implement the mean squared error loss so what this is doing 01:52:50.480 |
is we're going to basically iterate for y ground truth and y output in zip of y's and y_red 01:53:01.120 |
so we're going to pair up the ground truths with the predictions and the zip iterates over 01:53:07.360 |
tuples of them and for each y ground truth and y output we're going to subtract them 01:53:14.240 |
and square them so let's first see what these losses are these are individual loss components 01:53:21.680 |
and so basically for each one of the four we are taking the prediction and the ground truth 01:53:29.440 |
we are subtracting them and squaring them so because this one is so close to its target 01:53:36.240 |
0.91 is almost one subtracting them gives a very small number so here we would get like a negative 01:53:43.360 |
0.1 and then squaring it just makes sure that regardless of whether we are more negative or 01:53:50.160 |
more positive we always get a positive number instead of squaring we should we could also take 01:53:56.000 |
for example the absolute value we need to discard the sign and so you see that the expression is 01:54:00.880 |
ranged so that you only get zero exactly when y out is equal to y ground truth when those two are 01:54:07.120 |
equal so your prediction is exactly the target you are going to get zero and if your prediction is 01:54:11.920 |
not the target you are going to get some other number so here for example we are way off and so 01:54:17.200 |
that's why the loss is quite high and the more off we are the greater the loss will be so we don't 01:54:25.120 |
want high loss we want low loss and so the final loss here will be just the sum of all of these 01:54:32.720 |
numbers so you see that this should be zero roughly plus zero roughly but plus seven so 01:54:41.200 |
loss should be about seven here and now we want to minimize the loss we want the loss to be low 01:54:49.280 |
because if loss is low then every one of the predictions is equal to its target so the loss 01:54:57.120 |
the lowest it can be is zero and the greater it is the worse off the neural net is predicting 01:55:02.560 |
so now of course if we do loss.backward something magical happened when i hit enter 01:55:09.760 |
and the magical thing of course that happened is that we can look at n.layers.neuron 01:55:15.280 |
and that layers at say like the first layer that neurons at zero 01:55:20.000 |
because remember that mlp has the layers which is a list and each layer has neurons which is a list 01:55:28.720 |
and that gives us an individual neuron and then it's got some weights 01:55:32.000 |
and so we can for example look at the weights at zero 01:55:40.160 |
oops it's not called weights it's called w and that's a value but now this value also has a grad 01:55:48.000 |
because of the backward pass and so we see that because this gradient here on this particular 01:55:54.240 |
weight of this particular neuron of this particular layer is negative we see that 01:55:58.480 |
its influence on the loss is also negative so slightly increasing this particular weight 01:56:03.760 |
of this neuron of this layer would make the loss go down and we actually have this information for 01:56:10.400 |
every single one of our neurons and all their parameters actually it's worth looking at 01:56:14.640 |
also the draw dot of loss by the way so previously we looked at the draw dot of a single neuron 01:56:20.720 |
neuron forward pass and that was already a large expression but what is this expression we actually 01:56:25.920 |
forwarded every one of those four examples and then we have the loss on top of them with the 01:56:31.360 |
mean squared error and so this is a really massive graph because this graph that we've built up now 01:56:38.400 |
oh my gosh this graph that we've built up now which is kind of excessive it's excessive because 01:56:45.440 |
it has four forward passes of a neural net for every one of the examples and then it has the 01:56:50.080 |
loss on top and it ends with the value of the loss which was 7.12 and this loss will now back 01:56:56.240 |
propagate through all the four forward passes all the way through just every single intermediate 01:57:01.520 |
value of the neural net all the way back to of course the parameters of the weights which are 01:57:06.640 |
the input so these weight parameters here are inputs to this neural net and these numbers here 01:57:14.320 |
these scalars are inputs to the neural net so if we went around here we will probably find 01:57:20.640 |
some of these examples this 1.0 potentially maybe this 1.0 or you know some of the others and you'll 01:57:26.480 |
see that they all have gradients as well the thing is these gradients on the input data are not that 01:57:32.160 |
useful to us and that's because the input data seems to be not changeable it's it's a given to 01:57:39.040 |
the problem and so it's a fixed input we're not going to be changing it or messing with it even 01:57:43.120 |
though we do have gradients for it but some of these gradients here will be for the neural network 01:57:50.320 |
parameters the w's and the b's and those we of course we want to change okay so now we're going 01:57:56.640 |
to want some convenience code to gather up all of the parameters of the neural net so that we can 01:58:01.920 |
operate on all of them simultaneously and every one of them we will nudge a tiny amount based on 01:58:09.040 |
the gradient information so let's collect the parameters of the neural net all in one array 01:58:14.800 |
so let's create a parameters of self that just returns self.w which is a list 01:58:21.360 |
concatenated with a list of self.b so this will just return a list list plus list just 01:58:30.640 |
you know gives you a list so that's parameters of neuron and i'm calling it this way because 01:58:36.080 |
also pytorch has a parameters on every single nn module and it does exactly what we're doing here 01:58:42.240 |
it just returns the parameter tensors for us it's the parameter scalars now layer is also a module 01:58:50.320 |
so it will have parameters self and basically what we want to do here is something like this like 01:58:57.680 |
params is here and then for neuron in self.neurons we want to get neuron.parameters 01:59:10.560 |
and we want to params.extend right so these are the parameters of this neuron and then we want 01:59:17.280 |
to put them on top of params so params.extend of piece and then we want to return params 01:59:23.920 |
so this there's way too much code so actually there's a way to simplify this which is return 01:59:33.920 |
p for neuron in self.neurons for p in neuron.parameters 01:59:44.000 |
so it's a single list comprehension in python you can sort of nest them like this and you can 01:59:50.400 |
then create the desired array so this is these are identical we can take this out 02:00:01.440 |
def parameters self and return a parameter for layer in self.layers for p in layer.parameters 02:00:17.440 |
and that should be good now let me pop out this so we don't re-initialize our network 02:00:31.680 |
okay so unfortunately we will have to probably re-initialize the network because 02:00:39.120 |
we just had functionality because this class of course we i want to get 02:00:43.840 |
all the end up parameters but that's not going to work because this is the old class 02:00:48.240 |
okay so unfortunately we do have to re-initialize the network which will change some of the numbers 02:00:55.680 |
but let me do that so that we pick up the new api we can now do end up parameters 02:00:59.280 |
and these are all the weights and biases inside the entire neural net so in total this mlp has 41 02:01:08.720 |
parameters and now we'll be able to change them if we recalculate the loss here we see that 02:01:18.560 |
unfortunately we have slightly different predictions and slightly different loss 02:01:24.880 |
um but that's okay okay so we see that this neuron's gradient is slightly negative we can 02:01:33.200 |
also look at its data right now which is 0.85 so this is the current value of this neuron and this 02:01:39.840 |
is its gradient on the loss so what we want to do now is we want to iterate for every p in 02:01:47.040 |
end up parameters so for all the 41 parameters of this neural net we actually want to change 02:01:53.360 |
p.data slightly according to the gradient information okay so dot dot dot to do here 02:02:01.360 |
but this will be basically a tiny update in this gradient descent scheme and gradient descent 02:02:09.520 |
we are thinking of the gradient as a vector pointing in the direction of increased loss 02:02:19.120 |
and so in gradient descent we are modifying p.data by a small step size in the direction 02:02:26.480 |
of the gradient so the step size as an example could be like a very small number like 0.01 is 02:02:30.960 |
the step size times p.grad right but we have to think through some of the signs here so 02:02:39.120 |
in particular working with this specific example here we see that if we just left it like this 02:02:47.040 |
then this neuron's value would be currently increased by a tiny amount of the gradient 02:02:52.560 |
the gradient is negative so this value of this neuron would go slightly down it would become 02:02:58.480 |
like 0.84 or something like that but if this neuron's value goes lower that would actually 02:03:07.280 |
increase the loss that's because the derivative of this neuron is negative so increasing 02:03:16.560 |
this makes the loss go down so increasing it is what we want to do instead of decreasing it 02:03:22.880 |
so basically what we're missing here is we're actually missing a negative sign 02:03:25.440 |
and again this other interpretation and that's because we want to minimize the loss we don't 02:03:30.880 |
want to maximize the loss we want to decrease it and the other interpretation as i mentioned 02:03:35.600 |
is you can think of the gradient vector so basically just the vector of all the gradients 02:03:40.400 |
as pointing in the direction of increasing the loss but then we want to decrease it so we actually 02:03:46.800 |
want to go in the opposite direction and so you can convince yourself that this sort of like does 02:03:51.360 |
the right thing here with the negative because we want to minimize the loss so if we nudge all the 02:03:56.480 |
parameters by a tiny amount then we'll see that this data will have changed a little bit so now 02:04:04.640 |
this neuron is a tiny amount greater value so 0.854 went to 0.857 and that's a good thing because 02:04:15.600 |
slightly increasing this neuron data makes the loss go down according to the gradient and so 02:04:23.120 |
the correcting has happened sine wise and so now what we would expect of course is that because 02:04:29.360 |
we've changed all these parameters we expect that the loss should have gone down a bit so we want to 02:04:35.680 |
re-evaluate the loss let me basically this is just a data definition that hasn't changed but the 02:04:42.960 |
forward pass here of the network we can recalculate and actually let me do it outside here so that we 02:04:51.920 |
can compare the two loss values so here if i recalculate the loss we'd expect the new loss now 02:04:59.280 |
to be slightly lower than this number so hopefully what we're getting now is a tiny bit lower than 02:05:04.240 |
4.36 okay and remember the way we've arranged this is that low loss means that our predictions 02:05:13.440 |
are matching the targets so our predictions now are probably slightly closer to the targets 02:05:20.240 |
and now all we have to do is we have to iterate this process 02:05:23.440 |
so again we've done the forward pass and this is the loss now we can loss that backward 02:05:29.280 |
let me take these out and we can do a step size and now we should have a slightly lower loss 02:05:36.320 |
4.36 goes to 3.9 and okay so we've done the forward pass here's the backward pass 02:05:47.520 |
3.47 and you get the idea we just continue doing this and this is gradient descent we're just 02:05:57.520 |
iteratively doing forward pass backward pass update forward pass backward pass update and 02:06:02.880 |
the neural net is improving its predictions so here if we look at y pred now y pred 02:06:12.720 |
we see that this value should be getting closer to one so this value should be getting more positive 02:06:18.400 |
these should be getting more negative and this one should be also getting more positive 02:06:24.640 |
actually we may be able to afford to go a bit faster let's try a slightly higher learning rate 02:06:39.200 |
if you go too fast by the way if you try to make it too big of a step you may actually overstep 02:06:44.400 |
overconfidence because again remember we don't actually know exactly about the loss function 02:06:51.280 |
the loss function has all kinds of structure and we only know about the very local dependence of 02:06:56.320 |
all these parameters on the loss but if we step too far we may step into you know a part of the 02:07:01.520 |
loss that is completely different and that can destabilize training and make your loss actually 02:07:05.840 |
blow up even so the loss is now 0.04 so actually the predictions should be really quite close let's 02:07:13.600 |
take a look so you see how this is almost one almost negative one almost one we can continue 02:07:20.080 |
going so yep backward update oops there we go so we went way too fast and we actually overstepped 02:07:31.440 |
so we got too eager where are we now oops okay 7e negative 9 so this is very very low loss 02:07:40.080 |
and the predictions are basically perfect so somehow we basically we were doing way too big 02:07:48.880 |
updates and we briefly exploded but then somehow we ended up getting into a really good spot 02:07:52.880 |
so usually this learning rate and the tuning of it is a subtle art you want to set your learning 02:07:58.640 |
rate if it's too low you're going to take way too long to converge but if it's too high the 02:08:03.280 |
whole thing gets unstable and you might actually even explode the loss depending on your loss 02:08:07.760 |
function so finding the step size to be just right it's it's a pretty subtle art sometimes 02:08:13.040 |
when you're using sort of vanilla gradient descent but we happen to get into a good spot 02:08:17.680 |
we can look at n dot parameters so this is the setting of weights and biases that makes our 02:08:27.520 |
network predict the desired targets very very close and basically we've successfully trained 02:08:37.120 |
a neural net okay let's make this a tiny bit more respectable and implement an actual training loop 02:08:42.240 |
and what that looks like so this is the data definition that stays this is the forward pass 02:08:48.320 |
so for k in range you know we're going to take a bunch of steps 02:08:55.040 |
first you do the forward pass we validate the loss 02:09:01.040 |
let's reinitialize the neural net from scratch and here's the data 02:09:06.960 |
and we first do forward pass then we do the backward pass 02:09:12.480 |
and then we do an update that's gradient descent 02:09:21.680 |
and then we should be able to iterate this and we should be able to print the current step 02:09:29.600 |
the current loss let's just print the sort of number of the loss and that should be it 02:09:40.480 |
and then the learning rate 0.01 is a little too small 0.1 we saw is like a little bit dangerous 02:09:44.960 |
with ui let's go somewhere in between and we'll optimize this for not 10 steps but let's go for 02:09:51.680 |
say 20 steps let me erase all of this junk and let's run the optimization 02:10:00.800 |
and you see how we've actually converged slower in a more controlled manner and got to a loss that 02:10:09.280 |
is very low so i expect y_pred to be quite good there we go 02:10:16.240 |
and that's it okay so this is kind of embarrassing but we actually have a really terrible bug 02:10:27.600 |
in here and it's a subtle bug and it's a very common bug and i can't believe i've done it for 02:10:33.920 |
the 20th time in my life especially on camera and i could have re-shot the whole thing but i think 02:10:39.600 |
it's pretty funny and you know you get to appreciate a bit what working with neural 02:10:44.720 |
nuts maybe is like sometimes we are guilty of a common bug i've actually tweeted the most common 02:10:53.520 |
neural mistakes a long time ago now and i'm not really gonna explain any of these except for 02:11:01.520 |
we are guilty of number three you forgot to zero grad before dot backward what is that 02:11:06.800 |
basically what's happening and it's a subtle bug and i'm not sure if you saw it 02:11:12.160 |
is that all of these weights here have a dot data and a dot grad and the dot grad starts at zero 02:11:21.120 |
and then we do backward and we fill in the gradients and then we do an update on the data 02:11:27.520 |
but we don't flush the grad it stays there so when we do the second forward pass and we do 02:11:34.400 |
backward again remember that all the backward operations do a plus equals on the grad and so 02:11:39.760 |
these gradients just add up and they never get reset to zero so basically we didn't zero grad 02:11:47.280 |
so here's how we zero grad before backward we need to iterate over all the parameters 02:11:54.160 |
and we need to make sure that p dot grad is set to zero we need to reset it to zero just like it is 02:12:01.040 |
in the constructor so remember all the way here for all these value nodes grad is reset to zero 02:12:07.360 |
and then all these backward passes do a plus equals from that grad but we need to make sure 02:12:12.240 |
that we reset these graphs to zero so that when we do backward all of them start at zero and the 02:12:18.640 |
actual backward pass accumulates the loss derivatives into the grads so this is zero grad 02:12:26.960 |
in pytorch and we will slightly we'll get a slightly different optimization let's reset the 02:12:33.760 |
neural net the data is the same this is now i think correct and we get a much more you know 02:12:40.240 |
we get a much more slower descent we still end up with pretty good results and we can continue this 02:12:47.280 |
a bit more to get down lower and lower and lower yeah so the only reason that the previous thing 02:12:57.600 |
worked it's extremely buggy the only reason that worked is that this is a very very simple problem 02:13:04.880 |
and it's very easy for this neural net to fit this data and so the grads ended up accumulating 02:13:12.000 |
and it effectively gave us a massive step size and it made us converge extremely fast 02:13:16.880 |
but basically now we have to do more steps to get to very low values of loss 02:13:23.600 |
and get y pred to be really good we can try to step a bit greater 02:13:29.040 |
yeah we're gonna get closer and closer to one minus one and one so working with neural nets 02:13:40.640 |
is sometimes tricky because you may have lots of bugs in the code and your network might actually 02:13:49.280 |
work just like ours worked but chances are is that if we had a more complex problem then actually 02:13:55.200 |
this bug would have made us not optimize the loss very well and we were only able to get away with 02:13:59.520 |
it because the problem is very simple so let's now bring everything together and summarize what 02:14:05.680 |
we learned what are neural nets neural nets are these mathematical expressions fairly simple 02:14:12.000 |
mathematical expressions in the case of multilear perceptron that take input as the data and they 02:14:18.320 |
take input the weights and the parameters of the neural net mathematical expression for the forward 02:14:23.360 |
pass followed by a loss function and the loss function tries to measure the accuracy of the 02:14:28.320 |
predictions and usually the loss will be low when your predictions are matching your targets or 02:14:33.440 |
where the network is basically behaving well so we we manipulate the loss function so that when 02:14:38.960 |
the loss is low the network is doing what you want it to do on your problem and then we backward the 02:14:45.440 |
loss use back propagation to get the gradient and then we know how to tune all the parameters to 02:14:50.800 |
decrease the loss locally but then we have to iterate that process many times in what's called 02:14:55.200 |
the gradient descent so we simply follow the gradient information and that minimizes the loss 02:15:00.960 |
and the loss is arranged so that when the loss is minimized the network is doing what you want it to 02:15:05.120 |
do and yeah so we just have a blob of neural stuff and we can make it do arbitrary things 02:15:12.640 |
and that's what gives neural nets their power it's you know this is a very tiny network with 02:15:17.120 |
41 parameters but you can build significantly more complicated neural nets with billions at 02:15:24.160 |
this point almost trillions of parameters and it's a massive blob of neural tissue simulated neural 02:15:29.840 |
tissue roughly speaking and you can make it do extremely complex problems and these neural nets 02:15:36.560 |
then have all kinds of very fascinating emergent properties in when you try to make them do 02:15:42.800 |
significantly hard problems as in the case of gpt for example we have massive amounts of text from 02:15:49.680 |
the internet and we're trying to get a neural net to predict to take like a few words and try to 02:15:54.400 |
predict the next word in a sequence that's the learning problem and it turns out that when you 02:15:58.480 |
train this on all of internet the neural net actually has like really remarkable emergent 02:16:02.880 |
properties but that neural net would have hundreds of billions of parameters but it works on 02:16:08.400 |
fundamentally the exact same principles the neural net of course will be a bit more complex but 02:16:13.360 |
otherwise the evaluating the gradient is there and will be identical and the gradient descent would 02:16:20.080 |
be there and would be basically identical but people usually use slightly different updates 02:16:24.720 |
this is a very simple stochastic gradient descent update and the loss function would not be a mean 02:16:30.640 |
squared error they would be using something called the cross entropy loss for predicting the next 02:16:34.880 |
token so there's a few more details but fundamentally the neural network setup and 02:16:38.560 |
neural network training is identical and pervasive and now you understand intuitively how that works 02:16:44.800 |
under the hood in the beginning of this video i told you that by the end of it you would understand 02:16:48.800 |
everything in micrograd and then would slowly build it up let me briefly prove that to you 02:16:54.000 |
so i'm going to step through all the code that is in micrograd as of today actually potentially some 02:16:58.640 |
of the code will change by the time you watch this video because i intend to continue developing 02:17:02.160 |
micrograd but let's look at what we have so far at least init.py is empty when you go to 02:17:07.680 |
engine.py that has the value everything here you should mostly recognize so we have the data.data.grad 02:17:13.600 |
attributes we have the backward function we have the previous set of children and the operation 02:17:18.320 |
that produced this value we have addition multiplication and raising to a scalar power 02:17:24.320 |
we have the relu non-linearity which is slightly different type of non-linearity than tanh that 02:17:29.200 |
we used in this video both of them are non-linearities and notably tanh is not actually 02:17:34.160 |
present in micrograd as of right now but i intend to add it later we have the backward which is 02:17:39.680 |
identical and then all of these other operations which are built up on top of operations here 02:17:45.440 |
so value should be very recognizable except for the non-linearity used in this video 02:17:49.040 |
there's no massive difference between relu and tanh and sigmoid and these other non-linearities 02:17:55.200 |
they're all roughly equivalent and can be used in MLPs so i use tanh because it's a bit smoother 02:18:00.240 |
and because it's a little bit more complicated than relu and therefore it's stressed a little 02:18:04.480 |
bit more the local gradients and working with those derivatives which i thought would be useful 02:18:09.440 |
and then .py is the neural networks library as i mentioned so you should recognize 02:18:15.120 |
identical implementation of neuron layer and MLP notably or not so much we have a class module here 02:18:21.840 |
there's a parent class of all these modules i did that because there's an nn.module class 02:18:27.120 |
in pytorch and so this exactly matches that api and nn.module in pytorch has also a zero grad 02:18:32.720 |
which i refactored out here so that's the end of micrograd really then there's a test which you'll 02:18:40.720 |
see basically creates two chunks of code one in micrograd and one in pytorch and we'll make sure 02:18:47.520 |
that the forward and the backward paths agree identically for a slightly less complicated 02:18:51.920 |
expression a slightly more complicated expression everything agrees so we agree with pytorch and all 02:18:57.360 |
of these operations and finally there's a demo.pypy.ymb here and it's a bit more complicated 02:19:02.880 |
binary classification demo than the one i covered in this lecture so we only had a tiny data set 02:19:07.680 |
of four examples here we have a bit more complicated example with lots of blue points and lots of red 02:19:13.680 |
points and we're trying to again build a binary classifier to distinguish two-dimensional points 02:19:18.800 |
as red or blue it's a bit more complicated MLP here with it's a bigger MLP the loss is a bit 02:19:25.440 |
more complicated because it supports batches so because our data set was so tiny we always did a 02:19:31.680 |
forward pass on the entire data set of four examples but when your data set is like a million 02:19:36.400 |
examples what we usually do in practice is we basically pick out some random subset we call 02:19:42.000 |
that a batch and then we only process the batch forward backward and update so we don't have to 02:19:47.120 |
forward the entire training set so this supports batching because there's a lot more examples here 02:19:52.640 |
we do a forward pass the loss is slightly more different this is a max margin loss that i 02:19:58.720 |
implement here the one that we used was the mean squared error loss because it's the simplest one 02:20:04.560 |
there's also the binary cross-entropy loss all of them can be used for binary classification 02:20:09.040 |
and don't make too much of a difference in the simple examples that we looked at so far 02:20:12.720 |
there's something called L2 regularization used here this has to do with generalization of the 02:20:18.640 |
neural net and controls the overfitting in machine learning setting but i did not cover these concepts 02:20:23.920 |
in concepts in this video potentially later and the training loop you should recognize so forward 02:20:29.520 |
backward with zero grad and update and so on you'll notice that in the update here the learning rate 02:20:36.720 |
is scaled as a function of number of iterations and it shrinks and this is something called 02:20:42.720 |
learning rate decay so in the beginning you have a high learning rate and as the network 02:20:47.040 |
sort of stabilizes near the end you bring down the learning rate to get some of the fine details in 02:20:52.000 |
the end and in the end we see the decision surface of the neural net and we see that it learns to 02:20:57.440 |
separate out the red and the blue area based on the data points so that's the slightly more 02:21:02.880 |
complicated example in the demo demo.hypeiymb that you're free to go over but yeah as of today 02:21:08.720 |
that is micrograd i also wanted to show you a little bit of real stuff so that you get to see 02:21:13.040 |
how this is actually implemented in a production grade library like pytorch so in particular i 02:21:17.760 |
wanted to show i wanted to find and show you the backward pass for 10h in pytorch so here in 02:21:23.520 |
micrograd we see that the backward pass for 10h is 1 minus t square where t is the output of the 10h 02:21:30.560 |
of x times of that grad which is the chain rule so we're looking for something that looks like this 02:21:37.360 |
now i went to pytorch which has an open source github codebase and i looked through a lot of 02:21:45.760 |
its code and honestly i i spent about 15 minutes and i couldn't find 10h and that's because these 02:21:52.800 |
libraries unfortunately they grow in size and entropy and if you just search for 10h you get 02:21:58.320 |
apparently 2800 results and 400 and 406 files so i don't know what these files are doing honestly 02:22:05.760 |
and why there are so many mentions of 10h but unfortunately these libraries are quite complex 02:22:11.920 |
they're meant to be used not really inspected eventually i did stumble on someone who tries to 02:22:19.840 |
change the 10h backward code for some reason and someone here pointed to the cpu kernel and the 02:22:24.880 |
cuda kernel for 10h backward so this so basically depends on if you're using pytorch on a cpu device 02:22:31.440 |
or on a gpu which these are different devices and i haven't covered this but this is the 10h 02:22:36.080 |
backward kernel for cpu and the reason it's so large is that number one this is like if you're 02:22:45.200 |
using a complex type which we haven't even talked about if you're using a specific data type of 02:22:49.280 |
bfloat16 which we haven't talked about and then if you're not then this is the kernel and deep here 02:22:56.560 |
we see something that resembles our backward pass so they have a times 1 minus b square so this b 02:23:04.720 |
b here must be the output of the 10h and this is the out.grad so here we found it deep inside 02:23:14.080 |
pytorch on this location for some reason inside binary ops kernel when 10h is not actually binary 02:23:19.600 |
op and then this is the gpu kernel we're not complex we're here and here we go with one 02:23:29.280 |
line of code so we did find it but basically unfortunately these code bases are very large and 02:23:35.440 |
micrograd is very very simple but if you actually want to use real stuff 02:23:40.400 |
finding the code for it you'll actually find that difficult i also wanted to show you a little 02:23:45.840 |
example here where pytorch is showing you how you can register a new type of function that you want 02:23:50.560 |
to add to pytorch as a lego building block so here if you want to for example add a legendre polynomial 02:23:57.280 |
3 here's how you could do it you will register it as a class that subclasses torch.rgrad.function 02:24:06.320 |
and then you have to tell pytorch how to forward your new function and how to backward through it 02:24:12.080 |
so as long as you can do the forward pass of this little function piece that you want to add 02:24:16.240 |
and as long as you know the the local derivative the local gradients which are implemented in the 02:24:20.480 |
backward pytorch will be able to back propagate through your function and then you can use this 02:24:24.960 |
as a lego block in a larger lego castle of all the different lego blocks that pytorch already has 02:24:30.320 |
and so that's the only thing you have to tell pytorch and everything would just work 02:24:34.240 |
and you can register new types of functions in this way following this example and that is 02:24:39.360 |
everything that i wanted to cover in this lecture so i hope you enjoyed building out micrograd with 02:24:43.440 |
me i hope you find it interesting insightful and yeah i will post a lot of the links that are 02:24:50.560 |
related to this video in the video description below i will also probably post a link to a 02:24:55.280 |
discussion forum or discussion group where you can ask questions related to this video and then i can 02:25:00.800 |
answer or someone else can answer your questions and i may also do a follow-up video that answers 02:25:05.760 |
some of the most common questions but for now that's it i hope you enjoyed it if you did then 02:25:11.360 |
please like and subscribe so that youtube knows to feature this video to more people 02:25:22.480 |
now here's the problem we know dl by wait what is the problem 02:25:29.520 |
and that's everything i wanted to cover in this lecture so i hope 02:25:34.960 |
you enjoyed us building up micro grabbed micro grab 02:25:39.120 |
okay now let's do the exact same thing for multiply because we can't do something like 02:25:45.520 |
a times two whoops i know what happened there