back to indexStanford CS224N NLP with Deep Learning | 2023 | PyTorch Tutorial, Drew Kaul
00:00:05.500 |
And so today I kind of just want to cover the fundamentals of PyTorch, 00:00:10.400 |
really just kind of see what are the similarities between PyTorch and NumPy and Python, 00:00:18.700 |
and see how we can build up a lot of the building blocks that we'll need in order to define more complex models. 00:00:25.500 |
So specifically we're going to talk today about tensors, what are tensor objects, how do we manipulate them, 00:00:32.300 |
what is AutoGrad, how PyTorch helps us compute different gradients, 00:00:37.500 |
and finally how we actually do optimization and how we write the training loop for our neural networks. 00:00:42.500 |
And if we have time at the end, then we'll try and go through a bit of a demo to kind of put everything together 00:00:48.800 |
and see how everything comes together when you want to solve an actual NLP task. 00:00:58.800 |
So if you go to the course website, there is a notebook, 00:01:02.700 |
and you can just make a copy of this Colab notebook and then just run the cells as we go. 00:01:08.800 |
And so to start, today we're talking about PyTorch, like I said. 00:01:13.400 |
It's a deep learning framework that really does two main things. 00:01:16.900 |
One is it makes it very easy to author and manipulate tensors and make use of your GPU 00:01:22.800 |
so that you can actually leverage a lot of that capability. 00:01:25.800 |
And two is it makes the process of authoring neural networks much simpler. 00:01:31.300 |
You can now use different building blocks like linear layers and different loss functions 00:01:36.300 |
and compose them in different ways in order to author the types of models that you need for your specific use cases. 00:01:43.300 |
And so PyTorch is one of the two main frameworks along with TensorFlow. 00:01:48.200 |
In this class, we'll focus on PyTorch, but they're quite similar. 00:01:51.800 |
And so we'll start by importing Torch, and we'll import the neural network module, which is Torch.nn. 00:01:58.800 |
And for this first part of the tutorial, I want to talk a bit about tensors. 00:02:04.300 |
One thing that you guys are all familiar with now is NumPy arrays. 00:02:09.800 |
And so pretty much you can think about tensors as the equivalent in PyTorch to NumPy arrays. 00:02:16.800 |
They're essentially multi-dimensional arrays that you can manipulate in different ways, 00:02:21.800 |
and you'll essentially use them to represent your data, to be able to actually manipulate it, 00:02:28.300 |
and perform all the different matrix operations that underlie your neural network. 00:02:33.800 |
And so in this case, for example, if we're thinking of an image, 00:02:38.800 |
one way you can think about it in terms of a tensor is it's a 256 by 256 tensor, 00:02:44.800 |
where it has a width of 256 pixels and a height of 256 pixels. 00:02:50.300 |
And for instance, if we have a batch of images, and those images contain three channels, like red, green, and blue, 00:02:57.300 |
then we might have a four-dimensional tensor, which is the batch size by the number of channels, by the width and the height. 00:03:04.800 |
And so everything we're going to see today is all going to be represented as tensors, 00:03:08.800 |
which you can just think of as multi-dimensional arrays. 00:03:12.800 |
And so to kind of get some intuition about this, 00:03:15.800 |
we're going to spend a little bit of time going through essentially lists of lists, 00:03:20.800 |
and how we can convert them into tensors, and how we can manipulate them with different operations. 00:03:26.800 |
So to start off with, we just have a simple list of lists that you're all familiar with. 00:03:38.800 |
And so here, the way we'll create this tensor is by doing torch dot tensor, 00:03:44.800 |
and then essentially writing the same syntax that we had before. 00:03:49.800 |
Just write out the list of lists that represents that particular tensor. 00:03:55.800 |
And so in this case, we get back a tensor object, which is the same shape and contains the same data. 00:04:02.800 |
And so now, the second thing with a tensor is that it contains a data type. 00:04:08.800 |
For instance, there are different varying level of precision floating point numbers that you can use. 00:04:13.800 |
You can have integers. You can have different data types that actually populate your tensor. 00:04:18.800 |
And so by default, I believe this will be float 32, 00:04:21.800 |
but you can explicitly specify which data type your tensor is by passing in the dtype argument. 00:04:28.800 |
And so we see here now, even though we wrote in a bunch of integers, 00:04:33.800 |
they have a decimal point, which indicates that they're floating point numbers. 00:04:39.800 |
We can create another tensor, in this case with data type float 32. 00:04:45.800 |
And in this third example, you see that we create another tensor. 00:04:51.800 |
We don't actually specify the data type, but PyTorch essentially implicitly takes the data type to be floating point, 00:04:59.800 |
since we actually passed in a floating point number into this tensor. 00:05:03.800 |
So pretty much at a high level, tensors are like multi-dimensional arrays. 00:05:13.800 |
Okay, so now, great, we know how to create tensors. 00:05:17.800 |
We know that ultimately everything that we work with, all the data we have is going to be expressed as tensors. 00:05:23.800 |
Now the question is, what are the functions that we have to manipulate them? 00:05:27.800 |
And so we have some basic utilities that can help us instantiate tensors easily, 00:05:36.800 |
These are two ways to create tensors of a particular shape, in this case tensors of all zeros or tensors of all ones. 00:05:45.800 |
And you'll see that this will be very helpful when you do your homeworks. 00:05:50.800 |
Typically, you'll want to just need to create a bunch of a zero matrix, 00:05:54.800 |
and it'll be very easy to just specify the shape here without having to write everything out super explicitly. 00:06:00.800 |
And then you can update that tensor as needed. 00:06:04.800 |
Another thing you can do is, just like we have ranges in Python, 00:06:09.800 |
so if you want to loop over a bunch of numbers, you can specify a range. 00:06:14.800 |
You can also use torch.a_range to be able to actually instantiate a tensor with a particular range. 00:06:22.800 |
In this case, we just looped over the numbers 1 through 10. 00:06:26.800 |
You could reshape this and make it 1 through 5 and then 6 through 10. 00:06:30.800 |
That's another way to be able to instantiate tensors. 00:06:34.800 |
And finally, something to note is that when we apply particular operations, 00:06:41.800 |
such as just simple Python operations like addition or multiplication, 00:06:46.800 |
by default they're going to be element-wise, so they'll apply to all the elements in our tensor. 00:06:52.800 |
So in this case, we took our tensor, I think this one was probably from earlier above, 00:06:58.800 |
and we added 2 everywhere. Here we multiplied everything by 2. 00:07:04.800 |
But pretty much the PyTorch semantics for broadcasting work pretty much the same as the NumPy semantics. 00:07:10.800 |
So if you pretty much have different matrix operations where you need to batch across a particular dimension, 00:07:19.800 |
PyTorch will be smart about it and it will actually make sure that you broadcast over the appropriate dimensions. 00:07:25.800 |
Although of course you have to make sure that the shapes are compatible based on the actual broadcasting rules. 00:07:31.800 |
So we'll get to that in a little bit when we look at reshaping and how different operations have those semantics. 00:07:40.800 |
In this case, we have to define the, I guess I'm not personally aware of how you would define kind of a jagged tensor that has unequal dimensions. 00:07:50.800 |
But typically we don't want to do that because it makes our computation a lot more complex. 00:07:56.800 |
And so in cases where we have, you know, for instance, we have different sentences that we turn into tokens, 00:08:03.800 |
we might have different length sentences in our training set. 00:08:06.800 |
We'll actually pad all the dimensions to be the same because ultimately we want to do everything with matrix operations. 00:08:13.800 |
And so in order to do that, we need to have a matrix of a fixed shape. 00:08:16.800 |
But yeah, that's a good point. I'm not sure if there is a way to do that, but typically we just get around this by padding. 00:08:25.800 |
Okay, so now we know how to define tensors. We can do some interesting things with them. 00:08:30.800 |
So here we've created two tensors. One of them is a 3 by 2 tensor. 00:08:39.800 |
And I think the answer is written up here, but what do we expect is the shape when we multiply these two tensors? 00:08:47.800 |
So we have a 3 by 2 tensor and a 2 by 4 tensor. 00:08:54.800 |
And so more generally, we can use Matmul in order to do matrix multiplication. 00:09:02.800 |
It also implements batched matrix multiplication. 00:09:06.800 |
And so I won't go over the entire review of broadcasting semantics, 00:09:11.800 |
but the main gist is that the dimensions of two tensors are compatible if you can left pad the tensors with ones 00:09:19.800 |
so that the dimensions that line up either A, have the same number in that dimension, 00:09:25.800 |
or B, one of them is a dummy dimension. One of them has a 1. 00:09:28.800 |
And in that case, in those dummy dimensions, PyTorch will actually make sure to copy over the tensor as many times as needed 00:09:36.800 |
so that you can then actually perform the operation. 00:09:39.800 |
And that's useful when you want to do things like batched dot products or batched matrix multiplications. 00:09:45.800 |
And I guess the final point here is there's also a shorthand notation that you can use. 00:09:52.800 |
So instead of kind of having to type out Matmul every time, you can just use the at operator, similar to NumPy. 00:09:59.800 |
Effectively, that's kind of where we get into how batching works. 00:10:03.800 |
So for example, if you had, let's say, two tensors that have some batch dimension, 00:10:13.800 |
and then one of them is M by 1, and the other one is 1 by N. 00:10:19.800 |
And if you do a batched matrix multiply to those two tensors, 00:10:23.800 |
now what you effectively do is you preserve the batch dimension, 00:10:27.800 |
and then you're doing a matrix multiplication between an M by 1 tensor and a 1 by N. 00:10:32.800 |
So you get something that's the batch dimension by M by N. 00:10:36.800 |
So effectively, they're kind of more, I think the full semantics are written out on the PyTorch website 00:10:44.800 |
But you're right, you don't just have these cases where you have two two-dimensional tensors. 00:10:50.800 |
and as long as the dimensions match up based on those semantics I was saying, 00:10:56.800 |
Alternatively, you can do what I do, which is just multiply it anyways, 00:11:00.800 |
and then if it throws an error, print out the shapes and kind of work from there. 00:11:04.800 |
That tends to be faster in my opinion in a lot of ways. 00:11:10.800 |
All right, so yeah, let's keep going through some of the other different functionalities here. 00:11:20.800 |
and kind of one of the key things that we always want to look at is the shape. 00:11:25.800 |
So in this case, we just have a 1D tensor of length 3, 00:11:33.800 |
In general, this is kind of one of the key debugging steps 00:11:36.800 |
and something that I'll try and emphasize a lot throughout this session, 00:11:40.800 |
which is printing the shapes of all of your tensors is probably your best resource. 00:11:45.800 |
When it comes to debugging, it's kind of one of the hardest things to intuit exactly what's going on 00:11:50.800 |
once you start stacking a lot of different operations together. 00:11:54.800 |
So printing out the shapes at each point and seeing do they match what you expect is something important, 00:12:00.800 |
and it's better to rely on that than just on the error message that PyTorch gives you, 00:12:06.800 |
because under the hood, PyTorch might implement certain optimizations 00:12:10.800 |
and actually reshape the underlying tensor you have, 00:12:26.800 |
and we can have a more complex, in this case, 3-dimensional tensor, 00:12:34.800 |
and we can print out the shape and we can see all of the dimensions here. 00:12:39.800 |
And so now you're like, "Okay, great, we have tensors, we can look at their shapes, 00:12:46.800 |
And so now let's get into kind of what are the operations that we can apply to these tensors. 00:12:52.800 |
And so one of them is it's very easy to reshape tensors. 00:12:58.800 |
So in this case, we're creating this 15-dimensional tensor that's... 00:13:06.800 |
and now we're reshaping it so now it's a 5 by 3 tensor here. 00:13:11.800 |
And so you might wonder, "Well, like, what's the point of that?" 00:13:15.800 |
And it's because a lot of times when we are doing machine learning, 00:13:22.800 |
and so we might take our data and we might reshape it 00:13:25.800 |
so now that instead of kind of being a long, flattened list of things, 00:13:31.800 |
or in some cases we have a set of batches of a set of sentences or sequences 00:13:39.800 |
and each of the elements in that sequence has an embedding of a particular dimension. 00:13:44.800 |
And so based on the types of operations that you're trying to do, 00:13:48.800 |
you'll sometimes need to reshape those tensors, 00:13:51.800 |
and sometimes you'll want to particularly sometimes transpose dimensions 00:13:56.800 |
if you want to, for instance, reorganize your data. 00:14:09.800 |
view will create a view of the underlying tensor, 00:14:11.800 |
and so I think the underlying tensor will still have this same shape. 00:14:22.800 |
All right, and then finally, like I said at the beginning, 00:14:25.800 |
your intuition about PyTorch tensors can simply be they're kind of a nice, 00:14:35.800 |
like now we can essentially use them with GPUs and it's very optimized, 00:14:47.800 |
if you have some NumPy code and you have a bunch of NumPy arrays, 00:14:50.800 |
you can directly convert them into PyTorch sensors by simply casting them, 00:14:56.800 |
and you can also take those tensors and convert them back to NumPy arrays. 00:15:03.800 |
All right, and so one of the things you might be asking is, 00:15:11.800 |
And one of the great things about them is that they support vectorized operations very easily. 00:15:16.800 |
Essentially, we can parallelize a lot of different computations and do them, 00:15:21.800 |
for instance, across a batch of data all at once, 00:15:24.800 |
and one of those operations you might want to do, for instance, is a sum. 00:15:41.800 |
You can take a tensor that's shaped five by seven, 00:15:43.800 |
and now you can compute different operations on it that essentially collapse the dimensionality. 00:15:50.800 |
So the first one is sum, and so you can take it and you can sum across both the rows as well as the columns, 00:15:56.800 |
and so one way I like to think about this to kind of keep them straight is that 00:16:01.800 |
the dimension that you specify in the sum is the dimension you're collapsing. 00:16:06.800 |
So in this case, if you take the data and sum over dimension zero, 00:16:11.800 |
because you know the shape of the underlying tensor is five by seven, 00:16:18.800 |
so you should be left with something that's just shaped seven. 00:16:21.800 |
And if you see the actual tensor, you've got 75, 80, 85, 90, 00:16:26.800 |
you've got this tensor which is shaped seven. 00:16:29.800 |
Alternatively, you can think about whether or not you're kind of summing across the rows or summing across the columns. 00:16:35.800 |
But it's not just sum, it applies to other operations as well. 00:16:39.800 |
You can compute standard deviations, you can normalize your data, 00:16:43.800 |
you can do other operations which essentially batch across the entire set of data. 00:16:48.800 |
And not only do these apply over one dimension, 00:16:52.800 |
but here you can see that if you don't specify any dimensions, 00:16:55.800 |
then by default the operation actually applies to the entire tensor. 00:16:59.800 |
So here we end up just taking the sum of the entire thing. 00:17:03.800 |
So if you think about it, the zeroth dimension is the number of rows, 00:17:06.800 |
there are five rows and there are seven columns. 00:17:08.800 |
So if we sum out the rows, then we're actually summing across the columns, 00:17:19.800 |
But I like to think about more just in terms of the dimensions to keep it straight 00:17:22.800 |
rather than rows or columns because it can get confusing. 00:17:27.800 |
then effectively you've taken something which has some shape that's dimension zero by dimension one 00:17:36.800 |
And then from there you can kind of figure out, okay, which way did I actually sum to check if you were right. 00:17:41.800 |
NumPy implements a lot of this vectorization, 00:17:45.800 |
and I believe in the homework that you have right now, 00:17:49.800 |
I think part of your job is to vectorize a lot of these things. 00:17:52.800 |
So the big advantage with PyTorch is that essentially it's optimized to be able to take advantage of your GPU. 00:17:59.800 |
When we actually start building out neural networks that are bigger, that involve more computation, 00:18:04.800 |
we're going to be doing a lot of these matrix multiplication operations 00:18:08.800 |
that it's going to be a lot better for our processor if we can make use of the GPU. 00:18:13.800 |
And so that's where PyTorch really comes in handy. 00:18:17.800 |
In addition to also defining a lot of those neural network modules as we'll see later, 00:18:22.800 |
um, for you so that now you don't need to worry about, for instance, 00:18:26.800 |
implementing a basic linear layer and back propagation from scratch and also your optimizer. 00:18:32.800 |
All of those things will be built in and you can just call the respective APIs to make use of them. 00:18:37.800 |
Whereas in Python and NumPy you might have to do a lot of that coding yourself. 00:18:53.800 |
So this is a quiz except I think it tells you the answer so it's not much of a quiz. 00:18:59.800 |
But pretty much, you know, what would you do if now I told you instead of, you know, summing over this tensor, 00:19:10.800 |
And so there's, there's two different ways you could compute the average. 00:19:13.800 |
You could compute the average across the rows or across the columns. 00:19:18.800 |
And so essentially now we kind of get back to this question of, well, 00:19:23.800 |
which dimension am I actually going to reduce over? 00:19:28.800 |
then we need to actually sum over the second dimension. 00:19:32.800 |
Um, there are really the first, uh, zeroth and first. 00:19:36.800 |
So the first dimension is what we add to sum over, uh, 00:19:40.800 |
because we want to preserve the zeroth dimension. 00:19:43.800 |
And so that's why for row average you see the dim equals one. 00:19:47.800 |
And for column average, same reasoning is why you see the dim equals zero. 00:19:52.800 |
And so if we run this code, we'll see kind of what are the shapes that we expect. 00:20:01.800 |
then an object that's two by three should just become an object that's two. 00:20:06.800 |
It's just a one-dimensional, almost a vector you can think of. 00:20:10.800 |
And if we are averaging across the columns, there's three columns. 00:20:26.800 |
I guess is this general intuition about how we deal with shapes, 00:20:29.800 |
and how some of these operations manipulate shapes. 00:20:37.800 |
but I think you'll find that the semantics are very similar to NumPy. 00:20:43.800 |
So one of the things that you can do in NumPy is that you can take these NumPy arrays, 00:20:51.800 |
In many different ways, you can create copies of them, 00:20:54.800 |
and you can index across particular dimensions to select out different elements, 00:21:02.800 |
And so in this case, let's take this example tensor, 00:21:09.800 |
Um, and first thing you always want to do when you have a new tensor, 00:21:14.800 |
print out its shape, understand what you're working with. 00:21:21.800 |
I may have shown this already, but what will x bracket zero print out? 00:21:27.800 |
What happens if we index into just the first element? 00:21:38.800 |
our tensor is really just a list of three things. 00:21:41.800 |
Each of those things happens to also be a two by two tensor. 00:21:47.800 |
in this case, the first thing, one, two, three, four. 00:21:53.800 |
if you provide a colon in a particular dimension, 00:21:56.800 |
it means essentially copy over that dimension. 00:22:03.800 |
we're essentially putting a colon for all the other dimensions. 00:22:06.800 |
So it's essentially saying, grab the first thing along the zeroth dimension, 00:22:11.800 |
and then grab everything along the other two dimensions. 00:22:18.800 |
just the zeroth along- the element along the first dimension, 00:22:31.800 |
the kind of first dimension where these three things, 00:22:34.800 |
the second dimension is now each of these two rows within those things. 00:22:39.800 |
So like one, two, and three, four, five, six, and seven, 00:22:44.800 |
So if we index into the second dimen- or the first dimension and get the zeroth element, 00:22:51.800 |
then we're going to end up with one, two, five, six, and nine, 10. 00:23:00.800 |
you can kind of go back to the trick I mentioned before, 00:23:03.800 |
where we're slicing across the first dimension. 00:23:14.800 |
that two in the middle, we're left with something that's three by two. 00:23:17.800 |
So it might seem a little bit trivial kind of going through this in a lot of detail, 00:23:23.800 |
but I think it's important because it can get tricky when your tensor shapes get more complicated, 00:23:30.800 |
And so I won't go through every example here since a lot of them kind of reinforce the same thing, 00:23:38.800 |
Just like NumPy, you can choose to get a range of elements. 00:23:43.800 |
Uh, in this case, we're taking this new tensor, 00:23:49.800 |
which is one to- one through 15 rearranged as a five by three tensor. 00:23:55.800 |
And if we take the zeroth through third row, um, 00:24:02.800 |
And we can do the same thing, but now with slicing across multiple dimensions. 00:24:07.800 |
And I think the final point I want to talk about here is list indexing. 00:24:18.800 |
and it's a very clever shorthand for being able to essentially select out multiple elements at once. 00:24:28.800 |
if you want to get the zeroth, the second, and the fourth element of our matrix, 00:24:35.800 |
you can just, instead of indexing with a particular number or set of numbers, 00:24:47.800 |
if we take out the zeroth, the second, and the fourth, 00:24:59.800 |
Yeah, again, these are kind of a lot of examples to just reiterate the same point, 00:25:04.800 |
which is that you can slice across your data in multiple ways, 00:25:08.800 |
and at different points you're going to need to do that. 00:25:11.800 |
So being familiar with the shapes that you understand what's the underlying output that you expect is important. 00:25:18.800 |
In this case, for instance, we're slicing across the first and the second dimension, 00:25:27.800 |
And so we're going to end up getting essentially kind of the top left element of each of those three things in our tensor. 00:25:35.800 |
If we scroll all the way up here, we'll get this one, we'll get this five, and we'll get this nine, 00:25:41.800 |
because we go across all of the zeroth dimension, 00:25:45.800 |
and then across the first and the second, we only take the first, the zeroth element in both of those positions. 00:26:00.800 |
And also, of course, you can, you know, apply all of the colons to get back the original tensor. 00:26:11.800 |
Okay, and then I think the last thing when it comes to indexing is conversions. 00:26:17.800 |
So typically, when we're writing code with neural networks, ultimately we're going to, you know, 00:26:24.800 |
process some data through a network, and we're going to get a loss. 00:26:27.800 |
And that loss needs to be a scalar, and then we're going to compute gradients with respect to that loss. 00:26:32.800 |
So one thing to keep in mind is that sometimes you might have an operation, 00:26:37.800 |
and it fails because it was actually expecting a scalar value rather than a tensor. 00:26:42.800 |
And so you can extract out the scalar from this one-by-one tensor by just calling dot item. 00:26:50.800 |
So in this case, you know, if you have a tensor, which is just literally one, 00:26:54.800 |
then you can actually get the Python scalar that corresponds to it by calling dot item. 00:27:00.800 |
So now we can get into the more interesting stuff. 00:27:02.800 |
One of the really cool things with PyTorch is Autograd. 00:27:06.800 |
And what Autograd is, is PyTorch essentially provides an automatic differentiation package, 00:27:14.800 |
where when you define your neural network, you're essentially defining many nodes that compute some function. 00:27:23.800 |
And in the forward pass, you're kind of running your data through those nodes. 00:27:27.800 |
But what PyTorch is doing on the back end is that at each of those points, 00:27:31.800 |
it's going to actually store the gradients and accumulate them, 00:27:36.800 |
so that every time you do your backwards pass, 00:27:39.800 |
you apply the chain rule to be able to calculate all these different gradients, 00:27:46.800 |
And then you will have access to all of those gradients to be able to actually then run your favorite optimizer 00:27:52.800 |
and optimize, you know, with SGD or with Atom or whichever optimizer you choose. 00:27:59.800 |
And so that's kind of one of the great features. 00:28:02.800 |
You don't have to worry about actually writing the code that computes all of these gradients 00:28:06.800 |
and actually caches all of them properly, applies the chain rule, does all these steps. 00:28:11.800 |
You can abstract all of that away with just one call to dot backward. 00:28:16.800 |
And so in this case, we'll run through a little bit of an example 00:28:20.800 |
where we'll see the gradients getting computed automatically. 00:28:25.800 |
So in this case, we're going to initialize a tensor, 00:28:35.800 |
It just means that by default for a given tensor, 00:28:38.800 |
Python, PyTorch will store the gradient associated with it. 00:28:44.800 |
And you might wonder, well, you know, why do we have this? 00:28:49.800 |
You know, wouldn't we always want to store the gradient? 00:28:51.800 |
And the answer is, at train time, you need the gradients in order to actually train your network. 00:28:56.800 |
But at inference time, you'd actually want to disable your gradients, 00:29:00.800 |
and you can actually do that because it's a lot of extra computation that's not needed 00:29:04.800 |
since you're not making any updates to your network anymore. 00:29:13.800 |
We don't have any gradients being computed because we haven't actually called backwards 00:29:18.800 |
to actually compute some quantity with respect to this particular tensor. 00:29:25.800 |
We haven't actually computed those gradients yet. 00:29:29.800 |
So right now, the dot grad feature, which will actually store the gradient associated with that tensor, is none. 00:29:36.800 |
And so now let's just define a really simple function. 00:29:39.800 |
We have x. We're going to define the function y equals 3x squared. 00:29:45.800 |
And so now we're going to call y dot backward. 00:29:49.800 |
And so now what happens is when we actually print out x dot grad, 00:29:58.800 |
And the reason is that our function y is 3x squared. 00:30:03.800 |
If we compute the gradient of that function, we're going to get 6x, and our actual value was 2. 00:30:15.800 |
And we see that when we print out x dot grad, that's what we get. 00:30:25.800 |
We call z dot backwards, and we print out x dot grad again. 00:30:29.800 |
And now we see that- I may not run this in the right order. 00:30:35.800 |
Okay. So here in the second one that I re-ran, we see that it says 24. 00:30:42.800 |
And so you might be wondering, well, I just did the same thing twice. 00:30:47.800 |
And the answer is that by default, PyTorch will accumulate the gradients. 00:30:52.800 |
So it won't actually rewrite the gradient each time you compute it. 00:30:58.800 |
And the reason is because when you actually have backpropagation for your network, 00:31:02.800 |
you want to accumulate the gradients, you know, across all of your examples, 00:31:10.800 |
But this also means that every time you have a training iteration for your network, 00:31:17.800 |
because you don't want the previous gradients from the last epoch, 00:31:21.800 |
where you iterated through all of your training data, 00:31:23.800 |
to mess with the current update that you're doing. 00:31:27.800 |
So that's kind of one thing to note, which is that that's essentially why we will see 00:31:36.800 |
you have to run zero grad in order to zero out the gradient. 00:31:39.800 |
Yes, so I accidentally ran the cells in the wrong order. 00:31:44.800 |
Maybe to make it more clear, let me put this one first. 00:31:48.800 |
So this is actually what it should look like, 00:31:53.800 |
which is that we ran it once, and I ran this cell first. 00:31:58.800 |
And then we ran it a second time, and we get 24. 00:32:02.800 |
Yes, so if you have all of your tensors defined, 00:32:10.800 |
it's going to compute all of those partials, all of those gradients. 00:32:14.800 |
Yes, so what's happening here is that the way PyTorch works is that 00:32:23.800 |
And so we've essentially made two different backwards passes. 00:32:28.800 |
We've called it once on this function y, which is a function of x, 00:32:33.800 |
and we've called it once on z, which is also a function of x. 00:32:36.800 |
And so you're right, we can't actually disambiguate which came from what, 00:32:42.800 |
But typically, that's actually exactly what we want, 00:32:46.800 |
because what we want is to be able to run our network 00:32:49.800 |
and accumulate the gradient across all of the training examples 00:32:53.800 |
that define our loss, and then perform our optimizer step. 00:32:57.800 |
So yes, even with respect to one thing, it doesn't matter, 00:33:00.800 |
because in practice, each of those things is really a different example 00:33:05.800 |
And so we're not interested in, you know, the gradient from one example, 00:33:09.800 |
we're actually interested in the overall gradient. 00:33:14.800 |
what's happening here is that in the backwards pass, 00:33:17.800 |
what it's doing is, you can imagine there's the x tensor, 00:33:28.800 |
And what that is storing is it's storing the accumulated gradient 00:33:32.800 |
from every single time that you've called dot backward 00:33:39.800 |
essentially has some dependency on x that will have a non-zero gradient. 00:33:43.800 |
And so the first time we call it, the gradient will be 12, 00:33:49.800 |
The second time we do it with z, it's also still 12. 00:33:53.800 |
But the point is that dot grad doesn't actually overwrite the gradient 00:33:56.800 |
each time you call dot backwards, it simply adds them, it accumulates them. 00:34:00.800 |
And kind of the intuition there is that ultimately, 00:34:04.800 |
you're going to want to compute the gradient with respect to the loss, 00:34:09.800 |
and that loss is going to be made up of many different examples. 00:34:12.800 |
And so you need to accumulate the gradient from all of those 00:34:17.800 |
And then of course, you'll have to zero that out 00:34:19.800 |
because every time you make one pass through all of your data, 00:34:22.800 |
you don't want that next batch of data to also be double counting 00:34:41.800 |
one of the final pieces of the puzzle, which is neural networks. 00:34:47.800 |
And once we have that and we have our optimization, 00:34:51.800 |
we'll finally be able to figure out how do we actually train a neural network? 00:34:54.800 |
What does that look like and why it's so clean and efficient 00:35:04.800 |
we're going to be defining neural networks in terms of existing building blocks, 00:35:09.800 |
in terms of existing APIs, which will implement, 00:35:12.800 |
for instance, linear layers or different activation functions that we need. 00:35:17.800 |
So we're going to import torch.nn because that is the neural network package 00:35:27.800 |
The way the linear layer works in PyTorch is it takes in two arguments. 00:35:31.800 |
It takes in the input dimension and then the output dimension. 00:35:36.800 |
And so pretty much what it does is it takes in some input, 00:35:41.800 |
which has some arbitrary amount of dimensions, 00:35:48.800 |
And it will essentially output it to that same set of dimensions 00:35:52.800 |
except the output dimension in the very last place. 00:35:56.800 |
And you can think of the linear layer as essentially just performing a simple AX plus B. 00:36:02.800 |
By default, it's going to, um, it's going to apply a bias, 00:36:08.800 |
but you can also disable that if you don't want a bias term. 00:36:35.800 |
And all we're going to do is once we define it by instantiating with nn dot linear, 00:36:41.800 |
whatever the name of our layer is, in this case, we called it linear, 00:36:45.800 |
we just essentially apply it with parentheses as if it were a function to whatever input. 00:36:50.800 |
And that actually does the actual forward pass through this linear layer to get our output. 00:37:00.800 |
And so you can see that the original shape was two by three by four. 00:37:06.800 |
Then we pass it through this linear layer which has an output dimension of size two. 00:37:10.800 |
And so ultimately our output is two by three by two, which is good. 00:37:15.800 |
That's what we expect. That's not shape error. 00:37:18.800 |
But, you know, something common, um, that you'll see is, you know, 00:37:21.800 |
maybe, uh, you decide to- you get a little confused and maybe you do, um, 00:37:29.800 |
let's say, uh, two by two, you match the wrong dimension. 00:37:34.800 |
And so here we're going to get, uh, shape error. 00:37:38.800 |
And you see that the error message isn't as helpful because it's actually changed the shape of what we were working with. 00:37:43.800 |
We said this was two by three by four under the hood, 00:37:48.800 |
But if we, you know, in this case it's obvious because we instantiated it with the shape. 00:37:55.800 |
But if we didn't have the shape, then one simple thing we could do is actually just print out the shape. 00:38:01.800 |
And we'd see, okay, this last dimension is size four, 00:38:03.800 |
so I actually need to change my input dimension in my linear layer to be size four. 00:38:14.800 |
And you'll also notice on this, um, output we have this grad function. 00:38:18.800 |
And so that's because we're actually computing and storing the gradients here, uh, for our tensor. 00:38:31.800 |
Yeah, so typically we think of the first dimension as the batch dimension. 00:38:37.800 |
This, you can think of as if you had a batch of images, it would be the number of images. 00:38:43.800 |
it would be essentially the number of sentences or sequences. 00:38:48.800 |
Um, pretty much that is usually considered the batch dimension. 00:38:52.800 |
The star indicates that there can be an arbitrary number of dimensions. 00:38:55.800 |
So for instance, if we had images, this could be a four-dimensional tensor object. 00:39:01.800 |
It could be the batch size by the number of channels, by the height, by the width. 00:39:06.800 |
But in general, there's no fixed number of dimensions. 00:39:10.800 |
Your input tensor can be any number of dimensions. 00:39:13.800 |
The key is just that that last dimension needs to match up with the input dimension of your linear layer. 00:39:22.800 |
So essentially we're saying that we're going to map this last dimension, 00:39:28.800 |
which is four-dimensional to now two-dimensional. 00:39:31.800 |
Um, so in general, you know, you can think of this as if we're stacking a neural network, 00:39:35.800 |
this is kind of the input dimension size, and this would be like the hidden dimension size. 00:39:43.800 |
And so one thing we can do is we can actually print out the parameters, 00:39:46.800 |
and we can actually see what are the values of our linear layer, 00:39:49.800 |
or in general for any layer that we define in our neural network, 00:39:55.800 |
And in this case, we see that there's two sets of parameters, 00:40:00.800 |
because we have a bias as well as the actual, um, the actual linear layer itself. 00:40:07.800 |
And so both of them store the gradients, and in this case, 00:40:13.800 |
um, you know, these are, these are what the current values of these parameters are, 00:40:23.800 |
Okay, so now let's go through some of the other module layers. 00:40:32.800 |
nn.linear is one of the layers you have access to. 00:40:35.800 |
You have a couple of other different layers that are pretty common. 00:40:38.800 |
You have 2D convolutions, you have transpose convolutions, 00:40:42.800 |
you have batch norm layers when you need to do normalization in your network. 00:40:46.800 |
You can do upsampling, you can do max pooling, 00:40:51.800 |
But the main key here is that all of them are built-in building blocks 00:40:54.800 |
that you can just call, just like we did with nn.linear. 00:40:58.800 |
And so let's just go, I guess, I'm running out of time, 00:41:03.800 |
but let's just try and go through these last few layers, 00:41:06.800 |
and then I'll wrap up by kind of showing an example that puts it all together. 00:41:10.800 |
So in this case, we can define an activation function, 00:41:20.800 |
And so now we can define our, our network as this very simple thing, 00:41:23.800 |
which had one linear layer and then an activation. 00:41:28.800 |
And in general, when we compose these layers together, 00:41:32.800 |
we don't need to actually write every single line by line applying the next layer. 00:41:39.800 |
In this case, we can use nn.sequential and list all of the layers. 00:41:43.800 |
So here we have our linear layer followed by our sigmoid. 00:41:47.800 |
And then now we're just essentially passing the input through this whole set of layers all at once. 00:41:54.800 |
So we take our input, we call a block on the input, and we get the output. 00:42:00.800 |
And so let's just kind of see putting it all together, 00:42:05.800 |
and what does it look like when we train one? 00:42:07.800 |
So here we're going to actually define a multilayer perceptron. 00:42:11.800 |
And the way it works is to define a neural network, 00:42:17.800 |
The key here is there's really two main things you have to define when you create your own network. 00:42:23.800 |
So in the init function, you actually initialize all the parameters you need. 00:42:27.800 |
In this case, we initialize an input size, a hidden size, 00:42:34.800 |
In this case, it's a simple model which consists of a linear layer followed by an activation, 00:42:41.800 |
followed by another linear layer, followed by a final activation. 00:42:45.800 |
And the second function we have to define is the forward, 00:42:48.800 |
which actually does the forward pass of the network. 00:42:51.800 |
And so here our forward function takes in our input x. 00:42:56.800 |
In general, it could take in some arbitrary amount of inputs into this function, 00:43:01.800 |
but essentially it needs to figure out how are you actually computing the output. 00:43:07.800 |
It just takes in the input x and returns it back into the network that we just defined and return the output. 00:43:14.800 |
And again, you could do this more explicitly by kind of doing what we did earlier, 00:43:19.800 |
where we could actually write out all of the layers individually instead of wrapping them into one object 00:43:26.800 |
and then doing a line-by-line operation for each one of these layers. 00:43:33.800 |
If we define our class, it's very simple to use it. 00:43:39.800 |
instantiate our model by calling multilayer perceptron with our parameters, 00:43:48.800 |
So that's great, but this is all just the forward pass. 00:43:55.800 |
And so this is the final step, which is we have optimization built in to PyTorch. 00:44:00.800 |
So we have this backward function, which goes and computes all these gradients in the backward pass. 00:44:05.800 |
And now the only step left is to actually update the parameters using those gradients. 00:44:10.800 |
And so here, we'll import the torch.optim package, which contains all of the optimizers that you need. 00:44:18.800 |
Essentially, this part is just creating some random data so that we can actually decide how to fit our data. 00:44:25.800 |
But this is really the key here, which is we'll instantiate our model that we defined. 00:44:30.800 |
We'll define the atom optimizer, and we'll define it with a particular learning rate. 00:44:37.800 |
We'll define a loss function, which is again another built-in module. 00:44:41.800 |
In this case, we're using the cross-entropy loss. 00:44:44.800 |
And finally, to calculate our predictions, all we do simply is just call model on our actual input. 00:44:51.800 |
And to calculate our loss, we just call our loss function on our predictions and our true labels. 00:45:00.800 |
And now when we put it all together, this is what the training loop looks like. 00:45:05.800 |
We have some number of epochs that we want to train our network. 00:45:08.800 |
For each of these epochs, the first thing we do is we take our optimizer and we zero out the gradient. 00:45:13.800 |
And the reason we do that is because, like many of you noted, we actually are accumulating the gradient. 00:45:19.800 |
We're not resetting it every time we call dot backward. 00:45:23.800 |
We get our model predictions by doing a forward pass. 00:45:27.800 |
We then compute the loss between the predictions and the true values. 00:45:35.800 |
This is what actually computes all of the gradients in the backward pass from our loss. 00:45:41.800 |
And the final step is we call dot step on our optimizer. 00:45:47.800 |
And this will take a step on our loss function. 00:45:49.800 |
And so if we run this code, we end up seeing that we're able to start with some training loss, 00:45:55.800 |
which is relatively high, and in 10 epochs, we're able to essentially completely fit our data. 00:46:02.800 |
And if we print out our model parameters and we printed them out from the start as well, 00:46:06.800 |
we'd see that they've changed as we've actually done this optimization. 00:46:13.800 |
But I think the key takeaway is that a lot of the things that you're doing at the beginning of this class 00:46:19.800 |
are really about understanding the basics of how neural networks work, 00:46:23.800 |
how you actually implement them, how you implement the backward pass. 00:46:27.800 |
The great thing about PyTorch is that once you get to the very next assignment, 00:46:30.800 |
you'll see that now that you have a good underlying understanding of those things, 00:46:34.800 |
you can abstract a lot of the complexity of how do you do backprop, 00:46:38.800 |
how do you store all these gradients, how do you compute them, 00:46:41.800 |
how do you actually run the optimizer, and let PyTorch handle all of that for you. 00:46:45.800 |
And you can use all of these building blocks, all these different neural network layers, 00:46:49.800 |
to now define your own networks that you can use to solve whatever problems you need.