Stanford CS224N NLP with Deep Learning | 2023

00:00:05.500 | And so today I kind of just want to cover the fundamentals of PyTorch,

00:00:10.400 | really just kind of see what are the similarities between PyTorch and NumPy and Python,

00:00:15.700 | which you guys are used to at this point,

00:00:18.700 | and see how we can build up a lot of the building blocks that we'll need in order to define more complex models.

00:00:25.500 | So specifically we're going to talk today about tensors, what are tensor objects, how do we manipulate them,

00:00:32.300 | what is AutoGrad, how PyTorch helps us compute different gradients,

00:00:37.500 | and finally how we actually do optimization and how we write the training loop for our neural networks.

00:00:42.500 | And if we have time at the end, then we'll try and go through a bit of a demo to kind of put everything together

00:00:48.800 | and see how everything comes together when you want to solve an actual NLP task.

00:00:56.200 | All right, so let's get started.

00:00:58.800 | So if you go to the course website, there is a notebook,

00:01:02.700 | and you can just make a copy of this Colab notebook and then just run the cells as we go.

00:01:08.800 | And so to start, today we're talking about PyTorch, like I said.

00:01:13.400 | It's a deep learning framework that really does two main things.

00:01:16.900 | One is it makes it very easy to author and manipulate tensors and make use of your GPU

00:01:22.800 | so that you can actually leverage a lot of that capability.

00:01:25.800 | And two is it makes the process of authoring neural networks much simpler.

00:01:31.300 | You can now use different building blocks like linear layers and different loss functions

00:01:36.300 | and compose them in different ways in order to author the types of models that you need for your specific use cases.

00:01:43.300 | And so PyTorch is one of the two main frameworks along with TensorFlow.

00:01:48.200 | In this class, we'll focus on PyTorch, but they're quite similar.

00:01:51.800 | And so we'll start by importing Torch, and we'll import the neural network module, which is Torch.nn.

00:01:58.800 | And for this first part of the tutorial, I want to talk a bit about tensors.

00:02:04.300 | One thing that you guys are all familiar with now is NumPy arrays.

00:02:09.800 | And so pretty much you can think about tensors as the equivalent in PyTorch to NumPy arrays.

00:02:16.800 | They're essentially multi-dimensional arrays that you can manipulate in different ways,

00:02:21.800 | and you'll essentially use them to represent your data, to be able to actually manipulate it,

00:02:28.300 | and perform all the different matrix operations that underlie your neural network.

00:02:33.800 | And so in this case, for example, if we're thinking of an image,

00:02:38.800 | one way you can think about it in terms of a tensor is it's a 256 by 256 tensor,

00:02:44.800 | where it has a width of 256 pixels and a height of 256 pixels.

00:02:50.300 | And for instance, if we have a batch of images, and those images contain three channels, like red, green, and blue,

00:02:57.300 | then we might have a four-dimensional tensor, which is the batch size by the number of channels, by the width and the height.

00:03:04.800 | And so everything we're going to see today is all going to be represented as tensors,

00:03:08.800 | which you can just think of as multi-dimensional arrays.

00:03:12.800 | And so to kind of get some intuition about this,

00:03:15.800 | we're going to spend a little bit of time going through essentially lists of lists,

00:03:20.800 | and how we can convert them into tensors, and how we can manipulate them with different operations.

00:03:26.800 | So to start off with, we just have a simple list of lists that you're all familiar with.

00:03:32.800 | In this case, it's a two-by-three list.

00:03:35.800 | And now we want to create a tensor.

00:03:38.800 | And so here, the way we'll create this tensor is by doing torch dot tensor,

00:03:44.800 | and then essentially writing the same syntax that we had before.

00:03:49.800 | Just write out the list of lists that represents that particular tensor.

00:03:55.800 | And so in this case, we get back a tensor object, which is the same shape and contains the same data.

00:04:02.800 | And so now, the second thing with a tensor is that it contains a data type.

00:04:06.800 | So there's different data types.

00:04:08.800 | For instance, there are different varying level of precision floating point numbers that you can use.

00:04:13.800 | You can have integers. You can have different data types that actually populate your tensor.

00:04:18.800 | And so by default, I believe this will be float 32,

00:04:21.800 | but you can explicitly specify which data type your tensor is by passing in the dtype argument.

00:04:28.800 | And so we see here now, even though we wrote in a bunch of integers,

00:04:33.800 | they have a decimal point, which indicates that they're floating point numbers.

00:04:37.800 | And so same thing here.

00:04:39.800 | We can create another tensor, in this case with data type float 32.

00:04:45.800 | And in this third example, you see that we create another tensor.

00:04:51.800 | We don't actually specify the data type, but PyTorch essentially implicitly takes the data type to be floating point,

00:04:59.800 | since we actually passed in a floating point number into this tensor.

00:05:03.800 | So pretty much at a high level, tensors are like multi-dimensional arrays.

00:05:09.800 | We can specify the data type for them.

00:05:11.800 | We can populate them just like NumPy arrays.

00:05:13.800 | Okay, so now, great, we know how to create tensors.

00:05:17.800 | We know that ultimately everything that we work with, all the data we have is going to be expressed as tensors.

00:05:23.800 | Now the question is, what are the functions that we have to manipulate them?

00:05:27.800 | And so we have some basic utilities that can help us instantiate tensors easily,

00:05:32.800 | specifically torch.0s and torch.1s.

00:05:36.800 | These are two ways to create tensors of a particular shape, in this case tensors of all zeros or tensors of all ones.

00:05:45.800 | And you'll see that this will be very helpful when you do your homeworks.

00:05:50.800 | Typically, you'll want to just need to create a bunch of a zero matrix,

00:05:54.800 | and it'll be very easy to just specify the shape here without having to write everything out super explicitly.

00:06:00.800 | And then you can update that tensor as needed.

00:06:04.800 | Another thing you can do is, just like we have ranges in Python,

00:06:09.800 | so if you want to loop over a bunch of numbers, you can specify a range.

00:06:14.800 | You can also use torch.a_range to be able to actually instantiate a tensor with a particular range.

00:06:22.800 | In this case, we just looped over the numbers 1 through 10.

00:06:26.800 | You could reshape this and make it 1 through 5 and then 6 through 10.

00:06:30.800 | That's another way to be able to instantiate tensors.

00:06:34.800 | And finally, something to note is that when we apply particular operations,

00:06:41.800 | such as just simple Python operations like addition or multiplication,

00:06:46.800 | by default they're going to be element-wise, so they'll apply to all the elements in our tensor.

00:06:52.800 | So in this case, we took our tensor, I think this one was probably from earlier above,

00:06:58.800 | and we added 2 everywhere. Here we multiplied everything by 2.

00:07:04.800 | But pretty much the PyTorch semantics for broadcasting work pretty much the same as the NumPy semantics.

00:07:10.800 | So if you pretty much have different matrix operations where you need to batch across a particular dimension,

00:07:19.800 | PyTorch will be smart about it and it will actually make sure that you broadcast over the appropriate dimensions.

00:07:25.800 | Although of course you have to make sure that the shapes are compatible based on the actual broadcasting rules.

00:07:31.800 | So we'll get to that in a little bit when we look at reshaping and how different operations have those semantics.

00:07:40.800 | In this case, we have to define the, I guess I'm not personally aware of how you would define kind of a jagged tensor that has unequal dimensions.

00:07:50.800 | But typically we don't want to do that because it makes our computation a lot more complex.

00:07:56.800 | And so in cases where we have, you know, for instance, we have different sentences that we turn into tokens,

00:08:03.800 | we might have different length sentences in our training set.

00:08:06.800 | We'll actually pad all the dimensions to be the same because ultimately we want to do everything with matrix operations.

00:08:13.800 | And so in order to do that, we need to have a matrix of a fixed shape.

00:08:16.800 | But yeah, that's a good point. I'm not sure if there is a way to do that, but typically we just get around this by padding.

00:08:25.800 | Okay, so now we know how to define tensors. We can do some interesting things with them.

00:08:30.800 | So here we've created two tensors. One of them is a 3 by 2 tensor.

00:08:36.800 | The other one is a 2 by 4 tensor.

00:08:39.800 | And I think the answer is written up here, but what do we expect is the shape when we multiply these two tensors?

00:08:47.800 | So we have a 3 by 2 tensor and a 2 by 4 tensor.

00:08:52.800 | Yeah, 3 by 4.

00:08:54.800 | And so more generally, we can use Matmul in order to do matrix multiplication.

00:09:02.800 | It also implements batched matrix multiplication.

00:09:06.800 | And so I won't go over the entire review of broadcasting semantics,

00:09:11.800 | but the main gist is that the dimensions of two tensors are compatible if you can left pad the tensors with ones

00:09:19.800 | so that the dimensions that line up either A, have the same number in that dimension,

00:09:25.800 | or B, one of them is a dummy dimension. One of them has a 1.

00:09:28.800 | And in that case, in those dummy dimensions, PyTorch will actually make sure to copy over the tensor as many times as needed

00:09:36.800 | so that you can then actually perform the operation.

00:09:39.800 | And that's useful when you want to do things like batched dot products or batched matrix multiplications.

00:09:45.800 | And I guess the final point here is there's also a shorthand notation that you can use.

00:09:52.800 | So instead of kind of having to type out Matmul every time, you can just use the at operator, similar to NumPy.

00:09:59.800 | Effectively, that's kind of where we get into how batching works.

00:10:03.800 | So for example, if you had, let's say, two tensors that have some batch dimension,

00:10:13.800 | and then one of them is M by 1, and the other one is 1 by N.

00:10:19.800 | And if you do a batched matrix multiply to those two tensors,

00:10:23.800 | now what you effectively do is you preserve the batch dimension,

00:10:27.800 | and then you're doing a matrix multiplication between an M by 1 tensor and a 1 by N.

00:10:32.800 | So you get something that's the batch dimension by M by N.

00:10:36.800 | So effectively, they're kind of more, I think the full semantics are written out on the PyTorch website

00:10:42.800 | for how the matrix multiplication works.

00:10:44.800 | But you're right, you don't just have these cases where you have two two-dimensional tensors.

00:10:48.800 | You can have arbitrary number of dimensions,

00:10:50.800 | and as long as the dimensions match up based on those semantics I was saying,

00:10:55.800 | then you can multiply it.

00:10:56.800 | Alternatively, you can do what I do, which is just multiply it anyways,

00:11:00.800 | and then if it throws an error, print out the shapes and kind of work from there.

00:11:04.800 | That tends to be faster in my opinion in a lot of ways.

00:11:07.800 | But yeah, that's a good point.

00:11:10.800 | All right, so yeah, let's keep going through some of the other different functionalities here.

00:11:17.800 | So we can define another tensor,

00:11:20.800 | and kind of one of the key things that we always want to look at is the shape.

00:11:25.800 | So in this case, we just have a 1D tensor of length 3,

00:11:29.800 | so the torch.size just gives us 3.

00:11:33.800 | In general, this is kind of one of the key debugging steps

00:11:36.800 | and something that I'll try and emphasize a lot throughout this session,

00:11:40.800 | which is printing the shapes of all of your tensors is probably your best resource.

00:11:45.800 | When it comes to debugging, it's kind of one of the hardest things to intuit exactly what's going on

00:11:50.800 | once you start stacking a lot of different operations together.

00:11:54.800 | So printing out the shapes at each point and seeing do they match what you expect is something important,

00:12:00.800 | and it's better to rely on that than just on the error message that PyTorch gives you,

00:12:06.800 | because under the hood, PyTorch might implement certain optimizations

00:12:10.800 | and actually reshape the underlying tensor you have,

00:12:13.800 | so you may not see the numbers you expect.

00:12:15.800 | So it's always great to print out the shape.

00:12:18.800 | And so yeah, let's...

00:12:22.800 | So again, we can always print out the shape,

00:12:26.800 | and we can have a more complex, in this case, 3-dimensional tensor,

00:12:32.800 | which is 3 by 2 by 4,

00:12:34.800 | and we can print out the shape and we can see all of the dimensions here.

00:12:39.800 | And so now you're like, "Okay, great, we have tensors, we can look at their shapes,

00:12:44.800 | but what do we actually do with them?"

00:12:46.800 | And so now let's get into kind of what are the operations that we can apply to these tensors.

00:12:52.800 | And so one of them is it's very easy to reshape tensors.

00:12:58.800 | So in this case, we're creating this 15-dimensional tensor that's...

00:13:03.800 | the number is 1 to 15,

00:13:06.800 | and now we're reshaping it so now it's a 5 by 3 tensor here.

00:13:11.800 | And so you might wonder, "Well, like, what's the point of that?"

00:13:15.800 | And it's because a lot of times when we are doing machine learning,

00:13:20.800 | we actually want to learn in batches,

00:13:22.800 | and so we might take our data and we might reshape it

00:13:25.800 | so now that instead of kind of being a long, flattened list of things,

00:13:29.800 | we actually have a set of batches,

00:13:31.800 | or in some cases we have a set of batches of a set of sentences or sequences

00:13:37.800 | of a particular length,

00:13:39.800 | and each of the elements in that sequence has an embedding of a particular dimension.

00:13:44.800 | And so based on the types of operations that you're trying to do,

00:13:48.800 | you'll sometimes need to reshape those tensors,

00:13:51.800 | and sometimes you'll want to particularly sometimes transpose dimensions

00:13:56.800 | if you want to, for instance, reorganize your data.

00:14:00.800 | So that's another operation to keep in mind.

00:14:04.800 | I believe the difference is view will...

00:14:09.800 | view will create a view of the underlying tensor,

00:14:11.800 | and so I think the underlying tensor will still have this same shape.

00:14:15.800 | It will actually modify the tensor.

00:14:22.800 | All right, and then finally, like I said at the beginning,

00:14:25.800 | your intuition about PyTorch tensors can simply be they're kind of a nice,

00:14:30.800 | easy way to work with NumPy arrays,

00:14:33.800 | but they have all these great properties,

00:14:35.800 | like now we can essentially use them with GPUs and it's very optimized,

00:14:40.800 | and we can also compute gradients quickly,

00:14:44.800 | and to kind of just emphasize this point,

00:14:47.800 | if you have some NumPy code and you have a bunch of NumPy arrays,

00:14:50.800 | you can directly convert them into PyTorch sensors by simply casting them,

00:14:56.800 | and you can also take those tensors and convert them back to NumPy arrays.

00:15:03.800 | All right, and so one of the things you might be asking is,

00:15:06.800 | why do we care about tensors?

00:15:09.800 | What makes them good?

00:15:11.800 | And one of the great things about them is that they support vectorized operations very easily.

00:15:16.800 | Essentially, we can parallelize a lot of different computations and do them,

00:15:21.800 | for instance, across a batch of data all at once,

00:15:24.800 | and one of those operations you might want to do, for instance, is a sum.

00:15:28.800 | So you can take, in this case,

00:15:32.800 | a tensor which is shaped five by seven,

00:15:36.800 | and it looks like that's not working.

00:15:41.800 | You can take a tensor that's shaped five by seven,

00:15:43.800 | and now you can compute different operations on it that essentially collapse the dimensionality.

00:15:50.800 | So the first one is sum, and so you can take it and you can sum across both the rows as well as the columns,

00:15:56.800 | and so one way I like to think about this to kind of keep them straight is that

00:16:01.800 | the dimension that you specify in the sum is the dimension you're collapsing.

00:16:06.800 | So in this case, if you take the data and sum over dimension zero,

00:16:11.800 | because you know the shape of the underlying tensor is five by seven,

00:16:15.800 | you've collapsed the zeroth dimension,

00:16:18.800 | so you should be left with something that's just shaped seven.

00:16:21.800 | And if you see the actual tensor, you've got 75, 80, 85, 90,

00:16:26.800 | you've got this tensor which is shaped seven.

00:16:29.800 | Alternatively, you can think about whether or not you're kind of summing across the rows or summing across the columns.

00:16:35.800 | But it's not just sum, it applies to other operations as well.

00:16:39.800 | You can compute standard deviations, you can normalize your data,

00:16:43.800 | you can do other operations which essentially batch across the entire set of data.

00:16:48.800 | And not only do these apply over one dimension,

00:16:52.800 | but here you can see that if you don't specify any dimensions,

00:16:55.800 | then by default the operation actually applies to the entire tensor.

00:16:59.800 | So here we end up just taking the sum of the entire thing.

00:17:03.800 | So if you think about it, the zeroth dimension is the number of rows,

00:17:06.800 | there are five rows and there are seven columns.

00:17:08.800 | So if we sum out the rows, then we're actually summing across the columns,

00:17:16.800 | and so now we only have seven values.

00:17:19.800 | But I like to think about more just in terms of the dimensions to keep it straight

00:17:22.800 | rather than rows or columns because it can get confusing.

00:17:25.800 | If you're summing out dimension zero,

00:17:27.800 | then effectively you've taken something which has some shape that's dimension zero by dimension one

00:17:32.800 | to just whatever is the dimension one shape.

00:17:36.800 | And then from there you can kind of figure out, okay, which way did I actually sum to check if you were right.

00:17:41.800 | NumPy implements a lot of this vectorization,

00:17:45.800 | and I believe in the homework that you have right now,

00:17:49.800 | I think part of your job is to vectorize a lot of these things.

00:17:52.800 | So the big advantage with PyTorch is that essentially it's optimized to be able to take advantage of your GPU.

00:17:59.800 | When we actually start building out neural networks that are bigger, that involve more computation,

00:18:04.800 | we're going to be doing a lot of these matrix multiplication operations

00:18:08.800 | that it's going to be a lot better for our processor if we can make use of the GPU.

00:18:13.800 | And so that's where PyTorch really comes in handy.

00:18:17.800 | In addition to also defining a lot of those neural network modules as we'll see later,

00:18:22.800 | um, for you so that now you don't need to worry about, for instance,

00:18:26.800 | implementing a basic linear layer and back propagation from scratch and also your optimizer.

00:18:32.800 | All of those things will be built in and you can just call the respective APIs to make use of them.

00:18:37.800 | Whereas in Python and NumPy you might have to do a lot of that coding yourself.

00:18:43.800 | Yeah. All right. So we'll keep going.

00:18:53.800 | So this is a quiz except I think it tells you the answer so it's not much of a quiz.

00:18:59.800 | But pretty much, you know, what would you do if now I told you instead of, you know, summing over this tensor,

00:19:07.800 | I want you to compute the average.

00:19:10.800 | And so there's, there's two different ways you could compute the average.

00:19:13.800 | You could compute the average across the rows or across the columns.

00:19:18.800 | And so essentially now we kind of get back to this question of, well,

00:19:23.800 | which dimension am I actually going to reduce over?

00:19:25.800 | And so here if we want to preserve the rows,

00:19:28.800 | then we need to actually sum over the second dimension.

00:19:32.800 | Um, there are really the first, uh, zeroth and first.

00:19:36.800 | So the first dimension is what we add to sum over, uh,

00:19:40.800 | because we want to preserve the zeroth dimension.

00:19:43.800 | And so that's why for row average you see the dim equals one.

00:19:47.800 | And for column average, same reasoning is why you see the dim equals zero.

00:19:52.800 | And so if we run this code, we'll see kind of what are the shapes that we expect.

00:19:59.800 | If we're taking the average over rows,

00:20:01.800 | then an object that's two by three should just become an object that's two.

00:20:06.800 | It's just a one-dimensional, almost a vector you can think of.

00:20:10.800 | And if we are averaging across the columns, there's three columns.

00:20:15.800 | So now our average should have three values.

00:20:17.800 | And so now we're left with a three-dim- uh,

00:20:20.800 | one-dimensional tensor of length three.

00:20:23.800 | So yeah, does that kind of make sense?

00:20:26.800 | I guess is this general intuition about how we deal with shapes,

00:20:29.800 | and how some of these operations manipulate shapes.

00:20:32.800 | So now we'll get into indexing.

00:20:34.800 | This can get a little bit tricky,

00:20:37.800 | but I think you'll find that the semantics are very similar to NumPy.

00:20:43.800 | So one of the things that you can do in NumPy is that you can take these NumPy arrays,

00:20:49.800 | and you can slice across them.

00:20:51.800 | In many different ways, you can create copies of them,

00:20:54.800 | and you can index across particular dimensions to select out different elements,

00:21:00.800 | different rows, or different columns.

00:21:02.800 | And so in this case, let's take this example tensor,

00:21:05.800 | which is three by two by two.

00:21:09.800 | Um, and first thing you always want to do when you have a new tensor,

00:21:14.800 | print out its shape, understand what you're working with.

00:21:17.800 | And so I guess, uh,

00:21:21.800 | I may have shown this already, but what will x bracket zero print out?

00:21:27.800 | What happens if we index into just the first element?

00:21:30.800 | What's the shape of this?

00:21:33.800 | Yeah, two by two, right?

00:21:37.800 | Because if you think about it,

00:21:38.800 | our tensor is really just a list of three things.

00:21:41.800 | Each of those things happens to also be a two by two tensor.

00:21:45.800 | So we get a two by two object,

00:21:47.800 | in this case, the first thing, one, two, three, four.

00:21:50.800 | And so just like NumPy,

00:21:53.800 | if you provide a colon in a particular dimension,

00:21:56.800 | it means essentially copy over that dimension.

00:21:59.800 | So if we do x bracket zero implicitly,

00:22:03.800 | we're essentially putting a colon for all the other dimensions.

00:22:06.800 | So it's essentially saying, grab the first thing along the zeroth dimension,

00:22:11.800 | and then grab everything along the other two dimensions.

00:22:14.800 | If we now take, uh,

00:22:18.800 | just the zeroth along- the element along the first dimension,

00:22:22.800 | um, what are we going to get?

00:22:25.800 | Well, ultimately, we're going to get- now,

00:22:29.800 | if you look, uh,

00:22:31.800 | the kind of first dimension where these three things,

00:22:34.800 | the second dimension is now each of these two rows within those things.

00:22:39.800 | So like one, two, and three, four, five, six, and seven,

00:22:42.800 | eight, nine, 10, and 11, 12.

00:22:44.800 | So if we index into the second dimen- or the first dimension and get the zeroth element,

00:22:51.800 | then we're going to end up with one, two, five, six, and nine, 10.

00:22:56.800 | And even if that's a little bit tricky,

00:23:00.800 | you can kind of go back to the trick I mentioned before,

00:23:03.800 | where we're slicing across the first dimension.

00:23:07.800 | So if we look at the shape of our tensor,

00:23:09.800 | it's three by two by two.

00:23:11.800 | If we collapse the first dimension,

00:23:14.800 | that two in the middle, we're left with something that's three by two.

00:23:17.800 | So it might seem a little bit trivial kind of going through this in a lot of detail,

00:23:23.800 | but I think it's important because it can get tricky when your tensor shapes get more complicated,

00:23:28.800 | how to actually reason about this.

00:23:30.800 | And so I won't go through every example here since a lot of them kind of reinforce the same thing,

00:23:36.800 | but I'll just highlight a few things.

00:23:38.800 | Just like NumPy, you can choose to get a range of elements.

00:23:43.800 | Uh, in this case, we're taking this new tensor,

00:23:49.800 | which is one to- one through 15 rearranged as a five by three tensor.

00:23:55.800 | And if we take the zeroth through third row, um,

00:23:59.800 | exclusive, we'll get the first three rows.

00:24:02.800 | And we can do the same thing, but now with slicing across multiple dimensions.

00:24:07.800 | And I think the final point I want to talk about here is list indexing.

00:24:15.800 | List indexing is also present in NumPy,

00:24:18.800 | and it's a very clever shorthand for being able to essentially select out multiple elements at once.

00:24:25.800 | So in this case, what you can do is,

00:24:28.800 | if you want to get the zeroth, the second, and the fourth element of our matrix,

00:24:35.800 | you can just, instead of indexing with a particular number or set of numbers,

00:24:40.800 | index with a list of indices.

00:24:43.800 | So in this case, if we go up to our tensor,

00:24:47.800 | if we take out the zeroth, the second, and the fourth,

00:24:51.800 | we should see those three rows,

00:24:54.800 | and that's what we end up getting.

00:24:59.800 | Yeah, again, these are kind of a lot of examples to just reiterate the same point,

00:25:04.800 | which is that you can slice across your data in multiple ways,

00:25:08.800 | and at different points you're going to need to do that.

00:25:11.800 | So being familiar with the shapes that you understand what's the underlying output that you expect is important.

00:25:18.800 | In this case, for instance, we're slicing across the first and the second dimension,

00:25:23.800 | and we're keeping the first, the zeroth.

00:25:27.800 | And so we're going to end up getting essentially kind of the top left element of each of those three things in our tensor.

00:25:35.800 | If we scroll all the way up here, we'll get this one, we'll get this five, and we'll get this nine,

00:25:41.800 | because we go across all of the zeroth dimension,

00:25:45.800 | and then across the first and the second, we only take the first, the zeroth element in both of those positions.

00:25:53.800 | And so that's why we get one, five, nine.

00:26:00.800 | And also, of course, you can, you know, apply all of the colons to get back the original tensor.

00:26:11.800 | Okay, and then I think the last thing when it comes to indexing is conversions.

00:26:17.800 | So typically, when we're writing code with neural networks, ultimately we're going to, you know,

00:26:24.800 | process some data through a network, and we're going to get a loss.

00:26:27.800 | And that loss needs to be a scalar, and then we're going to compute gradients with respect to that loss.

00:26:32.800 | So one thing to keep in mind is that sometimes you might have an operation,

00:26:37.800 | and it fails because it was actually expecting a scalar value rather than a tensor.

00:26:42.800 | And so you can extract out the scalar from this one-by-one tensor by just calling dot item.

00:26:50.800 | So in this case, you know, if you have a tensor, which is just literally one,

00:26:54.800 | then you can actually get the Python scalar that corresponds to it by calling dot item.

00:27:00.800 | So now we can get into the more interesting stuff.

00:27:02.800 | One of the really cool things with PyTorch is Autograd.

00:27:06.800 | And what Autograd is, is PyTorch essentially provides an automatic differentiation package,

00:27:14.800 | where when you define your neural network, you're essentially defining many nodes that compute some function.

00:27:23.800 | And in the forward pass, you're kind of running your data through those nodes.

00:27:27.800 | But what PyTorch is doing on the back end is that at each of those points,

00:27:31.800 | it's going to actually store the gradients and accumulate them,

00:27:36.800 | so that every time you do your backwards pass,

00:27:39.800 | you apply the chain rule to be able to calculate all these different gradients,

00:27:43.800 | and PyTorch caches those gradients.

00:27:46.800 | And then you will have access to all of those gradients to be able to actually then run your favorite optimizer

00:27:52.800 | and optimize, you know, with SGD or with Atom or whichever optimizer you choose.

00:27:59.800 | And so that's kind of one of the great features.

00:28:02.800 | You don't have to worry about actually writing the code that computes all of these gradients

00:28:06.800 | and actually caches all of them properly, applies the chain rule, does all these steps.

00:28:11.800 | You can abstract all of that away with just one call to dot backward.

00:28:16.800 | And so in this case, we'll run through a little bit of an example

00:28:20.800 | where we'll see the gradients getting computed automatically.

00:28:25.800 | So in this case, we're going to initialize a tensor,

00:28:32.800 | and requires grad is true by default.

00:28:35.800 | It just means that by default for a given tensor,

00:28:38.800 | Python, PyTorch will store the gradient associated with it.

00:28:44.800 | And you might wonder, well, you know, why do we have this?

00:28:49.800 | You know, wouldn't we always want to store the gradient?

00:28:51.800 | And the answer is, at train time, you need the gradients in order to actually train your network.

00:28:56.800 | But at inference time, you'd actually want to disable your gradients,

00:29:00.800 | and you can actually do that because it's a lot of extra computation that's not needed

00:29:04.800 | since you're not making any updates to your network anymore.

00:29:08.800 | And so let's create this right now.

00:29:13.800 | We don't have any gradients being computed because we haven't actually called backwards

00:29:18.800 | to actually compute some quantity with respect to this particular tensor.

00:29:25.800 | We haven't actually computed those gradients yet.

00:29:29.800 | So right now, the dot grad feature, which will actually store the gradient associated with that tensor, is none.

00:29:36.800 | And so now let's just define a really simple function.

00:29:39.800 | We have x. We're going to define the function y equals 3x squared.

00:29:45.800 | And so now we're going to call y dot backward.

00:29:49.800 | And so now what happens is when we actually print out x dot grad,

00:29:54.800 | what we should expect to see is number 12.

00:29:58.800 | And the reason is that our function y is 3x squared.

00:30:03.800 | If we compute the gradient of that function, we're going to get 6x, and our actual value was 2.

00:30:11.800 | So the actual gradient is going to be 12.

00:30:15.800 | And we see that when we print out x dot grad, that's what we get.

00:30:20.800 | And now we'll just run it again.

00:30:23.800 | Let's set z equal to 3x squared.

00:30:25.800 | We call z dot backwards, and we print out x dot grad again.

00:30:29.800 | And now we see that- I may not run this in the right order.

00:30:35.800 | Okay. So here in the second one that I re-ran, we see that it says 24.

00:30:42.800 | And so you might be wondering, well, I just did the same thing twice.

00:30:45.800 | Shouldn't I see 12 again?

00:30:47.800 | And the answer is that by default, PyTorch will accumulate the gradients.

00:30:52.800 | So it won't actually rewrite the gradient each time you compute it.

00:30:57.800 | It will sum it.

00:30:58.800 | And the reason is because when you actually have backpropagation for your network,

00:31:02.800 | you want to accumulate the gradients, you know, across all of your examples,

00:31:06.800 | and then actually apply your update.

00:31:08.800 | You don't want to overwrite the gradient.

00:31:10.800 | But this also means that every time you have a training iteration for your network,

00:31:15.800 | you need to zero out the gradient,

00:31:17.800 | because you don't want the previous gradients from the last epoch,

00:31:21.800 | where you iterated through all of your training data,

00:31:23.800 | to mess with the current update that you're doing.

00:31:27.800 | So that's kind of one thing to note, which is that that's essentially why we will see

00:31:34.800 | when we actually write the training loop,

00:31:36.800 | you have to run zero grad in order to zero out the gradient.

00:31:39.800 | Yes, so I accidentally ran the cells in the wrong order.

00:31:44.800 | Maybe to make it more clear, let me put this one first.

00:31:48.800 | So this is actually what it should look like,

00:31:53.800 | which is that we ran it once, and I ran this cell first.

00:31:56.800 | And it has 12.

00:31:58.800 | And then we ran it a second time, and we get 24.

00:32:02.800 | Yes, so if you have all of your tensors defined,

00:32:06.800 | then when you actually call dot backwards,

00:32:08.800 | if it's a function of multiple variables,

00:32:10.800 | it's going to compute all of those partials, all of those gradients.

00:32:14.800 | Yes, so what's happening here is that the way PyTorch works is that

00:32:18.800 | it's storing the accumulated gradient at x.

00:32:23.800 | And so we've essentially made two different backwards passes.

00:32:28.800 | We've called it once on this function y, which is a function of x,

00:32:33.800 | and we've called it once on z, which is also a function of x.

00:32:36.800 | And so you're right, we can't actually disambiguate which came from what,

00:32:40.800 | we just see the accumulated gradient.

00:32:42.800 | But typically, that's actually exactly what we want,

00:32:46.800 | because what we want is to be able to run our network

00:32:49.800 | and accumulate the gradient across all of the training examples

00:32:53.800 | that define our loss, and then perform our optimizer step.

00:32:57.800 | So yes, even with respect to one thing, it doesn't matter,

00:33:00.800 | because in practice, each of those things is really a different example

00:33:03.800 | in our set of training examples.

00:33:05.800 | And so we're not interested in, you know, the gradient from one example,

00:33:09.800 | we're actually interested in the overall gradient.

00:33:12.800 | So going back to this example,

00:33:14.800 | what's happening here is that in the backwards pass,

00:33:17.800 | what it's doing is, you can imagine there's the x tensor,

00:33:22.800 | and then there's the dot grad attribute,

00:33:24.800 | which is another separate tensor.

00:33:26.800 | It's going to be the same shape as x.

00:33:28.800 | And what that is storing is it's storing the accumulated gradient

00:33:32.800 | from every single time that you've called dot backward

00:33:36.800 | on a quantity that, um,

00:33:39.800 | essentially has some dependency on x that will have a non-zero gradient.

00:33:43.800 | And so the first time we call it, the gradient will be 12,

00:33:46.800 | because 6x, 6 times 2, 12.

00:33:49.800 | The second time we do it with z, it's also still 12.

00:33:53.800 | But the point is that dot grad doesn't actually overwrite the gradient

00:33:56.800 | each time you call dot backwards, it simply adds them, it accumulates them.

00:34:00.800 | And kind of the intuition there is that ultimately,

00:34:04.800 | you're going to want to compute the gradient with respect to the loss,

00:34:09.800 | and that loss is going to be made up of many different examples.

00:34:12.800 | And so you need to accumulate the gradient from all of those

00:34:15.800 | in order to make a single update.

00:34:17.800 | And then of course, you'll have to zero that out

00:34:19.800 | because every time you make one pass through all of your data,

00:34:22.800 | you don't want that next batch of data to also be double counting

00:34:26.800 | the previous batch's update.

00:34:27.800 | You want to keep those separate.

00:34:29.800 | And so we'll see that in a second.

00:34:32.800 | Yeah.

00:34:37.800 | All right. So now we're going to move on to

00:34:41.800 | one of the final pieces of the puzzle, which is neural networks.

00:34:44.800 | How do we actually use them in PyTorch?

00:34:47.800 | And once we have that and we have our optimization,

00:34:51.800 | we'll finally be able to figure out how do we actually train a neural network?

00:34:54.800 | What does that look like and why it's so clean and efficient

00:34:58.800 | when you do it in PyTorch?

00:35:01.800 | So the first thing that you want to do is,

00:35:04.800 | we're going to be defining neural networks in terms of existing building blocks,

00:35:09.800 | in terms of existing APIs, which will implement,

00:35:12.800 | for instance, linear layers or different activation functions that we need.

00:35:17.800 | So we're going to import torch.nn because that is the neural network package

00:35:22.800 | that we're going to make use of.

00:35:24.800 | And so let's start with the linear layer.

00:35:27.800 | The way the linear layer works in PyTorch is it takes in two arguments.

00:35:31.800 | It takes in the input dimension and then the output dimension.

00:35:36.800 | And so pretty much what it does is it takes in some input,

00:35:41.800 | which has some arbitrary amount of dimensions,

00:35:45.800 | and then finally the input dimension.

00:35:48.800 | And it will essentially output it to that same set of dimensions

00:35:52.800 | except the output dimension in the very last place.

00:35:56.800 | And you can think of the linear layer as essentially just performing a simple AX plus B.

00:36:02.800 | By default, it's going to, um, it's going to apply a bias,

00:36:08.800 | but you can also disable that if you don't want a bias term.

00:36:12.800 | And so let's look at a small example.

00:36:14.800 | So- so here we have our input,

00:36:25.800 | and we're going to create a linear layer,

00:36:29.800 | in this case, as an input size of four,

00:36:32.800 | an output size of two.

00:36:35.800 | And all we're going to do is once we define it by instantiating with nn dot linear,

00:36:41.800 | whatever the name of our layer is, in this case, we called it linear,

00:36:45.800 | we just essentially apply it with parentheses as if it were a function to whatever input.

00:36:50.800 | And that actually does the actual forward pass through this linear layer to get our output.

00:37:00.800 | And so you can see that the original shape was two by three by four.

00:37:06.800 | Then we pass it through this linear layer which has an output dimension of size two.

00:37:10.800 | And so ultimately our output is two by three by two, which is good.

00:37:15.800 | That's what we expect. That's not shape error.

00:37:18.800 | But, you know, something common, um, that you'll see is, you know,

00:37:21.800 | maybe, uh, you decide to- you get a little confused and maybe you do, um,

00:37:29.800 | let's say, uh, two by two, you match the wrong dimension.

00:37:34.800 | And so here we're going to get, uh, shape error.

00:37:38.800 | And you see that the error message isn't as helpful because it's actually changed the shape of what we were working with.

00:37:43.800 | We said this was two by three by four under the hood,

00:37:46.800 | PyTorch has changed this to a six by four.

00:37:48.800 | But if we, you know, in this case it's obvious because we instantiated it with the shape.

00:37:55.800 | But if we didn't have the shape, then one simple thing we could do is actually just print out the shape.

00:38:01.800 | And we'd see, okay, this last dimension is size four,

00:38:03.800 | so I actually need to change my input dimension in my linear layer to be size four.

00:38:14.800 | And you'll also notice on this, um, output we have this grad function.

00:38:18.800 | And so that's because we're actually computing and storing the gradients here, uh, for our tensor.

00:38:31.800 | Yeah, so typically we think of the first dimension as the batch dimension.

00:38:36.800 | So in this case it's set n.

00:38:37.800 | This, you can think of as if you had a batch of images, it would be the number of images.

00:38:41.800 | If you had a training corpus of text,

00:38:43.800 | it would be essentially the number of sentences or sequences.

00:38:48.800 | Um, pretty much that is usually considered the batch dimension.

00:38:52.800 | The star indicates that there can be an arbitrary number of dimensions.

00:38:55.800 | So for instance, if we had images, this could be a four-dimensional tensor object.

00:39:01.800 | It could be the batch size by the number of channels, by the height, by the width.

00:39:06.800 | But in general, there's no fixed number of dimensions.

00:39:10.800 | Your input tensor can be any number of dimensions.

00:39:13.800 | The key is just that that last dimension needs to match up with the input dimension of your linear layer.

00:39:20.800 | The two is the output size.

00:39:22.800 | So essentially we're saying that we're going to map this last dimension,

00:39:28.800 | which is four-dimensional to now two-dimensional.

00:39:31.800 | Um, so in general, you know, you can think of this as if we're stacking a neural network,

00:39:35.800 | this is kind of the input dimension size, and this would be like the hidden dimension size.

00:39:43.800 | And so one thing we can do is we can actually print out the parameters,

00:39:46.800 | and we can actually see what are the values of our linear layer,

00:39:49.800 | or in general for any layer that we define in our neural network,

00:39:53.800 | what are the actual parameters.

00:39:55.800 | And in this case, we see that there's two sets of parameters,

00:40:00.800 | because we have a bias as well as the actual, um, the actual linear layer itself.

00:40:07.800 | And so both of them store the gradients, and in this case,

00:40:13.800 | um, you know, these are, these are what the current values of these parameters are,

00:40:19.800 | and they'll change as we train the network.

00:40:23.800 | Okay, so now let's go through some of the other module layers.

00:40:27.800 | Um, so in general,

00:40:32.800 | nn.linear is one of the layers you have access to.

00:40:35.800 | You have a couple of other different layers that are pretty common.

00:40:38.800 | You have 2D convolutions, you have transpose convolutions,

00:40:42.800 | you have batch norm layers when you need to do normalization in your network.

00:40:46.800 | You can do upsampling, you can do max pooling,

00:40:49.800 | you can do lots of different operators.

00:40:51.800 | But the main key here is that all of them are built-in building blocks

00:40:54.800 | that you can just call, just like we did with nn.linear.

00:40:58.800 | And so let's just go, I guess, I'm running out of time,

00:41:03.800 | but let's just try and go through these last few layers,

00:41:06.800 | and then I'll wrap up by kind of showing an example that puts it all together.

00:41:10.800 | So in this case, we can define an activation function,

00:41:14.800 | which is typical with our networks.

00:41:16.800 | We need to introduce nonlinearities.

00:41:18.800 | In this case, we use the sigmoid function.

00:41:20.800 | And so now we can define our, our network as this very simple thing,

00:41:23.800 | which had one linear layer and then an activation.

00:41:28.800 | And in general, when we compose these layers together,

00:41:32.800 | we don't need to actually write every single line by line applying the next layer.

00:41:37.800 | We can actually stack all of them together.

00:41:39.800 | In this case, we can use nn.sequential and list all of the layers.

00:41:43.800 | So here we have our linear layer followed by our sigmoid.

00:41:47.800 | And then now we're just essentially passing the input through this whole set of layers all at once.

00:41:54.800 | So we take our input, we call a block on the input, and we get the output.

00:42:00.800 | And so let's just kind of see putting it all together,

00:42:03.800 | what does it look like to define a network,

00:42:05.800 | and what does it look like when we train one?

00:42:07.800 | So here we're going to actually define a multilayer perceptron.

00:42:11.800 | And the way it works is to define a neural network,

00:42:14.800 | you extend the nn.module class.

00:42:17.800 | The key here is there's really two main things you have to define when you create your own network.

00:42:21.800 | One is the initialization.

00:42:23.800 | So in the init function, you actually initialize all the parameters you need.

00:42:27.800 | In this case, we initialize an input size, a hidden size,

00:42:31.800 | and we actually define the model itself.

00:42:34.800 | In this case, it's a simple model which consists of a linear layer followed by an activation,

00:42:41.800 | followed by another linear layer, followed by a final activation.

00:42:45.800 | And the second function we have to define is the forward,

00:42:48.800 | which actually does the forward pass of the network.

00:42:51.800 | And so here our forward function takes in our input x.

00:42:56.800 | In general, it could take in some arbitrary amount of inputs into this function,

00:43:01.800 | but essentially it needs to figure out how are you actually computing the output.

00:43:05.800 | And in this case, it's very simple.

00:43:07.800 | It just takes in the input x and returns it back into the network that we just defined and return the output.

00:43:14.800 | And again, you could do this more explicitly by kind of doing what we did earlier,

00:43:19.800 | where we could actually write out all of the layers individually instead of wrapping them into one object

00:43:26.800 | and then doing a line-by-line operation for each one of these layers.

00:43:33.800 | If we define our class, it's very simple to use it.

00:43:36.800 | We can now just instantiate some input,

00:43:39.800 | instantiate our model by calling multilayer perceptron with our parameters,

00:43:43.800 | and then just pass it through our model.

00:43:48.800 | So that's great, but this is all just the forward pass.

00:43:51.800 | How do we actually train the network?

00:43:53.800 | How do we actually make it better?

00:43:55.800 | And so this is the final step, which is we have optimization built in to PyTorch.

00:44:00.800 | So we have this backward function, which goes and computes all these gradients in the backward pass.

00:44:05.800 | And now the only step left is to actually update the parameters using those gradients.

00:44:10.800 | And so here, we'll import the torch.optim package, which contains all of the optimizers that you need.

00:44:18.800 | Essentially, this part is just creating some random data so that we can actually decide how to fit our data.

00:44:25.800 | But this is really the key here, which is we'll instantiate our model that we defined.

00:44:30.800 | We'll define the atom optimizer, and we'll define it with a particular learning rate.

00:44:37.800 | We'll define a loss function, which is again another built-in module.

00:44:41.800 | In this case, we're using the cross-entropy loss.

00:44:44.800 | And finally, to calculate our predictions, all we do simply is just call model on our actual input.

00:44:51.800 | And to calculate our loss, we just call our loss function on our predictions and our true labels.

00:44:57.800 | And we extract the scalar here.

00:45:00.800 | And now when we put it all together, this is what the training loop looks like.

00:45:05.800 | We have some number of epochs that we want to train our network.

00:45:08.800 | For each of these epochs, the first thing we do is we take our optimizer and we zero out the gradient.

00:45:13.800 | And the reason we do that is because, like many of you noted, we actually are accumulating the gradient.

00:45:19.800 | We're not resetting it every time we call dot backward.

00:45:21.800 | So we zero out the gradient.

00:45:23.800 | We get our model predictions by doing a forward pass.

00:45:27.800 | We then compute the loss between the predictions and the true values.

00:45:33.800 | Finally, we call loss dot backward.

00:45:35.800 | This is what actually computes all of the gradients in the backward pass from our loss.

00:45:41.800 | And the final step is we call dot step on our optimizer.

00:45:44.800 | In this case, we're using atom.

00:45:47.800 | And this will take a step on our loss function.

00:45:49.800 | And so if we run this code, we end up seeing that we're able to start with some training loss,

00:45:55.800 | which is relatively high, and in 10 epochs, we're able to essentially completely fit our data.

00:46:02.800 | And if we print out our model parameters and we printed them out from the start as well,

00:46:06.800 | we'd see that they've changed as we've actually done this optimization.

00:46:11.800 | And so I'll kind of wrap it up here.

00:46:13.800 | But I think the key takeaway is that a lot of the things that you're doing at the beginning of this class

00:46:19.800 | are really about understanding the basics of how neural networks work,

00:46:23.800 | how you actually implement them, how you implement the backward pass.

00:46:27.800 | The great thing about PyTorch is that once you get to the very next assignment,

00:46:30.800 | you'll see that now that you have a good underlying understanding of those things,

00:46:34.800 | you can abstract a lot of the complexity of how do you do backprop,

00:46:38.800 | how do you store all these gradients, how do you compute them,

00:46:41.800 | how do you actually run the optimizer, and let PyTorch handle all of that for you.

00:46:45.800 | And you can use all of these building blocks, all these different neural network layers,

00:46:49.800 | to now define your own networks that you can use to solve whatever problems you need.

00:46:54.800 | Thank you.

00:46:56.800 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | 2023 | PyTorch Tutorial, Drew Kaul