Stanford CS224N NLP with Deep Learning | 2023

And so today I kind of just want to cover the fundamentals of PyTorch, really just kind of see what are the similarities between PyTorch and NumPy and Python, which you guys are used to at this point, and see how we can build up a lot of the building blocks that we'll need in order to define more complex models.

So specifically we're going to talk today about tensors, what are tensor objects, how do we manipulate them, what is AutoGrad, how PyTorch helps us compute different gradients, and finally how we actually do optimization and how we write the training loop for our neural networks. And if we have time at the end, then we'll try and go through a bit of a demo to kind of put everything together and see how everything comes together when you want to solve an actual NLP task.

All right, so let's get started. So if you go to the course website, there is a notebook, and you can just make a copy of this Colab notebook and then just run the cells as we go. And so to start, today we're talking about PyTorch, like I said. It's a deep learning framework that really does two main things.

One is it makes it very easy to author and manipulate tensors and make use of your GPU so that you can actually leverage a lot of that capability. And two is it makes the process of authoring neural networks much simpler. You can now use different building blocks like linear layers and different loss functions and compose them in different ways in order to author the types of models that you need for your specific use cases.

And so PyTorch is one of the two main frameworks along with TensorFlow. In this class, we'll focus on PyTorch, but they're quite similar. And so we'll start by importing Torch, and we'll import the neural network module, which is Torch.nn. And for this first part of the tutorial, I want to talk a bit about tensors.

One thing that you guys are all familiar with now is NumPy arrays. And so pretty much you can think about tensors as the equivalent in PyTorch to NumPy arrays. They're essentially multi-dimensional arrays that you can manipulate in different ways, and you'll essentially use them to represent your data, to be able to actually manipulate it, and perform all the different matrix operations that underlie your neural network.

And so in this case, for example, if we're thinking of an image, one way you can think about it in terms of a tensor is it's a 256 by 256 tensor, where it has a width of 256 pixels and a height of 256 pixels. And for instance, if we have a batch of images, and those images contain three channels, like red, green, and blue, then we might have a four-dimensional tensor, which is the batch size by the number of channels, by the width and the height.

And so everything we're going to see today is all going to be represented as tensors, which you can just think of as multi-dimensional arrays. And so to kind of get some intuition about this, we're going to spend a little bit of time going through essentially lists of lists, and how we can convert them into tensors, and how we can manipulate them with different operations.

So to start off with, we just have a simple list of lists that you're all familiar with. In this case, it's a two-by-three list. And now we want to create a tensor. And so here, the way we'll create this tensor is by doing torch dot tensor, and then essentially writing the same syntax that we had before.

Just write out the list of lists that represents that particular tensor. And so in this case, we get back a tensor object, which is the same shape and contains the same data. And so now, the second thing with a tensor is that it contains a data type. So there's different data types.

For instance, there are different varying level of precision floating point numbers that you can use. You can have integers. You can have different data types that actually populate your tensor. And so by default, I believe this will be float 32, but you can explicitly specify which data type your tensor is by passing in the dtype argument.

And so we see here now, even though we wrote in a bunch of integers, they have a decimal point, which indicates that they're floating point numbers. And so same thing here. We can create another tensor, in this case with data type float 32. And in this third example, you see that we create another tensor.

We don't actually specify the data type, but PyTorch essentially implicitly takes the data type to be floating point, since we actually passed in a floating point number into this tensor. So pretty much at a high level, tensors are like multi-dimensional arrays. We can specify the data type for them.

We can populate them just like NumPy arrays. Okay, so now, great, we know how to create tensors. We know that ultimately everything that we work with, all the data we have is going to be expressed as tensors. Now the question is, what are the functions that we have to manipulate them?

And so we have some basic utilities that can help us instantiate tensors easily, specifically torch.0s and torch.1s. These are two ways to create tensors of a particular shape, in this case tensors of all zeros or tensors of all ones. And you'll see that this will be very helpful when you do your homeworks.

Typically, you'll want to just need to create a bunch of a zero matrix, and it'll be very easy to just specify the shape here without having to write everything out super explicitly. And then you can update that tensor as needed. Another thing you can do is, just like we have ranges in Python, so if you want to loop over a bunch of numbers, you can specify a range.

You can also use torch.a_range to be able to actually instantiate a tensor with a particular range. In this case, we just looped over the numbers 1 through 10. You could reshape this and make it 1 through 5 and then 6 through 10. That's another way to be able to instantiate tensors.

And finally, something to note is that when we apply particular operations, such as just simple Python operations like addition or multiplication, by default they're going to be element-wise, so they'll apply to all the elements in our tensor. So in this case, we took our tensor, I think this one was probably from earlier above, and we added 2 everywhere.

Here we multiplied everything by 2. But pretty much the PyTorch semantics for broadcasting work pretty much the same as the NumPy semantics. So if you pretty much have different matrix operations where you need to batch across a particular dimension, PyTorch will be smart about it and it will actually make sure that you broadcast over the appropriate dimensions.

Although of course you have to make sure that the shapes are compatible based on the actual broadcasting rules. So we'll get to that in a little bit when we look at reshaping and how different operations have those semantics. In this case, we have to define the, I guess I'm not personally aware of how you would define kind of a jagged tensor that has unequal dimensions.

But typically we don't want to do that because it makes our computation a lot more complex. And so in cases where we have, you know, for instance, we have different sentences that we turn into tokens, we might have different length sentences in our training set. We'll actually pad all the dimensions to be the same because ultimately we want to do everything with matrix operations.

And so in order to do that, we need to have a matrix of a fixed shape. But yeah, that's a good point. I'm not sure if there is a way to do that, but typically we just get around this by padding. Okay, so now we know how to define tensors.

We can do some interesting things with them. So here we've created two tensors. One of them is a 3 by 2 tensor. The other one is a 2 by 4 tensor. And I think the answer is written up here, but what do we expect is the shape when we multiply these two tensors?

So we have a 3 by 2 tensor and a 2 by 4 tensor. Yeah, 3 by 4. And so more generally, we can use Matmul in order to do matrix multiplication. It also implements batched matrix multiplication. And so I won't go over the entire review of broadcasting semantics, but the main gist is that the dimensions of two tensors are compatible if you can left pad the tensors with ones so that the dimensions that line up either A, have the same number in that dimension, or B, one of them is a dummy dimension.

One of them has a 1. And in that case, in those dummy dimensions, PyTorch will actually make sure to copy over the tensor as many times as needed so that you can then actually perform the operation. And that's useful when you want to do things like batched dot products or batched matrix multiplications.

And I guess the final point here is there's also a shorthand notation that you can use. So instead of kind of having to type out Matmul every time, you can just use the at operator, similar to NumPy. Effectively, that's kind of where we get into how batching works. So for example, if you had, let's say, two tensors that have some batch dimension, and then one of them is M by 1, and the other one is 1 by N.

And if you do a batched matrix multiply to those two tensors, now what you effectively do is you preserve the batch dimension, and then you're doing a matrix multiplication between an M by 1 tensor and a 1 by N. So you get something that's the batch dimension by M by N.

So effectively, they're kind of more, I think the full semantics are written out on the PyTorch website for how the matrix multiplication works. But you're right, you don't just have these cases where you have two two-dimensional tensors. You can have arbitrary number of dimensions, and as long as the dimensions match up based on those semantics I was saying, then you can multiply it.

Alternatively, you can do what I do, which is just multiply it anyways, and then if it throws an error, print out the shapes and kind of work from there. That tends to be faster in my opinion in a lot of ways. But yeah, that's a good point. All right, so yeah, let's keep going through some of the other different functionalities here.

So we can define another tensor, and kind of one of the key things that we always want to look at is the shape. So in this case, we just have a 1D tensor of length 3, so the torch.size just gives us 3. In general, this is kind of one of the key debugging steps and something that I'll try and emphasize a lot throughout this session, which is printing the shapes of all of your tensors is probably your best resource.

When it comes to debugging, it's kind of one of the hardest things to intuit exactly what's going on once you start stacking a lot of different operations together. So printing out the shapes at each point and seeing do they match what you expect is something important, and it's better to rely on that than just on the error message that PyTorch gives you, because under the hood, PyTorch might implement certain optimizations and actually reshape the underlying tensor you have, so you may not see the numbers you expect.

So it's always great to print out the shape. And so yeah, let's... So again, we can always print out the shape, and we can have a more complex, in this case, 3-dimensional tensor, which is 3 by 2 by 4, and we can print out the shape and we can see all of the dimensions here.

And so now you're like, "Okay, great, we have tensors, we can look at their shapes, but what do we actually do with them?" And so now let's get into kind of what are the operations that we can apply to these tensors. And so one of them is it's very easy to reshape tensors.

So in this case, we're creating this 15-dimensional tensor that's... the number is 1 to 15, and now we're reshaping it so now it's a 5 by 3 tensor here. And so you might wonder, "Well, like, what's the point of that?" And it's because a lot of times when we are doing machine learning, we actually want to learn in batches, and so we might take our data and we might reshape it so now that instead of kind of being a long, flattened list of things, we actually have a set of batches, or in some cases we have a set of batches of a set of sentences or sequences of a particular length, and each of the elements in that sequence has an embedding of a particular dimension.

And so based on the types of operations that you're trying to do, you'll sometimes need to reshape those tensors, and sometimes you'll want to particularly sometimes transpose dimensions if you want to, for instance, reorganize your data. So that's another operation to keep in mind. I believe the difference is view will...

view will create a view of the underlying tensor, and so I think the underlying tensor will still have this same shape. It will actually modify the tensor. All right, and then finally, like I said at the beginning, your intuition about PyTorch tensors can simply be they're kind of a nice, easy way to work with NumPy arrays, but they have all these great properties, like now we can essentially use them with GPUs and it's very optimized, and we can also compute gradients quickly, and to kind of just emphasize this point, if you have some NumPy code and you have a bunch of NumPy arrays, you can directly convert them into PyTorch sensors by simply casting them, and you can also take those tensors and convert them back to NumPy arrays.

All right, and so one of the things you might be asking is, why do we care about tensors? What makes them good? And one of the great things about them is that they support vectorized operations very easily. Essentially, we can parallelize a lot of different computations and do them, for instance, across a batch of data all at once, and one of those operations you might want to do, for instance, is a sum.

So you can take, in this case, a tensor which is shaped five by seven, and it looks like that's not working. You can take a tensor that's shaped five by seven, and now you can compute different operations on it that essentially collapse the dimensionality. So the first one is sum, and so you can take it and you can sum across both the rows as well as the columns, and so one way I like to think about this to kind of keep them straight is that the dimension that you specify in the sum is the dimension you're collapsing.

So in this case, if you take the data and sum over dimension zero, because you know the shape of the underlying tensor is five by seven, you've collapsed the zeroth dimension, so you should be left with something that's just shaped seven. And if you see the actual tensor, you've got 75, 80, 85, 90, you've got this tensor which is shaped seven.

Alternatively, you can think about whether or not you're kind of summing across the rows or summing across the columns. But it's not just sum, it applies to other operations as well. You can compute standard deviations, you can normalize your data, you can do other operations which essentially batch across the entire set of data.

And not only do these apply over one dimension, but here you can see that if you don't specify any dimensions, then by default the operation actually applies to the entire tensor. So here we end up just taking the sum of the entire thing. So if you think about it, the zeroth dimension is the number of rows, there are five rows and there are seven columns.

So if we sum out the rows, then we're actually summing across the columns, and so now we only have seven values. But I like to think about more just in terms of the dimensions to keep it straight rather than rows or columns because it can get confusing. If you're summing out dimension zero, then effectively you've taken something which has some shape that's dimension zero by dimension one to just whatever is the dimension one shape.

And then from there you can kind of figure out, okay, which way did I actually sum to check if you were right. NumPy implements a lot of this vectorization, and I believe in the homework that you have right now, I think part of your job is to vectorize a lot of these things.

So the big advantage with PyTorch is that essentially it's optimized to be able to take advantage of your GPU. When we actually start building out neural networks that are bigger, that involve more computation, we're going to be doing a lot of these matrix multiplication operations that it's going to be a lot better for our processor if we can make use of the GPU.

And so that's where PyTorch really comes in handy. In addition to also defining a lot of those neural network modules as we'll see later, um, for you so that now you don't need to worry about, for instance, implementing a basic linear layer and back propagation from scratch and also your optimizer.

All of those things will be built in and you can just call the respective APIs to make use of them. Whereas in Python and NumPy you might have to do a lot of that coding yourself. Yeah. All right. So we'll keep going. So this is a quiz except I think it tells you the answer so it's not much of a quiz.

But pretty much, you know, what would you do if now I told you instead of, you know, summing over this tensor, I want you to compute the average. And so there's, there's two different ways you could compute the average. You could compute the average across the rows or across the columns.

And so essentially now we kind of get back to this question of, well, which dimension am I actually going to reduce over? And so here if we want to preserve the rows, then we need to actually sum over the second dimension. Um, there are really the first, uh, zeroth and first.

So the first dimension is what we add to sum over, uh, because we want to preserve the zeroth dimension. And so that's why for row average you see the dim equals one. And for column average, same reasoning is why you see the dim equals zero. And so if we run this code, we'll see kind of what are the shapes that we expect.

If we're taking the average over rows, then an object that's two by three should just become an object that's two. It's just a one-dimensional, almost a vector you can think of. And if we are averaging across the columns, there's three columns. So now our average should have three values.

And so now we're left with a three-dim- uh, one-dimensional tensor of length three. So yeah, does that kind of make sense? I guess is this general intuition about how we deal with shapes, and how some of these operations manipulate shapes. So now we'll get into indexing. This can get a little bit tricky, but I think you'll find that the semantics are very similar to NumPy.

So one of the things that you can do in NumPy is that you can take these NumPy arrays, and you can slice across them. In many different ways, you can create copies of them, and you can index across particular dimensions to select out different elements, different rows, or different columns.

And so in this case, let's take this example tensor, which is three by two by two. Um, and first thing you always want to do when you have a new tensor, print out its shape, understand what you're working with. And so I guess, uh, I may have shown this already, but what will x bracket zero print out?

What happens if we index into just the first element? What's the shape of this? Yeah, two by two, right? Because if you think about it, our tensor is really just a list of three things. Each of those things happens to also be a two by two tensor. So we get a two by two object, in this case, the first thing, one, two, three, four.

And so just like NumPy, if you provide a colon in a particular dimension, it means essentially copy over that dimension. So if we do x bracket zero implicitly, we're essentially putting a colon for all the other dimensions. So it's essentially saying, grab the first thing along the zeroth dimension, and then grab everything along the other two dimensions.

If we now take, uh, just the zeroth along- the element along the first dimension, um, what are we going to get? Well, ultimately, we're going to get- now, if you look, uh, the kind of first dimension where these three things, the second dimension is now each of these two rows within those things.

So like one, two, and three, four, five, six, and seven, eight, nine, 10, and 11, 12. So if we index into the second dimen- or the first dimension and get the zeroth element, then we're going to end up with one, two, five, six, and nine, 10. And even if that's a little bit tricky, you can kind of go back to the trick I mentioned before, where we're slicing across the first dimension.

So if we look at the shape of our tensor, it's three by two by two. If we collapse the first dimension, that two in the middle, we're left with something that's three by two. So it might seem a little bit trivial kind of going through this in a lot of detail, but I think it's important because it can get tricky when your tensor shapes get more complicated, how to actually reason about this.

And so I won't go through every example here since a lot of them kind of reinforce the same thing, but I'll just highlight a few things. Just like NumPy, you can choose to get a range of elements. Uh, in this case, we're taking this new tensor, which is one to- one through 15 rearranged as a five by three tensor.

And if we take the zeroth through third row, um, exclusive, we'll get the first three rows. And we can do the same thing, but now with slicing across multiple dimensions. And I think the final point I want to talk about here is list indexing. List indexing is also present in NumPy, and it's a very clever shorthand for being able to essentially select out multiple elements at once.

So in this case, what you can do is, if you want to get the zeroth, the second, and the fourth element of our matrix, you can just, instead of indexing with a particular number or set of numbers, index with a list of indices. So in this case, if we go up to our tensor, if we take out the zeroth, the second, and the fourth, we should see those three rows, and that's what we end up getting.

Yeah, again, these are kind of a lot of examples to just reiterate the same point, which is that you can slice across your data in multiple ways, and at different points you're going to need to do that. So being familiar with the shapes that you understand what's the underlying output that you expect is important.

In this case, for instance, we're slicing across the first and the second dimension, and we're keeping the first, the zeroth. And so we're going to end up getting essentially kind of the top left element of each of those three things in our tensor. If we scroll all the way up here, we'll get this one, we'll get this five, and we'll get this nine, because we go across all of the zeroth dimension, and then across the first and the second, we only take the first, the zeroth element in both of those positions.

And so that's why we get one, five, nine. And also, of course, you can, you know, apply all of the colons to get back the original tensor. Okay, and then I think the last thing when it comes to indexing is conversions. So typically, when we're writing code with neural networks, ultimately we're going to, you know, process some data through a network, and we're going to get a loss.

And that loss needs to be a scalar, and then we're going to compute gradients with respect to that loss. So one thing to keep in mind is that sometimes you might have an operation, and it fails because it was actually expecting a scalar value rather than a tensor. And so you can extract out the scalar from this one-by-one tensor by just calling dot item.

So in this case, you know, if you have a tensor, which is just literally one, then you can actually get the Python scalar that corresponds to it by calling dot item. So now we can get into the more interesting stuff. One of the really cool things with PyTorch is Autograd.

And what Autograd is, is PyTorch essentially provides an automatic differentiation package, where when you define your neural network, you're essentially defining many nodes that compute some function. And in the forward pass, you're kind of running your data through those nodes. But what PyTorch is doing on the back end is that at each of those points, it's going to actually store the gradients and accumulate them, so that every time you do your backwards pass, you apply the chain rule to be able to calculate all these different gradients, and PyTorch caches those gradients.

And then you will have access to all of those gradients to be able to actually then run your favorite optimizer and optimize, you know, with SGD or with Atom or whichever optimizer you choose. And so that's kind of one of the great features. You don't have to worry about actually writing the code that computes all of these gradients and actually caches all of them properly, applies the chain rule, does all these steps.

You can abstract all of that away with just one call to dot backward. And so in this case, we'll run through a little bit of an example where we'll see the gradients getting computed automatically. So in this case, we're going to initialize a tensor, and requires grad is true by default.

It just means that by default for a given tensor, Python, PyTorch will store the gradient associated with it. And you might wonder, well, you know, why do we have this? You know, wouldn't we always want to store the gradient? And the answer is, at train time, you need the gradients in order to actually train your network.

But at inference time, you'd actually want to disable your gradients, and you can actually do that because it's a lot of extra computation that's not needed since you're not making any updates to your network anymore. And so let's create this right now. We don't have any gradients being computed because we haven't actually called backwards to actually compute some quantity with respect to this particular tensor.

We haven't actually computed those gradients yet. So right now, the dot grad feature, which will actually store the gradient associated with that tensor, is none. And so now let's just define a really simple function. We have x. We're going to define the function y equals 3x squared. And so now we're going to call y dot backward.

And so now what happens is when we actually print out x dot grad, what we should expect to see is number 12. And the reason is that our function y is 3x squared. If we compute the gradient of that function, we're going to get 6x, and our actual value was 2.

So the actual gradient is going to be 12. And we see that when we print out x dot grad, that's what we get. And now we'll just run it again. Let's set z equal to 3x squared. We call z dot backwards, and we print out x dot grad again.

And now we see that- I may not run this in the right order. Okay. So here in the second one that I re-ran, we see that it says 24. And so you might be wondering, well, I just did the same thing twice. Shouldn't I see 12 again? And the answer is that by default, PyTorch will accumulate the gradients.

So it won't actually rewrite the gradient each time you compute it. It will sum it. And the reason is because when you actually have backpropagation for your network, you want to accumulate the gradients, you know, across all of your examples, and then actually apply your update. You don't want to overwrite the gradient.

But this also means that every time you have a training iteration for your network, you need to zero out the gradient, because you don't want the previous gradients from the last epoch, where you iterated through all of your training data, to mess with the current update that you're doing.

So that's kind of one thing to note, which is that that's essentially why we will see when we actually write the training loop, you have to run zero grad in order to zero out the gradient. Yes, so I accidentally ran the cells in the wrong order. Maybe to make it more clear, let me put this one first.

So this is actually what it should look like, which is that we ran it once, and I ran this cell first. And it has 12. And then we ran it a second time, and we get 24. Yes, so if you have all of your tensors defined, then when you actually call dot backwards, if it's a function of multiple variables, it's going to compute all of those partials, all of those gradients.

Yes, so what's happening here is that the way PyTorch works is that it's storing the accumulated gradient at x. And so we've essentially made two different backwards passes. We've called it once on this function y, which is a function of x, and we've called it once on z, which is also a function of x.

And so you're right, we can't actually disambiguate which came from what, we just see the accumulated gradient. But typically, that's actually exactly what we want, because what we want is to be able to run our network and accumulate the gradient across all of the training examples that define our loss, and then perform our optimizer step.

So yes, even with respect to one thing, it doesn't matter, because in practice, each of those things is really a different example in our set of training examples. And so we're not interested in, you know, the gradient from one example, we're actually interested in the overall gradient. So going back to this example, what's happening here is that in the backwards pass, what it's doing is, you can imagine there's the x tensor, and then there's the dot grad attribute, which is another separate tensor.

It's going to be the same shape as x. And what that is storing is it's storing the accumulated gradient from every single time that you've called dot backward on a quantity that, um, essentially has some dependency on x that will have a non-zero gradient. And so the first time we call it, the gradient will be 12, because 6x, 6 times 2, 12.

The second time we do it with z, it's also still 12. But the point is that dot grad doesn't actually overwrite the gradient each time you call dot backwards, it simply adds them, it accumulates them. And kind of the intuition there is that ultimately, you're going to want to compute the gradient with respect to the loss, and that loss is going to be made up of many different examples.

And so you need to accumulate the gradient from all of those in order to make a single update. And then of course, you'll have to zero that out because every time you make one pass through all of your data, you don't want that next batch of data to also be double counting the previous batch's update.

You want to keep those separate. And so we'll see that in a second. Yeah. All right. So now we're going to move on to one of the final pieces of the puzzle, which is neural networks. How do we actually use them in PyTorch? And once we have that and we have our optimization, we'll finally be able to figure out how do we actually train a neural network?

What does that look like and why it's so clean and efficient when you do it in PyTorch? So the first thing that you want to do is, we're going to be defining neural networks in terms of existing building blocks, in terms of existing APIs, which will implement, for instance, linear layers or different activation functions that we need.

So we're going to import torch.nn because that is the neural network package that we're going to make use of. And so let's start with the linear layer. The way the linear layer works in PyTorch is it takes in two arguments. It takes in the input dimension and then the output dimension.

And so pretty much what it does is it takes in some input, which has some arbitrary amount of dimensions, and then finally the input dimension. And it will essentially output it to that same set of dimensions except the output dimension in the very last place. And you can think of the linear layer as essentially just performing a simple AX plus B.

By default, it's going to, um, it's going to apply a bias, but you can also disable that if you don't want a bias term. And so let's look at a small example. So- so here we have our input, and we're going to create a linear layer, in this case, as an input size of four, an output size of two.

And all we're going to do is once we define it by instantiating with nn dot linear, whatever the name of our layer is, in this case, we called it linear, we just essentially apply it with parentheses as if it were a function to whatever input. And that actually does the actual forward pass through this linear layer to get our output.

And so you can see that the original shape was two by three by four. Then we pass it through this linear layer which has an output dimension of size two. And so ultimately our output is two by three by two, which is good. That's what we expect. That's not shape error.

But, you know, something common, um, that you'll see is, you know, maybe, uh, you decide to- you get a little confused and maybe you do, um, let's say, uh, two by two, you match the wrong dimension. And so here we're going to get, uh, shape error. And you see that the error message isn't as helpful because it's actually changed the shape of what we were working with.

We said this was two by three by four under the hood, PyTorch has changed this to a six by four. But if we, you know, in this case it's obvious because we instantiated it with the shape. But if we didn't have the shape, then one simple thing we could do is actually just print out the shape.

And we'd see, okay, this last dimension is size four, so I actually need to change my input dimension in my linear layer to be size four. And you'll also notice on this, um, output we have this grad function. And so that's because we're actually computing and storing the gradients here, uh, for our tensor.

Yeah, so typically we think of the first dimension as the batch dimension. So in this case it's set n. This, you can think of as if you had a batch of images, it would be the number of images. If you had a training corpus of text, it would be essentially the number of sentences or sequences.

Um, pretty much that is usually considered the batch dimension. The star indicates that there can be an arbitrary number of dimensions. So for instance, if we had images, this could be a four-dimensional tensor object. It could be the batch size by the number of channels, by the height, by the width.

But in general, there's no fixed number of dimensions. Your input tensor can be any number of dimensions. The key is just that that last dimension needs to match up with the input dimension of your linear layer. The two is the output size. So essentially we're saying that we're going to map this last dimension, which is four-dimensional to now two-dimensional.

Um, so in general, you know, you can think of this as if we're stacking a neural network, this is kind of the input dimension size, and this would be like the hidden dimension size. And so one thing we can do is we can actually print out the parameters, and we can actually see what are the values of our linear layer, or in general for any layer that we define in our neural network, what are the actual parameters.

And in this case, we see that there's two sets of parameters, because we have a bias as well as the actual, um, the actual linear layer itself. And so both of them store the gradients, and in this case, um, you know, these are, these are what the current values of these parameters are, and they'll change as we train the network.

Okay, so now let's go through some of the other module layers. Um, so in general, nn.linear is one of the layers you have access to. You have a couple of other different layers that are pretty common. You have 2D convolutions, you have transpose convolutions, you have batch norm layers when you need to do normalization in your network.

You can do upsampling, you can do max pooling, you can do lots of different operators. But the main key here is that all of them are built-in building blocks that you can just call, just like we did with nn.linear. And so let's just go, I guess, I'm running out of time, but let's just try and go through these last few layers, and then I'll wrap up by kind of showing an example that puts it all together.

So in this case, we can define an activation function, which is typical with our networks. We need to introduce nonlinearities. In this case, we use the sigmoid function. And so now we can define our, our network as this very simple thing, which had one linear layer and then an activation.

And in general, when we compose these layers together, we don't need to actually write every single line by line applying the next layer. We can actually stack all of them together. In this case, we can use nn.sequential and list all of the layers. So here we have our linear layer followed by our sigmoid.

And then now we're just essentially passing the input through this whole set of layers all at once. So we take our input, we call a block on the input, and we get the output. And so let's just kind of see putting it all together, what does it look like to define a network, and what does it look like when we train one?

So here we're going to actually define a multilayer perceptron. And the way it works is to define a neural network, you extend the nn.module class. The key here is there's really two main things you have to define when you create your own network. One is the initialization. So in the init function, you actually initialize all the parameters you need.

In this case, we initialize an input size, a hidden size, and we actually define the model itself. In this case, it's a simple model which consists of a linear layer followed by an activation, followed by another linear layer, followed by a final activation. And the second function we have to define is the forward, which actually does the forward pass of the network.

And so here our forward function takes in our input x. In general, it could take in some arbitrary amount of inputs into this function, but essentially it needs to figure out how are you actually computing the output. And in this case, it's very simple. It just takes in the input x and returns it back into the network that we just defined and return the output.

And again, you could do this more explicitly by kind of doing what we did earlier, where we could actually write out all of the layers individually instead of wrapping them into one object and then doing a line-by-line operation for each one of these layers. If we define our class, it's very simple to use it.

We can now just instantiate some input, instantiate our model by calling multilayer perceptron with our parameters, and then just pass it through our model. So that's great, but this is all just the forward pass. How do we actually train the network? How do we actually make it better? And so this is the final step, which is we have optimization built in to PyTorch.

So we have this backward function, which goes and computes all these gradients in the backward pass. And now the only step left is to actually update the parameters using those gradients. And so here, we'll import the torch.optim package, which contains all of the optimizers that you need. Essentially, this part is just creating some random data so that we can actually decide how to fit our data.

But this is really the key here, which is we'll instantiate our model that we defined. We'll define the atom optimizer, and we'll define it with a particular learning rate. We'll define a loss function, which is again another built-in module. In this case, we're using the cross-entropy loss. And finally, to calculate our predictions, all we do simply is just call model on our actual input.

And to calculate our loss, we just call our loss function on our predictions and our true labels. And we extract the scalar here. And now when we put it all together, this is what the training loop looks like. We have some number of epochs that we want to train our network.

For each of these epochs, the first thing we do is we take our optimizer and we zero out the gradient. And the reason we do that is because, like many of you noted, we actually are accumulating the gradient. We're not resetting it every time we call dot backward. So we zero out the gradient.

We get our model predictions by doing a forward pass. We then compute the loss between the predictions and the true values. Finally, we call loss dot backward. This is what actually computes all of the gradients in the backward pass from our loss. And the final step is we call dot step on our optimizer.

In this case, we're using atom. And this will take a step on our loss function. And so if we run this code, we end up seeing that we're able to start with some training loss, which is relatively high, and in 10 epochs, we're able to essentially completely fit our data.

And if we print out our model parameters and we printed them out from the start as well, we'd see that they've changed as we've actually done this optimization. And so I'll kind of wrap it up here. But I think the key takeaway is that a lot of the things that you're doing at the beginning of this class are really about understanding the basics of how neural networks work, how you actually implement them, how you implement the backward pass.

The great thing about PyTorch is that once you get to the very next assignment, you'll see that now that you have a good underlying understanding of those things, you can abstract a lot of the complexity of how do you do backprop, how do you store all these gradients, how do you compute them, how do you actually run the optimizer, and let PyTorch handle all of that for you.

And you can use all of these building blocks, all these different neural network layers, to now define your own networks that you can use to solve whatever problems you need. Thank you.

Stanford CS224N NLP with Deep Learning | 2023 | PyTorch Tutorial, Drew Kaul

Transcript