back to index

Building makemore Part 5: Building a WaveNet


Chapters

0:0 intro
1:40 starter code walkthrough
6:56 let’s fix the learning rate plot
9:16 pytorchifying our code: layers, containers, torch.nn, fun bugs
17:11 overview: WaveNet
19:33 dataset bump the context size to 8
19:55 re-running baseline code on block_size 8
21:36 implementing WaveNet
37:41 training the WaveNet: first pass
38:50 fixing batchnorm1d bug
45:21 re-training WaveNet with bug fix
46:7 scaling up our WaveNet
46:58 experimental harness
47:44 WaveNet but with “dilated causal convolutions”
51:34 torch.nn
52:28 the development process of building deep neural nets
54:17 going forward
55:26 improve on my loss! how far can we improve a WaveNet on this data?

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everyone. Today we are continuing our implementation of MakeMore,
00:00:03.760 | our favorite character-level language model. Now, you'll notice that the background behind
00:00:08.000 | me is different. That's because I am in Kyoto and it is awesome. So I'm in a hotel room here.
00:00:12.560 | Now, over the last few lectures we've built up to this architecture that is a multi-layer
00:00:18.160 | perceptron character-level language model. So we see that it receives three previous characters
00:00:22.720 | and tries to predict the fourth character in a sequence using a very simple multi-layer
00:00:26.720 | perceptron using one hidden layer of neurons with tenational neurities.
00:00:30.720 | So what I'd like to do now in this lecture is I'd like to complexify this architecture.
00:00:34.880 | In particular, we would like to take more characters in a sequence as an input, not just
00:00:39.440 | three. And in addition to that, we don't just want to feed them all into a single hidden layer
00:00:44.720 | because that squashes too much information too quickly. Instead, we would like to make a deeper
00:00:49.200 | model that progressively fuses this information to make its guess about the next character in
00:00:54.160 | a sequence. And so we'll see that as we make this architecture more complex, we're actually
00:00:59.680 | going to arrive at something that looks very much like a WaveNet. So WaveNet is this paper
00:01:04.800 | published by DeKind in 2016, and it is also a language model basically, but it tries to predict
00:01:12.000 | audio sequences instead of character-level sequences or word-level sequences. But fundamentally,
00:01:17.600 | the modeling setup is identical. It is an autoregressive model and it tries to predict
00:01:23.120 | the next character in a sequence. And the architecture actually takes this interesting
00:01:27.520 | hierarchical sort of approach to predicting the next character in a sequence with this tree-like
00:01:33.440 | structure. And this is the architecture, and we're going to implement it in the course of this video.
00:01:38.960 | So let's get started. So the story code for part five is very similar to where we ended up in
00:01:44.400 | part three. Recall that part four was the manual backpropagation exercise that is kind of an aside.
00:01:50.160 | So we are coming back to part three, copy-pasting chunks out of it, and that is our starter code
00:01:54.400 | for part five. I've changed very few things otherwise. So a lot of this should look familiar
00:01:58.720 | to you if you've gone through part three. So in particular, very briefly, we are doing imports.
00:02:03.760 | We are reading our data set of words, and we are processing the data set of words into individual
00:02:10.480 | examples, and none of this data generation code has changed. And basically, we have lots and lots
00:02:15.200 | of examples. In particular, we have 182,000 examples of three characters trying to predict
00:02:22.000 | the fourth one. And we've broken up every one of these words into little problems of given three
00:02:27.520 | characters, predict the fourth one. So this is our data set, and this is what we're trying to get the
00:02:31.120 | neural net to do. Now, in part three, we started to develop our code around these layer modules
00:02:38.480 | that are, for example, a class linear. And we're doing this because we want to think of these
00:02:43.440 | modules as building blocks and like a Lego building block bricks that we can sort of like
00:02:48.960 | stack up into neural networks. And we can feed data between these layers and stack them up into
00:02:54.240 | sort of graphs. Now, we also developed these layers to have APIs and signatures very similar
00:03:01.280 | to those that are found in PyTorch. So we have torch.nn, and it's got all these layer building
00:03:06.160 | blocks that you would use in practice. And we were developing all of these to mimic the APIs of
00:03:10.720 | these. So for example, we have linear. So there will also be a torch.nn.linear, and its signature
00:03:17.360 | will be very similar to our signature, and the functionality will be also quite identical as far
00:03:21.680 | as I'm aware. So we have the linear layer with the batch norm 1D layer and the 10H layer that we
00:03:27.200 | developed previously. And linear just does a matrix multiply in the forward pass of this module.
00:03:33.600 | Batch norm, of course, is this crazy layer that we developed in the previous lecture. And what's
00:03:38.880 | crazy about it is, well, there's many things. Number one, it has these running mean and
00:03:44.080 | variances that are trained outside of back propagation. They are trained using exponential
00:03:49.840 | moving average inside this layer, what we call the forward pass. In addition to that,
00:03:55.600 | there's this training flag because the behavior of batch norm is different during train time
00:04:00.400 | and evaluation time. And so suddenly, we have to be very careful that batch norm is in its
00:04:04.160 | correct state, that it's in the evaluation state or training state. So that's something to now keep
00:04:08.560 | track of, something that sometimes introduces bugs because you forget to put it into the right mode.
00:04:13.840 | And finally, we saw that batch norm couples the statistics or the activations across the examples
00:04:20.240 | in the batch. So normally, we thought of the batch as just an efficiency thing. But now,
00:04:25.360 | we are coupling the computation across batch elements, and it's done for the purposes of
00:04:30.880 | controlling the activation statistics as we saw in the previous video. So it's a very weird layer,
00:04:36.320 | at least a lot of bugs, partly, for example, because you have to modulate the training and
00:04:41.280 | eval phase and so on. In addition, for example, you have to wait for the mean and the variance
00:04:48.720 | to settle and to actually reach a steady state. And so you have to make sure that you... Basically,
00:04:54.400 | there's state in this layer, and state is harmful, usually. Now, I brought out the generator object.
00:05:02.640 | Previously, we had a generator equals G and so on inside these layers. I've discarded that in
00:05:07.600 | favor of just initializing the torch RNG outside here just once globally, just for simplicity.
00:05:15.840 | And then here, we are starting to build out some of the neural network elements. This should look
00:05:20.480 | very familiar. We have our embedding table C, and then we have a list of layers. And it's a linear,
00:05:27.120 | feeds to batch or feeds to 10H, and then a linear output layer. And its weights are scaled down,
00:05:33.040 | so we are not confidently wrong at initialization. We see that this is about 12,000 parameters.
00:05:38.400 | We're telling PyTorch that the parameters require gradients. The optimization is, as far as I'm
00:05:44.320 | aware, identical and should look very, very familiar. Nothing changed here. Lost function
00:05:50.480 | looks very crazy. We should probably fix this. And that's because 32 batch elements are too few.
00:05:57.120 | And so you can get very lucky or unlucky in any one of these batches, and it creates a very thick
00:06:02.560 | loss function. So we're going to fix that soon. Now, once we want to evaluate the trained neural
00:06:08.560 | network, we need to remember, because of the batch norm layers, to set all the layers to
00:06:12.560 | be training equals false. This only matters for the batch norm layer so far. And then we evaluate.
00:06:18.480 | We see that currently we have a validation loss of 2.10, which is fairly good, but there's still
00:06:25.760 | a ways to go. But even at 2.10, we see that when we sample from the model, we actually get relatively
00:06:31.520 | name-like results that do not exist in a training set. So for example, Yvonne, Kilo, Pras, Alaya,
00:06:40.640 | et cetera. So certainly not unreasonable, I would say, but not amazing. And we can still push this
00:06:48.560 | validation loss even lower and get much better samples that are even more name-like. So let's
00:06:53.760 | improve this model now. OK, first, let's fix this graph, because it is daggers in my eyes,
00:06:59.520 | and I just can't take it anymore. So loss_i, if you recall, is a Python list of floats. So for
00:07:07.040 | example, the first 10 elements look like this. Now, what we'd like to do basically is we need
00:07:12.480 | to average up some of these values to get a more representative value along the way.
00:07:19.520 | So one way to do this is the following. In PyTorch, if I create, for example, a tensor of
00:07:25.680 | the first 10 numbers, then this is currently a one-dimensional array. But recall that I can view
00:07:30.880 | this array as two-dimensional. So for example, I can view it as a 2x5 array, and this is a 2D
00:07:36.560 | tensor now, 2x5. And you see what PyTorch has done is that the first row of this tensor is the first
00:07:42.800 | five elements, and the second row is the second five elements. I can also view it as a 5x2,
00:07:48.560 | as an example. And then recall that I can also use -1 in place of one of these numbers,
00:07:55.360 | and PyTorch will calculate what that number must be in order to make the number of elements work
00:08:00.400 | out. So this can be this, or like that. Both will work. Of course, this would not work.
00:08:06.800 | Okay, so this allows it to spread out some of the consecutive values into rows. So that's very
00:08:14.240 | helpful, because what we can do now is, first of all, we're going to create a Torch.tensor
00:08:18.800 | out of the list of floats. And then we're going to view it as whatever it is, but we're going to
00:08:26.560 | stretch it out into rows of 1,000 consecutive elements. So the shape of this now becomes
00:08:32.480 | 200 by 1,000, and each row is 1,000 consecutive elements in this list. So that's very helpful,
00:08:40.320 | because now we can do a mean along the rows, and the shape of this will just be 200.
00:08:46.000 | And so we've taken basically the mean on every row. So plt.plot of that should be something nicer.
00:08:52.320 | Much better. So we see that we've basically made a lot of progress. And then here, this is the
00:08:58.720 | learning rate decay. So here we see that the learning rate decay subtracted a ton of energy
00:09:03.680 | out of the system, and allowed us to settle into the local minimum in this optimization.
00:09:09.280 | So this is a much nicer plot. Let me come up and delete the monster, and we're going to be
00:09:15.040 | using this going forward. Now, next up, what I'm bothered by is that you see our forward pass is
00:09:20.640 | a little bit gnarly, and takes way too many lines of code. So in particular, we see that
00:09:25.680 | we've organized some of the layers inside the layers list, but not all of them for no reason.
00:09:31.120 | So in particular, we see that we still have the embedding table special case outside of the layers.
00:09:36.800 | And in addition to that, the viewing operation here is also outside of our layers.
00:09:40.880 | So let's create layers for these, and then we can add those layers to just our list.
00:09:45.360 | So in particular, the two things that we need is here, we have this embedding table,
00:09:51.200 | and we are indexing at the integers inside the batch xb, inside the tensor xb.
00:09:57.520 | So that's an embedding table lookup just done with indexing. And then here we see that we
00:10:03.440 | have this view operation, which if you recall from the previous video, simply rearranges the
00:10:08.320 | character embeddings and stretches them out into a row. And effectively, what that does is the
00:10:15.040 | concatenation operation, basically, except it's free because viewing is very cheap in PyTorch.
00:10:20.480 | And no memory is being copied. We're just re-representing how we view that tensor.
00:10:25.280 | So let's create modules for both of these operations, the embedding operation and the
00:10:32.000 | flattening operation. So I actually wrote the code just to save some time. So we have a module
00:10:39.600 | embedding and a module flatten, and both of them simply do the indexing operation in a forward
00:10:45.200 | pass and the flattening operation here. And this c now will just become a self.weight inside an
00:10:54.800 | embedding module. And I'm calling these layers specifically embedding and flatten because it
00:10:59.920 | turns out that both of them actually exist in PyTorch. So in PyTorch, we have n and dot embedding,
00:11:05.520 | and it also takes the number of embeddings and the dimensionality of the embedding, just like
00:11:09.440 | we have here. But in addition, PyTorch takes in a lot of other keyword arguments that we are not
00:11:14.240 | using for our purposes yet. And for flatten, that also exists in PyTorch, and it also takes
00:11:21.360 | additional keyword arguments that we are not using. So we have a very simple flatten.
00:11:26.560 | But both of them exist in PyTorch, they're just a bit more simpler. And now that we have these,
00:11:31.360 | we can simply take out some of these special cased things. So instead of c, we're just going
00:11:40.000 | to have an embedding and a vocab size and n embed. And then after the embedding, we are going to
00:11:47.120 | flatten. So let's construct those modules. And now I can take out this c. And here, I don't have to
00:11:54.080 | special case it anymore, because now c is the embedding's weight, and it's inside layers.
00:12:00.160 | So this should just work. And then here, our forward pass simplifies substantially,
00:12:07.600 | because we don't need to do these now outside of these layer, outside and explicitly. They're now
00:12:13.600 | inside layers, so we can delete those. But now to kick things off, we want this little x, which
00:12:20.960 | in the beginning is just xb, the tensor of integers specifying the identities of these
00:12:25.760 | characters at the input. And so these characters can now directly feed into the first layer,
00:12:30.720 | and this should just work. So let me come here and insert a break, because I just want to make
00:12:36.000 | sure that the first iteration of this runs and that there's no mistake. So that ran properly.
00:12:40.720 | And basically, we've substantially simplified the forward pass here. Okay, I'm sorry,
00:12:45.520 | I changed my microphone. So hopefully, the audio is a little bit better. Now,
00:12:50.320 | one more thing that I would like to do in order to PyTorchify our code in further
00:12:53.840 | is that right now, we are maintaining all of our modules in a naked list of layers.
00:12:57.520 | And we can also simplify this, because we can introduce the concept of PyTorch containers.
00:13:03.920 | So in torch.nn, which we are basically rebuilding from scratch here, there's a concept of containers.
00:13:08.400 | And these containers are basically a way of organizing layers into lists or dicts and so on.
00:13:15.520 | So in particular, there's a sequential, which maintains a list of layers, and is a module
00:13:21.040 | class in PyTorch. And it basically just passes a given input through all the layers sequentially,
00:13:26.400 | exactly as we are doing here. So let's write our own sequential. I've written a code here.
00:13:31.920 | And basically, the code for sequential is quite straightforward. We pass in a list of layers,
00:13:37.520 | which we keep here. And then given any input in a forward pass, we just call the layers
00:13:42.080 | sequentially and return the result. And in terms of the parameters, it's just all the parameters
00:13:46.240 | of the child modules. So we can run this. And we can again simplify this substantially. Because
00:13:52.480 | we don't maintain this naked list of layers. We now have a notion of a model, which is a module.
00:13:57.920 | And in particular, is a sequential of all these layers. And now, parameters are simply
00:14:06.880 | just model.parameters. And so that list comprehension now lives here. And then here
00:14:14.320 | we are doing all the things we used to do. Now here, the code again simplifies substantially.
00:14:20.880 | Because we don't have to do this forwarding here. Instead, we just call the model on the
00:14:25.200 | input data. And the input data here are the integers inside xb. So we can simply do logits,
00:14:31.040 | which are the outputs of our model, are simply the model called on xb. And then the cross entropy
00:14:38.080 | here takes the logits and the targets. So this simplifies substantially. And then this looks
00:14:45.280 | good. So let's just make sure this runs. That looks good. Now here, we actually have some work
00:14:50.960 | to do still here, but I'm going to come back later. For now, there's no more layers. There's
00:14:55.040 | a model that layers, but it's not easy to access attributes of these classes directly. So we'll
00:15:00.880 | come back and fix this later. And then here, of course, this simplifies substantially as well,
00:15:05.680 | because logits are the model called on x. And then these logits come here.
00:15:11.840 | So we can evaluate the train and validation loss, which currently is terrible because we just
00:15:18.400 | initialized the neural net. And then we can also sample from the model. And this simplifies
00:15:22.480 | dramatically as well, because we just want to call the model onto the context and outcome logits.
00:15:29.280 | And then these logits go into Softmax and get the probabilities, etc.
00:15:34.240 | So we can sample from this model. What did I screw up?
00:15:38.880 | Okay, so I fixed the issue and we now get the result that we expect,
00:15:45.520 | which is gibberish because the model is not trained because we reinitialize it from scratch.
00:15:50.560 | The problem was that when I fixed this cell to be modeled out layers instead of just layers,
00:15:54.960 | I did not actually run the cell. And so our neural net was in a training mode.
00:15:59.360 | And what caused the issue here is the batch norm layer, as batch norm layer often likes to do,
00:16:04.160 | because batch norm was in the training mode. And here we are passing in an input,
00:16:08.880 | which is a batch of just a single example made up of the context.
00:16:12.080 | And so if you are trying to pass in a single example into a batch norm that is in the training
00:16:16.800 | mode, you're going to end up estimating the variance using the input. And the variance of
00:16:21.280 | a single number is not a number, because it is a measure of a spread. So for example,
00:16:26.560 | the variance of just a single number five, you can see is not a number. And so that's what happened.
00:16:31.760 | And batch norm basically caused an issue. And then that polluted all of the further processing.
00:16:37.040 | So all that we had to do was make sure that this runs. And we basically made the issue of,
00:16:45.120 | again, we didn't actually see the issue with the loss. We could have evaluated the loss,
00:16:48.560 | but we got the wrong result because batch norm was in the training mode.
00:16:51.360 | And so we still get a result, it's just the wrong result, because it's using the
00:16:56.080 | sample statistics of the batch. Whereas we want to use the running mean and running variance inside
00:17:01.600 | the batch norm. And so again, an example of introducing a bug inline, because we did not
00:17:08.800 | properly maintain the state of what is training or not. Okay, so I re-run everything. And here's
00:17:13.600 | where we are. As a reminder, we have the training loss of 2.05 and validation 2.10.
00:17:18.000 | Now, because these losses are very similar to each other, we have a sense that we are not
00:17:22.960 | overfitting too much on this task. And we can make additional progress in our performance by scaling
00:17:28.160 | up the size of the neural network and making everything bigger and deeper. Now, currently,
00:17:33.200 | we are using this architecture here, where we are taking in some number of characters,
00:17:36.800 | going into a single hidden layer, and then going to the prediction of the next character.
00:17:41.200 | The problem here is, we don't have a naive way of making this bigger in a productive way. We could,
00:17:47.360 | of course, use our layers, sort of building blocks and materials to introduce additional layers here
00:17:53.360 | and make the network deeper. But it is still the case that we are crushing all of the characters
00:17:57.440 | into a single layer all the way at the beginning. And even if we make this a bigger layer and add
00:18:02.960 | neurons, it's still kind of like silly to squash all that information so fast in a single step.
00:18:09.760 | So we'd like to do instead is we'd like our network to look a lot more like this in the
00:18:13.520 | WaveNet case. So you see in the WaveNet, when we are trying to make the prediction for the next
00:18:17.840 | character in the sequence, it is a function of the previous characters that feed in. But not
00:18:24.400 | all of these different characters are not just crushed to a single layer, and then you have a
00:18:28.640 | sandwich. They are crushed slowly. So in particular, we take two characters and we fuse
00:18:34.640 | them into sort of like a bigram representation. And we do that for all these characters consecutively.
00:18:40.160 | And then we take the bigrams and we fuse those into four character level chunks. And then we
00:18:47.120 | fuse that again. And so we do that in this like tree-like hierarchical manner. So we fuse the
00:18:52.720 | information from the previous context slowly into the network as it gets deeper. And so this is the
00:18:58.560 | kind of architecture that we want to implement. Now, in the WaveNet's case, this is a visualization
00:19:03.200 | of a stack of dilated causal convolution layers. And this makes it sound very scary, but actually
00:19:08.480 | the idea is very simple. And the fact that it's a dilated causal convolution layer is really just
00:19:13.360 | an implementation detail to make everything fast. We're going to see that later. But for now,
00:19:17.680 | let's just keep the basic idea of it, which is this progressive fusion. So we want to make the
00:19:22.240 | network deeper. And at each level, we want to fuse only two consecutive elements, two characters,
00:19:28.000 | then two bigrams, then two fourgrams, and so on. So let's implement this.
00:19:32.720 | Okay, so first up, let me scroll to where we built the dataset. And let's change the block size from
00:19:36.720 | three to eight. So we're going to be taking eight characters of context to predict the ninth
00:19:42.240 | character. So the dataset now looks like this. We have a lot more context feeding in to predict any
00:19:47.440 | next character in a sequence. And these eight characters are going to be processed in this
00:19:51.440 | tree-like structure. Now, if we scroll here, everything here should just be able to work.
00:19:57.440 | So we should be able to redefine the network. You see that the number of parameters has increased
00:20:01.440 | by 10,000. And that's because the block size has grown. So this first linear layer is much,
00:20:06.800 | much bigger. Our linear layer now takes eight characters into this middle layer. So there's a
00:20:12.800 | lot more parameters there. But this should just run. Let me just break right after the very first
00:20:18.720 | iteration. So you see that this runs just fine. It's just that this network doesn't make too much
00:20:23.120 | sense. We're crushing way too much information way too fast. So let's now come in and see how we
00:20:28.640 | could try to implement the hierarchical scheme. Now, before we dive into the detail of the
00:20:33.520 | reimplementation here, I was just curious to actually run it and see where we are in terms
00:20:38.080 | of the baseline performance of just lazily scaling up the context length. So I let it run. We get a
00:20:43.600 | nice loss curve. And then evaluating the loss, we actually see quite a bit of improvement just from
00:20:48.720 | increasing the context length. So I started a little bit of a performance log here. And
00:20:53.200 | previously where we were is we were getting a performance of 2.10 on the validation loss.
00:20:58.880 | And now simply scaling up the context length from three to eight gives us a performance of 2.02.
00:21:04.160 | So quite a bit of an improvement here. And also, when you sample from the model,
00:21:08.400 | you see that the names are definitely improving qualitatively as well. So we could, of course,
00:21:13.840 | spend a lot of time here tuning things and making it even bigger and scaling up the network further,
00:21:19.840 | even with a simple set up here. But let's continue. And let's implement the hierarchical model
00:21:26.400 | and treat this as just a rough baseline performance. But there's a lot of optimization
00:21:31.680 | left on the table in terms of some of the hyperparameters that you're hopefully getting
00:21:35.200 | a sense of now. OK, so let's scroll up now and come back up. And what I've done here is I've
00:21:41.040 | created a bit of a scratch space for us to just look at the forward pass of the neural net and
00:21:46.560 | inspect the shape of the tensors along the way as the neural net forwards. So here I'm just
00:21:53.040 | temporarily for debugging, creating a batch of just, say, four examples, so four random integers.
00:21:59.040 | Then I'm plucking out those rows from our training set. And then I'm passing into the model the
00:22:04.560 | input xb. Now, the shape of xb here, because we have only four examples, is four by eight. And
00:22:10.880 | this eight is now the current block size. So inspecting xb, we just see that we have four
00:22:17.920 | examples. Each one of them is a row of xb. And we have eight characters here. And this integer
00:22:25.040 | tensor just contains the identities of those characters. So the first layer of our neural net
00:22:30.880 | is the embedding layer. So passing xb, this integer tensor, through the embedding layer
00:22:36.080 | creates an output that is four by eight by 10. So our embedding table has, for each character,
00:22:43.120 | a 10-dimensional vector that we are trying to learn. And so what the embedding layer does here
00:22:48.320 | is it plucks out the embedding vector for each one of these integers and organizes it all in a four
00:22:55.360 | by eight by 10 tensor now. So all of these integers are translated into 10-dimensional vectors inside
00:23:02.000 | this three-dimensional tensor now. Now, passing that through the flattened layer, as you recall,
00:23:07.360 | what this does is it views this tensor as just a four by 80 tensor. And what that effectively does
00:23:14.080 | is that all these 10-dimensional embeddings for all these eight characters just end up being
00:23:18.640 | stretched out into a long row. And that looks kind of like a concatenation operation, basically.
00:23:24.560 | So by viewing the tensor differently, we now have a four by 80. And inside this 80,
00:23:29.920 | it's all the 10-dimensional vectors just concatenated next to each other. And the
00:23:36.400 | linear layer, of course, takes 80 and creates 200 channels just via matrix multiplication.
00:23:42.160 | So, so far, so good. Now I'd like to show you something surprising.
00:23:46.000 | Let's look at the insides of the linear layer and remind ourselves how it works.
00:23:52.480 | The linear layer here in a forward pass takes the input x, multiplies it with a weight,
00:23:57.440 | and then optionally adds a bias. And the weight here is two-dimensional, as defined here,
00:24:01.680 | and the bias is one-dimensional here. So effectively, in terms of the shapes involved,
00:24:07.040 | what's happening inside this linear layer looks like this right now. And I'm using random numbers
00:24:12.000 | here, but I'm just illustrating the shapes and what happens. Basically, a four by 80 input comes
00:24:18.480 | into the linear layer, gets multiplied by this 80 by 200 weight matrix inside, and there's a
00:24:23.200 | plus 200 bias. And the shape of the whole thing that comes out of the linear layer is four by 200,
00:24:28.560 | as we see here. Now, notice here, by the way, that this here will create a four by 200 tensor,
00:24:35.840 | and then plus 200, there's a broadcasting happening here. But four by 200 broadcasts with 200,
00:24:41.600 | so everything works here. So now the surprising thing that I'd like to show you that you may not
00:24:47.120 | expect is that this input here that is being multiplied doesn't actually have to be two-
00:24:52.080 | dimensional. This matrix multiply operator in PyTorch is quite powerful, and in fact,
00:24:57.280 | you can actually pass in higher-dimensional arrays or tensors, and everything works fine.
00:25:01.680 | So for example, this could be four by five by 80, and the result in that case will become
00:25:05.840 | four by five by 200. You can add as many dimensions as you like on the left here.
00:25:10.480 | And so effectively, what's happening is that the matrix multiplication only works on the last
00:25:16.240 | dimension, and the dimensions before it in the input tensor are left unchanged.
00:25:20.800 | So basically, these dimensions on the left are all treated as just a batch dimension.
00:25:30.240 | So we can have multiple batch dimensions, and then in parallel over all those dimensions,
00:25:36.320 | we are doing the matrix multiplication on the last dimension. So this is quite convenient,
00:25:40.800 | because we can use that in our network now. Because remember that we have these eight
00:25:46.160 | characters coming in, and we don't want to now flatten all of it out into a large
00:25:52.800 | eight-dimensional vector, because we don't want to matrix multiply 80 into a weight matrix
00:26:00.560 | multiply immediately. Instead, we want to group these like this. So every consecutive two elements,
00:26:09.680 | one, two, and three, and four, and five, and six, and seven, and eight, all of these should be now
00:26:13.040 | basically flattened out and multiplied by a weight matrix. But all of these four groups here,
00:26:20.320 | we'd like to process in parallel. So it's kind of like a batch dimension that we can introduce.
00:26:25.120 | And then we can in parallel basically process all of these bigram groups in the four batch
00:26:33.760 | dimensions of an individual example, and also over the actual batch dimension of the,
00:26:38.880 | you know, four examples in our example here. So let's see how that works. Effectively,
00:26:43.680 | what we want is right now, we take a 4 by 80, and multiply it by 80 by 200
00:26:49.360 | in the linear layer. This is what happens. But instead, what we want is, we don't want 80
00:26:56.480 | characters or 80 numbers to come in. We only want two characters to come in on the very first layer,
00:27:01.920 | and those two characters should be fused. So in other words, we just want 20 to come in,
00:27:08.000 | right? 20 numbers would come in. And here, we don't want a 4 by 80 to feed into the linear layer.
00:27:14.320 | We actually want these groups of two to feed in. So instead of 4 by 80, we want this to be a 4
00:27:20.000 | by 4 by 20. So these are the four groups of two, and each one of them is 10-dimensional vector.
00:27:28.720 | So what we want is now, is we need to change the flattened layer. So it doesn't output a 4 by 80,
00:27:34.400 | but it outputs a 4 by 4 by 20, where basically, every two consecutive characters are packed in
00:27:44.320 | on the very last dimension. And then these four is the first batch dimension, and this four is the
00:27:50.000 | second batch dimension, referring to the four groups inside every one of these examples.
00:27:54.400 | And then this will just multiply like this. So this is what we want to get to.
00:27:59.280 | So we're going to have to change the linear layer in terms of how many inputs it expects.
00:28:03.280 | It shouldn't expect 80, it should just expect 20 numbers. And we have to change our flattened layer
00:28:08.960 | so it doesn't just fully flatten out this entire example. It needs to create a 4 by 4 by 20 instead
00:28:15.680 | of a 4 by 80. So let's see how this could be implemented. Basically, right now, we have an
00:28:20.800 | input that is a 4 by 8 by 10 that feeds into the flattened layer. And currently, the flattened
00:28:26.480 | layer just stretches it out. So if you remember the implementation of flatten, it takes our x,
00:28:32.800 | and it just views it as whatever the batch dimension is, and then negative 1. So effectively,
00:28:37.840 | what it does right now is it does E.view of 4, negative 1, and the shape of this, of course,
00:28:43.280 | is 4 by 80. So that's what currently happens. And we instead want this to be a 4 by 4 by 20,
00:28:50.560 | where these consecutive 10-dimensional vectors get concatenated.
00:28:53.280 | So you know how in Python, you can take a list of range of 10. So we have numbers from 0 to 9.
00:29:02.880 | And we can index like this to get all the even parts. And we can also index like starting at 1,
00:29:09.040 | and going in steps of 2 to get all the odd parts. So one way to implement this would be as follows.
00:29:16.400 | We can take E, and we can index into it for all the batch elements, and then just even elements
00:29:23.680 | in this dimension. So at indexes 0, 2, 4, and 8. And then all the parts here from this last dimension.
00:29:33.120 | And this gives us the even characters. And then here, this gives us all the odd characters.
00:29:41.440 | And basically, what we want to do is we want to make sure that these get concatenated
00:29:45.280 | in PyTorch. And then we want to concatenate these two tensors along the second dimension.
00:29:51.040 | So this and the shape of it would be 4 by 4 by 20. This is definitely the result we want.
00:29:58.160 | We are explicitly grabbing the even parts and the odd parts. And we're arranging those 4 by 4 by 10
00:30:05.600 | right next to each other and concatenate. So this works. But it turns out that what also works
00:30:11.440 | is you can simply use view again and just request the right shape. And it just so happens that in
00:30:17.440 | this case, those vectors will again end up being arranged exactly the way we want.
00:30:22.560 | So in particular, if we take E, and we just view it as a 4 by 4 by 20, which is what we want,
00:30:28.720 | we can check that this is exactly equal to, let me call this, this is the explicit
00:30:33.760 | concatenation, I suppose. So explicit dot shape is 4 by 4 by 20. If you just view it as 4 by 4 by 20,
00:30:42.080 | you can check that when you compare to explicit, you get a bit, this is element-wise operation.
00:30:48.640 | So making sure that all of them are true, the values to true. So basically, long story short,
00:30:54.160 | we don't need to make an explicit call to concatenate, etc. We can simply take this
00:30:59.200 | input tensor to flatten, and we can just view it in whatever way we want. And in particular,
00:31:06.400 | we don't want to stretch things out with negative one, we want to actually create a three-dimensional
00:31:10.720 | array. And depending on how many vectors that are consecutive, we want to fuse, like for example,
00:31:18.480 | two, then we can just simply ask for this dimension to be 20. And using negative one here,
00:31:25.440 | and PyTorch will figure out how many groups it needs to pack into this additional batch dimension.
00:31:29.840 | So let's now go into flatten and implement this. Okay, so I scrolled up here to flatten.
00:31:35.360 | And what we'd like to do is we'd like to change it now. So let me create a constructor and take
00:31:39.840 | the number of elements that are consecutive that we would like to concatenate now in the last
00:31:44.480 | dimension of the output. So here, we're just going to remember, cell.n equals n. And then I want to
00:31:51.360 | be careful here, because PyTorch actually has a Torch.flatten, and its keyword arguments are
00:31:56.560 | different, and they kind of like function differently. So our flatten is going to start
00:32:00.640 | to depart from PyTorch.flatten. So let me call it flatten consecutive, or something like that,
00:32:06.080 | just to make sure that our APIs are about equal. So this basically flattens only some n
00:32:13.920 | consecutive elements and puts them into the last dimension. Now here, the shape of x is b by t by c.
00:32:21.360 | So let me pop those out into variables and recall that in our example down below, b was 4, t was 8,
00:32:29.440 | and c was 10. Now, instead of doing x.view of b by negative one, right, this is what we had before.
00:32:40.960 | We want this to be b by negative one by, and basically here, we want c times n. That's how
00:32:49.520 | many consecutive elements we want. And here, instead of negative one, I don't super love the
00:32:55.920 | use of negative one, because I like to be very explicit so that you get error messages when
00:32:59.840 | things don't go according to your expectation. So what do we expect here? We expect this to become
00:33:04.480 | t divide n, using integer division here. So that's what I expect to happen. And then one more thing
00:33:11.440 | I want to do here is, remember previously, all the way in the beginning, n was 3, and basically
00:33:18.240 | we're concatenating all the three characters that existed there. So we basically concatenated
00:33:25.120 | everything. And so sometimes that can create a spurious dimension of one here. So if it is the
00:33:30.880 | case that x.shape at one is one, then it's kind of like a spurious dimension. So we don't want to
00:33:38.240 | return a three-dimensional tensor with a one here. We just want to return a two-dimensional tensor
00:33:43.520 | exactly as we did before. So in this case, basically, we will just say x equals x.squeeze,
00:33:49.600 | that is a PyTorch function. And squeeze takes a dimension that it identifies as a three-dimensional
00:33:59.280 | dimension, that it either squeezes out all the dimensions of a tensor that are one, or you can
00:34:05.520 | specify the exact dimension that you want to be squeezed. And again, I like to be as explicit as
00:34:11.520 | possible always. So I expect to squeeze out the first dimension only of this tensor, this
00:34:18.240 | three-dimensional tensor. And if this dimension here is one, then I just want to return b by
00:34:22.960 | n. And so self.out will be x, and then we return self.out. So that's the candidate implementation.
00:34:31.040 | And of course, this should be self.in instead of just n. So let's run. And let's come here now
00:34:38.000 | and take it for a spin. So flatten consecutive. And in the beginning, let's just use eight.
00:34:47.680 | So this should recover the previous behavior. So flatten consecutive of eight, which is the
00:34:52.960 | current block size. We can do this. That should recover the previous behavior. So we should be
00:35:00.400 | able to run the model. And here we can inspect. I have a little code snippet here where I iterate
00:35:07.680 | over all the layers. I print the name of this class and the shape. And so we see the shapes
00:35:16.960 | as we expect them after every single layer in its output. So now let's try to restructure it
00:35:22.720 | using our flatten consecutive and do it hierarchically. So in particular,
00:35:27.440 | we want to flatten consecutive not block size, but just two. And then we want to process this
00:35:34.400 | with linear. Now the number of inputs to this linear will not be n embed times block size.
00:35:39.600 | It will now only be n embed times two, 20. This goes through the first layer. And now we can,
00:35:47.040 | in principle, just copy paste this. Now the next linear layer should expect n hidden times two.
00:35:52.800 | And the last piece of it should expect n hidden times two again.
00:35:59.680 | So this is sort of like the naive version of it. So running this, we now have a much,
00:36:07.280 | much bigger model. And we should be able to basically just forward the model.
00:36:11.760 | And now we can inspect the numbers in between. So 4 by 8 by 20 was flattened consecutively into 4
00:36:21.040 | by 4 by 20. This was projected into 4 by 4 by 200. And then Bashorm just worked out of the box. And
00:36:30.160 | we have to verify that Bashorm does the correct thing, even though it takes a three-dimensional
00:36:33.600 | embed instead of two-dimensional embed. Then we have 10h, which is element-wise. Then we crushed
00:36:39.600 | it again. So we flattened consecutively and ended up with a 4 by 2 by 400 now. Then linear brought
00:36:46.000 | it back down to 200, Bashorm 10h. And lastly, we get a 4 by 400. And we see that the flattened
00:36:52.000 | consecutive for the last flattened here, it squeezed out that dimension of one. So we only
00:36:57.520 | ended up with 4 by 400. And then linear Bashorm 10h and the last linear layer to get our logits.
00:37:04.880 | And so the logits end up in the same shape as they were before. But now we actually have a nice
00:37:09.600 | three-layer neural net. And it basically corresponds to-- whoops, sorry. It basically
00:37:14.880 | corresponds exactly to this network now, except only this piece here, because we only have three
00:37:20.080 | layers. Whereas here in this example, there's four layers with a total receptive field size of
00:37:27.440 | 16 characters instead of just eight characters. So the block size here is 16. So this piece of it
00:37:33.920 | is basically implemented here. Now we just have to figure out some good channel numbers to use here.
00:37:41.200 | Now in particular, I changed the number of hidden units to be 68 in this architecture,
00:37:46.240 | because when I use 68, the number of parameters comes out to be 22,000. So that's exactly the
00:37:51.280 | same that we had before. And we have the same amount of capacity at this neural net in terms
00:37:55.760 | of the number of parameters. But the question is whether we are utilizing those parameters in a
00:37:59.600 | more efficient architecture. So what I did then is I got rid of a lot of the debugging cells here,
00:38:05.440 | and I reran the optimization. And scrolling down to the result, we see that we get the identical
00:38:11.360 | performance roughly. So our validation loss now is 2.029, and previously it was 2.027.
00:38:17.600 | So controlling for the number of parameters, changing from the flat to hierarchical is not
00:38:21.200 | giving us anything yet. That said, there are two things to point out. Number one, we didn't really
00:38:27.600 | torture the architecture here very much. This is just my first guess. And there's a bunch of
00:38:32.240 | hyperparameter search that we could do in terms of how we allocate our budget of parameters to
00:38:38.000 | what layers. Number two, we still may have a bug inside the BashNorm1D layer. So let's take a look
00:38:44.480 | at that, because it runs, but does it do the right thing? So I pulled up the layer inspector that we
00:38:53.840 | have here and printed out the shapes along the way. And currently it looks like the BashNorm is
00:38:58.240 | receiving an input that is 32 by 4 by 68. And here on the right, I have the current implementation of
00:39:04.880 | BashNorm that we have right now. Now, this BashNorm assumed, in the way we wrote it and at the time,
00:39:10.400 | that x is two-dimensional. So it was n by d, where n was the batch size. So that's why we only reduced
00:39:17.920 | the mean and the variance over the zeroth dimension. But now x will basically become
00:39:22.000 | three-dimensional. So what's happening inside the BashNorm layer right now, and how come it's
00:39:25.680 | working at all and not giving any errors? The reason for that is basically because everything
00:39:30.000 | broadcasts properly, but the BashNorm is not doing what we want it to do. So in particular,
00:39:36.400 | let's basically think through what's happening inside the BashNorm, looking at what's happening
00:39:41.760 | here. I have the code here. So we're receiving an input of 32 by 4 by 68. And then we are doing
00:39:50.480 | here, x dot mean, here I have e instead of x, but we're doing the mean over zero. And that's
00:39:57.360 | actually given us 1 by 4 by 68. So we're doing the mean only over the very first dimension.
00:40:02.480 | And it's given us a mean and a variance that still maintain this dimension here. So these
00:40:08.240 | means are only taking over 32 numbers in the first dimension. And then when we perform this,
00:40:13.200 | everything broadcasts correctly still. But basically what ends up happening is
00:40:18.480 | when we also look at the running mean,
00:40:22.160 | the shape of it. So I'm looking at the model dot layers at three, which is the first BashNorm
00:40:29.280 | layer, and then looking at whatever the running mean became and its shape. The shape of this
00:40:34.720 | running mean now is 1 by 4 by 68. Instead of it being just a size of dimension, because we have
00:40:43.520 | 68 channels, we expect to have 68 means and variances that we're maintaining. But actually,
00:40:48.800 | we have an array of 4 by 68. And so basically what this is telling us is this BashNorm is currently
00:40:56.320 | working in parallel over 4 times 68 instead of just 68 channels. So basically, we are maintaining
00:41:08.400 | statistics for every one of these four positions individually and independently. And instead,
00:41:14.160 | what we want to do is we want to treat this 4 as a Bash dimension, just like the 0th dimension.
00:41:19.200 | So as far as the BashNorm is concerned, we don't want to average over 32 numbers,
00:41:26.240 | we want to now average over 32 times 4 numbers for every single one of these 68 channels.
00:41:31.760 | So let me now remove this. It turns out that when you look at the documentation of torch.mean,
00:41:40.000 | in one of its signatures, when we specify the dimension,
00:41:53.120 | we see that the dimension here is not just, it can be int or it can also be a tuple of ints.
00:41:57.520 | So we can reduce over multiple integers at the same time, over multiple dimensions at the same
00:42:03.040 | time. So instead of just reducing over 0, we can pass in a tuple, 0, 1, and here 0, 1 as well.
00:42:09.760 | And then what's going to happen is the output, of course, is going to be the same.
00:42:13.040 | But now what's going to happen is because we reduce over 0 and 1, if we look at in mean.shape,
00:42:20.080 | we see that now we've reduced, we took the mean over both the 0th and the 1st dimension.
00:42:25.760 | So we're just getting 68 numbers and a bunch of spurious dimensions here.
00:42:29.840 | So now this becomes 1 by 1 by 68, and the running mean and the running variance,
00:42:35.920 | analogously, will become 1 by 1 by 68. So even though there are the spurious dimensions,
00:42:40.320 | the correct thing will happen in that we are only maintaining means and variances for 68 channels.
00:42:49.520 | And we're now calculating the mean and variance across 32 times 4 dimensions.
00:42:54.080 | So that's exactly what we want. And let's change the implementation of BatchNorm1D that we have
00:42:59.680 | so that it can take in two-dimensional or three-dimensional inputs and perform accordingly.
00:43:05.120 | So at the end of the day, the fix is relatively straightforward. Basically, the dimension we
00:43:09.040 | want to reduce over is either 0 or the tuple 0 and 1, depending on the dimensionality of x.
00:43:15.280 | So if x.ndim is 2, so it's a two-dimensional tensor, then the dimension we want to reduce over
00:43:20.480 | is just the integer 0. If x.ndim is 3, so it's a three-dimensional tensor, then the dims we're
00:43:26.960 | going to assume are 0 and 1 that we want to reduce over. And then here, we just pass in dim.
00:43:32.880 | And if the dimensionality of x is anything else, we'll now get an error, which is good.
00:43:36.640 | So that should be the fix. Now I want to point out one more thing. We're actually departing
00:43:43.200 | from the API of PyTorch here a little bit. Because when you come to BatchNorm1D in PyTorch,
00:43:47.920 | you can scroll down and you can see that the input to this layer can either be n by c,
00:43:53.600 | where n is the batch size and c is the number of features or channels, or it actually does
00:43:57.920 | accept three-dimensional inputs, but it expects it to be n by c by l, where l is, say, the sequence
00:44:04.000 | length or something like that. So this is a problem because you see how c is nested here
00:44:10.160 | in the middle. And so when it gets three-dimensional inputs, this BatchNorm layer will reduce over 0
00:44:16.480 | and 2 instead of 0 and 1. So basically, PyTorch BatchNorm1D layer assumes that c will always be
00:44:24.160 | the first dimension, whereas we assume here that c is the last dimension and there are some number
00:44:31.120 | of batch dimensions beforehand. And so it expects n by c or n by c by l. We expect n by c or n by l
00:44:41.360 | by c. And so it's a deviation. I think it's okay. I prefer it this way, honestly, so this is the way
00:44:49.920 | that we will keep it for our purposes. So I redefined the layers, reinitialized the neural
00:44:54.160 | nut, and did a single forward pass with a break just for one step. Looking at the shapes along
00:44:59.520 | the way, they're of course identical. All the shapes are the same. But the way we see that
00:45:03.520 | things are actually working as we want them to now is that when we look at the BatchNorm layer,
00:45:07.760 | the running mean shape is now 1 by 1 by 68. So we're only maintaining 68 means for every one
00:45:13.440 | of our channels, and we're treating both the 0th and the first dimension as a batch dimension,
00:45:18.640 | which is exactly what we want. So let me retrain the neural net now. Okay, so I've retrained the
00:45:22.240 | neural net with the bug fix. We get a nice curve. And when we look at the validation performance,
00:45:26.240 | we do actually see a slight improvement. So it went from 2.029 to 2.022. So basically,
00:45:31.920 | the bug inside the BatchNorm was holding us back a little bit, it looks like. And we are getting a
00:45:37.680 | tiny improvement now, but it's not clear if this is statistically significant. And the reason we
00:45:43.360 | slightly expect an improvement is because we're not maintaining so many different means and
00:45:47.200 | variances that are only estimated using 32 numbers, effectively. Now we are estimating them using 32
00:45:53.440 | times 4 numbers. So you just have a lot more numbers that go into any one estimate of the
00:45:58.000 | mean and variance. And it allows things to be a bit more stable and less wiggly inside those
00:46:03.360 | estimates of those statistics. So pretty nice. With this more general architecture in place,
00:46:09.280 | we are now set up to push the performance further by increasing the size of the network.
00:46:13.920 | So for example, I've bumped up the number of embeddings to 24 instead of 10, and also increased
00:46:19.280 | the number of hidden units. But using the exact same architecture, we now have 76,000 parameters.
00:46:24.960 | And the training takes a lot longer, but we do get a nice curve. And then when you actually
00:46:29.600 | evaluate the performance, we are now getting validation performance of 1.993. So we've crossed
00:46:34.640 | over the 2.0 sort of territory, and we're at about 1.99. But we are starting to have to
00:46:40.400 | wait quite a bit longer. And we're a little bit in the dark with respect to the correct setting of
00:46:45.920 | the hyperparameters here and the learning rates and so on, because the experiments are starting
00:46:49.200 | to take longer to train. And so we are missing sort of like an experimental harness on which we
00:46:54.400 | could run a number of experiments and really tune this architecture very well. So I'd like to
00:46:58.720 | conclude now with a few notes. We basically improved our performance from a starting of 2.1
00:47:04.160 | down to 1.9. But I don't want that to be the focus, because honestly, we're kind of in the dark,
00:47:08.800 | we have no experimental harness, we're just guessing and checking. And this whole thing is
00:47:12.960 | terrible. We're just looking at the training loss. Normally, you want to look at both the
00:47:16.640 | training and the validation loss together. The whole thing looks different if you're actually
00:47:20.960 | trying to squeeze out numbers. That said, we did implement this architecture from the WaveNet paper.
00:47:27.120 | But we did not implement this specific forward pass of it, where you have a more complicated
00:47:33.760 | linear layer sort of that is this gated linear layer kind of. And there's residual connections
00:47:40.000 | and skip connections and so on. So we did not implement that, we just implemented this structure.
00:47:44.960 | I would like to briefly hint or preview how what we've done here relates to convolutional neural
00:47:49.600 | networks as used in the WaveNet paper. And basically, the use of convolutions is strictly
00:47:54.480 | for efficiency. It doesn't actually change the model we've implemented. So here, for example,
00:48:00.000 | let me look at a specific name to work with an example. So there's a name in our training set,
00:48:05.600 | and it's D'Andrea. And it has seven letters, so that is eight independent examples in our model.
00:48:12.000 | So all these rows here are independent examples of D'Andrea. Now, you can forward, of course,
00:48:17.920 | any one of these rows independently. So I can take my model and call it on any individual index.
00:48:25.520 | Notice, by the way, here, I'm being a little bit tricky. The reason for this is that extra at 7.shape
00:48:31.120 | is just one dimensional array of eight. So you can't actually call the model on it,
00:48:37.840 | you're going to get an error, because there's no batch dimension. So when you do extra at
00:48:44.000 | list of seven, then the shape of this becomes one by eight. So I get an extra batch dimension of
00:48:50.800 | one, and then we can forward the model. So that forwards a single example. And you might imagine
00:48:57.920 | that you actually may want to forward all of these eight at the same time. So pre-allocating
00:49:04.640 | some memory and then doing a for loop eight times and forwarding all of those eight here will give
00:49:10.240 | us all the logits in all these different cases. Now, for us with the model as we've implemented
00:49:15.040 | it right now, this is eight independent calls to our model. But what convolutions allow you to do
00:49:20.400 | is it allow you to basically slide this model efficiently over the input sequence. And so
00:49:26.400 | this for loop can be done not outside in Python, but inside of kernels in CUDA. And so this for
00:49:33.200 | loop gets hidden into the convolution. So the convolution basically, you can think of it as
00:49:37.840 | it's a for loop, applying a little linear filter over space of some input sequence. And in our
00:49:44.480 | case, the space we're interested in is one dimensional, and we're interested in sliding
00:49:47.760 | these filters over the input data. So this diagram actually is fairly good as well. Basically,
00:49:55.760 | what we've done is here they are highlighting in black one single sort of like tree of this
00:50:01.200 | calculation. So just calculating the single output example here. And so this is basically
00:50:08.160 | what we've implemented here. We've implemented a single, this black structure, we've implemented
00:50:13.680 | that and calculated a single output, like a single example. But what convolutions allow you to do is
00:50:18.960 | it allows you to take this black structure and kind of like slide it over the input sequence
00:50:24.640 | here and calculate all of these orange outputs at the same time. Or here that corresponds to
00:50:31.520 | calculating all of these outputs of at all the positions of DeAndre at the same time.
00:50:37.440 | And the reason that this is much more efficient is because number one, as I mentioned, the for loop
00:50:43.920 | is inside the CUDA kernels in the sliding. So that makes it efficient. But number two,
00:50:49.840 | notice the variable reuse here. For example, if we look at this circle, this node here,
00:50:54.400 | this node here is the right child of this node, but it's also the left child of the node here.
00:51:00.240 | And so basically, this node and its value is used twice. And so right now, in this naive way,
00:51:08.000 | we'd have to recalculate it. But here we are allowed to reuse it. So in the convolutional
00:51:13.360 | neural network, you think of these linear layers that we have up above as filters. And we take
00:51:19.040 | these filters, and they're linear filters, and you slide them over input sequence. And we calculate
00:51:24.000 | the first layer, and then the second layer, and then the third layer, and then the output layer
00:51:28.160 | of the sandwich. And it's all done very efficiently using these convolutions. So we're going to cover
00:51:33.280 | that in a future video. The second thing I hope you took away from this video is you've seen me
00:51:37.520 | basically implement all of these layer Lego building blocks or module building blocks.
00:51:43.680 | And I'm implementing them over here. And we've implemented a number of layers together.
00:51:47.600 | And we've also implementing these these containers. And we've overall pytorchified our code quite a
00:51:53.680 | bit more. Now, basically, what we're doing here is we're re-implementing torch.nn, which is the
00:51:59.040 | neural networks library on top of torch.tensor. And it looks very much like this, except it is
00:52:05.520 | much better, because it's in pytorch instead of a janky, lazy, and stupid notebook. So I think
00:52:11.520 | going forward, I will probably have considered us having unlocked torch.nn. We understand roughly
00:52:17.600 | what's in there, how these modules work, how they're nested, and what they're doing on top
00:52:21.840 | of torch.tensor. So hopefully, we'll just switch over and continue and start using torch.nn directly.
00:52:28.080 | The next thing I hope you got a bit of a sense of is what the development process of building
00:52:32.880 | deep neural networks looks like, which I think was relatively representative to some extent.
00:52:37.280 | So number one, we are spending a lot of time in the documentation page of pytorch. And we're
00:52:43.200 | reading through all the layers, looking at documentations, what are the shapes of the
00:52:47.040 | inputs, what can they be, what does the layer do, and so on. Unfortunately, I have to say the
00:52:52.800 | pytorch documentation is not very good. They spend a ton of time on hardcore engineering of all kinds
00:52:59.200 | of distributed primitives, etc. But as far as I can tell, no one is maintaining the documentation.
00:53:04.240 | It will lie to you, it will be wrong, it will be incomplete, it will be unclear. So unfortunately,
00:53:11.440 | it is what it is, and you just kind of do your best with what they've given us. Number two,
00:53:19.120 | the other thing that I hope you got a sense of is there's a ton of trying to make the shapes work.
00:53:25.840 | And there's a lot of gymnastics around these multi-dimensional arrays. And are they two
00:53:29.120 | dimensional, three dimensional, four dimensional? What layers take what shapes? Is it NCL or NLC?
00:53:35.520 | And you're permuting and viewing, and it just can get pretty messy. And so that brings me to number
00:53:41.120 | three. I very often prototype these layers and implementations in Jupyter Notebooks and make
00:53:45.760 | sure that all the shapes work out. And I'm spending a lot of time basically babysitting the shapes and
00:53:50.880 | making sure everything is correct. And then once I'm satisfied with the functionality in a Jupyter
00:53:55.120 | Notebook, I will take that code and copy paste it into my repository of actual code that I'm
00:54:00.240 | training with. And so then I'm working with VS code on the side. So I usually have Jupyter Notebook
00:54:05.440 | and VS code. I develop a Jupyter Notebook, I paste into VS code, and then I kick off experiments from
00:54:10.560 | the repo, of course, from the code repository. So that's roughly some notes on the development
00:54:16.320 | process of working with neural nets. Lastly, I think this lecture unlocks a lot of potential
00:54:20.800 | further lectures, because number one, we have to convert our neural network to actually use
00:54:25.040 | these dilated causal convolutional layers. So implementing the ConvNet. Number two,
00:54:30.960 | I'm potentially starting to get into what this means, where are residual connections and skip
00:54:35.360 | connections and why are they useful? Number three, as I mentioned, we don't have any experimental
00:54:41.520 | harness. So right now I'm just guessing, checking everything. This is not representative of typical
00:54:45.840 | deep learning workflows. You have to set up your evaluation harness, you can kick off experiments,
00:54:50.800 | you have lots of arguments that your script can take, you're kicking off a lot of experimentation,
00:54:55.360 | you're looking at a lot of plots of training and validation losses, and you're looking at what is
00:54:59.360 | working and what is not working. And you're working on this like population level, and you're doing
00:55:03.360 | all these hyperparameter searches. And so we've done none of that so far. So how to set that up
00:55:09.520 | and how to make it good, I think is a whole another topic. And number three, we should probably cover
00:55:15.280 | recurring neural networks, RNNs, LSTMs, GRUs, and of course, transformers. So many places to go,
00:55:22.560 | and we'll cover that in the future. For now, bye. Sorry, I forgot to say that if you are interested,
00:55:29.120 | I think it is kind of interesting to try to beat this number 1.993. Because I really haven't
00:55:34.560 | tried a lot of experimentation here, and there's quite a bit of longing fruit potentially
00:55:38.160 | to still push this further. So I haven't tried any other ways of allocating these channels in
00:55:43.200 | this neural net. Maybe the number of dimensions for the embedding is all wrong. Maybe it's possible
00:55:49.200 | to actually take the original network, which is one hidden layer, and make it big enough and
00:55:53.360 | actually beat my fancy hierarchical network. It's not obvious. That would be kind of embarrassing.
00:55:58.880 | If this did not do better, even once you torture it a little bit. Maybe you can read the WaveNet
00:56:03.680 | paper and try to figure out how some of these layers work and implement them yourselves using
00:56:07.280 | what we have. And of course, you can always tune some of the initialization or some of the
00:56:12.480 | optimization and see if you can improve it that way. So I'd be curious if people can come up with
00:56:17.120 | some ways to beat this. And yeah, that's it for now. Bye.