Building makemore Part 2: MLP

Hi everyone. Today we are continuing our implementation of Makemore. Now, in the last lecture, we implemented the bigram language model, and we implemented it both using counts and also using a super simple neural network that has a single linear layer. Now, this is the Jupyter Notebook that we built out last lecture, and we saw that the way we approached this is that we looked at only the single previous character, and we predicted the distribution for the character that would go next in the sequence.

And we did that by taking counts and normalizing them into probabilities so that each row here sums to 1. Now, this is all well and good if you only have one character of previous context. And this works, and it's approachable. The problem with this model, of course, is that the predictions from this model are not very good because you only take one character of context.

So the model didn't produce very name-like sounding things. Now, the problem with this approach, though, is that if we are to take more context into account when predicting the next character in a sequence, things quickly blow up. And this table, the size of this table, grows, and in fact it grows exponentially with the length of the context.

Because if we only take a single character at a time, that's 27 possibilities of context. But if we take two characters in the past and try to predict the third one, suddenly the number of rows in this matrix, you can look at it that way, is 27 times 27.

So there's 729 possibilities for what could have come in the context. If we take three characters as the context, suddenly we have 20,000 possibilities of context. And so that's just way too many rows of this matrix. It's way too few counts for each possibility. And the whole thing just kind of explodes and doesn't work very well.

So that's why today we're going to move on to this bullet point here. And we're going to implement a multilayer perceptron model to predict the next character in a sequence. And this modeling approach that we're going to adopt follows this paper, Ben-Ju et al., 2003. So I have the paper pulled up here.

Now, this isn't the very first paper that proposed the use of multilayer perceptrons or neural networks to predict the next character or token in a sequence. But it's definitely one that was very influential around that time. It is very often cited to stand in for this idea. And I think it's a very nice write-up.

And so this is the paper that we're going to first look at and then implement. Now, this paper has 19 pages, so we don't have time to go into the full detail of this paper. But I invite you to read it. It's very readable, interesting, and has a lot of interesting ideas in it as well.

In the introduction, they described the exact same problem I just described. And then to address it, they proposed the following model. Now, keep in mind that we are building a character-level language model. So we're working on the level of characters. In this paper, they have a vocabulary of 17,000 possible words, and they instead build a word-level language model.

But we're going to still stick with the characters, but we'll take the same modeling approach. Now, what they do is basically they propose to take every one of these words, 17,000 words, and they're going to associate to each word a, say, 30-dimensional feature vector. So every word is now embedded into a 30-dimensional space.

You can think of it that way. So we have 17,000 points or vectors in a 30-dimensional space, and you might imagine that's very crowded. That's a lot of points for a very small space. Now, in the beginning, these words are initialized completely randomly, so they're spread out at random.

But then we're going to tune these embeddings of these words using backpropagation. So during the course of training of this neural network, these points or vectors are going to basically move around in this space. And you might imagine that, for example, words that have very similar meanings or that are indeed synonyms of each other might end up in a very similar part of the space, and conversely, words that mean very different things would go somewhere else in the space.

Now, their modeling approach otherwise is identical to ours. They are using a multilayer neural network to predict the next word, given the previous words, and to train the neural network, they are maximizing the log likelihood of the training data, just like we did. So the modeling approach itself is identical.

Now, here they have a concrete example of this intuition. Why does it work? Basically, suppose that, for example, you are trying to predict a dog was running in a blank. Now, suppose that the exact phrase "a dog was running in a" has never occurred in the training data. And here you are at sort of test time later, when the model is deployed somewhere, and it's trying to make a sentence, and it's saying "a dog was running in a blank." And because it's never encountered this exact phrase in the training set, you're out of distribution, as we say.

Like, you don't have fundamentally any reason to suspect what might come next. But this approach actually allows you to get around that, because maybe you didn't see the exact phrase "a dog was running in a" something, but maybe you've seen similar phrases. Maybe you've seen the phrase "the dog was running in a blank." And maybe your network has learned that "a" and "the" are, like, frequently are interchangeable with each other.

And so maybe it took the embedding for "a" and the embedding for "the," and it actually put them, like, nearby each other in the space. And so you can transfer knowledge through that embedding, and you can generalize in that way. Similarly, the network could know that cats and dogs are animals, and they co-occur in lots of very similar contexts.

So even though you haven't seen this exact phrase, or you haven't seen exactly "walking" or "running," you can, through the embedding space, transfer knowledge, and you can generalize to novel scenarios. So let's now scroll down to the diagram of the neural network. They have a nice diagram here. And in this example, we are taking three previous words, and we are trying to predict the fourth word in a sequence.

Now, these three previous words, as I mentioned, they have a vocabulary of 17,000 possible words. So every one of these basically are the index of the incoming word. And because there are 17,000 words, this is an integer between 0 and 16,999. Now, there's also a lookup table that they call C.

This lookup table is a matrix that is 17,000 by, say, 30. And basically what we're doing here is we're treating this as a lookup table. And so every index is plucking out a row of this embedding matrix so that each index is converted to the 30-dimensional vector that corresponds to the embedding vector for that word.

So here we have the input layer of 30 neurons for three words, making up 90 neurons in total. And here they're saying that this matrix C is shared across all the words. So we're always indexing into the same matrix C over and over for each one of these words.

Next up is the hidden layer of this neural network. The size of this hidden neural layer of this neural net is a hyperparameter. So we use the word hyperparameter when it's kind of like a design choice up to the designer of the neural net. And this can be as large as you'd like or as small as you'd like.

So, for example, the size could be 100. And we are going to go over multiple choices of the size of this hidden layer, and we're going to evaluate how well they work. So say there were 100 neurons here. All of them would be fully connected to the 90 words or 90 numbers that make up these three words.

So this is a fully connected layer. Then there's a 10-inch-long linearity. And then there's this output layer. And because there are 17,000 possible words that could come next, this layer has 17,000 neurons, and all of them are fully connected to all of these neurons in the hidden layer. So there's a lot of parameters here because there's a lot of words.

So most computation is here. This is the expensive layer. Now, there are 17,000 logits here. So on top of there, we have the softmax layer, which we've seen in our previous video as well. So every one of these logits is exponentiated, and then everything is normalized to sum to 1 to have a nice probability distribution for the next word in the sequence.

Now, of course, during training, we actually have the label. We have the identity of the next word in the sequence. That word or its index is used to pluck out the probability of that word, and then we are maximizing the probability of that word with respect to the parameters of this neural net.

So the parameters are the weights and biases of this output layer. The weights and biases of this hidden layer, and the embedding lookup table C, and all of that is optimized using backpropagation. And these dashed arrows, ignore those. That represents a variation of a neural net that we are not going to explore in this video.

So that's the setup, and now let's implement it. Okay, so I started a brand new notebook for this lecture. We are importing PyTorch, and we are importing Matplotlib so we can create figures. Then I am reading all the names into a list of words like I did before, and I'm showing the first eight right here.

Keep in mind that we have 32,000 in total. These are just the first eight. And then here I'm building out the vocabulary of characters and all the mappings from the characters as strings to integers and vice versa. Now, the first thing we want to do is we want to compile the dataset for the neural network, and I had to rewrite this code.

I'll show you in a second what it looks like. So this is the code that I created for the dataset creation. So let me first run it, and then I'll briefly explain how this works. So first we're going to define something called block size, and this is basically the context length of how many characters do we take to predict the next one.

So here in this example, we're taking three characters to predict the fourth one, so we have a block size of three. That's the size of the block that supports the prediction. Then here I'm building out the x and y. The x are the input to the neural net, and the y are the labels for each example inside x.

Then I'm iterating over the first five words. I'm doing the first five just for efficiency while we are developing all the code, but then later we are going to come here and erase this so that we use the entire training set. So here I'm printing the word "Emma," and here I'm basically showing the examples that we can generate, the five examples that we can generate out of the single word "Emma." So when we are given the context of just dot, dot, dot, the first character in a sequence is E.

In this context, the label is M. When the context is this, the label is M, and so forth. So the way I build this out is first I start with a padded context of just zero tokens. Then I iterate over all the characters. I get the character in the sequence, and I basically build out the array y of this current character, and the array x, which stores the current running context.

Then here, see, I print everything, and here I crop the context and enter the new character in the sequence. So this is kind of like a rolling window of context. Now we can change the block size here to, for example, four, and in that case we would be predicting the fifth character in the previous four.

Or it can be five, and then it would look like this. Or it can be, say, ten, and then it would look something like this. We're taking ten characters to predict the eleventh one, and we're always padding with dots. So let me bring this back to three just so that we have what we have here in the paper.

And finally, the data set right now looks as follows. From these five words, we have created a data set of 32 examples, and each input to the neural net is three integers, and we have a label that is also an integer, y. So x looks like this. These are the individual examples.

And then y are the labels. So given this, let's now write a neural network that takes these x's and predicts the y's. First, let's build the embedding lookup table C. So we have 27 possible characters, and we're going to embed them in a lower-dimensional space. In the paper, they have 17,000 words, and they embed them in spaces as small-dimensional as 30.

So they cram 17,000 words into 30-dimensional space. In our case, we have only 27 possible characters, so let's cram them in something as small as, to start with, for example, a two-dimensional space. So this lookup table will be random numbers, and we'll have 27 rows, and we'll have two columns.

So each one of 27 characters will have a two-dimensional embedding. So that's our matrix C of embeddings, in the beginning, initialized randomly. Now, before we embed all of the integers inside the input x using this lookup table C, let me actually just try to embed a single individual integer, like, say, 5.

So we get a sense of how this works. Now, one way this works, of course, is we can just take the C, and we can index into row 5, and that gives us a vector, the fifth row of C. And this is one way to do it. The other way that I presented in the previous lecture is actually seemingly different, but actually identical.

So in the previous lecture, what we did is we took these integers, and we used the one-hot encoding to first encode them. So f.one-hot, we want to encode integer 5, and we want to tell it that the number of classes is 27. So that's the 26-dimensional vector of all zeros, except the fifth bit is turned on.

Now, this actually doesn't work. The reason is that this input actually must be a torstadt tensor. And I'm making some of these errors intentionally, just so you get to see some errors and how to fix them. So this must be a tensor, not an int. Fairly straightforward to fix.

We get a one-hot vector. The fifth dimension is 1, and the shape of this is 27. And now notice that, just as I briefly alluded to in a previous video, if we take this one-hot vector and we multiply it by C, then what would you expect? Well, number one, first you'd expect an error, because expected scalar type long, but found float.

So a little bit confusing, but the problem here is that one-hot, the data type of it, is long. It's a 64-bit integer, but this is a float tensor. And so PyTorch doesn't know how to multiply an int with a float, and that's why we had to explicitly cast this to a float, so that we can multiply.

Now, the output actually here is identical. And it's identical because of the way the matrix multiplication here works. We have the one-hot vector multiplying columns of C, and because of all the zeros, they actually end up masking out everything in C except for the fifth row, which is plucked out.

And so we actually arrive at the same result. And that tells you that here we can interpret this first piece here, this embedding of the integer, we can either think of it as the integer indexing into lookup table C, but equivalently we can also think of this little piece here as a first layer of this bigger neural net.

This layer here has neurons that have no nonlinearity. There's no tanh. They're just linear neurons. And their weight matrix is C. And then we are encoding integers into one-hot and feeding those into a neural net. And this first layer basically embeds them. Those are two equivalent ways of doing the same thing.

We're just going to index because it's much, much faster, and we're going to discard this interpretation of one-hot inputs into neural nets, and we're just going to index integers and use embedding tables. Now, embedding a single integer like 5 is easy enough. We can simply ask PyTorch to retrieve the fifth row of C, or the row index 5 of C.

But how do we simultaneously embed all of these 32 by 3 integers stored in array X? Luckily, PyTorch indexing is fairly flexible and quite powerful. So it doesn't just work to ask for a single element 5 like this. You can actually index using lists. So, for example, we can get the rows 5, 6, and 7, and this will just work like this.

We can index with a list. It doesn't just have to be a list. It can also be actually a tensor of integers, and we can index with that. So this is an integer tensor of 5, 6, 7, and this will just work as well. In fact, we can also, for example, repeat row 7 and retrieve it multiple times, and that same index will just get embedded multiple times here.

So here we are indexing with a one-dimensional tensor of integers, but it turns out that you can also index with multi-dimensional tensors of integers. Here we have a two-dimensional tensor of integers. So we can simply just do C at x, and this just works. And the shape of this is 32 by 3, which is the original shape, and now for every one of those 32 by 3 integers, we've retrieved the embedding vector here.

So basically we have that as an example. The 13th--or example index 13, the second dimension, is the integer 1, as an example. And so here, if we do C of x, which gives us that array, and then we index into 13 by 2 of that array, then we get the embedding here.

And you can verify that C at 1, which is the integer at that location, is indeed equal to this. You see they're equal. So basically, long story short, PyTorch indexing is awesome, and to embed simultaneously all of the integers in x, we can simply do C of x, and that is our embedding, and that just works.

Now let's construct this layer here, the hidden layer. So we have that w1, as I'll call it, are these weights, which we will initialize randomly. Now the number of inputs to this layer is going to be 3 times 2, right? Because we have two-dimensional embeddings, and we have three of them, so the number of inputs is 6.

And the number of neurons in this layer is a variable up to us. Let's use 100 neurons as an example. And then biases will be also initialized randomly, as an example, and we just need 100 of them. Now the problem with this is we can't simply-- normally we would take the input-- in this case, that's embedding-- and we'd like to multiply it with these weights, and then we would like to add the bias.

This is roughly what we want to do. But the problem here is that these embeddings are stacked up in the dimensions of this input tensor. So this will not work, this matrix multiplication, because this is a shape 32 by 3 by 2, and I can't multiply that by 6 by 100.

So somehow we need to concatenate these inputs here together so that we can do something along these lines, which currently does not work. So how do we transform this 32 by 3 by 2 into a 32 by 6 so that we can actually perform this multiplication over here? I'd like to show you that there are usually many ways of implementing what you'd like to do in Torch, and some of them will be faster, better, shorter, etc.

And that's because Torch is a very large library, and it's got lots and lots of functions. So if we just go to the documentation and click on Torch, you'll see that my slider here is very tiny, and that's because there are so many functions that you can call on these tensors to transform them, create them, multiply them, add them, perform all kinds of different operations on them.

And so this is kind of like the space of possibility, if you will. Now, one of the things that you can do is we can Ctrl+F for concatenate, and we see that there's a function, Torch.cat, short for concatenate. And this concatenates a given sequence of tensors in a given dimension, and these tensors must have the same shape, etc.

So we can use the concatenate operation to, in a naive way, concatenate these three embeddings for each input. So in this case, we have m of the shape. And really what we want to do is we want to retrieve these three parts and concatenate them. So we want to grab all the examples.

We want to grab first the 0th index and then all of this. So this plucks out the 32x2 embeddings of just the first word here. And so basically we want this guy, we want the first dimension, and we want the second dimension. And these are the three pieces individually.

And then we want to treat this as a sequence, and we want to Torch.cat on that sequence. So this is the list. Torch.cat takes a sequence of tensors, and then we have to tell it along which dimension to concatenate. So in this case, all of these are 32x2, and we want to concatenate not across dimension 0, but across dimension 1.

So passing in 1 gives us the result that the shape of this is 32x6, exactly as we'd like. So that basically took 32 and squashed these by concatenating them into 32x6. Now, this is kind of ugly because this code would not generalize if we want to later change the block size.

Right now we have three inputs, three words. But what if we had five? Then here we would have to change the code because I'm indexing directly. Well, Torch comes to rescue again because there turns out to be a function called unbind, and it removes a tensor dimension. So it removes a tensor dimension, returns a tuple of all slices along a given dimension without it.

So this is exactly what we need. And basically, when we call Torch.unbind of m and pass in dimension 1, index 1, this gives us a list of tensors exactly equivalent to this. So running this gives us a line 3, and it's exactly this list. And so we can call Torch.cat on it and along the first dimension.

And this works, and this shape is the same. But now it doesn't matter if we have block size 3 or 5 or 10. This will just work. So this is one way to do it. But it turns out that in this case, there's actually a significantly better and more efficient way.

And this gives me an opportunity to hint at some of the internals of Torch.tensor. So let's create an array here of elements from 0 to 17. And the shape of this is just 18. It's a single vector of 18 numbers. It turns out that we can very quickly re-represent this as different sized n-dimensional tensors.

We do this by calling a view. And we can say that actually this is not a single vector of 18. This is a 2 by 9 tensor. Or alternatively, this is a 9 by 2 tensor. Or this is actually a 3 by 3 by 2 tensor. As long as the total number of elements here multiply to be the same, this will just work.

And in PyTorch, this operation, calling that view, is extremely efficient. And the reason for that is that in each tensor, there's something called the underlying storage. And the storage is just the numbers, always as a one-dimensional vector. And this is how this tensor is represented in the computer memory.

It's always a one-dimensional vector. But when we call that view, we are manipulating some of the attributes of that tensor that dictate how this one-dimensional sequence is interpreted to be an n-dimensional tensor. And so what's happening here is that no memory is being changed, copied, moved, or created when we call that view.

The storage is identical. But when you call that view, some of the internal attributes of the view of this tensor are being manipulated and changed. In particular, there's something called storage offset, strides, and shapes. And those are manipulated so that this one-dimensional sequence of bytes is seen as different n-dimensional arrays.

There's a blog post here from Eric called PyTorch Internals, where he goes into some of this with respect to tensor and how the view of a tensor is represented. And this is really just like a logical construct of representing the physical memory. And so this is a pretty good blog post that you can go into.

I might also create an entire video on the internals of TorchTensor and how this works. For here, we just note that this is an extremely efficient operation. And if I delete this and come back to our EMP, we see that the shape of our EMP is 32x3x2. But we can simply ask for PyTorch to view this instead as a 32x6.

And the way this gets flattened into a 32x6 array just happens that these two get stacked up in a single row. And so that's basically the concatenation operation that we're after. And you can verify that this actually gives the exact same result as what we had before. So this is an element y=, and you can see that all the elements of these two tensors are the same.

And so we get the exact same result. So long story short, we can actually just come here, and if we just view this as a 32x6 instead, this multiplication will work and give us the hidden states that we're after. So if this is h, then h-shape is now the 100-dimensional activations for every one of our 32 examples.

And this gives the desired result. Let me do two things here. Number one, let's not use 32. We can, for example, do something like EMP.shape(0) so that we don't hard-code these numbers. And this would work for any size of this EMP. Or alternatively, we can also do -1. When we do -1, PyTorch will infer what this should be.

Because the number of elements must be the same, and we're saying that this is 6, PyTorch will derive that this must be 32, or whatever else it is if EMP is of different size. The other thing is here, one more thing I'd like to point out is here when we do the concatenation, this actually is much less efficient because this concatenation would create a whole new tensor with a whole new storage.

So new memory is being created because there's no way to concatenate tensors just by manipulating the view attributes. So this is inefficient and creates all kinds of new memory. So let me delete this now. We don't need this. And here to calculate h, we want to also dot 10h of this to get our h.

So these are now numbers between -1 and 1 because of the 10h. And we have that the shape is 32 by 100. And that is basically this hidden layer of activations here for every one of our 32 examples. Now there's one more thing I've lost over that we have to be very careful with, and that's this plus here.

In particular, we want to make sure that the broadcasting will do what we like. The shape of this is 32 by 100, and v1's shape is 100. So we see that the addition here will broadcast these two. And in particular, we have 32 by 100 broadcasting to 100. So broadcasting will align on the right, create a fake dimension here.

So this will become a 1 by 100 row vector. And then it will copy vertically for every one of these rows of 32 and do an element-wise addition. So in this case, the correct thing will be happening because the same bias vector will be added to all the rows of this matrix.

So that is correct. That's what we'd like. And it's always good practice to just make sure so that you don't shoot yourself in the foot. And finally, let's create the final layer here. So let's create w2 and v2. The input now is 100, and the output number of neurons will be for us 27 because we have 27 possible characters that come next.

So the biases will be 27 as well. So therefore, the logits, which are the outputs of this neural net, are going to be h multiplied by w2 plus v2. Logits.shape is 32 by 27, and the logits look good. Now, exactly as we saw in the previous video, we want to take these logits, and we want to first exponentiate them to get our fake counts, and then we want to normalize them into a probability.

So prob is counts divide, and now counts.sum along the first dimension and keep them as true, exactly as in the previous video. And so prob.shape now is 32 by 27, and you'll see that every row of prob sums to 1, so it's normalized. So that gives us the probabilities.

Now, of course, we have the actual letter that comes next, and that comes from this array y, which we created during the dataset creation. So y is this last piece here, which is the identity of the next character in the sequence that we'd like to now predict. So what we'd like to do now is, just as in the previous video, we'd like to index into the rows of prob, and in each row, we'd like to pluck out the probability assigned to the correct character, as given here.

So first, we have torch.arrange of 32, which is kind of like an iterator over numbers from 0 to 31, and then we can index into prob in the following way. Prob in torch.arrange of 32, which iterates the rows, and then in each row, we'd like to grab this column, as given by y.

So this gives the current probabilities as assigned by this neural network with this setting of its weights to the correct character in the sequence. And you can see here that this looks okay for some of these characters, like this is basically 0.2, but it doesn't look very good at all for many other characters.

Like this is 0.0701 probability, and so the network thinks that some of these are extremely unlikely. But of course, we haven't trained the neural network yet. This will improve, and ideally, all of these numbers here, of course, are 1, because then we are correctly predicting the next character. Now, just as in the previous video, we want to take these probabilities, we want to look at the log probability, and then we want to look at the average log probability and the negative of it to create the negative log likelihood loss.

So the loss here is 17, and this is the loss that we'd like to minimize to get the network to predict the correct character in the sequence. Okay, so I rewrote everything here and made it a bit more respectable. So here's our dataset. Here's all the parameters that we defined.

I'm now using a generator to make it reproducible. I clustered all the parameters into a single list of parameters so that, for example, it's easy to count them and see that in total we currently have about 3,400 parameters. And this is the forward pass as we developed it, and we arrive at a single number here, the loss, that is currently expressing how well this neural network works with the current setting of parameters.

Now I would like to make it even more respectable. So in particular, see these lines here where we take the logits and we calculate the loss. We're not actually reinventing the wheel here. This is just classification, and many people use classification, and that's why there is a functional.crossentropy function in PyTorch to calculate this much more efficiently.

So we could just simply call f.crossentropy, and we can pass in the logits, and we can pass in the array of targets, y, and this calculates the exact same loss. So in fact, we can simply put this here and erase these three lines, and we're going to get the exact same result.

Now there are actually many good reasons to prefer f.crossentropy over rolling your own implementation like this. I did this for educational reasons, but you'd never use this in practice. Why is that? Number one, when you use f.crossentropy, PyTorch will not actually create all these intermediate tensors because these are all new tensors in memory, and all this is fairly inefficient to run like this.

Instead, PyTorch will cluster up all these operations and very often have fused kernels that very efficiently evaluate these expressions that are sort of like clustered mathematical operations. Number two, the backward pass can be made much more efficient, and not just because it's a fused kernel, but also analytically and mathematically, it's often a very much simpler backward pass to implement.

We actually saw this with micrograd. You see here when we implemented 10H, the forward pass of this operation to calculate the 10H was actually a fairly complicated mathematical expression. But because it's a clustered mathematical expression, when we did the backward pass, we didn't individually backward through the x and the 2 times and the -1 and division, etc.

We just said it's 1 - t^2, and that's a much simpler mathematical expression. And we were able to do this because we're able to reuse calculations and because we are able to mathematically and analytically derive the derivative, and often that expression simplifies mathematically, and so there's much less to implement.

So not only can it be made more efficient because it runs in a fused kernel, but also because the expressions can take a much simpler form mathematically. So that's number one. Number two, under the hood, F dot cross entropy can also be significantly more numerically well-behaved. Let me show you an example of how this works.

Suppose we have a logit of -2, 3, -3, 0, and 5, and then we are taking the exponent of it and normalizing it to sum to 1. So when logits take on this value, everything is well and good, and we get a nice probability distribution. Now consider what happens when some of these logits take on more extreme values, and that can happen during optimization of the neural network.

Suppose that some of these numbers grow very negative, like say -100, then actually everything will come out fine. We still get probabilities that are well-behaved, and they sum to 1, and everything is great. But because of the way the X works, if you have very positive logits, like say +100 in here, you actually start to run into trouble, and we get not a number here.

And the reason for that is that these counts have an inf here. So if you pass in a very negative number to exp, you just get a very small number, very near 0, and that's fine. But if you pass in a very positive number, suddenly we run out of range in our floating-point number that represents these counts.

So basically we're taking e and we're raising it to the power of 100, and that gives us inf, because we've run out of dynamic range on this floating-point number that is count. And so we cannot pass very large logits through this expression. Now let me reset these numbers to something reasonable.

The way PyTorch solved this is that - you see how we have a well-behaved result here? It turns out that because of the normalization here, you can actually offset logits by any arbitrary constant value that you want. So if I add 1 here, you actually get the exact same result.

Or if I add 2, or if I subtract 3, any offset will produce the exact same probabilities. So because negative numbers are okay, but positive numbers can actually overflow this exp, what PyTorch does is it internally calculates the maximum value that occurs in the logits, and it subtracts it.

So in this case, it would subtract 5. And so therefore the greatest number in logits will become 0, and all the other numbers will become some negative numbers. And then the result of this is always well-behaved. So even if we have 100 here, previously, not good. But because PyTorch will subtract 100, this will work.

And so there's many good reasons to call cross-entropy. Number one, the forward pass can be much more efficient. The backward pass can be much more efficient. And also things can be much more numerically well-behaved. Okay, so let's now set up the training of this neural net. We have the forward pass.

We don't need these. Instead, we have that loss is equal to the cross-entropy. That's the forward pass. Then we need the backward pass. First, we want to set the gradients to be 0. For p-in parameters, we want to make sure that p.grad is none, which is the same as setting it to 0 in PyTorch.

And then loss.backward to populate those gradients. Once we have the gradients, we can do the parameter update. So for p-in parameters, we want to take all the data, and we want to nudge it learning rate times p.grad. And then we want to repeat this a few times. And let's print the loss here as well.

Now, this won't suffice, and it will create an error, because we also have to go for p-in parameters, and we have to make sure that p.requires grad is set to true in PyTorch. And this should just work. Okay, so we started off with loss of 17, and we're decreasing it.

Let's run longer. And you see how the loss decreases a lot here. So if we just run for 1,000 times, we get a very, very low loss, and that means that we're making very good predictions. Now, the reason that this is so straightforward right now is because we're only overfitting 32 examples.

So we only have 32 examples of the first five words, and therefore it's very easy to make this neural net fit only these 32 examples because we have 3,400 parameters and only 32 examples. So we're doing what's called overfitting a single batch of the data and getting a very low loss and good predictions.

But that's just because we have so many parameters for so few examples, so it's easy to make this be very low. Now, we're not able to achieve exactly zero, and the reason for that is we can, for example, look at logits, which are being predicted, and we can look at the max along the first dimension.

And in PyTorch, max reports both the actual values that take on the maximum number but also the indices of these. And you'll see that the indices are very close to the labels, but in some cases, they differ. For example, in this very first example, the predicted index is 19, but the label is 5.

And we're not able to make loss be zero, and fundamentally that's because here, the very first or the zeroth index is the example where dot, dot, dot is supposed to predict E, but you see how dot, dot, dot is also supposed to predict an O, and dot, dot, dot is also supposed to predict an I and an S as well.

And so basically E, O, A, or S are all possible outcomes in a training set for the exact same input. So we're not able to completely overfit and make the loss be exactly zero, but we're getting very close in the cases where there's a unique input for a unique output.

In those cases, we do what's called overfit, and we basically get the exact same and the exact correct result. So now all we have to do is we just need to make sure that we read in the full dataset and optimize the neural net. Okay, so let's swing back up where we created the dataset, and we see that here we only use the first five words.

So let me now erase this, and let me erase the print statements, otherwise we'd be printing way too much. And so when we process the full dataset of all the words, we now have 228,000 examples instead of just 32. So let's now scroll back down. The dataset is much larger.

We initialize the weights, the same number of parameters. They all require gradients. And then let's push this print I lost that item to be here, and let's just see how the optimization goes if we run this. Okay, so we started with a fairly high loss, and then as we're optimizing, the loss is coming down.

But you'll notice that it takes quite a bit of time for every single iteration, so let's actually address that because we're doing way too much work forwarding and backwarding 228,000 examples. In practice, what people usually do is they perform forward and backward pass and update on many batches of the data.

So what we will want to do is we want to randomly select some portion of the dataset, and that's a mini batch, and then only forward, backward, and update on that little mini batch, and then we iterate on those mini batches. So in PyTorch, we can, for example, use torch.randint.

We can generate numbers between 0 and 5 and make 32 of them. I believe the size has to be a tuple in PyTorch. So we can have a tuple, 32 of numbers between 0 and 5, but actually we want x.shape of 0 here. And so this creates integers that index into our dataset, and there's 32 of them.

So if our mini batch size is 32, then we can come here and we can first do mini batch construct. So integers that we want to optimize in this single iteration are in the ix, and then we want to index into x with ix to only grab those rows.

So we're only getting 32 rows of x, and therefore embeddings will again be 32 by 3 by 2, not 200,000 by 3 by 2. And then this ix has to be used not just to index into x, but also to index into y. And now this should be mini batches, and this should be much, much faster.

So it's instant almost. So this way we can run many, many examples nearly instantly and decrease the loss much, much faster. Now because we're only dealing with mini batches, the quality of our gradient is lower, so the direction is not as reliable. It's not the actual gradient direction. But the gradient direction is good enough, even when it's estimating on only 32 examples, that it is useful.

And so it's much better to have an approximate gradient and just make more steps than it is to evaluate the exact gradient and take fewer steps. So that's why in practice this works quite well. So let's now continue the optimization. Let me take out this lost.item from here and place it over here at the end.

Okay, so we're hovering around 2.5 or so. However, this is only the loss for that mini batch. So let's actually evaluate the loss here for all of x and for all of y, just so we have a full sense of exactly how well the model is doing right now.

So right now we're at about 2.7 on the entire training set. So let's run the optimization for a while. We're at 2.6, 2.57, 2.53. Okay, so one issue, of course, is we don't know if we're stepping too slow or too fast. So this point one, I just guessed it.

So one question is, how do you determine this learning rate? And how do we gain confidence that we're stepping in the right sort of speed? So I'll show you one way to determine a reasonable learning rate. It works as follows. Let's reset our parameters to the initial settings. And now let's print in every step, but let's only do 10 steps or so, or maybe 100 steps.

We want to find a very reasonable search range, if you will. So, for example, if this is very low, then we see that the loss is barely decreasing. So that's not -- that's too low, basically. So let's try this one. Okay, so we're decreasing the loss, but not very quickly.

So that's a pretty good low range. Now let's reset it again. And now let's try to find the place at which the loss kind of explodes. So maybe at negative 1. Okay, we see that we're minimizing the loss, but you see how it's kind of unstable. It goes up and down quite a bit.

So negative 1 is probably like a fast learning rate. Let's try negative 10. Okay, so this isn't optimizing. This is not working very well. So negative 10 is way too big. Negative 1 was already kind of big. So, therefore, negative 1 was like somewhat reasonable if I reset. So I'm thinking that the right learning rate is somewhere between negative 0.001 and negative 1.

So the way we can do this here is we can use torch.learnspace. And we want to basically do something like this, between 0 and 1. But -- oh, number of steps is one more parameter that's required. Let's do 1,000 steps. This creates 1,000 numbers between 0.001 and 1. But it doesn't really make sense to step between these linearly.

So instead, let me create learning rate exponent. And instead of 0.001, this will be a negative 3, and this will be a 0. And then the actual LRs that we want to search over are going to be 10 to the power of LRE. So now what we're doing is we're stepping linearly between the exponents of these learning rates.

This is 0.001, and this is 1, because 10 to the power of 0 is 1. And therefore, we are spaced exponentially in this interval. So these are the candidate learning rates that we want to sort of like search over, roughly. So now what we're going to do is here we are going to run the optimization for 1,000 steps.

And instead of using a fixed number, we are going to use learning rate indexing into here, LRs of i, and make this i. So basically, let me reset this to be, again, starting from random, creating these learning rates between 0.001 and 1, but exponentially stepped. And here what we're doing is we're iterating 1,000 times.

We're going to use the learning rate that's in the beginning very, very low. In the beginning, it's going to be 0.001, but by the end, it's going to be 1. And we're going to step with that learning rate. And now what we want to do is we want to keep track of the learning rates that we used, and we want to look at the losses that resulted.

And so here, let me track stats. So LRI.append LR and LOSI.append LOS.item. So again, reset everything and then run. And so basically, we started with a very low learning rate, and we went all the way up to a learning rate of -1. And now what we can do is we can plt.plot, and we can plot the two.

So we can plot the learning rates on the x-axis and the losses we saw on the y-axis. And often, you're going to find that your plot looks something like this, where in the beginning, you had very low learning rates. So basically, anything--barely anything happened. Then we got to, like, a nice spot here.

And then as we increased the learning rate enough, we basically started to be kind of unstable here. So a good learning rate turns out to be somewhere around here. And because we have LRI here, we actually may want to do not LR--not the learning rate, but the exponent. So that would be the LRE at i is maybe what we want to log.

So let me reset this and redo that calculation. But now on the x-axis, we have the exponent of the learning rate. And so we can see the exponent of the learning rate that is good to use. It would be sort of like roughly in the valley here, because here the learning rates are just way too low.

And then here, we expect relatively good learning rates somewhere here. And then here, things are starting to explode. So somewhere around -1 as the exponent of the learning rate is a pretty good setting. And 10 to the -1 is 0.1. So 0.1 was actually a fairly good learning rate around here.

And that's what we had in the initial setting. But that's roughly how you would determine it. And so here now we can take out the tracking of these. And we can just simply set LR to be 10 to the -1, or basically otherwise 0.1, as it was before. And now we have some confidence that this is actually a fairly good learning rate.

And so now what we can do is we can crank up the iterations. We can reset our optimization. And we can run for a pretty long time using this learning rate. Oops, and we don't want to print. It's way too much printing. So let me again reset and run 10,000 steps.

Okay, so we're at 2.48, roughly. Let's run another 10,000 steps. 2.46. And now let's do one learning rate decay. What this means is we're going to take our learning rate and we're going to 10x lower it. And so we're at the late stages of training, potentially, and we may want to go a bit slower.

Let's do one more, actually, at 0.1, just to see if we're making a dent here. Okay, we're still making a dent. And by the way, the bigram loss that we achieved last video was 2.45. So we've already surpassed the bigram model. And once I get a sense that this is actually kind of starting to plateau off, people like to do, as I mentioned, this learning rate decay.

So let's try to decay the loss, the learning rate, I mean. And we achieve at about 2.3 now. Obviously, this is janky and not exactly how you would train it in production, but this is roughly what you're going through. You first find a decent learning rate using the approach that I showed you.

Then you start with that learning rate and you train for a while. And then at the end, people like to do a learning rate decay, where you decay the learning rate by, say, a factor of 10, and you do a few more steps. And then you get a trained network, roughly speaking.

So we've achieved 2.3 and dramatically improved on the bigram language model using this simple neural net, as described here, using these 3,400 parameters. Now, there's something we have to be careful with. I said that we have a better model because we are achieving a lower loss, 2.3, much lower than 2.45 with the bigram model previously.

Now, that's not exactly true. And the reason that's not true is that this is actually a fairly small model, but these models can get larger and larger if you keep adding neurons and parameters. So you can imagine that we don't potentially have a thousand parameters. We could have 10,000 or 100,000 or millions of parameters.

And as the capacity of the neural network grows, it becomes more and more capable of overfitting your training set. What that means is that the loss on the training set, on the data that you're training on, will become very, very low, as low as zero. But all that the model is doing is memorizing your training set verbatim.

So if you take that model and it looks like it's working really well, but you try to sample from it, you will basically only get examples exactly as they are in the training set. You won't get any new data. In addition to that, if you try to evaluate the loss on some withheld names or other words, you will actually see that the loss on those can be very high.

And so basically, it's not a good model. So the standard in the field is to split up your data set into three splits, as we call them. We have the training split, the dev split or the validation split, and the test split. So training split, dev or validation split, and test split.

And typically, this would be, say, 80% of your data set. This could be 10% and this 10%, roughly. So you have these three splits of the data. Now, these 80% of the data set, the training set, is used to optimize the parameters of the model, just like we're doing here, using gradient descent.

These 10% of the examples, the dev or validation split, they're used for development over all the hyperparameters of your model. So hyperparameters are, for example, the size of this hidden layer, the size of the embedding. So this is 100 or a 2 for us, but we could try different things.

The strength of the regularization, which we aren't using yet so far. So there's lots of different hyperparameters and settings that go into defining a neural net. And you can try many different variations of them and see whichever one works best on your validation split. So this is used to train the parameters.

This is used to train the hyperparameters. And test split is used to evaluate, basically, the performance of the model at the end. So we're only evaluating the loss on the test split very, very sparingly and very few times, because every single time you evaluate your test loss and you learn something from it, you are basically starting to also train on the test split.

So you are only allowed to test the loss on the test set very, very few times. Otherwise, you risk overfitting to it as well as you experiment on your model. So let's also split up our training data into train, dev, and test. And then we are going to train on train and only evaluate on test very, very sparingly.

Okay, so here we go. Here is where we took all the words and put them into x and y tensors. So instead, let me create a new cell here, and let me just copy/paste some code here, because I don't think it's that complex, but we're going to try to save a little bit of time.

I'm converting this to be a function now. And this function takes some list of words and builds the arrays x and y for those words only. And then here, I am shuffling up all the words. So these are the input words that we get. We are randomly shuffling them all up.

And then we're going to set n1 to be the number of examples that is 80% of the words and n2 to be 90% of the way of the words. So basically, if length of words is 32,000, n1 is--well, sorry, I should probably run this. n1 is 25,000, and n2 is 28,000.

And so here we see that I'm calling buildDataSet to build a training set x and y by indexing into up to n1. So we're going to have only 25,000 training words. And then we're going to have roughly n2 minus n1, 3,000 validation examples or dev examples. And we're going to have length of words basically minus n2 or 3,204 examples here for the test set.

So now we have x's and y's for all those three splits. Oh yeah, I'm printing their size here inside the function as well. But here we don't have words, but these are already the individual examples made from those words. So let's now scroll down here. And the data set now for training is more like this.

And then when we reset the network, when we're training, we're only going to be training using x train, x train, and y train. So that's the only thing we're training on. Let's see where we are on the single batch. Let's now train maybe a few more steps. Training neural networks can take a while.

Usually you don't do it inline. You launch a bunch of jobs and you wait for them to finish. It can take multiple days and so on. But basically this is a very small network. Okay, so the loss is pretty good. Oh, we accidentally used a learning rate that is way too low.

So let me actually come back. We used the decay learning rate of 0.01. So this will train much faster. And then here when we evaluate, let's use the dev set here, x dev and y dev to evaluate the loss. Okay, and let's now decay the learning rate and only do say 10,000 examples.

And let's evaluate the dev loss once here. Okay, so we're getting about 2.3 on dev. And so the neural network when it was training did not see these dev examples. It hasn't optimized on them. And yet when we evaluate the loss on these dev, we actually get a pretty decent loss.

And so we can also look at what the loss is on all of training set. Oops. And so we see that the training and the dev loss are about equal. So we're not overfitting. This model is not powerful enough to just be purely memorizing the data. And so far we are what's called underfitting because the training loss and the dev or test losses are roughly equal.

So what that typically means is that our network is very tiny, very small. And we expect to make performance improvements by scaling up the size of this neural net. So let's do that now. So let's come over here and let's increase the size of the neural net. The easiest way to do this is we can come here to the hidden layer, which currently is 100 neurons.

And let's just bump this up. So let's do 300 neurons. And then this is also 300 biases. And here we have 300 inputs into the final layer. So let's initialize our neural net. We now have 10,000 parameters instead of 3,000 parameters. And then we're not using this. And then here what I'd like to do is I'd like to actually keep track of that.

Okay, let's just do this. Let's keep stats again. And here when we're keeping track of the loss, let's just also keep track of the steps. And let's just have an eye here. And let's train on 30,000. Or rather say, let's try 30,000. And we are at 0.1. And we should be able to run this and optimize the neural net.

And then here basically I want to plt.plot the steps against the loss. So these are the x's and the y's. And this is the loss function and how it's being optimized. Now, you see that there's quite a bit of thickness to this. And that's because we are optimizing over these mini-batches.

And the mini-batches create a little bit of noise in this. Where are we in the dev set? We are at 2.5. So we still haven't optimized this neural net very well. And that's probably because we made it bigger. It might take longer for this neural net to converge. And so let's continue training.

Yeah, let's just continue training. One possibility is that the batch size is so low that we just have way too much noise in the training. And we may want to increase the batch size so that we have a bit more correct gradient. And we're not thrashing too much. And we can actually optimize more properly.

This will now become meaningless because we've reinitialized these. So yeah, this looks not pleasing right now. But there probably is a tiny improvement, but it's so hard to tell. Let's go again. 2.52. Let's try to decrease the learning rate by a factor of two. Okay, we're at 2.32. Let's continue training.

We basically expect to see a lower loss than what we had before. Because now we have a much, much bigger model. And we were underfitting. So we'd expect that increasing the size of the model should help the neural net. 2.32. Okay, so that's not happening too well. Now, one other concern is that even though we've made the 10H layer here, or the hidden layer, much, much bigger, it could be that the bottleneck of the network right now are these embeddings that are two-dimensional.

It can be that we're just cramming way too many characters into just two dimensions. And the neural net is not able to really use that space effectively. And that is sort of like the bottleneck to our network's performance. Okay, 2.23. So just by decreasing the learning rate, I was able to make quite a bit of progress.

Let's run this one more time. And then evaluate the training and the dev loss. Now, one more thing after training that I'd like to do is I'd like to visualize the embedding vectors for these characters before we scale up the embedding size from 2. Because we'd like to make this bottleneck potentially go away.

And once I make this greater than 2, we won't be able to visualize them. So here, okay, we're at 2.23 and 2.24. So we're not improving much more. And maybe the bottleneck now is the character embedding size, which is 2. So here I have a bunch of code that will create a figure.

And then we're going to visualize the embeddings that were trained by the neural net on these characters. Because right now the embedding size is just 2. So we can visualize all the characters with the x and the y coordinates as the two embedding locations for each of these characters.

And so here are the x coordinates and the y coordinates, which are the columns of C. And then for each one, I also include the text of the little character. So here what we see is actually kind of interesting. The network has basically learned to separate out the characters and cluster them a little bit.

So, for example, you see how the vowels, A, E, I, O, U, are clustered up here. So what that's telling us is that the neural net treats these as very similar, right? Because when they feed into the neural net, the embedding for all these characters is very similar. And so the neural net thinks that they're very similar and kind of like interchangeable, if that makes sense.

Then the points that are like really far away are, for example, Q. Q is kind of treated as an exception, and Q has a very special embedding vector, so to speak. Similarly, dot, which is a special character, is all the way out here. And a lot of the other letters are sort of like clustered up here.

And so it's kind of interesting that there's a little bit of structure here after the training. And it's definitely not random, and these embeddings make sense. So we're now going to scale up the embedding size and won't be able to visualize it directly. But we expect that because we're underfitting and we made this layer much bigger and did not sufficiently improve the loss, we're thinking that the constraint to better performance right now could be these embedding vectors.

So let's make them bigger. OK, so let's scroll up here. And now we don't have two-dimensional embeddings. We are going to have, say, 10-dimensional embeddings for each word. Then this layer will receive 3 times 10, so 30 inputs will go into the hidden layer. Let's also make the hidden layer a bit smaller.

So instead of 300, let's just do 200 neurons in that hidden layer. So now the total number of elements will be slightly bigger at 11,000. And then here we have to be a bit careful because, OK, the learning rate, we set to 0.1. Here we are hardcoding 6. And obviously if you're working in production, you don't want to be hardcoding magic numbers.

But instead of 6, this should now be 30. And let's run for 50,000 iterations. And let me split out the initialization here outside so that when we run this cell multiple times, it's not going to wipe out our loss. In addition to that, here, instead of logging the lost.item, let's actually log the -- let's do log10.

I believe that's a function of the loss. And I'll show you why in a second. Let's optimize this. Basically I'd like to plot the log loss instead of the loss. Because when you plot the loss, many times it can have this hockey stick appearance and log squashes it in.

So it just kind of looks nicer. So the x-axis is step i, and the y-axis will be the loss i. And then here this is 30. Ideally we wouldn't be hardcoding these. Because let's look at the loss. Okay. It's, again, very thick because the mini batch size is very small.

But the total loss over the training set is 2.3, and the test -- or the dev set is 2.38 as well. So so far so good. Let's try to now decrease the learning rate by a factor of 10. And train for another 50,000 iterations. We'd hope that we would be able to beat 2.32.

But, again, we're just kind of, like, doing this very haphazardly. So I don't actually have confidence that our learning rate is set very well, that our learning rate decay, which we just do at random, is set very well. And so the optimization here is kind of suspect, to be honest.

And this is not how you would do it typically in production. In production you would create parameters or hyperparameters out of all these settings, and then you would run lots of experiments and see whichever ones are working well for you. Okay. So we have 2.17 now and 2.2. Okay.

So you see how the training and the validation performance are starting to slightly slowly depart. So maybe we're getting the sense that the neural net is getting good enough or that number of parameters is large enough that we are slowly starting to overfit. Let's maybe run one more iteration of this and see where we get.

But, yeah, basically you would be running lots of experiments and then you are slowly scrutinizing whichever ones give you the best depth performance. And then once you find all the hyperparameters that make your depth performance good, you take that model and you evaluate the test set performance a single time.

And that's the number that you report in your paper or wherever else you want to talk about and brag about your model. So let's then rerun the plot and rerun the train and dev. And because we're getting lower loss now, it is the case that the embedding size of these was holding us back very likely.

Okay. So 2.16, 2.19 is what we're roughly getting. So there's many ways to go from here. We can continue tuning the optimization. We can continue, for example, playing with the size of the neural net. Or we can increase the number of words or characters in our case that we are taking as an input.

So instead of just three characters, we could be taking more characters as an input. And that could further improve the loss. Okay. So I changed the code slightly. So we have here 200,000 steps of the optimization. And in the first 100,000, we're using a learning rate of .1. And then in the next 100,000, we're using a learning rate of .01.

This is the loss that I achieve. And these are the performance on the training and validation loss. And in particular, the best validation loss I've been able to obtain in the last 30 minutes or so is 2.17. So now I invite you to beat this number. And you have quite a few knobs available to you to, I think, surpass this number.

So number one, you can, of course, change the number of neurons in the hidden layer of this model. You can change the dimensionality of the embedding lookup table. You can change the number of characters that are feeding in as an input, as the context into this model. And then, of course, you can change the details of the optimization.

How long are we running? What is the learning rate? How does it change over time? How does it decay? You can change the batch size, and you may be able to actually achieve a much better convergence speed in terms of how many seconds or minutes it takes to train the model and get your result in terms of really good loss.

And then, of course, I actually invite you to read this paper. It is 19 pages, but at this point, you should actually be able to read a good chunk of this paper and understand pretty good chunks of it. And this paper also has quite a few ideas for improvements that you can play with.

So all of those are knobs available to you, and you should be able to beat this number. I'm leaving that as an exercise to the reader. And that's it for now, and I'll see you next time. Before we wrap up, I also wanted to show how you would sample from the model.

So we're going to generate 20 samples. At first, we begin with all dots, so that's the context. And then until we generate the zeroth character again, we're going to embed the current context using the embedding table C. Now, usually here, the first dimension was the size of the training set.

But here, we're only working with a single example that we're generating. So this is just dimension one, just for simplicity. And so this embedding then gets projected into the end state. You get the logits. Now we calculate the probabilities. For that, you can use f.softmax of logits, and that just basically exponentiates logits and makes them sum to one.

And similar to cross entropy, it is careful that there's no overflows. Once we have the probabilities, we sample from them using torsion multinomial to get our next index. And then we shift the context window to append the index and record it. And then we can just decode all the integers to strings and print them out.

And so these are some example samples, and you can see that the model now works much better. So the words here are much more word-like or name-like. So we have things like Ham, Joe's, Lila. It's starting to sound a little bit more name-like. So we're definitely making progress, but we can still improve on this model quite a lot.

Okay, sorry, there's some bonus content. I wanted to mention that I want to make these notebooks more accessible. And so I don't want you to have to install Jupyter Notebooks and Torch and everything else. So I will be sharing a link to a Google Colab. And the Google Colab will look like a notebook in your browser.

And you can just go to a URL, and you'll be able to execute all of the code that you saw in the Google Colab. And so this is me executing the code in this lecture, and I shortened it a little bit. But basically, you're able to train the exact same network and then plot and sample from the model.

And everything is ready for you to tinker with the numbers right there in your browser, no installation necessary. So I just wanted to point that out, and the link to this will be in the video description.

Building makemore Part 2: MLP

Chapters

Transcript