back to indexBuilding makemore Part 2: MLP
Chapters
0:0 intro
1:48 Bengio et al. 2003 (MLP language model) paper walkthrough
9:3 (re-)building our training dataset
12:19 implementing the embedding lookup table
18:35 implementing the hidden layer + internals of torch.Tensor: storage, views
29:15 implementing the output layer
29:53 implementing the negative log likelihood loss
32:17 summary of the full network
32:49 introducing F.cross_entropy and why
37:56 implementing the training loop, overfitting one batch
41:25 training on the full dataset, minibatches
45:40 finding a good initial learning rate
53:20 splitting up the dataset into train/val/test splits and why
60:49 experiment: larger hidden layer
65:27 visualizing the character embeddings
67:16 experiment: larger embedding size
71:46 summary of our final code, conclusion
73:24 sampling from the model
74:55 google collab (new!!) notebook advertisement
00:00:00.000 |
Hi everyone. Today we are continuing our implementation of Makemore. 00:00:05.000 |
Now, in the last lecture, we implemented the bigram language model, 00:00:08.000 |
and we implemented it both using counts and also using a super simple neural network 00:00:15.000 |
Now, this is the Jupyter Notebook that we built out last lecture, 00:00:20.000 |
and we saw that the way we approached this is that we looked at only the single previous character, 00:00:24.000 |
and we predicted the distribution for the character that would go next in the sequence. 00:00:29.000 |
And we did that by taking counts and normalizing them into probabilities 00:00:36.000 |
Now, this is all well and good if you only have one character of previous context. 00:00:44.000 |
The problem with this model, of course, is that the predictions from this model are not very good 00:00:49.000 |
because you only take one character of context. 00:00:51.000 |
So the model didn't produce very name-like sounding things. 00:00:55.000 |
Now, the problem with this approach, though, is that if we are to take more context into account 00:01:01.000 |
when predicting the next character in a sequence, things quickly blow up. 00:01:04.000 |
And this table, the size of this table, grows, and in fact it grows exponentially 00:01:11.000 |
Because if we only take a single character at a time, that's 27 possibilities of context. 00:01:15.000 |
But if we take two characters in the past and try to predict the third one, 00:01:19.000 |
suddenly the number of rows in this matrix, you can look at it that way, is 27 times 27. 00:01:25.000 |
So there's 729 possibilities for what could have come in the context. 00:01:30.000 |
If we take three characters as the context, suddenly we have 20,000 possibilities of context. 00:01:37.000 |
And so that's just way too many rows of this matrix. 00:01:41.000 |
It's way too few counts for each possibility. 00:01:45.000 |
And the whole thing just kind of explodes and doesn't work very well. 00:01:49.000 |
So that's why today we're going to move on to this bullet point here. 00:01:52.000 |
And we're going to implement a multilayer perceptron model 00:01:58.000 |
And this modeling approach that we're going to adopt follows this paper, Ben-Ju et al., 2003. 00:02:06.000 |
Now, this isn't the very first paper that proposed the use of multilayer perceptrons 00:02:11.000 |
or neural networks to predict the next character or token in a sequence. 00:02:14.000 |
But it's definitely one that was very influential around that time. 00:02:18.000 |
It is very often cited to stand in for this idea. 00:02:23.000 |
And so this is the paper that we're going to first look at and then implement. 00:02:27.000 |
Now, this paper has 19 pages, so we don't have time to go into the full detail of this paper. 00:02:33.000 |
It's very readable, interesting, and has a lot of interesting ideas in it as well. 00:02:37.000 |
In the introduction, they described the exact same problem I just described. 00:02:40.000 |
And then to address it, they proposed the following model. 00:02:44.000 |
Now, keep in mind that we are building a character-level language model. 00:02:50.000 |
In this paper, they have a vocabulary of 17,000 possible words, 00:02:54.000 |
and they instead build a word-level language model. 00:02:57.000 |
But we're going to still stick with the characters, but we'll take the same modeling approach. 00:03:01.000 |
Now, what they do is basically they propose to take every one of these words, 17,000 words, 00:03:07.000 |
and they're going to associate to each word a, say, 30-dimensional feature vector. 00:03:13.000 |
So every word is now embedded into a 30-dimensional space. 00:03:19.000 |
So we have 17,000 points or vectors in a 30-dimensional space, 00:03:26.000 |
That's a lot of points for a very small space. 00:03:29.000 |
Now, in the beginning, these words are initialized completely randomly, 00:03:35.000 |
But then we're going to tune these embeddings of these words using backpropagation. 00:03:40.000 |
So during the course of training of this neural network, 00:03:42.000 |
these points or vectors are going to basically move around in this space. 00:03:46.000 |
And you might imagine that, for example, words that have very similar meanings 00:03:49.000 |
or that are indeed synonyms of each other might end up in a very similar part of the space, 00:03:54.000 |
and conversely, words that mean very different things would go somewhere else in the space. 00:03:59.000 |
Now, their modeling approach otherwise is identical to ours. 00:04:03.000 |
They are using a multilayer neural network to predict the next word, given the previous words, 00:04:07.000 |
and to train the neural network, they are maximizing the log likelihood of the training data, 00:04:13.000 |
So the modeling approach itself is identical. 00:04:16.000 |
Now, here they have a concrete example of this intuition. 00:04:20.000 |
Basically, suppose that, for example, you are trying to predict a dog was running in a blank. 00:04:26.000 |
Now, suppose that the exact phrase "a dog was running in a" has never occurred in the training data. 00:04:32.000 |
And here you are at sort of test time later, when the model is deployed somewhere, 00:04:36.000 |
and it's trying to make a sentence, and it's saying "a dog was running in a blank." 00:04:41.000 |
And because it's never encountered this exact phrase in the training set, 00:04:47.000 |
Like, you don't have fundamentally any reason to suspect what might come next. 00:04:54.000 |
But this approach actually allows you to get around that, 00:04:57.000 |
because maybe you didn't see the exact phrase "a dog was running in a" something, 00:05:03.000 |
Maybe you've seen the phrase "the dog was running in a blank." 00:05:06.000 |
And maybe your network has learned that "a" and "the" are, like, 00:05:10.000 |
frequently are interchangeable with each other. 00:05:12.000 |
And so maybe it took the embedding for "a" and the embedding for "the," 00:05:16.000 |
and it actually put them, like, nearby each other in the space. 00:05:19.000 |
And so you can transfer knowledge through that embedding, and you can generalize in that way. 00:05:23.000 |
Similarly, the network could know that cats and dogs are animals, 00:05:27.000 |
and they co-occur in lots of very similar contexts. 00:05:30.000 |
So even though you haven't seen this exact phrase, 00:05:32.000 |
or you haven't seen exactly "walking" or "running," 00:05:35.000 |
you can, through the embedding space, transfer knowledge, 00:05:42.000 |
So let's now scroll down to the diagram of the neural network. 00:05:47.000 |
And in this example, we are taking three previous words, 00:05:51.000 |
and we are trying to predict the fourth word in a sequence. 00:05:56.000 |
Now, these three previous words, as I mentioned, 00:05:59.000 |
they have a vocabulary of 17,000 possible words. 00:06:03.000 |
So every one of these basically are the index of the incoming word. 00:06:09.000 |
And because there are 17,000 words, this is an integer between 0 and 16,999. 00:06:17.000 |
Now, there's also a lookup table that they call C. 00:06:21.000 |
This lookup table is a matrix that is 17,000 by, say, 30. 00:06:26.000 |
And basically what we're doing here is we're treating this as a lookup table. 00:06:29.000 |
And so every index is plucking out a row of this embedding matrix 00:06:35.000 |
so that each index is converted to the 30-dimensional vector 00:06:39.000 |
that corresponds to the embedding vector for that word. 00:06:43.000 |
So here we have the input layer of 30 neurons for three words, 00:06:51.000 |
And here they're saying that this matrix C is shared across all the words. 00:06:55.000 |
So we're always indexing into the same matrix C over and over 00:07:02.000 |
Next up is the hidden layer of this neural network. 00:07:05.000 |
The size of this hidden neural layer of this neural net is a hyperparameter. 00:07:09.000 |
So we use the word hyperparameter when it's kind of like a design choice 00:07:14.000 |
And this can be as large as you'd like or as small as you'd like. 00:07:19.000 |
And we are going to go over multiple choices of the size of this hidden layer, 00:07:23.000 |
and we're going to evaluate how well they work. 00:07:28.000 |
All of them would be fully connected to the 90 words 00:07:31.000 |
or 90 numbers that make up these three words. 00:07:42.000 |
And because there are 17,000 possible words that could come next, 00:07:49.000 |
and all of them are fully connected to all of these neurons in the hidden layer. 00:07:55.000 |
So there's a lot of parameters here because there's a lot of words. 00:08:05.000 |
So on top of there, we have the softmax layer, 00:08:07.000 |
which we've seen in our previous video as well. 00:08:09.000 |
So every one of these logits is exponentiated, 00:08:12.000 |
and then everything is normalized to sum to 1 00:08:15.000 |
to have a nice probability distribution for the next word in the sequence. 00:08:20.000 |
Now, of course, during training, we actually have the label. 00:08:23.000 |
We have the identity of the next word in the sequence. 00:08:26.000 |
That word or its index is used to pluck out the probability of that word, 00:08:33.000 |
and then we are maximizing the probability of that word 00:08:37.000 |
with respect to the parameters of this neural net. 00:08:40.000 |
So the parameters are the weights and biases of this output layer. 00:08:49.000 |
and all of that is optimized using backpropagation. 00:08:57.000 |
that we are not going to explore in this video. 00:09:00.000 |
So that's the setup, and now let's implement it. 00:09:02.000 |
Okay, so I started a brand new notebook for this lecture. 00:09:07.000 |
and we are importing Matplotlib so we can create figures. 00:09:10.000 |
Then I am reading all the names into a list of words like I did before, 00:09:23.000 |
And then here I'm building out the vocabulary of characters 00:09:25.000 |
and all the mappings from the characters as strings to integers and vice versa. 00:09:31.000 |
Now, the first thing we want to do is we want to compile the dataset 00:09:34.000 |
for the neural network, and I had to rewrite this code. 00:09:38.000 |
I'll show you in a second what it looks like. 00:09:41.000 |
So this is the code that I created for the dataset creation. 00:09:44.000 |
So let me first run it, and then I'll briefly explain how this works. 00:09:48.000 |
So first we're going to define something called block size, 00:09:51.000 |
and this is basically the context length of how many characters do we take 00:09:57.000 |
So here in this example, we're taking three characters 00:09:59.000 |
to predict the fourth one, so we have a block size of three. 00:10:02.000 |
That's the size of the block that supports the prediction. 00:10:10.000 |
The x are the input to the neural net, and the y are the labels 00:10:17.000 |
Then I'm iterating over the first five words. 00:10:20.000 |
I'm doing the first five just for efficiency while we are developing 00:10:23.000 |
all the code, but then later we are going to come here and erase this 00:10:29.000 |
So here I'm printing the word "Emma," and here I'm basically showing 00:10:33.000 |
the examples that we can generate, the five examples that we can generate 00:10:41.000 |
So when we are given the context of just dot, dot, dot, 00:10:50.000 |
When the context is this, the label is M, and so forth. 00:10:54.000 |
So the way I build this out is first I start with a padded context 00:11:02.000 |
I get the character in the sequence, and I basically build out the array y 00:11:06.000 |
of this current character, and the array x, which stores the current 00:11:12.000 |
Then here, see, I print everything, and here I crop the context 00:11:19.000 |
So this is kind of like a rolling window of context. 00:11:23.000 |
Now we can change the block size here to, for example, four, 00:11:26.000 |
and in that case we would be predicting the fifth character 00:11:30.000 |
Or it can be five, and then it would look like this. 00:11:34.000 |
Or it can be, say, ten, and then it would look something like this. 00:11:38.000 |
We're taking ten characters to predict the eleventh one, 00:11:43.000 |
So let me bring this back to three just so that we have 00:11:50.000 |
And finally, the data set right now looks as follows. 00:11:53.000 |
From these five words, we have created a data set of 32 examples, 00:11:58.000 |
and each input to the neural net is three integers, 00:12:01.000 |
and we have a label that is also an integer, y. 00:12:12.000 |
So given this, let's now write a neural network that takes these x's 00:12:19.000 |
First, let's build the embedding lookup table C. 00:12:23.000 |
So we have 27 possible characters, and we're going to embed them 00:12:31.000 |
and they embed them in spaces as small-dimensional as 30. 00:12:36.000 |
So they cram 17,000 words into 30-dimensional space. 00:12:40.000 |
In our case, we have only 27 possible characters, 00:12:43.000 |
so let's cram them in something as small as, to start with, 00:12:51.000 |
and we'll have 27 rows, and we'll have two columns. 00:12:56.000 |
So each one of 27 characters will have a two-dimensional embedding. 00:13:08.000 |
Now, before we embed all of the integers inside the input x 00:13:14.000 |
let me actually just try to embed a single individual integer, 00:13:22.000 |
Now, one way this works, of course, is we can just take the C, 00:13:28.000 |
and that gives us a vector, the fifth row of C. 00:13:34.000 |
The other way that I presented in the previous lecture 00:13:37.000 |
is actually seemingly different, but actually identical. 00:13:40.000 |
So in the previous lecture, what we did is we took these integers, 00:13:43.000 |
and we used the one-hot encoding to first encode them. 00:13:50.000 |
and we want to tell it that the number of classes is 27. 00:13:53.000 |
So that's the 26-dimensional vector of all zeros, 00:14:02.000 |
The reason is that this input actually must be a torstadt tensor. 00:14:07.000 |
And I'm making some of these errors intentionally, 00:14:09.000 |
just so you get to see some errors and how to fix them. 00:14:18.000 |
The fifth dimension is 1, and the shape of this is 27. 00:14:22.000 |
And now notice that, just as I briefly alluded to in a previous video, 00:14:26.000 |
if we take this one-hot vector and we multiply it by C, 00:14:37.000 |
Well, number one, first you'd expect an error, 00:14:41.000 |
because expected scalar type long, but found float. 00:14:46.000 |
So a little bit confusing, but the problem here is that one-hot, 00:14:54.000 |
It's a 64-bit integer, but this is a float tensor. 00:14:57.000 |
And so PyTorch doesn't know how to multiply an int with a float, 00:15:01.000 |
and that's why we had to explicitly cast this to a float, 00:15:11.000 |
And it's identical because of the way the matrix multiplication here works. 00:15:15.000 |
We have the one-hot vector multiplying columns of C, 00:15:20.000 |
and because of all the zeros, they actually end up masking out 00:15:23.000 |
everything in C except for the fifth row, which is plucked out. 00:15:27.000 |
And so we actually arrive at the same result. 00:15:30.000 |
And that tells you that here we can interpret this first piece here, 00:15:34.000 |
this embedding of the integer, we can either think of it as 00:15:40.000 |
but equivalently we can also think of this little piece here 00:15:46.000 |
This layer here has neurons that have no nonlinearity. 00:15:50.000 |
There's no tanh. They're just linear neurons. 00:15:55.000 |
And then we are encoding integers into one-hot 00:16:03.000 |
Those are two equivalent ways of doing the same thing. 00:16:06.000 |
We're just going to index because it's much, much faster, 00:16:08.000 |
and we're going to discard this interpretation of one-hot inputs 00:16:12.000 |
into neural nets, and we're just going to index integers 00:16:17.000 |
Now, embedding a single integer like 5 is easy enough. 00:16:21.000 |
We can simply ask PyTorch to retrieve the fifth row of C, 00:16:28.000 |
But how do we simultaneously embed all of these 32 by 3 integers 00:16:35.000 |
Luckily, PyTorch indexing is fairly flexible and quite powerful. 00:16:38.000 |
So it doesn't just work to ask for a single element 5 like this. 00:16:46.000 |
So, for example, we can get the rows 5, 6, and 7, 00:16:55.000 |
It can also be actually a tensor of integers, 00:17:06.000 |
In fact, we can also, for example, repeat row 7 00:17:11.000 |
and that same index will just get embedded multiple times here. 00:17:16.000 |
So here we are indexing with a one-dimensional tensor of integers, 00:17:25.000 |
Here we have a two-dimensional tensor of integers. 00:17:28.000 |
So we can simply just do C at x, and this just works. 00:17:34.000 |
And the shape of this is 32 by 3, which is the original shape, 00:17:39.000 |
and now for every one of those 32 by 3 integers, 00:17:49.000 |
The 13th--or example index 13, the second dimension, 00:17:58.000 |
And so here, if we do C of x, which gives us that array, 00:18:03.000 |
and then we index into 13 by 2 of that array, 00:18:21.000 |
So basically, long story short, PyTorch indexing is awesome, 00:18:25.000 |
and to embed simultaneously all of the integers in x, 00:18:29.000 |
we can simply do C of x, and that is our embedding, 00:18:35.000 |
Now let's construct this layer here, the hidden layer. 00:18:42.000 |
are these weights, which we will initialize randomly. 00:18:52.000 |
and we have three of them, so the number of inputs is 6. 00:19:03.000 |
And then biases will be also initialized randomly, 00:19:11.000 |
Now the problem with this is we can't simply-- 00:19:17.000 |
and we'd like to multiply it with these weights, 00:19:24.000 |
But the problem here is that these embeddings 00:19:26.000 |
are stacked up in the dimensions of this input tensor. 00:19:29.000 |
So this will not work, this matrix multiplication, 00:19:41.000 |
so that we can do something along these lines, 00:19:54.000 |
I'd like to show you that there are usually many ways 00:19:57.000 |
of implementing what you'd like to do in Torch, 00:20:00.000 |
and some of them will be faster, better, shorter, etc. 00:20:04.000 |
And that's because Torch is a very large library, 00:20:09.000 |
So if we just go to the documentation and click on Torch, 00:20:14.000 |
and that's because there are so many functions 00:20:18.000 |
to transform them, create them, multiply them, add them, 00:20:21.000 |
perform all kinds of different operations on them. 00:20:36.000 |
and we see that there's a function, Torch.cat, 00:20:40.000 |
And this concatenates a given sequence of tensors 00:20:45.000 |
and these tensors must have the same shape, etc. 00:20:50.000 |
to, in a naive way, concatenate these three embeddings 00:21:00.000 |
And really what we want to do is we want to retrieve 00:21:36.000 |
And then we want to treat this as a sequence, 00:21:47.000 |
and then we have to tell it along which dimension 00:21:53.000 |
and we want to concatenate not across dimension 0, 00:22:01.000 |
that the shape of this is 32x6, exactly as we'd like. 00:22:13.000 |
would not generalize if we want to later change 00:22:28.000 |
because there turns out to be a function called unbind, 00:22:37.000 |
returns a tuple of all slices along a given dimension 00:22:44.000 |
And basically, when we call Torch.unbind of m 00:22:56.000 |
this gives us a list of tensors exactly equivalent to this. 00:23:19.000 |
But now it doesn't matter if we have block size 3 or 5 or 10. 00:23:28.000 |
there's actually a significantly better and more efficient way. 00:23:36.000 |
So let's create an array here of elements from 0 to 17. 00:23:47.000 |
It turns out that we can very quickly re-represent this 00:24:14.000 |
multiply to be the same, this will just work. 00:24:18.000 |
And in PyTorch, this operation, calling that view, 00:24:24.000 |
And the reason for that is that in each tensor, 00:24:27.000 |
there's something called the underlying storage. 00:24:44.000 |
we are manipulating some of the attributes of that tensor 00:24:47.000 |
that dictate how this one-dimensional sequence 00:24:50.000 |
is interpreted to be an n-dimensional tensor. 00:24:55.000 |
no memory is being changed, copied, moved, or created 00:25:03.000 |
some of the internal attributes of the view of this tensor 00:25:30.000 |
And this is really just like a logical construct 00:25:41.000 |
on the internals of TorchTensor and how this works. 00:25:48.000 |
And if I delete this and come back to our EMP, 00:25:56.000 |
But we can simply ask for PyTorch to view this 00:26:02.000 |
And the way this gets flattened into a 32x6 array 00:26:14.000 |
And so that's basically the concatenation operation 00:26:31.000 |
So long story short, we can actually just come here, 00:27:02.000 |
EMP.shape(0) so that we don't hard-code these numbers. 00:27:07.000 |
And this would work for any size of this EMP. 00:27:12.000 |
When we do -1, PyTorch will infer what this should be. 00:27:16.000 |
Because the number of elements must be the same, 00:27:22.000 |
or whatever else it is if EMP is of different size. 00:27:44.000 |
because there's no way to concatenate tensors 00:27:48.000 |
So this is inefficient and creates all kinds of new memory. 00:27:57.000 |
And here to calculate h, we want to also dot 10h 00:28:14.000 |
And that is basically this hidden layer of activations here 00:28:26.000 |
In particular, we want to make sure that the broadcasting 00:28:36.000 |
So we see that the addition here will broadcast these two. 00:28:39.000 |
And in particular, we have 32 by 100 broadcasting to 100. 00:28:59.000 |
So in this case, the correct thing will be happening 00:29:10.000 |
And it's always good practice to just make sure 00:29:13.000 |
so that you don't shoot yourself in the foot. 00:29:15.000 |
And finally, let's create the final layer here. 00:29:25.000 |
and the output number of neurons will be for us 27 00:29:28.000 |
because we have 27 possible characters that come next. 00:29:35.000 |
So therefore, the logits, which are the outputs of this neural net, 00:29:53.000 |
Now, exactly as we saw in the previous video, 00:30:00.000 |
and then we want to normalize them into a probability. 00:30:10.000 |
and keep them as true, exactly as in the previous video. 00:30:20.000 |
and you'll see that every row of prob sums to 1, 00:30:29.000 |
Now, of course, we have the actual letter that comes next, 00:30:34.000 |
which we created during the dataset creation. 00:30:40.000 |
which is the identity of the next character in the sequence 00:30:45.000 |
So what we'd like to do now is, just as in the previous video, 00:30:52.000 |
the probability assigned to the correct character, 00:31:00.000 |
which is kind of like an iterator over numbers from 0 to 31, 00:31:06.000 |
and then we can index into prob in the following way. 00:31:09.000 |
Prob in torch.arrange of 32, which iterates the rows, 00:31:14.000 |
and then in each row, we'd like to grab this column, 00:31:30.000 |
for some of these characters, like this is basically 0.2, 00:31:33.000 |
but it doesn't look very good at all for many other characters. 00:31:40.000 |
and so the network thinks that some of these are extremely unlikely. 00:31:43.000 |
But of course, we haven't trained the neural network yet. 00:31:46.000 |
This will improve, and ideally, all of these numbers here, of course, are 1, 00:31:51.000 |
because then we are correctly predicting the next character. 00:31:54.000 |
Now, just as in the previous video, we want to take these probabilities, 00:31:59.000 |
and then we want to look at the average log probability 00:32:02.000 |
and the negative of it to create the negative log likelihood loss. 00:32:10.000 |
and this is the loss that we'd like to minimize 00:32:12.000 |
to get the network to predict the correct character in the sequence. 00:32:16.000 |
Okay, so I rewrote everything here and made it a bit more respectable. 00:32:24.000 |
I'm now using a generator to make it reproducible. 00:32:27.000 |
I clustered all the parameters into a single list of parameters 00:32:30.000 |
so that, for example, it's easy to count them 00:32:33.000 |
and see that in total we currently have about 3,400 parameters. 00:32:36.000 |
And this is the forward pass as we developed it, 00:32:39.000 |
and we arrive at a single number here, the loss, 00:32:42.000 |
that is currently expressing how well this neural network works 00:32:48.000 |
Now I would like to make it even more respectable. 00:32:53.000 |
where we take the logits and we calculate the loss. 00:32:57.000 |
We're not actually reinventing the wheel here. 00:33:00.000 |
This is just classification, and many people use classification, 00:33:04.000 |
and that's why there is a functional.crossentropy function in PyTorch 00:33:13.000 |
and we can pass in the logits, and we can pass in the array of targets, y, 00:33:22.000 |
So in fact, we can simply put this here and erase these three lines, 00:33:27.000 |
and we're going to get the exact same result. 00:33:30.000 |
Now there are actually many good reasons to prefer f.crossentropy 00:33:33.000 |
over rolling your own implementation like this. 00:33:36.000 |
I did this for educational reasons, but you'd never use this in practice. 00:33:43.000 |
PyTorch will not actually create all these intermediate tensors 00:33:49.000 |
and all this is fairly inefficient to run like this. 00:33:52.000 |
Instead, PyTorch will cluster up all these operations 00:33:58.000 |
that very efficiently evaluate these expressions 00:34:01.000 |
that are sort of like clustered mathematical operations. 00:34:04.000 |
Number two, the backward pass can be made much more efficient, 00:34:13.000 |
it's often a very much simpler backward pass to implement. 00:34:22.000 |
the forward pass of this operation to calculate the 10H 00:34:25.000 |
was actually a fairly complicated mathematical expression. 00:34:28.000 |
But because it's a clustered mathematical expression, 00:34:31.000 |
when we did the backward pass, we didn't individually backward 00:34:34.000 |
through the x and the 2 times and the -1 and division, etc. 00:34:40.000 |
and that's a much simpler mathematical expression. 00:34:43.000 |
And we were able to do this because we're able to reuse calculations 00:34:46.000 |
and because we are able to mathematically and analytically 00:34:49.000 |
derive the derivative, and often that expression simplifies mathematically, 00:35:00.000 |
but also because the expressions can take a much simpler form mathematically. 00:35:09.000 |
F dot cross entropy can also be significantly more numerically well-behaved. 00:35:15.000 |
Let me show you an example of how this works. 00:35:19.000 |
Suppose we have a logit of -2, 3, -3, 0, and 5, 00:35:24.000 |
and then we are taking the exponent of it and normalizing it to sum to 1. 00:35:27.000 |
So when logits take on this value, everything is well and good, 00:35:33.000 |
Now consider what happens when some of these logits take on more extreme values, 00:35:37.000 |
and that can happen during optimization of the neural network. 00:35:40.000 |
Suppose that some of these numbers grow very negative, 00:35:43.000 |
like say -100, then actually everything will come out fine. 00:35:47.000 |
We still get probabilities that are well-behaved, 00:35:56.000 |
if you have very positive logits, like say +100 in here, 00:36:00.000 |
you actually start to run into trouble, and we get not a number here. 00:36:04.000 |
And the reason for that is that these counts have an inf here. 00:36:10.000 |
So if you pass in a very negative number to exp, 00:36:13.000 |
you just get a very small number, very near 0, and that's fine. 00:36:21.000 |
suddenly we run out of range in our floating-point number 00:36:28.000 |
So basically we're taking e and we're raising it to the power of 100, 00:36:32.000 |
and that gives us inf, because we've run out of dynamic range 00:36:38.000 |
And so we cannot pass very large logits through this expression. 00:36:44.000 |
Now let me reset these numbers to something reasonable. 00:36:50.000 |
you see how we have a well-behaved result here? 00:36:53.000 |
It turns out that because of the normalization here, 00:36:56.000 |
you can actually offset logits by any arbitrary constant value that you want. 00:37:00.000 |
So if I add 1 here, you actually get the exact same result. 00:37:08.000 |
any offset will produce the exact same probabilities. 00:37:15.000 |
but positive numbers can actually overflow this exp, 00:37:18.000 |
what PyTorch does is it internally calculates the maximum value 00:37:21.000 |
that occurs in the logits, and it subtracts it. 00:37:27.000 |
And so therefore the greatest number in logits will become 0, 00:37:30.000 |
and all the other numbers will become some negative numbers. 00:37:33.000 |
And then the result of this is always well-behaved. 00:37:36.000 |
So even if we have 100 here, previously, not good. 00:37:40.000 |
But because PyTorch will subtract 100, this will work. 00:37:44.000 |
And so there's many good reasons to call cross-entropy. 00:37:48.000 |
Number one, the forward pass can be much more efficient. 00:37:51.000 |
The backward pass can be much more efficient. 00:37:53.000 |
And also things can be much more numerically well-behaved. 00:37:56.000 |
Okay, so let's now set up the training of this neural net. 00:38:05.000 |
Instead, we have that loss is equal to the cross-entropy. 00:38:15.000 |
For p-in parameters, we want to make sure that p.grad is none, 00:38:18.000 |
which is the same as setting it to 0 in PyTorch. 00:38:21.000 |
And then loss.backward to populate those gradients. 00:38:24.000 |
Once we have the gradients, we can do the parameter update. 00:38:27.000 |
So for p-in parameters, we want to take all the data, 00:38:31.000 |
and we want to nudge it learning rate times p.grad. 00:38:49.000 |
Now, this won't suffice, and it will create an error, 00:38:51.000 |
because we also have to go for p-in parameters, 00:38:54.000 |
and we have to make sure that p.requires grad is set to true in PyTorch. 00:39:04.000 |
Okay, so we started off with loss of 17, and we're decreasing it. 00:39:10.000 |
And you see how the loss decreases a lot here. 00:39:13.000 |
So if we just run for 1,000 times, we get a very, very low loss, 00:39:21.000 |
and that means that we're making very good predictions. 00:39:23.000 |
Now, the reason that this is so straightforward right now 00:39:27.000 |
is because we're only overfitting 32 examples. 00:39:32.000 |
So we only have 32 examples of the first five words, 00:39:36.000 |
and therefore it's very easy to make this neural net fit only these 32 examples 00:39:41.000 |
because we have 3,400 parameters and only 32 examples. 00:39:46.000 |
So we're doing what's called overfitting a single batch of the data 00:39:50.000 |
and getting a very low loss and good predictions. 00:39:54.000 |
But that's just because we have so many parameters for so few examples, 00:40:02.000 |
and the reason for that is we can, for example, look at logits, 00:40:08.000 |
and we can look at the max along the first dimension. 00:40:13.000 |
And in PyTorch, max reports both the actual values 00:40:17.000 |
that take on the maximum number but also the indices of these. 00:40:22.000 |
And you'll see that the indices are very close to the labels, 00:40:31.000 |
the predicted index is 19, but the label is 5. 00:40:40.000 |
the very first or the zeroth index is the example 00:40:43.000 |
where dot, dot, dot is supposed to predict E, 00:40:45.000 |
but you see how dot, dot, dot is also supposed to predict an O, 00:40:49.000 |
and dot, dot, dot is also supposed to predict an I and an S as well. 00:40:53.000 |
And so basically E, O, A, or S are all possible outcomes 00:41:00.000 |
So we're not able to completely overfit and make the loss be exactly zero, 00:41:09.000 |
where there's a unique input for a unique output. 00:41:15.000 |
and we basically get the exact same and the exact correct result. 00:41:19.000 |
So now all we have to do is we just need to make sure 00:41:22.000 |
that we read in the full dataset and optimize the neural net. 00:41:25.000 |
Okay, so let's swing back up where we created the dataset, 00:41:29.000 |
and we see that here we only use the first five words. 00:41:32.000 |
So let me now erase this, and let me erase the print statements, 00:41:38.000 |
And so when we process the full dataset of all the words, 00:41:41.000 |
we now have 228,000 examples instead of just 32. 00:41:49.000 |
We initialize the weights, the same number of parameters. 00:41:54.000 |
And then let's push this print I lost that item to be here, 00:41:58.000 |
and let's just see how the optimization goes if we run this. 00:42:06.000 |
and then as we're optimizing, the loss is coming down. 00:42:12.000 |
But you'll notice that it takes quite a bit of time 00:42:14.000 |
for every single iteration, so let's actually address that 00:42:17.000 |
because we're doing way too much work forwarding 00:42:22.000 |
In practice, what people usually do is they perform forward 00:42:26.000 |
and backward pass and update on many batches of the data. 00:42:29.000 |
So what we will want to do is we want to randomly select 00:42:32.000 |
some portion of the dataset, and that's a mini batch, 00:42:35.000 |
and then only forward, backward, and update on that little mini batch, 00:42:42.000 |
So in PyTorch, we can, for example, use torch.randint. 00:42:45.000 |
We can generate numbers between 0 and 5 and make 32 of them. 00:42:52.000 |
I believe the size has to be a tuple in PyTorch. 00:42:57.000 |
So we can have a tuple, 32 of numbers between 0 and 5, 00:43:05.000 |
And so this creates integers that index into our dataset, 00:43:11.000 |
So if our mini batch size is 32, then we can come here 00:43:20.000 |
So integers that we want to optimize in this single iteration 00:43:25.000 |
are in the ix, and then we want to index into x with ix 00:43:34.000 |
So we're only getting 32 rows of x, and therefore embeddings 00:43:37.000 |
will again be 32 by 3 by 2, not 200,000 by 3 by 2. 00:43:43.000 |
And then this ix has to be used not just to index into x, 00:43:50.000 |
And now this should be mini batches, and this should be much, much faster. 00:43:58.000 |
So this way we can run many, many examples nearly instantly 00:44:06.000 |
Now because we're only dealing with mini batches, 00:44:08.000 |
the quality of our gradient is lower, so the direction is not as reliable. 00:44:17.000 |
even when it's estimating on only 32 examples, that it is useful. 00:44:22.000 |
And so it's much better to have an approximate gradient 00:44:25.000 |
and just make more steps than it is to evaluate the exact gradient 00:44:31.000 |
So that's why in practice this works quite well. 00:44:50.000 |
However, this is only the loss for that mini batch. 00:44:52.000 |
So let's actually evaluate the loss here for all of x and for all of y, 00:45:00.000 |
just so we have a full sense of exactly how well the model is doing right now. 00:45:05.000 |
So right now we're at about 2.7 on the entire training set. 00:45:22.000 |
Okay, so one issue, of course, is we don't know if we're stepping too slow or too fast. 00:45:31.000 |
So one question is, how do you determine this learning rate? 00:45:35.000 |
And how do we gain confidence that we're stepping in the right sort of speed? 00:45:40.000 |
So I'll show you one way to determine a reasonable learning rate. 00:45:44.000 |
Let's reset our parameters to the initial settings. 00:45:51.000 |
And now let's print in every step, but let's only do 10 steps or so, 00:46:01.000 |
We want to find a very reasonable search range, if you will. 00:46:05.000 |
So, for example, if this is very low, then we see that the loss is barely decreasing. 00:46:18.000 |
Okay, so we're decreasing the loss, but not very quickly. 00:46:25.000 |
And now let's try to find the place at which the loss kind of explodes. 00:46:33.000 |
Okay, we see that we're minimizing the loss, but you see how it's kind of unstable. 00:46:40.000 |
So negative 1 is probably like a fast learning rate. 00:46:55.000 |
So, therefore, negative 1 was like somewhat reasonable if I reset. 00:47:00.000 |
So I'm thinking that the right learning rate is somewhere between negative 0.001 and negative 1. 00:47:08.000 |
So the way we can do this here is we can use torch.learnspace. 00:47:13.000 |
And we want to basically do something like this, between 0 and 1. 00:47:16.000 |
But -- oh, number of steps is one more parameter that's required. 00:47:24.000 |
This creates 1,000 numbers between 0.001 and 1. 00:47:29.000 |
But it doesn't really make sense to step between these linearly. 00:47:32.000 |
So instead, let me create learning rate exponent. 00:47:36.000 |
And instead of 0.001, this will be a negative 3, and this will be a 0. 00:47:41.000 |
And then the actual LRs that we want to search over are going to be 10 to the power of LRE. 00:47:48.000 |
So now what we're doing is we're stepping linearly between the exponents of these learning rates. 00:47:52.000 |
This is 0.001, and this is 1, because 10 to the power of 0 is 1. 00:47:58.000 |
And therefore, we are spaced exponentially in this interval. 00:48:02.000 |
So these are the candidate learning rates that we want to sort of like search over, roughly. 00:48:08.000 |
So now what we're going to do is here we are going to run the optimization for 1,000 steps. 00:48:15.000 |
And instead of using a fixed number, we are going to use learning rate indexing into here, LRs of i, and make this i. 00:48:26.000 |
So basically, let me reset this to be, again, starting from random, 00:48:31.000 |
creating these learning rates between 0.001 and 1, but exponentially stepped. 00:48:40.000 |
And here what we're doing is we're iterating 1,000 times. 00:48:44.000 |
We're going to use the learning rate that's in the beginning very, very low. 00:48:49.000 |
In the beginning, it's going to be 0.001, but by the end, it's going to be 1. 00:48:54.000 |
And we're going to step with that learning rate. 00:48:57.000 |
And now what we want to do is we want to keep track of the learning rates that we used, 00:49:06.000 |
and we want to look at the losses that resulted. 00:49:30.000 |
And so basically, we started with a very low learning rate, and we went all the way up to a learning rate of -1. 00:49:36.000 |
And now what we can do is we can plt.plot, and we can plot the two. 00:49:40.000 |
So we can plot the learning rates on the x-axis and the losses we saw on the y-axis. 00:49:46.000 |
And often, you're going to find that your plot looks something like this, 00:49:50.000 |
where in the beginning, you had very low learning rates. 00:49:53.000 |
So basically, anything--barely anything happened. 00:50:00.000 |
And then as we increased the learning rate enough, we basically started to be kind of unstable here. 00:50:06.000 |
So a good learning rate turns out to be somewhere around here. 00:50:10.000 |
And because we have LRI here, we actually may want to do not LR--not the learning rate, but the exponent. 00:50:22.000 |
So that would be the LRE at i is maybe what we want to log. 00:50:26.000 |
So let me reset this and redo that calculation. 00:50:30.000 |
But now on the x-axis, we have the exponent of the learning rate. 00:50:35.000 |
And so we can see the exponent of the learning rate that is good to use. 00:50:38.000 |
It would be sort of like roughly in the valley here, because here the learning rates are just way too low. 00:50:43.000 |
And then here, we expect relatively good learning rates somewhere here. 00:50:47.000 |
And then here, things are starting to explode. 00:50:49.000 |
So somewhere around -1 as the exponent of the learning rate is a pretty good setting. 00:50:57.000 |
So 0.1 was actually a fairly good learning rate around here. 00:51:02.000 |
And that's what we had in the initial setting. 00:51:05.000 |
But that's roughly how you would determine it. 00:51:08.000 |
And so here now we can take out the tracking of these. 00:51:13.000 |
And we can just simply set LR to be 10 to the -1, or basically otherwise 0.1, as it was before. 00:51:21.000 |
And now we have some confidence that this is actually a fairly good learning rate. 00:51:24.000 |
And so now what we can do is we can crank up the iterations. 00:51:31.000 |
And we can run for a pretty long time using this learning rate. 00:52:02.000 |
What this means is we're going to take our learning rate and we're going to 10x lower it. 00:52:07.000 |
And so we're at the late stages of training, potentially, and we may want to go a bit slower. 00:52:12.000 |
Let's do one more, actually, at 0.1, just to see if we're making a dent here. 00:52:20.000 |
And by the way, the bigram loss that we achieved last video was 2.45. 00:52:29.000 |
And once I get a sense that this is actually kind of starting to plateau off, people like to do, as I mentioned, this learning rate decay. 00:52:35.000 |
So let's try to decay the loss, the learning rate, I mean. 00:52:46.000 |
Obviously, this is janky and not exactly how you would train it in production, but this is roughly what you're going through. 00:52:52.000 |
You first find a decent learning rate using the approach that I showed you. 00:52:55.000 |
Then you start with that learning rate and you train for a while. 00:52:59.000 |
And then at the end, people like to do a learning rate decay, where you decay the learning rate by, say, a factor of 10, and you do a few more steps. 00:53:05.000 |
And then you get a trained network, roughly speaking. 00:53:08.000 |
So we've achieved 2.3 and dramatically improved on the bigram language model using this simple neural net, as described here, using these 3,400 parameters. 00:53:20.000 |
Now, there's something we have to be careful with. 00:53:22.000 |
I said that we have a better model because we are achieving a lower loss, 2.3, much lower than 2.45 with the bigram model previously. 00:53:32.000 |
And the reason that's not true is that this is actually a fairly small model, but these models can get larger and larger if you keep adding neurons and parameters. 00:53:43.000 |
So you can imagine that we don't potentially have a thousand parameters. 00:53:46.000 |
We could have 10,000 or 100,000 or millions of parameters. 00:53:49.000 |
And as the capacity of the neural network grows, it becomes more and more capable of overfitting your training set. 00:53:56.000 |
What that means is that the loss on the training set, on the data that you're training on, will become very, very low, as low as zero. 00:54:04.000 |
But all that the model is doing is memorizing your training set verbatim. 00:54:08.000 |
So if you take that model and it looks like it's working really well, but you try to sample from it, you will basically only get examples exactly as they are in the training set. 00:54:19.000 |
In addition to that, if you try to evaluate the loss on some withheld names or other words, you will actually see that the loss on those can be very high. 00:54:31.000 |
So the standard in the field is to split up your data set into three splits, as we call them. 00:54:36.000 |
We have the training split, the dev split or the validation split, and the test split. 00:54:42.000 |
So training split, dev or validation split, and test split. 00:54:51.000 |
And typically, this would be, say, 80% of your data set. 00:55:01.000 |
Now, these 80% of the data set, the training set, is used to optimize the parameters of the model, just like we're doing here, using gradient descent. 00:55:10.000 |
These 10% of the examples, the dev or validation split, they're used for development over all the hyperparameters of your model. 00:55:19.000 |
So hyperparameters are, for example, the size of this hidden layer, the size of the embedding. 00:55:24.000 |
So this is 100 or a 2 for us, but we could try different things. 00:55:28.000 |
The strength of the regularization, which we aren't using yet so far. 00:55:32.000 |
So there's lots of different hyperparameters and settings that go into defining a neural net. 00:55:36.000 |
And you can try many different variations of them and see whichever one works best on your validation split. 00:55:49.000 |
And test split is used to evaluate, basically, the performance of the model at the end. 00:55:54.000 |
So we're only evaluating the loss on the test split very, very sparingly and very few times, 00:55:59.000 |
because every single time you evaluate your test loss and you learn something from it, 00:56:04.000 |
you are basically starting to also train on the test split. 00:56:08.000 |
So you are only allowed to test the loss on the test set very, very few times. 00:56:14.000 |
Otherwise, you risk overfitting to it as well as you experiment on your model. 00:56:19.000 |
So let's also split up our training data into train, dev, and test. 00:56:24.000 |
And then we are going to train on train and only evaluate on test very, very sparingly. 00:56:31.000 |
Here is where we took all the words and put them into x and y tensors. 00:56:36.000 |
So instead, let me create a new cell here, and let me just copy/paste some code here, 00:56:41.000 |
because I don't think it's that complex, but we're going to try to save a little bit of time. 00:56:50.000 |
And this function takes some list of words and builds the arrays x and y for those words only. 00:56:56.000 |
And then here, I am shuffling up all the words. 00:57:05.000 |
And then we're going to set n1 to be the number of examples that is 80% of the words 00:57:16.000 |
So basically, if length of words is 32,000, n1 is--well, sorry, I should probably run this. 00:57:28.000 |
And so here we see that I'm calling buildDataSet to build a training set x and y 00:57:36.000 |
So we're going to have only 25,000 training words. 00:57:40.000 |
And then we're going to have roughly n2 minus n1, 3,000 validation examples or dev examples. 00:57:50.000 |
And we're going to have length of words basically minus n2 or 3,204 examples here for the test set. 00:58:03.000 |
So now we have x's and y's for all those three splits. 00:58:13.000 |
Oh yeah, I'm printing their size here inside the function as well. 00:58:19.000 |
But here we don't have words, but these are already the individual examples made from those words. 00:58:27.000 |
And the data set now for training is more like this. 00:58:33.000 |
And then when we reset the network, when we're training, we're only going to be training using x train, 00:59:11.000 |
You launch a bunch of jobs and you wait for them to finish. 00:59:24.000 |
Oh, we accidentally used a learning rate that is way too low. 00:59:37.000 |
And then here when we evaluate, let's use the dev set here, x dev and y dev to evaluate the loss. 00:59:47.000 |
Okay, and let's now decay the learning rate and only do say 10,000 examples. 01:00:01.000 |
And so the neural network when it was training did not see these dev examples. 01:00:07.000 |
And yet when we evaluate the loss on these dev, we actually get a pretty decent loss. 01:00:12.000 |
And so we can also look at what the loss is on all of training set. 01:00:20.000 |
And so we see that the training and the dev loss are about equal. 01:00:26.000 |
This model is not powerful enough to just be purely memorizing the data. 01:00:30.000 |
And so far we are what's called underfitting because the training loss and the dev or test losses are roughly equal. 01:00:37.000 |
So what that typically means is that our network is very tiny, very small. 01:00:42.000 |
And we expect to make performance improvements by scaling up the size of this neural net. 01:00:48.000 |
So let's come over here and let's increase the size of the neural net. 01:00:52.000 |
The easiest way to do this is we can come here to the hidden layer, which currently is 100 neurons. 01:01:03.000 |
And here we have 300 inputs into the final layer. 01:01:10.000 |
We now have 10,000 parameters instead of 3,000 parameters. 01:01:18.000 |
And then here what I'd like to do is I'd like to actually keep track of that. 01:01:31.000 |
And here when we're keeping track of the loss, let's just also keep track of the steps. 01:01:51.000 |
And we should be able to run this and optimize the neural net. 01:01:57.000 |
And then here basically I want to plt.plot the steps against the loss. 01:02:11.000 |
And this is the loss function and how it's being optimized. 01:02:16.000 |
Now, you see that there's quite a bit of thickness to this. 01:02:19.000 |
And that's because we are optimizing over these mini-batches. 01:02:22.000 |
And the mini-batches create a little bit of noise in this. 01:02:29.000 |
So we still haven't optimized this neural net very well. 01:02:32.000 |
And that's probably because we made it bigger. 01:02:34.000 |
It might take longer for this neural net to converge. 01:02:46.000 |
One possibility is that the batch size is so low that we just have way too much noise in the training. 01:02:52.000 |
And we may want to increase the batch size so that we have a bit more correct gradient. 01:03:08.000 |
This will now become meaningless because we've reinitialized these. 01:03:16.000 |
But there probably is a tiny improvement, but it's so hard to tell. 01:03:25.000 |
Let's try to decrease the learning rate by a factor of two. 01:04:05.000 |
We basically expect to see a lower loss than what we had before. 01:04:08.000 |
Because now we have a much, much bigger model. 01:04:12.000 |
So we'd expect that increasing the size of the model should help the neural net. 01:04:19.000 |
Now, one other concern is that even though we've made the 10H layer here, 01:04:25.000 |
it could be that the bottleneck of the network right now are these embeddings that are two-dimensional. 01:04:30.000 |
It can be that we're just cramming way too many characters into just two dimensions. 01:04:34.000 |
And the neural net is not able to really use that space effectively. 01:04:38.000 |
And that is sort of like the bottleneck to our network's performance. 01:04:44.000 |
So just by decreasing the learning rate, I was able to make quite a bit of progress. 01:04:52.000 |
And then evaluate the training and the dev loss. 01:04:57.000 |
Now, one more thing after training that I'd like to do is I'd like to visualize the embedding vectors for these characters 01:05:06.000 |
before we scale up the embedding size from 2. 01:05:10.000 |
Because we'd like to make this bottleneck potentially go away. 01:05:14.000 |
And once I make this greater than 2, we won't be able to visualize them. 01:05:24.000 |
And maybe the bottleneck now is the character embedding size, which is 2. 01:05:28.000 |
So here I have a bunch of code that will create a figure. 01:05:31.000 |
And then we're going to visualize the embeddings that were trained by the neural net on these characters. 01:05:38.000 |
Because right now the embedding size is just 2. 01:05:40.000 |
So we can visualize all the characters with the x and the y coordinates as the two embedding locations for each of these characters. 01:05:47.000 |
And so here are the x coordinates and the y coordinates, which are the columns of C. 01:05:53.000 |
And then for each one, I also include the text of the little character. 01:05:58.000 |
So here what we see is actually kind of interesting. 01:06:02.000 |
The network has basically learned to separate out the characters and cluster them a little bit. 01:06:07.000 |
So, for example, you see how the vowels, A, E, I, O, U, are clustered up here. 01:06:12.000 |
So what that's telling us is that the neural net treats these as very similar, right? 01:06:16.000 |
Because when they feed into the neural net, the embedding for all these characters is very similar. 01:06:22.000 |
And so the neural net thinks that they're very similar and kind of like interchangeable, if that makes sense. 01:06:29.000 |
Then the points that are like really far away are, for example, Q. 01:06:33.000 |
Q is kind of treated as an exception, and Q has a very special embedding vector, so to speak. 01:06:38.000 |
Similarly, dot, which is a special character, is all the way out here. 01:06:42.000 |
And a lot of the other letters are sort of like clustered up here. 01:06:46.000 |
And so it's kind of interesting that there's a little bit of structure here after the training. 01:06:51.000 |
And it's definitely not random, and these embeddings make sense. 01:06:55.000 |
So we're now going to scale up the embedding size and won't be able to visualize it directly. 01:07:00.000 |
But we expect that because we're underfitting and we made this layer much bigger and did not sufficiently improve the loss, 01:07:08.000 |
we're thinking that the constraint to better performance right now could be these embedding vectors. 01:07:18.000 |
And now we don't have two-dimensional embeddings. 01:07:20.000 |
We are going to have, say, 10-dimensional embeddings for each word. 01:07:25.000 |
Then this layer will receive 3 times 10, so 30 inputs will go into the hidden layer. 01:07:36.000 |
Let's also make the hidden layer a bit smaller. 01:07:37.000 |
So instead of 300, let's just do 200 neurons in that hidden layer. 01:07:42.000 |
So now the total number of elements will be slightly bigger at 11,000. 01:07:47.000 |
And then here we have to be a bit careful because, OK, the learning rate, we set to 0.1. 01:07:55.000 |
And obviously if you're working in production, you don't want to be hardcoding magic numbers. 01:08:07.000 |
And let me split out the initialization here outside so that when we run this cell multiple times, it's not going to wipe out our loss. 01:08:17.000 |
In addition to that, here, instead of logging the lost.item, let's actually log the -- let's do log10. 01:08:37.000 |
Basically I'd like to plot the log loss instead of the loss. 01:08:40.000 |
Because when you plot the loss, many times it can have this hockey stick appearance and log squashes it in. 01:08:49.000 |
So the x-axis is step i, and the y-axis will be the loss i. 01:09:12.000 |
It's, again, very thick because the mini batch size is very small. 01:09:15.000 |
But the total loss over the training set is 2.3, and the test -- or the dev set is 2.38 as well. 01:09:23.000 |
Let's try to now decrease the learning rate by a factor of 10. 01:09:35.000 |
We'd hope that we would be able to beat 2.32. 01:09:43.000 |
But, again, we're just kind of, like, doing this very haphazardly. 01:09:46.000 |
So I don't actually have confidence that our learning rate is set very well, that our learning rate decay, which we just do at random, is set very well. 01:09:55.000 |
And so the optimization here is kind of suspect, to be honest. 01:09:59.000 |
And this is not how you would do it typically in production. 01:10:01.000 |
In production you would create parameters or hyperparameters out of all these settings, and then you would run lots of experiments and see whichever ones are working well for you. 01:10:17.000 |
So you see how the training and the validation performance are starting to slightly slowly depart. 01:10:23.000 |
So maybe we're getting the sense that the neural net is getting good enough or that number of parameters is large enough that we are slowly starting to overfit. 01:10:34.000 |
Let's maybe run one more iteration of this and see where we get. 01:10:41.000 |
But, yeah, basically you would be running lots of experiments and then you are slowly scrutinizing whichever ones give you the best depth performance. 01:10:48.000 |
And then once you find all the hyperparameters that make your depth performance good, you take that model and you evaluate the test set performance a single time. 01:10:57.000 |
And that's the number that you report in your paper or wherever else you want to talk about and brag about your model. 01:11:05.000 |
So let's then rerun the plot and rerun the train and dev. 01:11:11.000 |
And because we're getting lower loss now, it is the case that the embedding size of these was holding us back very likely. 01:11:30.000 |
We can continue, for example, playing with the size of the neural net. 01:11:33.000 |
Or we can increase the number of words or characters in our case that we are taking as an input. 01:11:39.000 |
So instead of just three characters, we could be taking more characters as an input. 01:11:48.000 |
So we have here 200,000 steps of the optimization. 01:11:51.000 |
And in the first 100,000, we're using a learning rate of .1. 01:11:54.000 |
And then in the next 100,000, we're using a learning rate of .01. 01:12:00.000 |
And these are the performance on the training and validation loss. 01:12:03.000 |
And in particular, the best validation loss I've been able to obtain in the last 30 minutes or so is 2.17. 01:12:12.000 |
And you have quite a few knobs available to you to, I think, surpass this number. 01:12:16.000 |
So number one, you can, of course, change the number of neurons in the hidden layer of this model. 01:12:21.000 |
You can change the dimensionality of the embedding lookup table. 01:12:25.000 |
You can change the number of characters that are feeding in as an input, as the context into this model. 01:12:31.000 |
And then, of course, you can change the details of the optimization. 01:12:41.000 |
You can change the batch size, and you may be able to actually achieve a much better convergence speed in terms of how many seconds or minutes it takes to train the model and get your result in terms of really good loss. 01:12:55.000 |
And then, of course, I actually invite you to read this paper. 01:12:58.000 |
It is 19 pages, but at this point, you should actually be able to read a good chunk of this paper and understand pretty good chunks of it. 01:13:06.000 |
And this paper also has quite a few ideas for improvements that you can play with. 01:13:11.000 |
So all of those are knobs available to you, and you should be able to beat this number. 01:13:15.000 |
I'm leaving that as an exercise to the reader. 01:13:17.000 |
And that's it for now, and I'll see you next time. 01:13:24.000 |
Before we wrap up, I also wanted to show how you would sample from the model. 01:13:31.000 |
At first, we begin with all dots, so that's the context. 01:13:35.000 |
And then until we generate the zeroth character again, we're going to embed the current context using the embedding table C. 01:13:46.000 |
Now, usually here, the first dimension was the size of the training set. 01:13:50.000 |
But here, we're only working with a single example that we're generating. 01:13:53.000 |
So this is just dimension one, just for simplicity. 01:13:58.000 |
And so this embedding then gets projected into the end state. 01:14:05.000 |
For that, you can use f.softmax of logits, and that just basically exponentiates logits and makes them sum to one. 01:14:13.000 |
And similar to cross entropy, it is careful that there's no overflows. 01:14:18.000 |
Once we have the probabilities, we sample from them using torsion multinomial to get our next index. 01:14:23.000 |
And then we shift the context window to append the index and record it. 01:14:28.000 |
And then we can just decode all the integers to strings and print them out. 01:14:33.000 |
And so these are some example samples, and you can see that the model now works much better. 01:14:37.000 |
So the words here are much more word-like or name-like. 01:14:48.000 |
It's starting to sound a little bit more name-like. 01:14:51.000 |
So we're definitely making progress, but we can still improve on this model quite a lot. 01:14:57.000 |
I wanted to mention that I want to make these notebooks more accessible. 01:15:01.000 |
And so I don't want you to have to install Jupyter Notebooks and Torch and everything else. 01:15:05.000 |
So I will be sharing a link to a Google Colab. 01:15:09.000 |
And the Google Colab will look like a notebook in your browser. 01:15:13.000 |
And you can just go to a URL, and you'll be able to execute all of the code that you saw in the Google Colab. 01:15:19.000 |
And so this is me executing the code in this lecture, and I shortened it a little bit. 01:15:24.000 |
But basically, you're able to train the exact same network and then plot and sample from the model. 01:15:29.000 |
And everything is ready for you to tinker with the numbers right there in your browser, no installation necessary. 01:15:35.000 |
So I just wanted to point that out, and the link to this will be in the video description.