back to indexThe spelled-out intro to language modeling: building makemore
Chapters
0:0 intro
3:3 reading and exploring the dataset
6:24 exploring the bigrams in the dataset
9:24 counting bigrams in a python dictionary
12:45 counting bigrams in a 2D torch tensor ("training the model")
18:19 visualizing the bigram tensor
20:54 deleting spurious (S) and (E) tokens in favor of a single . token
24:2 sampling from the model
36:17 efficiency! vectorized normalization of the rows, tensor broadcasting
50:14 loss function (the negative log likelihood of the data under our model)
60:50 model smoothing with fake counts
62:57 PART 2: the neural network approach: intro
65:26 creating the bigram dataset for the neural net
70:1 feeding integers into neural nets? one-hot encodings
73:53 the "neural net": one linear layer of neurons implemented with matrix multiplication
78:46 transforming neural net outputs into probabilities: the softmax
86:17 summary, preview to next steps, reference to micrograd
95:49 vectorized loss
98:36 backward and update, in PyTorch
102:55 putting everything together
107:49 note 1: one-hot encoding really just selects a row of the next Linear layer's weight matrix
110:18 note 2: model smoothing as regularization loss
114:31 sampling from the neural net
116:16 conclusion
00:00:02.480 |
And next up what I'd like to do is I'd like to build out Makemore. 00:00:06.320 |
Like micrograd before it, Makemore is a repository that I have on my GitHub web page. 00:00:12.760 |
But just like with micrograd, I'm going to build it out step by step, and I'm going to 00:00:17.920 |
So we're going to build it out slowly and together. 00:00:22.000 |
Makemore, as the name suggests, makes more of things that you give it. 00:00:32.800 |
And when you look at names.txt, you'll find that it's a very large dataset of names. 00:00:41.800 |
In fact, I believe there are 32,000 names that I've sort of found randomly on a government 00:00:48.120 |
And if you train Makemore on this dataset, it will learn to make more of things like 00:00:54.360 |
And in particular, in this case, that will mean more things that sound name-like, but 00:01:02.560 |
And maybe if you have a baby and you're trying to assign a name, maybe you're looking for 00:01:05.680 |
a cool new sounding unique name, Makemore might help you. 00:01:09.780 |
So here are some example generations from the neural network once we train it on our 00:01:16.440 |
So here's some example unique names that it will generate. 00:01:25.840 |
And so all these sort of sound name-like, but they're not, of course, names. 00:01:30.900 |
So under the hood, Makemore is a character-level language model. 00:01:34.980 |
So what that means is that it is treating every single line here as an example. 00:01:39.880 |
And within each example, it's treating them all as sequences of individual characters. 00:01:50.800 |
And that's the level on which we are building out Makemore. 00:01:54.160 |
And what it means to be a character-level language model then is that it's just sort 00:01:58.440 |
of modeling those sequences of characters, and it knows how to predict the next character 00:02:03.840 |
Now we're actually going to implement a large number of character-level language models 00:02:08.160 |
in terms of the neural networks that are involved in predicting the next character in a sequence. 00:02:12.520 |
So very simple bigram and bag-of-word models, multilayered perceptrons, recurrent neural 00:02:17.720 |
networks, all the way to modern transformers. 00:02:19.840 |
In fact, the transformer that we will build will be basically the equivalent transformer 00:02:30.840 |
And by the end of the series, you will actually understand how that works on the level of 00:02:36.000 |
Now, to give you a sense of the extensions here, after characters, we will probably spend 00:02:41.280 |
some time on the word level so that we can generate documents of words, not just little 00:02:45.640 |
segments of characters, but we can generate entire large, much larger documents. 00:02:51.280 |
And then we're probably going to go into images and image-text networks, such as DALI, stable 00:02:58.360 |
But for now, we have to start here, character-level language modeling. 00:03:03.660 |
So like before, we are starting with a completely blank Jupyter Notebook page. 00:03:07.240 |
The first thing is I would like to basically load up the dataset, names.txt. 00:03:12.040 |
So we're going to open up names.txt for reading, and we're going to read in everything into 00:03:20.060 |
And then because it's a massive string, we'd only like the individual words and put them 00:03:24.760 |
So let's call splitlines on that string to get all of our words as a Python list of strings. 00:03:32.320 |
So basically, we can look at, for example, the first 10 words, and we have that it's 00:03:41.800 |
And if we look at the top of the page here, that is indeed what we see. 00:03:49.980 |
This list actually makes me feel that this is probably sorted by frequency. 00:03:58.560 |
Now we'd like to actually learn a little bit more about this dataset. 00:04:06.760 |
And then what is the, for example, shortest word? 00:04:13.880 |
So the shortest word will be length 2, and max of len w for w in words. 00:04:24.900 |
So let's now think through our very first language model. 00:04:27.600 |
As I mentioned, a character-level language model is predicting the next character in 00:04:31.280 |
a sequence given already some concrete sequence of characters before it. 00:04:36.800 |
Now what we have to realize here is that every single word here, like Isabella, is actually 00:04:41.560 |
quite a few examples packed in to that single word. 00:04:45.880 |
Because what is an existence of a word like Isabella in the dataset telling us, really? 00:04:50.020 |
It's saying that the character i is a very likely character to come first in a sequence 00:05:07.880 |
The character b is very likely to come after isa. 00:05:10.920 |
And so on, all the way to a following Isabelle. 00:05:14.680 |
And then there's one more example actually packed in here. 00:05:17.540 |
And that is that after there's Isabella, the word is very likely to end. 00:05:24.020 |
So that's one more sort of explicit piece of information that we have here that we have 00:05:29.900 |
And so there's a lot packed into a single individual word in terms of the statistical 00:05:34.280 |
structure of what's likely to follow in these character sequences. 00:05:38.340 |
And then of course, we don't have just an individual word. 00:05:42.360 |
And so there's a lot of structure here to model. 00:05:45.080 |
Now in the beginning, what I'd like to start with is I'd like to start with building a 00:05:51.520 |
Now in a bigram language model, we're always working with just two characters at a time. 00:05:56.920 |
So we're only looking at one character that we are given, and we're trying to predict 00:06:04.080 |
So what characters are likely to follow are, what characters are likely to follow a, and 00:06:10.360 |
And we're just modeling that kind of a little local structure. 00:06:13.160 |
And we're forgetting the fact that we may have a lot more information. 00:06:17.000 |
We're always just looking at the previous character to predict the next one. 00:06:20.440 |
So it's a very simple and weak language model, but I think it's a great place to start. 00:06:24.360 |
So now let's begin by looking at these bigrams in our dataset and what they look like. 00:06:28.120 |
And these bigrams again are just two characters in a row. 00:06:31.160 |
So for w in words, each w here is an individual word, a string. 00:06:36.600 |
We want to iterate this word with consecutive characters. 00:06:43.880 |
So two characters at a time, sliding it through the word. 00:06:46.960 |
Now a interesting, nice way, cute way to do this in Python, by the way, is doing something 00:06:53.120 |
Character one, character two in zip of w and w at one, one column. 00:07:07.200 |
And I'm going to show you in a second how this works. 00:07:10.240 |
But for now, basically as an example, let's just do the very first word alone, Emma. 00:07:15.560 |
You see how we have a Emma and this will just print em, mm, ma. 00:07:21.200 |
And the reason this works is because w is the string Emma, w at one column is the string 00:07:27.260 |
mma, and zip takes two iterators and it pairs them up and then creates an iterator over 00:07:37.600 |
And if any one of these lists is shorter than the other, then it will just halt and return. 00:07:43.920 |
So basically that's why we return em, mm, mm, ma, but then because this iterator, the 00:07:52.040 |
second one here, runs out of elements, zip just ends. 00:08:00.000 |
So these are the consecutive elements in the first word. 00:08:03.320 |
Now we have to be careful because we actually have more information here than just these 00:08:08.080 |
As I mentioned, we know that e is very likely to come first. 00:08:12.080 |
And we know that a in this case is coming last. 00:08:15.840 |
So what I'm going to do this is basically we're going to create a special array here, 00:08:21.120 |
our characters, and we're going to hallucinate a special start token here. 00:08:32.680 |
So this is a list of one element plus w and then plus a special end character. 00:08:41.520 |
And the reason I'm wrapping the list of w here is because w is a string, Emma. 00:08:46.240 |
List of w will just have the individual characters in the list. 00:08:51.080 |
And then doing this again now, but not iterating over w's, but over the characters will give 00:09:00.360 |
So e is likely, so this is a bigram of the start character and e, and this is a bigram 00:09:09.320 |
And now we can look at, for example, what this looks like for Olivia or Eva. 00:09:13.600 |
And indeed, we can actually potentially do this for the entire dataset, but we won't 00:09:20.800 |
But these are the individual character bigrams and we can print them. 00:09:25.160 |
Now in order to learn the statistics about which characters are likely to follow other 00:09:29.400 |
characters, the simplest way in the bigram language models is to simply do it by counting. 00:09:34.680 |
So we're basically just going to count how often any one of these combinations occurs 00:09:42.040 |
So we're going to need some kind of a dictionary that's going to maintain some counts for every 00:09:47.540 |
So let's use a dictionary B and this will map these bigrams. 00:09:52.920 |
So bigram is a tuple of character one, character two. 00:09:56.440 |
And then B at bigram will be B dot get of bigram, which is basically the same as B at 00:10:04.960 |
But in the case that bigram is not in the dictionary B, we would like to by default 00:10:13.320 |
So this will basically add up all the bigrams and count how often they occur. 00:10:18.540 |
Let's get rid of printing or rather let's keep the printing and let's just inspect what 00:10:27.400 |
And we see that many bigrams occur just a single time. 00:10:38.160 |
All of Emma, Olivia and Eva end with A. So that's why this occurred three times. 00:10:59.000 |
Let's just run and now B will have the statistics of the entire dataset. 00:11:04.360 |
So these are the counts across all the words of the individual bigrams. 00:11:08.680 |
And we could, for example, look at some of the most common ones and least common ones. 00:11:13.640 |
This kind of grows in Python, but the way to do this, the simplest way I like is we 00:11:25.560 |
In this case, the keys are the character bigrams and the values are the counts. 00:11:31.160 |
And so then what we want to do is we want to do sorted of this. 00:11:38.720 |
But by default, sort is on the first item of a tuple. 00:11:45.800 |
But we want to sort by the values, which are the second element of a tuple that is the 00:11:50.860 |
So we want to use the key equals lambda that takes the key value and returns the key value 00:12:00.080 |
at one, not at zero, but at one, which is the count. 00:12:04.160 |
So we want to sort by the count of these elements. 00:12:12.860 |
So here what we have is the bigram QNR occurs only a single time. 00:12:20.920 |
And when we sort this the other way around, we're going to see the most likely bigrams. 00:12:26.520 |
So we see that N was very often an ending character, many, many times. 00:12:31.960 |
And apparently N almost always follows an A, and that's a very likely combination as 00:12:38.880 |
So this is kind of the individual counts that we achieve over the entire dataset. 00:12:45.160 |
Now it's actually going to be significantly more convenient for us to keep this information 00:12:49.520 |
in a two-dimensional array instead of a Python dictionary. 00:12:53.800 |
So we're going to store this information in a 2D array, and the rows are going to be the 00:13:01.160 |
first character of the bigram, and the columns are going to be the second character. 00:13:05.280 |
And each entry in this two-dimensional array will tell us how often that first character 00:13:12.920 |
So in particular, the array representation that we're going to use, or the library, is 00:13:18.440 |
And PyTorch is a deep learning neural network framework, but part of it is also this torch.tensor, 00:13:25.440 |
which allows us to create multidimensional arrays and manipulate them very efficiently. 00:13:30.000 |
So let's import PyTorch, which you can do by import torch. 00:13:37.740 |
So let's create an array of zeros, and we give it a size of this array. 00:13:51.860 |
And by default, you'll notice a.dtype, which is short for datatype, is float32. 00:13:56.820 |
So these are single precision floating point numbers. 00:13:59.820 |
Because we are going to represent counts, let's actually use dtype as torch.int32. 00:14:10.340 |
So now you see that we have integer data inside this tensor. 00:14:14.820 |
Now tensors allow us to really manipulate all the individual entries and do it very 00:14:20.960 |
So for example, if we want to change this bit, we have to index into the tensor. 00:14:25.980 |
And in particular, here, this is the first row, because it's zero-indexed. 00:14:33.000 |
So this is row index 1, and column index 0, 1, 2, 3. 00:14:54.140 |
And also we can, for example, say a[0, 0] is 5, and then a will have a 5 over here. 00:15:03.480 |
Now of course the array that we are interested in is much, much bigger. 00:15:06.420 |
So for our purposes, we have 26 letters of the alphabet, and then we have two special 00:15:19.500 |
And let's call it the capital N, because it's going to represent the counts. 00:15:27.020 |
So that's the array that starts at 0s, 28 by 28. 00:15:34.860 |
But instead of having a dictionary b, which we're going to erase, we now have an n. 00:15:41.260 |
Now the problem here is that we have these characters, which are strings, but we have 00:15:44.580 |
to now basically index into an array, and we have to index using integers. 00:15:51.580 |
So we need some kind of a lookup table from characters to integers. 00:15:58.300 |
And the way we're going to do this is we're going to take all the words, which is a list 00:16:01.620 |
of strings, we're going to concatenate all of it into a massive string, so this is just 00:16:06.060 |
simply the entire dataset as a single string. 00:16:09.500 |
We're going to pass this to the set constructor, which takes this massive string and throws 00:16:15.320 |
out duplicates, because sets do not allow duplicates. 00:16:19.080 |
So set of this will just be the set of all the lowercase characters. 00:16:28.860 |
And now we actually don't want a set, we want a list. 00:16:32.900 |
But we don't want a list sorted in some weird arbitrary way, we want it to be sorted from 00:16:45.780 |
Now what we want is this lookup table, as I mentioned. 00:16:47.940 |
So let's create a special s to i, I will call it. 00:16:55.740 |
And this will be an s to i mapping for i, s in enumerate of these characters. 00:17:04.460 |
So enumerate basically gives us this iterator over the integer, index, and the actual element 00:17:12.100 |
And then we are mapping the character to the integer. 00:17:15.380 |
So s to i is a mapping from A to 0, B to 1, etc., all the way from Z to 25. 00:17:24.340 |
And that's going to be useful here, but we actually also have to specifically set that 00:17:27.760 |
s will be 26, and s to i at E will be 27, because Z was 25. 00:17:37.780 |
And now we can come here and we can map both character 1 and character 2 to their integers. 00:17:42.960 |
So this will be s to i of character 1, and i x 2 will be s to i of character 2. 00:17:49.620 |
And now we should be able to do this line, but using our array. 00:17:57.540 |
This is the two-dimensional array indexing I've shown you before. 00:18:00.820 |
And honestly, just plus equals 1, because everything starts at 0. 00:18:06.380 |
So this should work and give us a large 28 by 28 array of all these counts. 00:18:14.480 |
So if we print n, this is the array, but of course it looks ugly. 00:18:19.960 |
So let's erase this ugly mess and let's try to visualize it a bit more nicer. 00:18:25.000 |
So for that, we're going to use a library called matplotlib. 00:18:31.200 |
So we can do things like plt im show of the count array. 00:18:36.280 |
So this is the 28 by 28 array, and this is the structure. 00:18:41.140 |
But even this, I would say, is still pretty ugly. 00:18:44.040 |
So we're going to try to create a much nicer visualization of it, and I wrote a bunch of 00:18:49.880 |
The first thing we're going to need is we're going to need to invert this array here, this 00:18:56.760 |
So s to i is a mapping from s to i, and in i to s, we're going to reverse this dictionary. 00:19:03.200 |
So iterate over all the items and just reverse that array. 00:19:06.760 |
So i to s maps inversely from 0 to a, 1 to b, etc. 00:19:14.400 |
And then here's the code that I came up with to try to make this a little bit nicer. 00:19:19.320 |
We create a figure, we plot n, and then we visualize a bunch of things later. 00:19:27.480 |
Let me just run it so you get a sense of what this is. 00:19:32.240 |
So you see here that we have the array spaced out, and every one of these is basically b 00:19:40.040 |
follows g zero times, b follows h 41 times, so a follows j 175 times. 00:19:48.160 |
And so what you can see that I'm doing here is first I show that entire array, and then 00:19:53.240 |
I iterate over all the individual little cells here, and I create a character string here, 00:19:59.440 |
which is the inverse mapping i to s of the integer i and the integer j. 00:20:04.680 |
So those are the bigrams in a character representation. 00:20:08.800 |
And then I plot just the bigram text, and then I plot the number of times that this 00:20:16.240 |
Now the reason that there's a dot item here is because when you index into these arrays, 00:20:21.120 |
these are torch tensors, you see that we still get a tensor back. 00:20:26.160 |
So the type of this thing, you'd think it would be just an integer, 149, but it's actually 00:20:32.160 |
And so if you do dot item, then it will pop out that individual integer. 00:20:42.600 |
And these are just some options to make it look nice. 00:20:46.840 |
We have all these counts, and we see that some of them occur often, and some of them 00:20:54.120 |
Now if you scrutinize this carefully, you will notice that we're not actually being 00:20:58.840 |
That's because when you come over here, you'll notice that, for example, we have an entire 00:21:04.800 |
And that's because the end character is never possibly going to be the first character of 00:21:09.080 |
a bigram, because we're always placing these end tokens at the end of the bigram. 00:21:14.360 |
Similarly, we have entire columns of zeros here, because the s character will never possibly 00:21:21.040 |
be the second element of a bigram, because we always start with s and we end with e, 00:21:27.840 |
So we have an entire column of zeros, an entire row of zeros. 00:21:31.960 |
And in this little two by two matrix here as well, the only one that can possibly happen 00:21:38.580 |
That can be non-zero if we have a word that has no letters. 00:21:43.240 |
So in that case, there's no letters in the word. 00:21:44.760 |
It's an empty word, and we just have s follows e. 00:21:51.820 |
And not only that, but the s and the e are getting very crowded here. 00:21:55.600 |
I was using these brackets because there's convention in natural language processing 00:21:59.440 |
to use these kinds of brackets to denote special tokens, but we're going to use something else. 00:22:08.380 |
We're not actually going to have two special tokens. 00:22:13.140 |
So we're going to have n by n array of 27 by set 27 instead. 00:22:19.160 |
Instead of having two, we will just have one, and I will call it a dot. 00:22:30.600 |
Now one more thing that I would like to do is I would actually like to make this special 00:22:36.480 |
And I would like to offset all the other letters off. 00:22:42.800 |
So we need a plus one here so that the first character, which is a, will start at one. 00:22:50.040 |
So s to i will now be a starts at one and dot is zero. 00:22:56.120 |
And i to s, of course, we're not changing this because i to s just creates a reverse 00:23:18.920 |
And then here we don't go up to 28, we go up to 27. 00:23:33.640 |
It's at zero because we don't have empty words. 00:23:36.680 |
And this row here now is just very simply the counts for all the first letters. 00:23:43.680 |
So j starts a word, h starts a word, i starts a word, et cetera. 00:23:49.720 |
And then these are all the ending characters. 00:23:53.200 |
And in between we have the structure of what characters follow each other. 00:23:57.200 |
So this is the counts array of our entire dataset. 00:24:01.840 |
So this array actually has all the information necessary for us to actually sample from this 00:24:09.960 |
And roughly speaking, what we're going to do is we're just going to start following 00:24:13.640 |
these probabilities and these counts, and we're going to start sampling from the model. 00:24:19.000 |
So in the beginning, of course, we start with the dot, the start token dot. 00:24:24.760 |
So to sample the first character of a name, we're looking at this row here. 00:24:30.720 |
So we see that we have the counts and those counts externally are telling us how often 00:24:35.720 |
any one of these characters is to start a word. 00:24:39.740 |
So if we take this n and we grab the first row, we can do that by using just indexing 00:24:47.340 |
as zero and then using this notation, colon, for the rest of that row. 00:24:53.860 |
So n zero colon is indexing into the zeroth row and then it's grabbing all the columns. 00:25:02.140 |
And so this will give us a one dimensional array of the first row. 00:25:06.320 |
So 0 4 4 10, you know, 0 4 4 10, 1 3 0 6, 1 5 4 2, etc. 00:25:20.180 |
And the other way that you can do this also is you just you don't actually give this, 00:25:28.300 |
Now these are the counts and now what we'd like to do is we'd like to basically sample 00:25:34.700 |
Since these are the raw counts, we actually have to convert this to probabilities. 00:25:43.100 |
So we'll take n of zero and we'll actually convert this to float first. 00:25:48.860 |
OK, so these integers are converted to float, floating point numbers. 00:25:54.300 |
And the reason we're creating floats is because we're about to normalize these counts. 00:25:59.060 |
So to create a probability distribution here, we want to divide. 00:26:04.020 |
We basically want to do p, p, p divide p dot sum. 00:26:13.900 |
So of course, because we divided by the sum, the sum of p now is one. 00:26:18.980 |
So this is a nice proper probability distribution. 00:26:22.220 |
And this is giving us the probability for any single character to be the first character 00:26:28.140 |
So now we can try to sample from this distribution. 00:26:30.900 |
To sample from these distributions, we're going to use tors dot multinomial, which I've 00:26:36.400 |
So tors dot multinomial returns samples from the multinomial probability distribution, 00:26:43.460 |
which is a complicated way of saying you give me probabilities and I will give you integers, 00:26:48.220 |
which are sampled according to the probability distribution. 00:26:53.420 |
And to make everything deterministic, we're going to use a generator object in PyTorch. 00:27:01.160 |
So when you run this on your computer, you're going to get the exact same results that I'm 00:27:12.980 |
Here's the deterministic way of creating a torch generator object, seeding it with some 00:27:19.180 |
number that we can agree on, so that seeds a generator, gives us an object g. 00:27:24.940 |
And then we can pass that g to a function that creates here random numbers, tors dot 00:27:34.940 |
And it's using this generator object as a source of randomness. 00:27:46.700 |
This is sort of like numbers between zero and one that are random according to this 00:27:51.500 |
And whenever I run it again, I'm always going to get the same result because I keep using 00:27:55.500 |
the same generator object, which I'm seeding here. 00:27:59.000 |
And then if I divide to normalize, I'm going to get a nice probability distribution of 00:28:07.780 |
And then we can use tors dot multinomial to draw samples from it. 00:28:13.900 |
Tors dot multinomial will take the torch tensor of probability distributions. 00:28:21.220 |
Then we can ask for a number of samples, let's say 20. 00:28:25.140 |
Replacement equals true means that when we draw an element, we can draw it and then we 00:28:30.980 |
can put it back into the list of eligible indices to draw again. 00:28:36.100 |
And we have to specify replacement as true because by default, for some reason, it's 00:28:42.260 |
You know, it's just something to be careful with. 00:28:47.580 |
So we're going to always get deterministic results, the same results. 00:28:51.480 |
So if I run these two, we're going to get a bunch of samples from this distribution. 00:28:57.260 |
Now you'll notice here that the probability for the first element in this tensor is 60%. 00:29:04.700 |
So in these 20 samples, we'd expect 60% of them to be zero. 00:29:14.500 |
And because the element index two has only 10% probability, very few of these samples 00:29:22.380 |
And indeed, we only have a small number of twos. 00:29:29.220 |
And the more we sample, the more these numbers should roughly have the distribution here. 00:29:36.080 |
So we should have lots of zeros, half as many ones, and we should have three times as few 00:29:51.940 |
So you see that we have very few twos, we have some ones, and most of them are zero. 00:29:59.060 |
For us here, we are interested in this row, we've created this p here, and now we can 00:30:09.840 |
So if we use the same seed, and then we sample from this distribution, let's just get one 00:30:15.800 |
sample, then we see that the sample is, say, 13. 00:30:28.940 |
We again have to use .item to pop out that integer. 00:30:37.640 |
And of course, we can map the I2S of IX to figure out exactly which character we're sampling 00:30:46.720 |
We're sampling M. So we're saying that the first character is M in our generation. 00:30:53.380 |
And just looking at the row here, M was drawn, and we can see that M actually starts a large 00:31:04.880 |
So almost a bit less than 10% of the words start with M. So this is actually a fairly 00:31:15.440 |
So that would be the first character of our word. 00:31:17.240 |
And now we can continue to sample more characters, because now we know that M is already sampled. 00:31:24.940 |
So now to draw the next character, we will come back here, and we will look for the row 00:31:30.920 |
that starts with M. So you see M, and we have a row here. 00:31:36.900 |
So we see that M. is 516, MA is this many, MB is this many, etc. 00:31:43.920 |
So these are the counts for the next row, and that's the next character that we are 00:31:48.840 |
So I think we are ready to actually just write out the loop, because I think you're starting 00:31:54.720 |
We always begin at index 0, because that's the start token. 00:32:02.560 |
And then while true, we're going to grab the row corresponding to index that we're currently 00:32:10.440 |
So that's N array at ix, converted to float is our P. 00:32:32.960 |
We're going to initialize up here, and we're going to draw a single sample from this distribution. 00:32:41.040 |
And then this is going to tell us what index is going to be next. 00:32:46.760 |
If the index sampled is 0, then that's now the end token. 00:33:14.880 |
We started with M, the next step was O, then R, and then dot. 00:33:37.360 |
And instead of printing, we're going to append. 00:33:44.640 |
And then here, let's just print it at the end. 00:33:47.080 |
So let's just join up all the outs, and we're just going to print more. 00:33:51.040 |
Now, we're always getting the same result because of the generator. 00:33:55.360 |
So if we want to do this a few times, we can go for i in range 10. 00:34:00.840 |
We can sample 10 names, and we can just do that 10 times. 00:34:05.960 |
And these are the names that we're getting out. 00:34:14.480 |
I'll be honest with you, this doesn't look right. 00:34:16.720 |
So I started a few minutes to convince myself that it actually is right. 00:34:20.680 |
The reason these samples are so terrible is that bigram language model is actually just 00:34:29.560 |
And you can see that they're kind of like — they're name-like a little bit, like 00:34:33.200 |
Ianu, O'Reilly, et cetera, but they're just totally messed up. 00:34:38.920 |
And I mean, the reason that this is so bad — like, we're generating H as a name, 00:34:43.240 |
but you have to think through it from the model's eyes. 00:34:46.680 |
It doesn't know that this H is the very first H. 00:34:49.520 |
All it knows is that H was previously, and now how likely is H the last character? 00:34:55.040 |
Well, it's somewhat likely, and so it just makes it last character. 00:34:59.240 |
It doesn't know that there were other things before it or there were not other things before 00:35:04.200 |
And so that's why it's generating all these, like, nonsense names. 00:35:08.480 |
Another way to do this is to convince yourself that this is actually doing something reasonable, 00:35:14.600 |
even though it's so terrible, is these little p's here are 27, right? 00:35:26.600 |
Instead of p having any structure whatsoever, how about if p was just torch.once(27). 00:35:37.440 |
By default, this is a float 32, so this is fine. 00:35:43.120 |
So what I'm doing here is this is the uniform distribution, which will make everything equally 00:35:55.080 |
So this is what you have from a model that is completely untrained, where everything 00:36:02.560 |
And then if we have a trained model, which is trained on just bigrams, this is what we 00:36:12.000 |
It's just bigram is so terrible, and we have to do better. 00:36:16.720 |
Now next, I would like to fix an inefficiency that we have going on here, because what we're 00:36:20.760 |
doing here is we're always fetching a row of n from the counts matrix up ahead. 00:36:28.080 |
We're converting to float and we're dividing. 00:36:30.080 |
And we're doing this every single iteration of this loop. 00:36:32.960 |
And we just keep renormalizing these rows over and over again. 00:36:37.580 |
So what I'd like to do is I'd like to actually prepare a matrix, capital P, that will just 00:36:44.160 |
So in other words, it's going to be the same as the capital N matrix here of counts. 00:36:48.180 |
But every single row will have the row of probabilities that is normalized to 1, indicating 00:36:53.440 |
the probability distribution for the next character given the character before it, as 00:37:01.760 |
So basically what we'd like to do is we'd like to just do it up front here. 00:37:05.280 |
And then we would like to just use that row here. 00:37:08.360 |
So here, we would like to just do P equals P of IX instead. 00:37:15.120 |
The other reason I want to do this is not just for efficiency, but also I would like 00:37:21.360 |
And I'd like us to practice their manipulation, and especially something that's called broadcasting, 00:37:27.160 |
We're actually going to have to become very good at these tensor manipulations, because 00:37:31.000 |
if we're going to build out all the way to transformers, we're going to be doing some 00:37:34.040 |
pretty complicated array operations for efficiency. 00:37:37.600 |
And we need to really understand that and be very good at it. 00:37:42.460 |
So intuitively, what we want to do is we first want to grab the floating point copy of N. 00:37:51.160 |
And then we want to divide all the rows so that they sum to 1. 00:37:55.880 |
So we'd like to do something like this, P divide P dot sum. 00:38:00.840 |
But now we have to be careful, because P dot sum actually produces a sum--sorry, P equals 00:38:10.920 |
P dot sum produces a--sums up all of the counts of this entire matrix N and gives us a single 00:38:23.600 |
We want to simultaneously and in parallel divide all the rows by their respective sums. 00:38:30.800 |
So what we have to do now is we have to go into documentation for torch dot sum. 00:38:36.100 |
And we can scroll down here to a definition that is relevant to us, which is where we 00:38:40.120 |
don't only provide an input array that we want to sum, but we also provide the dimension 00:38:46.720 |
And in particular, we want to sum up over rows. 00:38:52.560 |
Now one more argument that I want you to pay attention to here is the keep_dim is false. 00:38:58.080 |
If keep_dim is true, then the output tensor is of the same size as input, except of course 00:39:03.100 |
the dimension along which you summed, which will become just 1. 00:39:07.600 |
But if you pass in keep_dim as false, then this dimension is squeezed out. 00:39:14.580 |
And so torch dot sum not only does the sum and collapses dimension to be of size 1, but 00:39:19.120 |
in addition, it does what's called a squeeze, where it squeezes out that dimension. 00:39:24.780 |
So basically what we want here is we instead want to do p dot sum of some axis. 00:39:31.040 |
And in particular, notice that p dot shape is 27 by 27. 00:39:35.820 |
So when we sum up across axis 0, then we would be taking the 0th dimension and we would be 00:39:42.860 |
So when keep_dim is true, then this thing will not only give us the counts along the 00:39:50.500 |
columns, but notice that basically the shape of this is 1 by 27. 00:39:57.700 |
And the reason we get a row vector here, again, is because we pass in 0th dimension. 00:40:06.060 |
And so basically we've done the sum this way, vertically, and arrived at just a single 1 00:40:15.540 |
What happens when you take out keep_dim is that we just get 27. 00:40:19.740 |
So it squeezes out that dimension, and we just get a one-dimensional vector of size 00:40:28.780 |
Now we don't actually want 1 by 27 row vector, because that gives us the counts, or the sums, 00:40:39.660 |
We actually want to sum the other way, along dimension 1. 00:40:42.940 |
And you'll see that the shape of this is 27 by 1. 00:40:54.100 |
And that's because what's happened here is that we're going horizontally, and this 27 00:41:03.740 |
And you'll notice, by the way, that the actual numbers of these counts are identical. 00:41:10.820 |
And that's because this special array of counts here comes from bigram statistics. 00:41:15.020 |
And actually, it just so happens by chance, or because of the way this array is constructed, 00:41:20.340 |
that the sums along the columns, or along the rows, horizontally or vertically, is identical. 00:41:26.380 |
But actually what we want to do in this case is we want to sum across the rows, horizontally. 00:41:32.420 |
So what we want here is p.sum of 1 with keep_dim true, 27 by 1 column vector. 00:41:39.740 |
And now what we want to do is we want to divide by that. 00:41:46.400 |
Is it possible to take what's a p.shape you see here, 27 by 27, is it possible to take 00:41:53.500 |
a 27 by 27 array and divide it by what is a 27 by 1 array? 00:42:04.140 |
And whether or not you can perform this operation is determined by what's called broadcasting 00:42:08.480 |
So if you just search broadcasting semantics in Torch, you'll notice that there's a special 00:42:13.100 |
definition for what's called broadcasting, that for whether or not these two arrays can 00:42:20.380 |
be combined in a binary operation like division. 00:42:24.120 |
So the first condition is each tensor has at least one dimension, which is the case 00:42:28.780 |
And then when iterating over the dimension sizes, starting at the trailing dimension, 00:42:32.580 |
the dimension sizes must either be equal, one of them is one, or one of them does not 00:42:40.460 |
We need to align the two arrays and their shapes, which is very easy because both of 00:42:45.340 |
these shapes have two elements, so they're aligned. 00:42:48.300 |
Then we iterate over from the right and going to the left. 00:42:52.560 |
Each dimension must be either equal, one of them is a one, or one of them does not exist. 00:42:58.140 |
So in this case, they're not equal, but one of them is a one. 00:43:06.100 |
So all the dimensions are fine, and therefore this operation is broadcastable. 00:43:12.100 |
So that means that this operation is allowed. 00:43:14.740 |
And what is it that these arrays do when you divide 27 by 27 by 27 by one? 00:43:20.040 |
What it does is that it takes this dimension one, and it stretches it out, it copies it 00:43:29.220 |
So in our case, it takes this column vector, which is 27 by one, and it copies it 27 times 00:43:41.680 |
And so it copies those counts, and then it does an element-wise division, which is what 00:43:47.900 |
we want because these counts, we want to divide by them on every single one of these columns 00:43:55.000 |
So this actually, we expect, will normalize every single row. 00:43:59.960 |
And we can check that this is true by taking the first row, for example, and taking its 00:44:05.860 |
We expect this to be one because it's now normalized. 00:44:10.580 |
And then we expect this now because if we actually correctly normalize all the rows, 00:44:23.160 |
So now I would like to scare you a little bit. 00:44:25.620 |
You actually have to like, I basically encourage you very strongly to read through broadcasting 00:44:29.320 |
semantics and I encourage you to treat this with respect. 00:44:32.960 |
And it's not something to play fast and loose with. 00:44:35.340 |
It's something to really respect, really understand, and look up maybe some tutorials for broadcasting 00:44:39.580 |
and practice it and be careful with it because you can very quickly run into bugs. 00:44:47.380 |
You see how here we have p.sum of one, keep them as true. 00:44:53.200 |
Let me take out this line just so we have the n and then we can see the counts. 00:44:58.640 |
We can see that this is all the counts across all the rows and it's 27 by one column vector, 00:45:07.280 |
Now suppose that I tried to do the following, but I erase keep them as true here. 00:45:17.360 |
And remember, according to documentation, it gets rid of this dimension one, it squeezes 00:45:23.080 |
So basically we just get all the same counts, the same result, except the shape of it is 00:45:27.520 |
not 27 by one, it's just 27, the one that disappears. 00:45:34.340 |
So you'd think that this divide that would work. 00:45:40.280 |
First of all, can we even write this and is it even expected to run? 00:45:46.160 |
Let's determine if this result is broadcastable. 00:46:00.460 |
So now rules of broadcasting number one, align all the dimensions on the right, done. 00:46:06.440 |
Now iteration over all the dimensions starting from the right, going to the left. 00:46:13.200 |
One of them must be one or one of them does not exist. 00:46:20.160 |
So internally what broadcasting will do is it will create a one here. 00:46:24.560 |
And then we see that one of them is a one and this will get copied and this will run, 00:46:31.960 |
Okay, so you'd expect this to work because we are, this broadcasts and this, we can divide 00:46:43.880 |
Now if I run this, you'd expect it to work, but it doesn't. 00:46:49.280 |
You get a wrong result because this is actually a bug. 00:47:03.040 |
In both cases, we are doing the correct counts. 00:47:06.440 |
We are summing up across the rows, but keep dim is saving us and making it work. 00:47:11.720 |
So in this case, I'd like to encourage you to potentially like pause this video at this 00:47:15.320 |
point and try to think about why this is buggy and why the keep dim was necessary here. 00:47:21.320 |
Okay, so the reason to do for this is I'm trying to hint it here when I was sort of 00:47:27.360 |
giving you a bit of a hint on how this works. 00:47:29.720 |
This 27 vector internally inside the broadcasting, this becomes a one by 27 and one by 27 is 00:47:39.840 |
And now we are dividing 27 by 27 by one by 27 and torch will replicate this dimension. 00:47:46.100 |
So basically it will take, it will take this row vector and it will copy it vertically 00:47:56.380 |
So the 27 by 27 lines exactly and element wise divides. 00:48:00.520 |
And so basically what's happening here is we're actually normalizing the columns instead 00:48:09.640 |
So you can check that what's happening here is that P at zero, which is the first row 00:48:18.800 |
It is the first column as an example that sums to one. 00:48:23.840 |
So to summarize, where does the issue come from? 00:48:26.640 |
The issue comes from the silent adding of a dimension here, because in broadcasting 00:48:30.840 |
rules, you align on the right and go from right to left. 00:48:33.960 |
And if dimension doesn't exist, you create it. 00:48:39.520 |
We did the counts across the rows and we got the counts on the right here as a column vector. 00:48:45.800 |
But because the keyptons was true, this dimension was discarded and now we just have a vector 00:48:51.680 |
And because of broadcasting the way it works, this vector of 27 suddenly becomes a row vector. 00:48:57.240 |
And then this row vector gets replicated vertically. 00:49:00.120 |
And that every single point we are dividing by the count in the opposite direction. 00:49:11.640 |
This needs to be keyptons equals true in this case. 00:49:14.380 |
So then we have that P at zero is normalized. 00:49:20.120 |
And conversely, the first column you'd expect to potentially not be normalized. 00:49:29.600 |
And hopefully this helps to scare you, that you should have respect for broadcasting, 00:49:34.400 |
be careful, check your work, and understand how it works under the hood and make sure 00:49:39.200 |
that it's broadcasting in the direction that you like. 00:49:41.160 |
Otherwise, you're going to introduce very subtle bugs, very hard to find bugs. 00:49:48.440 |
We don't want to be doing this here because this creates a completely new tensor that 00:49:53.040 |
we store into P. We prefer to use in-place operations if possible. 00:49:58.040 |
So this would be an in-place operation, has the potential to be faster. 00:50:08.280 |
And let's also just do fewer, just so I'm not wasting space. 00:50:14.280 |
Okay, so we're actually in a pretty good spot now. 00:50:17.160 |
We trained a bigram language model, and we trained it really just by counting how frequently 00:50:22.900 |
any pairing occurs and then normalizing so that we get a nice property distribution. 00:50:28.120 |
So really these elements of this array P are really the parameters of our bigram language 00:50:33.080 |
model, giving us and summarizing the statistics of these bigrams. 00:50:37.140 |
So we trained the model, and then we know how to sample from the model. 00:50:40.320 |
We just iteratively sample the next character and feed it in each time and get the next 00:50:46.680 |
Now, what I'd like to do is I'd like to somehow evaluate the quality of this model. 00:50:51.320 |
We'd like to somehow summarize the quality of this model into a single number. 00:50:55.400 |
How good is it at predicting the training set? 00:50:59.280 |
And as an example, so in the training set, we can evaluate now the training loss. 00:51:04.480 |
And this training loss is telling us about sort of the quality of this model in a single 00:51:12.120 |
So let's try to think through the quality of the model and how we would evaluate it. 00:51:16.160 |
Basically, what we're going to do is we're going to copy paste this code that we previously 00:51:26.160 |
We're going to use f strings, and I'm going to print character one followed by character 00:51:32.480 |
And then I don't want to do it for all the words, just do the first three words. 00:51:36.240 |
So here we have Emma, Olivia, and Ava bigrams. 00:51:40.360 |
Now what we'd like to do is we'd like to basically look at the probability that the model assigns 00:51:48.400 |
So in other words, we can look at the probability, which is summarized in the matrix P of IX1, 00:51:54.160 |
and then we can print it here as probability. 00:52:00.920 |
And because these probabilities are way too large, let me percent or colon .4f to truncate 00:52:10.400 |
We're looking at the probabilities that the model assigns to every one of these bigrams 00:52:15.440 |
And so we can see some of them are 4%, 3%, et cetera. 00:52:18.880 |
Just to have a measuring stick in our mind, by the way, we have 27 possible characters 00:52:24.960 |
And if everything was equally likely, then you'd expect all these probabilities to be 00:52:32.640 |
So anything above 4% means that we've learned something useful from these bigram statistics. 00:52:37.240 |
And you see that roughly some of these are 4%, but some of them are as high as 40%, 35%, 00:52:44.360 |
So you see that the model actually assigned a pretty high probability to whatever's in 00:52:50.960 |
Especially if you have a very good model, you'd expect that these probabilities should 00:52:53.840 |
be near 1, because that means that your model is correctly predicting what's going to come 00:52:58.120 |
next, especially on the training set where you trained your model. 00:53:03.000 |
So now we'd like to think about how can we summarize these probabilities into a single 00:53:07.900 |
number that measures the quality of this model? 00:53:11.920 |
Now when you look at the literature into maximum likelihood estimation and statistical modeling 00:53:16.160 |
and so on, you'll see that what's typically used here is something called the likelihood. 00:53:21.720 |
And the likelihood is the product of all of these probabilities. 00:53:26.120 |
And so the product of all of these probabilities is the likelihood. 00:53:29.460 |
And it's really telling us about the probability of the entire data set assigned by the model 00:53:39.620 |
So the product of these should be as high as possible when you are training the model 00:53:46.520 |
Your product of these probabilities should be very high. 00:53:50.560 |
Now because the product of these probabilities is an unwieldy thing to work with, you can 00:53:56.480 |
So your product of these probabilities will be a very tiny number. 00:54:01.100 |
So for convenience, what people work with usually is not the likelihood, but they work 00:54:11.040 |
To get the log likelihood, we just have to take the log of the probability. 00:54:15.240 |
And so the log of the probability here, I have the log of x from 0 to 1. 00:54:19.900 |
The log is a, you see here, monotonic transformation of the probability, where if you pass in 1, 00:54:29.020 |
So probability 1 gets you log probability of 0. 00:54:32.440 |
And then as you go lower and lower probability, the log will grow more and more negative until 00:54:42.080 |
So here we have a log prob, which is really just a torsion log of probability. 00:54:47.080 |
Let's print it out to get a sense of what that looks like. 00:54:56.800 |
So as you can see, when we plug in numbers that are very close, some of our higher numbers, 00:55:03.660 |
And then if we plug in very bad probabilities, we get more and more negative number. 00:55:09.760 |
So and the reason we work with this is for large extent convenience, right? 00:55:15.560 |
Because we have mathematically that if you have some product, a times b times c, of all 00:55:21.360 |
The likelihood is the product of all these probabilities. 00:55:25.620 |
And the log of these is just log of a plus log of b plus log of c. 00:55:35.260 |
If you remember your logs from your high school or undergrad and so on. 00:55:39.960 |
So we have that basically, the likelihood is the product of probabilities. 00:55:43.420 |
The log likelihood is just the sum of the logs of the individual probabilities. 00:55:54.860 |
And then log likelihood here, we can just accumulate simply. 00:56:21.420 |
Now we actually want -- so how high can log likelihood get? 00:56:30.020 |
So when all the probabilities are 1, log likelihood will be 0. 00:56:33.020 |
And then when all the probabilities are lower, this will grow more and more negative. 00:56:37.580 |
Now we don't actually like this because what we'd like is a loss function. 00:56:41.340 |
And a loss function has the semantics that low is good. 00:56:50.540 |
And that's what gives us something called the negative log likelihood. 00:56:56.300 |
Negative log likelihood is just negative of the log likelihood. 00:57:04.060 |
These are F strings, by the way, if you'd like to look this up. 00:57:09.540 |
So negative log likelihood now is just negative of it. 00:57:12.240 |
And so the negative log likelihood is a very nice loss function. 00:57:19.980 |
And the higher it is, the worse off the predictions are that you're making. 00:57:24.900 |
And then one more modification to this that sometimes people do is that for convenience, 00:57:29.300 |
they actually like to normalize by -- they like to make it an average instead of a sum. 00:57:34.620 |
And so here, let's just keep some counts as well. 00:57:43.020 |
And then here, we can have sort of like a normalized log likelihood. 00:57:50.660 |
If we just normalize it by the count, then we will sort of get the average log likelihood. 00:57:55.940 |
So this would be usually our loss function here. 00:58:02.480 |
So our loss function for the training set assigned by the model is 2.4. 00:58:13.740 |
And the job of our training is to find the parameters that minimize the negative log 00:58:25.140 |
Okay, so to summarize, I actually wrote it out here. 00:58:28.240 |
So our goal is to maximize likelihood, which is the product of all the probabilities assigned 00:58:35.420 |
And we want to maximize this likelihood with respect to the model parameters. 00:58:39.420 |
And in our case, the model parameters here are defined in the table. 00:58:43.660 |
These numbers, the probabilities, are the model parameters sort of in our bigram language 00:58:49.580 |
But you have to keep in mind that here we are storing everything in a table format, 00:58:54.820 |
But what's coming up as a brief preview is that these numbers will not be kept explicitly, 00:59:00.340 |
but these numbers will be calculated by a neural network. 00:59:04.340 |
And we want to change and tune the parameters of these neural networks. 00:59:08.260 |
We want to change these parameters to maximize the likelihood, the product of the probabilities. 00:59:13.500 |
Now maximizing the likelihood is equivalent to maximizing the log likelihood because log 00:59:22.260 |
And basically all it is doing is it's just scaling your, you can look at it as just a 00:59:29.700 |
And so the optimization problem here and here are actually equivalent because this is just 00:59:37.240 |
And so these are two identical optimization problems. 00:59:42.460 |
Maximizing the log likelihood is equivalent to minimizing the negative log likelihood. 00:59:46.420 |
And then in practice, people actually minimize the average negative log likelihood to get 00:59:53.180 |
And then this summarizes the quality of your model. 00:59:56.460 |
And we'd like to minimize it and make it as small as possible. 01:00:02.600 |
And the lower it is, the better off your model is because it's assigning high probabilities 01:00:09.600 |
Now let's estimate the probability over the entire training set just to make sure that 01:00:24.700 |
Now what I'd like to show you is that you can actually evaluate the probability for 01:00:28.520 |
So for example, if we just test a single word, Andre, and bring back the print statement, 01:00:36.040 |
then you see that Andre is actually kind of like an unlikely word. 01:00:39.000 |
Like on average, we take three log probability to represent it. 01:00:44.760 |
And roughly that's because EJ apparently is very uncommon as an example. 01:00:52.440 |
When I take Andre and I append Q and I test the probability of it, Andre Q, we actually 01:01:03.240 |
And that's because JQ has a 0% probability according to our model. 01:01:07.840 |
So the log likelihood, so the log of zero will be negative infinity. 01:01:15.980 |
Because we plugged in a string that could be like a somewhat reasonable name. 01:01:19.400 |
But basically what this is saying is that this model is exactly 0% likely to predict 01:01:30.000 |
And really what the reason for that is that J is followed by Q zero times. 01:01:42.440 |
So it's actually kind of gross and people don't like this too much. 01:01:45.420 |
To fix this, there's a very simple fix that people like to do to sort of like smooth out 01:01:52.340 |
And roughly what's happening is that we will add some fake counts. 01:01:56.440 |
So imagine adding a count of one to everything. 01:02:01.160 |
So we add a count of one like this, and then we recalculate the probabilities. 01:02:10.260 |
You can add five and that will give you a smoother model. 01:02:13.140 |
And the more you add here, the more uniform model you're going to have. 01:02:17.940 |
And the less you add, the more peaked model you're going to have, of course. 01:02:25.940 |
And that will ensure that there will be no zeros in our probability matrix P. 01:02:31.120 |
And so this will of course change the generations a little bit. 01:02:33.800 |
In this case, it didn't, but in principle it could. 01:02:36.780 |
But what that's going to do now is that nothing will be infinity unlikely. 01:02:41.340 |
So now our model will predict some other probability. 01:02:44.940 |
And we see that JQ now has a very small probability. 01:02:47.860 |
So the model still finds it very surprising that this was a word or a bigram, but we don't 01:02:53.020 |
So it's kind of like a nice fix that people like to apply sometimes, and it's called model 01:02:57.340 |
Okay, so we've now trained a respectable bigram character level language model. 01:03:01.660 |
And we saw that we both sort of trained the model by looking at the counts of all the 01:03:06.780 |
bigrams and normalizing the rows to get probability distributions. 01:03:11.660 |
We saw that we can also then use those parameters of this model to perform sampling of new words. 01:03:19.700 |
So we sample new names according to those distributions. 01:03:22.420 |
And we also saw that we can evaluate the quality of this model. 01:03:25.640 |
And the quality of this model is summarized in a single number, which is the negative 01:03:30.140 |
And the lower this number is, the better the model is, because it is giving high probabilities 01:03:35.620 |
to the actual next characters in all the bigrams in our training set. 01:03:40.340 |
So that's all well and good, but we've arrived at this model explicitly by doing something 01:03:46.260 |
We were just performing counts, and then we were normalizing those counts. 01:03:51.180 |
Now what I would like to do is I would like to take an alternative approach. 01:03:54.260 |
We will end up in a very, very similar position, but the approach will look very different, 01:03:58.460 |
because I would like to cast the problem of bigram character level language modeling into 01:04:04.380 |
And in the neural network framework, we're going to approach things slightly differently, 01:04:12.100 |
Now our neural network is going to be still a bigram character level language model. 01:04:17.420 |
So it receives a single character as an input. 01:04:20.540 |
Then there's neural network with some weights or some parameters w. 01:04:24.220 |
And it's going to output the probability distribution over the next character in a sequence. 01:04:29.300 |
And it's going to make guesses as to what is likely to follow this character that was 01:04:36.140 |
And then in addition to that, we're going to be able to evaluate any setting of the 01:04:39.900 |
parameters of the neural net, because we have the loss function, the negative log likelihood. 01:04:45.160 |
So we're going to take a look at this probability distributions, and we're going to use the 01:04:48.260 |
labels, which are basically just the identity of the next character in that bigram, the 01:04:54.880 |
So knowing what second character actually comes next in the bigram allows us to then 01:04:59.140 |
look at how high of a probability the model assigns to that character. 01:05:04.100 |
And then we of course want the probability to be very high. 01:05:07.240 |
And that is another way of saying that the loss is low. 01:05:10.960 |
So we're going to use gradient based optimization then to tune the parameters of this network, 01:05:15.660 |
because we have the loss function and we're going to minimize it. 01:05:18.600 |
So we're going to tune the weights so that the neural net is correctly predicting the 01:05:25.700 |
The first thing I want to do is I want to compile the training set of this neural network. 01:05:29.660 |
So create the training set of all the bigrams. 01:05:37.940 |
And here I'm going to copy/paste this code, because this code iterates over all the bigrams. 01:05:47.580 |
So here we start with the words, we iterate over all the bigrams, and previously, as you 01:05:57.060 |
Now this training set will be made up of two lists. 01:06:02.220 |
We have the inputs and the targets, the labels. 01:06:13.340 |
And so we're given the first character of the bigram, and then we're trying to predict 01:06:19.520 |
So here we'll take xs.append is just x1, ys.append is just x2. 01:06:27.840 |
And then here, we actually don't want lists of integers. 01:06:33.760 |
So xs is tors.tensor of xs, and ys is tors.tensor of ys. 01:06:41.640 |
And then we don't actually want to take all the words just yet, because I want everything 01:06:47.120 |
So let's just do the first word, which is Emma. 01:06:51.400 |
And then it's clear what these xs and ys would be. 01:06:55.600 |
Here let me print character1, character2, just so you see what's going on here. 01:07:01.760 |
So the bigrams of these characters is .emmmma. 01:07:08.960 |
So this single word, as I mentioned, has one, two, three, four, five examples for our neural 01:07:19.480 |
When the input to the neural network is integer 0, the desired label is integer 5, which corresponds 01:07:27.920 |
When the input to the neural network is 5, we want its weights to be arranged so that 01:07:35.280 |
When 13 is put in, we want 13 to have a high probability. 01:07:39.360 |
When 13 is put in, we also want 1 to have a high probability. 01:07:43.840 |
When 1 is input, we want 0 to have a very high probability. 01:07:47.600 |
So there are five separate input examples to a neural net in this dataset. 01:07:55.120 |
I wanted to add a tangent of a note of caution to be careful with a lot of the APIs of some 01:08:01.600 |
You saw me silently use torch.tensor with a lowercase t, and the output looked right. 01:08:08.040 |
But you should be aware that there's actually two ways of constructing a tensor. 01:08:11.960 |
There's a torch.lowercase tensor, and there's also a torch.capital tensor class, which you 01:08:20.240 |
You can also do torch.capital tensor, and you get an X as in Y as well. 01:08:29.080 |
There are threads on what is the difference between these two. 01:08:31.840 |
And unfortunately, the docs are just not clear on the difference. 01:08:36.200 |
And when you look at the docs of lowercase tensor, construct tensor with no autograd 01:08:46.840 |
So the actual difference, as far as I can tell, is explained eventually in this random 01:08:51.800 |
And really it comes down to, I believe, that-- where is this? 01:08:58.840 |
Torch.tensor infers the D type, the data type, automatically, while torch.tensor just returns 01:09:04.480 |
I would recommend to stick to torch.lowercase tensor. 01:09:07.960 |
So indeed, we see that when I construct this with a capital T, the data type here of X's 01:09:18.360 |
But torch.lowercase tensor, you see how it's now X.Dtype is now integer. 01:09:26.920 |
So it's advised that you use lowercase t, and you can read more about it if you like 01:09:34.440 |
But basically, I'm pointing out some of these things because I want to caution you, and 01:09:39.480 |
I want you to get used to reading a lot of documentation and reading through a lot of 01:09:47.320 |
And some of this stuff is unfortunately not easy and not very well documented, and you 01:09:52.880 |
What we want here is integers, because that's what makes sense. 01:09:58.280 |
And so lowercase tensor is what we are using. 01:10:01.240 |
Okay, now we want to think through how we're going to feed in these examples into a neural 01:10:06.160 |
Now, it's not quite as straightforward as plugging it in, because these examples right 01:10:14.800 |
It gives us the index of the character, and you can't just plug an integer index into 01:10:20.180 |
These neural nets are sort of made up of these neurons, and these neurons have weights. 01:10:27.060 |
And as you saw in micrograd, these weights act multiplicatively on the inputs, wx + b, 01:10:34.440 |
And so it doesn't really make sense to make an input neuron take on integer values that 01:10:37.900 |
you feed in and then multiply on with weights. 01:10:41.880 |
So instead, a common way of encoding integers is what's called one-hot encoding. 01:10:47.160 |
In one-hot encoding, we take an integer like 13, and we create a vector that is all zeros, 01:10:53.760 |
except for the 13th dimension, which we turn to a 1. 01:10:57.640 |
And then that vector can feed into a neural net. 01:11:01.280 |
Now conveniently, PyTorch actually has something called the one-hot function inside Torch and 01:11:19.440 |
And it also takes a number of classes, which is how large you want your vector to be. 01:11:27.880 |
So here, let's import Torch. and in .Functional.sf. 01:11:34.300 |
And then let's do f.one-hot, and we feed in the integers that we want to encode. 01:11:40.160 |
So we can actually feed in the entire array of x's. 01:11:49.540 |
It may have guessed that it's only 13 and would give us an incorrect result. 01:12:02.240 |
And then we see that xEncoded.shape is 5 by 27. 01:12:07.360 |
And we can also visualize it, plt.imshow(xInc) to make it a little bit more clear, because 01:12:15.640 |
So we see that we've encoded all the five examples into vectors. 01:12:22.860 |
And each row here is now an example into a neural net. 01:12:26.480 |
And we see that the appropriate bit is turned on as a 1, and everything else is 0. 01:12:32.080 |
So here, for example, the 0th bit is turned on, the 5th bit is turned on, the 13th bits 01:12:39.080 |
are turned on for both of these examples, and the first bit here is turned on. 01:12:44.800 |
So that's how we can encode integers into vectors. 01:12:49.520 |
And then these vectors can feed into neural nets. 01:12:52.120 |
One more issue to be careful with here, by the way, is let's look at the data type of 01:12:57.800 |
We always want to be careful with data types. 01:12:58.800 |
What would you expect xEncoding's data type to be? 01:13:03.000 |
When we're plugging numbers into neural nets, we don't want them to be integers. 01:13:06.340 |
We want them to be floating-point numbers that can take on various values. 01:13:10.620 |
But the Dtype here is actually a 64-bit integer. 01:13:14.480 |
And the reason for that, I suspect, is that one hot received a 64-bit integer here, and 01:13:22.120 |
And when you look at the signature of one hot, it doesn't even take a Dtype, a desired 01:13:28.640 |
And so we can't, in a lot of functions in Torch, we'd be able to do something like Dtype 01:13:38.120 |
So instead, we're going to want to cast this to float like this, so that these, everything 01:13:45.200 |
is the same, everything looks the same, but the Dtype is float32. 01:13:56.240 |
This neuron will look at these input vectors. 01:14:00.320 |
And as you remember from micrograd, these neurons basically perform a very simple function, 01:14:12.260 |
Let's first define the weights of this neuron, basically. 01:14:14.640 |
What are the initial weights at initialization for this neuron? 01:14:21.840 |
Torch.random fills a tensor with random numbers drawn from a normal distribution. 01:14:29.480 |
And a normal distribution has a probability density function like this. 01:14:34.600 |
And so most of the numbers drawn from this distribution will be around zero, but some 01:14:39.560 |
of them will be as high as almost three and so on. 01:14:42.120 |
And very few numbers will be above three in magnitude. 01:15:03.360 |
And these weights are then multiplied by the inputs. 01:15:08.840 |
So now to perform this multiplication, we can take x encoding and we can multiply it 01:15:15.080 |
This is a matrix multiplication operator in PyTorch. 01:15:26.080 |
We took x encoding, which is 5 by 27, and we multiplied it by 27 by 1. 01:15:33.840 |
And in matrix multiplication, you see that the output will become 5 by 1 because these 01:15:45.100 |
So basically what we're seeing here out of this operation is we are seeing the five activations 01:16:00.680 |
We didn't feed in just a single input to the single neuron. 01:16:03.600 |
We fed in simultaneously all the five inputs into the same neuron. 01:16:08.320 |
And in parallel, PyTorch has evaluated the wx plus b. 01:16:15.600 |
It has evaluated w times x for all of them independently. 01:16:20.800 |
Now instead of a single neuron, though, I would like to have 27 neurons. 01:16:24.280 |
And I'll show you in a second why I want 27 neurons. 01:16:27.820 |
So instead of having just a 1 here, which is indicating this presence of one single 01:16:34.940 |
And then when w is 27 by 27, this will in parallel evaluate all the 27 neurons on all 01:16:43.720 |
the five inputs, giving us a much better, much, much bigger result. 01:16:49.740 |
So now what we've done is 5 by 27 multiplied 27 by 27. 01:16:57.960 |
So we can see that the shape of this is 5 by 27. 01:17:07.400 |
It's telling us for every one of 27 neurons that we created, what is the firing rate of 01:17:15.160 |
those neurons on every one of those five examples? 01:17:19.800 |
So the element, for example, 3, 13 is giving us the firing rate of the 13th neuron looking 01:17:32.120 |
And the way this was achieved is by a dot product between the third input and the 13th 01:17:45.800 |
So using matrix multiplication, we can very efficiently evaluate the dot product between 01:17:51.980 |
lots of input examples in a batch and lots of neurons where all of those neurons have 01:18:01.320 |
And in matrix multiplication, we're just doing those dot products in parallel. 01:18:05.720 |
Just to show you that this is the case, we can take x_inc and we can take the third row. 01:18:10.600 |
And we can take the w and take its 13th column. 01:18:17.600 |
And then we can do x_inc at 3, element-wise multiply with w at 13, and sum that up. 01:18:35.520 |
So you see that this is just being done efficiently by the matrix multiplication operation for 01:18:40.680 |
all the input examples and for all the output neurons of this first layer. 01:18:45.760 |
Okay, so we fed our 27 dimensional inputs into a first layer of a neural net that has 01:18:53.040 |
So we have 27 inputs and now we have 27 neurons. 01:18:59.860 |
They don't have a bias and they don't have a nonlinearity like tanh. 01:19:03.360 |
We're going to leave them to be a linear layer. 01:19:06.740 |
In addition to that, we're not going to have any other layers. 01:19:10.260 |
It's just going to be the dumbest, smallest, simplest neural net, which is just a single 01:19:16.680 |
And now I'd like to explain what I want those 27 outputs to be. 01:19:20.680 |
Intuitively, what we're trying to produce here for every single input example is we're 01:19:24.880 |
trying to produce some kind of a probability distribution for the next character in a sequence. 01:19:31.760 |
But we have to come up with precise semantics for exactly how we're going to interpret these 01:19:39.880 |
Now intuitively, you see here that these numbers are negative and some of them are positive, 01:19:45.760 |
And that's because these are coming out of a neural net layer initialized with these 01:19:54.560 |
But what we want is we want something like we had here. 01:19:57.460 |
But each row here told us the counts, and then we normalized the counts to get probabilities. 01:20:03.720 |
And we want something similar to come out of a neural net. 01:20:06.700 |
But what we just have right now is just some negative and positive numbers. 01:20:10.720 |
Now we want those numbers to somehow represent the probabilities for the next character. 01:20:15.560 |
But you see that probabilities, they have a special structure. 01:20:20.120 |
They're positive numbers and they sum to one. 01:20:23.060 |
And so that doesn't just come out of a neural net. 01:20:25.980 |
And then they can't be counts because these counts are positive and counts are integers. 01:20:32.960 |
So counts are also not really a good thing to output from a neural net. 01:20:36.920 |
So instead, what the neural net is going to output and how we are going to interpret the 01:20:40.880 |
27 numbers is that these 27 numbers are giving us log counts, basically. 01:20:50.660 |
So instead of giving us counts directly, like in this table, they're giving us log counts. 01:20:56.300 |
And to get the counts, we're going to take the log counts and we're going to exponentiate 01:21:07.360 |
It takes numbers that are negative or they are positive. 01:21:13.120 |
And then if you plug in negative numbers, you're going to get e to the x, which is always below 01:21:23.680 |
And if you plug in numbers greater than 0, you're getting numbers greater than 1, all 01:21:33.480 |
So basically, we're going to take these numbers here and instead of them being positive and 01:21:44.640 |
negative and all over the place, we're going to interpret them as log counts. 01:21:48.980 |
And then we're going to element-wise exponentiate these numbers. 01:21:53.480 |
Exponentiating them now gives us something like this. 01:21:56.640 |
And you see that these numbers now, because they went through an exponent, all the negative 01:21:59.940 |
numbers turned into numbers below 1, like 0.338. 01:22:04.460 |
And all the positive numbers originally turned into even more positive numbers, sort of greater 01:22:11.000 |
So like for example, 7 is some positive number over here. 01:22:21.300 |
But exponentiated outputs here basically give us something that we can use and interpret 01:22:31.140 |
So you see these counts here, 112, 751, 1, etc. 01:22:36.560 |
The neural net is kind of now predicting counts. 01:22:47.040 |
And they can now take on various values depending on the settings of W. 01:22:56.400 |
We're going to interpret these to be the log counts. 01:23:01.420 |
In other words for this, that is often used, is so-called logits. 01:23:13.680 |
And this is equivalent to the n matrix, sort of the n array that we used previously. 01:23:24.260 |
And each row here are the counts for the next character, sort of. 01:23:34.520 |
And now the probabilities are just the counts normalized. 01:23:39.860 |
And so I'm not going to find the same, but basically I'm not going to scroll all over 01:23:47.740 |
We want to counts.sum along the first dimension. 01:23:56.400 |
And this is how we normalize the rows of our counts matrix to get our probabilities. 01:24:08.240 |
And these are the counts that we have currently. 01:24:10.880 |
And now when I show the probabilities, you see that every row here, of course, will sum 01:24:27.640 |
And so really what we've achieved is for every one of our five examples, we now have a row 01:24:35.360 |
And because of the transformations here, we made sure that this output of this neural 01:24:39.480 |
net now are probabilities, or we can interpret to be probabilities. 01:24:48.260 |
And then we interpret those to be log counts. 01:24:50.880 |
We exponentiate to get something that looks like counts. 01:24:54.200 |
And then we normalize those counts to get a probability distribution. 01:24:57.700 |
And all of these are differentiable operations. 01:25:00.560 |
So what we've done now is we are taking inputs. 01:25:03.400 |
We have differentiable operations that we can back propagate through. 01:25:07.120 |
And we're getting out probability distributions. 01:25:10.040 |
So for example, for the zeroth example that fed in, which was the zeroth example here, 01:25:20.880 |
And it basically corresponded to feeding in this example here. 01:25:30.560 |
And the way we fed the dot into a neural net is that we first got its index. 01:25:38.680 |
And out came this distribution of probabilities. 01:25:48.920 |
And we're going to interpret this as the neural net's assignment for how likely every one 01:25:54.400 |
of these characters, the 27 characters, are to come next. 01:25:59.980 |
And as we tune the weights w, we're going to be, of course, getting different probabilities 01:26:07.400 |
And so now the question is just, can we optimize and find a good w such that the probabilities 01:26:14.720 |
And the way we measure pretty good is by the loss function. 01:26:16.960 |
OK, so I organized everything into a single summary so that hopefully it's a bit more 01:26:26.600 |
And we have some labels for the correct next character in a sequence. 01:26:33.000 |
Here I'm using torch generators now so that you see the same numbers that I see. 01:26:48.800 |
Then here we're going to plug in all the input examples, x's, into a neural net. 01:26:55.200 |
First, we have to encode all of the inputs into one-hot representations. 01:27:12.480 |
We then multiply this in the first layer of a neural net to get logits. 01:27:17.440 |
Exponentiate the logits to get fake counts, sort of. 01:27:20.920 |
And normalize these counts to get probabilities. 01:27:24.560 |
So these last two lines, by the way, here, are called the softmax, which I pulled up 01:27:32.600 |
Softmax is a very often used layer in a neural net that takes these z's, which are logits, 01:27:39.640 |
exponentiates them, and divides and normalizes. 01:27:43.320 |
It's a way of taking outputs of a neural net layer. 01:27:46.540 |
And these outputs can be positive or negative. 01:27:52.600 |
It outputs something that always sums to 1 and are positive numbers, just like probabilities. 01:27:59.080 |
So this is kind of like a normalization function, if you want to think of it that way. 01:28:02.400 |
And you can put it on top of any other linear layer inside a neural net. 01:28:05.840 |
And it basically makes a neural net output probabilities. 01:28:13.600 |
So this is the forward pass, and that's how we made a neural net output probability. 01:28:18.080 |
Now, you'll notice that all of these, this entire forward pass is made up of differentiable 01:28:28.200 |
Everything here, we can backpropagate through. 01:28:30.400 |
And we saw some of the backpropagation in micrograd. 01:28:36.700 |
All that's happening here is just multiply and add. 01:28:38.960 |
And we know how to backpropagate through them. 01:28:41.000 |
Exponentiation, we know how to backpropagate through. 01:28:44.160 |
And then here, we are summing, and sum is easily backpropagatable as well, and division 01:28:52.200 |
So everything here is a differentiable operation, and we can backpropagate through. 01:28:57.080 |
Now, we achieve these probabilities, which are 5 by 27. 01:29:01.880 |
For every single example, we have a vector of probabilities that sum to 1. 01:29:06.560 |
And then here, I wrote a bunch of stuff to sort of like break down the examples. 01:29:11.680 |
So we have five examples making up Emma, right? 01:29:20.240 |
So bigram example 1 is that E is the beginning character, right after dot. 01:29:36.160 |
We get probabilities from the neural net that are 27 numbers. 01:29:41.540 |
And then the label is 5, because E actually comes after dot. 01:29:48.140 |
And then we use this label 5 to index into the probability distribution here. 01:30:04.320 |
So that's basically the probability assigned by the neural net to the actual correct character. 01:30:08.960 |
You see that the network currently thinks that this next character, that E following 01:30:12.960 |
dot, is only 1% likely, which is, of course, not very good, right? 01:30:19.980 |
And the network thinks that this is currently very, very unlikely. 01:30:22.720 |
But that's just because we didn't get very lucky in generating a good setting of W. 01:30:27.140 |
So right now, this network thinks this is unlikely. 01:30:32.180 |
So the log likelihood, then, is very negative. 01:30:36.100 |
And the negative log likelihood is very positive. 01:30:39.660 |
And so 4 is a very high negative log likelihood. 01:30:43.420 |
And that means we're going to have a high loss. 01:30:46.980 |
The loss is just the average negative log likelihood. 01:30:53.860 |
And you see here that also the network thought that M following E is very unlikely, 1%. 01:31:04.500 |
And for A following M, it actually thought it was 7% likely. 01:31:08.020 |
So just by chance, this one actually has a pretty good probability and therefore a pretty 01:31:15.580 |
And finally here, it thought this was 1% likely. 01:31:18.580 |
So overall, our average negative log likelihood, which is the loss, the total loss that summarizes 01:31:24.740 |
basically how well this network currently works, at least on this one word, not on the 01:31:29.220 |
full data set, just the one word, is 3.76, which is actually a fairly high loss. 01:31:41.480 |
We can actually come here and we can change our W. We can resample it. 01:31:45.860 |
So let me just add one to have a different seed. 01:31:48.980 |
And then we get a different W. And then we can rerun this. 01:31:53.100 |
And with this different seed, with this different setting of W's, we now get 3.37. 01:32:00.940 |
And it's better because the probabilities just happen to come out higher for the characters 01:32:09.100 |
And so you can imagine actually just resampling this. 01:32:20.180 |
Okay, this was a terrible setting because we have a very high loss. 01:32:28.020 |
What I'm doing here, which is just guess and check of randomly assigning parameters 01:32:33.460 |
and seeing if the network is good, that is amateur hour. 01:32:39.300 |
The way you optimize neural net is you start with some random guess, and we're going to 01:32:42.740 |
commit to this one, even though it's not very good. 01:32:45.380 |
But now the big deal is we have a loss function. 01:32:48.500 |
So this loss is made up only of differentiable operations. 01:32:54.500 |
And we can minimize the loss by tuning Ws by computing the gradients of the loss with 01:33:05.320 |
And so then we can tune W to minimize the loss and find a good setting of W using gradient-based 01:33:13.260 |
Now things are actually going to look almost identical to what we had with micrograd. 01:33:17.300 |
So here I pulled up the lecture from micrograd, the notebook that's from this repository. 01:33:24.100 |
And when I scroll all the way to the end where we left off with micrograd, we had something 01:33:31.020 |
In this case, we had four input examples inside Xs. 01:33:37.920 |
Just like here, we have our Xs now, but we have five of them. 01:33:44.380 |
But we're going to convert our integers to vectors, except our vectors will be 27 large 01:33:52.280 |
And then here what we did is first we did a forward pass where we ran a neural net on 01:34:00.500 |
Our neural net at the time, this N of X, was a multilayer perceptron. 01:34:05.340 |
Our neural net is going to look different because our neural net is just a single layer, 01:34:16.100 |
And the loss here was the mean squared error. 01:34:18.600 |
So we simply subtracted the prediction from the ground truth and squared it and summed 01:34:24.380 |
And loss was the single number that summarized the quality of the neural net. 01:34:28.700 |
And when loss is low, like almost zero, that means the neural net is predicting correctly. 01:34:36.440 |
So we had a single number that summarized the performance of the neural net. 01:34:42.240 |
And everything here was differentiable and was stored in a massive compute graph. 01:34:47.040 |
And then we iterated over all the parameters. 01:34:49.320 |
We made sure that the gradients are set to zero. 01:34:54.360 |
And loss.backward initiated backpropagation at the final output node of loss. 01:35:03.720 |
We start backpropagation and we went all the way back. 01:35:06.600 |
And we made sure that we populated all the parameters dot grad. 01:35:11.000 |
So dot grad started at zero, but backpropagation filled it in. 01:35:14.600 |
And then in the update, we iterated over all the parameters. 01:35:17.680 |
And we simply did a parameter update where every single element of our parameters was 01:35:23.800 |
nudged in the opposite direction of the gradient. 01:35:27.860 |
And so we're going to do the exact same thing here. 01:35:31.960 |
So I'm going to pull this up on the side here. 01:35:40.080 |
And we're actually going to do the exact same thing. 01:35:49.200 |
So now we have to evaluate the loss, but we're not using the mean squared error. 01:35:52.600 |
We're using the negative log likelihood because we are doing classification. 01:36:02.640 |
Now the way we calculate it is just this average negative log likelihood. 01:36:07.380 |
Now this probs here has a shape of five by 27. 01:36:13.660 |
And so to get all the, we basically want to pluck out the probabilities at the correct 01:36:19.720 |
So in particular, because the labels are stored here in the array y's, basically what we're 01:36:25.120 |
after is for the first example, we're looking at probability of five, right? 01:36:31.120 |
For the second example, at the second row or row index one, we are interested in the 01:36:47.640 |
And at the last row, which is four, we want zero. 01:36:51.480 |
So these are the probabilities we're interested in, right? 01:36:54.280 |
And you can see that they're not amazing as we saw above. 01:36:58.880 |
So these are the probabilities we want, but we want like a more efficient way to access 01:37:04.760 |
Not just listing them out in a tuple like this. 01:37:07.280 |
So it turns out that the way to do this in PyTorch, one of the ways at least, is we can 01:37:11.340 |
basically pass in all of these, sorry about that, all of these integers in a vectors. 01:37:22.280 |
So these ones, you see how they're just 0, 1, 2, 3, 4. 01:37:27.320 |
We can actually create that using mp, not mp, sorry, torch.arrange(5), 0, 1, 2, 3, 01:37:35.280 |
So we index here with torch.arrange(5), and here we index with y's. 01:37:40.600 |
And you see that that gives us exactly these numbers. 01:37:49.200 |
So that plucks out the probabilities that the neural network assigns to the correct 01:37:56.460 |
Now we take those probabilities, and we actually look at the log probability. 01:38:00.720 |
So we want to dot log, and then we want to just average that up. 01:38:06.840 |
So take the mean of all of that, and then it's the negative average log likelihood 01:38:14.400 |
So the loss here is 3.7 something, and you see that this loss, 3.76, 3.76 is exactly 01:38:22.080 |
as we've obtained before, but this is a vectorized form of that expression. 01:38:29.720 |
And this same loss we can consider sort of as part of this forward pass, and we've 01:38:35.960 |
Okay, so we made our way all the way to loss. 01:38:44.540 |
So backward pass, we want to first make sure that all the gradients are reset, so they're 01:38:52.600 |
Now in PyTorch, you can set the gradients to be zero, but you can also just set it to 01:38:57.400 |
none, and setting it to none is more efficient. 01:39:00.440 |
And PyTorch will interpret none as like a lack of a gradient, and it's the same as 01:39:06.000 |
So this is a way to set to zero the gradient. 01:39:15.080 |
Before we do loss.backward, we need one more thing. 01:39:17.080 |
If you remember from micrograd, PyTorch actually requires that we pass in requires grad is 01:39:23.480 |
true so that we tell PyTorch that we are interested in calculating gradient for this leaf tensor. 01:39:33.720 |
So let me recalculate with that, and then set to none and loss.backward. 01:39:40.920 |
Now something magical happened when loss.backward was run, because PyTorch, just like micrograd, 01:39:47.280 |
when we did the forward pass here, it keeps track of all the operations under the hood. 01:39:55.080 |
Just like the graphs we produced in micrograd, those graphs exist inside PyTorch. 01:40:01.020 |
And so it knows all the dependencies and all the mathematical operations of everything. 01:40:05.260 |
And when you then calculate the loss, we can call a dot backward on it. 01:40:09.860 |
And dot backward then fills in the gradients of all the intermediates all the way back 01:40:16.320 |
to W's, which are the parameters of our neural net. 01:40:20.180 |
So now we can do W.grad, and we see that it has structure. 01:40:29.400 |
And these gradients, every single element here, so W.shape is 27 by 27. 01:40:40.920 |
And every element of W.grad is telling us the influence of that weight on the loss function. 01:40:48.960 |
So for example, this number all the way here, if this element, the 0,0 element of W, because 01:40:55.960 |
the gradient is positive, it's telling us that this has a positive influence on the 01:41:01.720 |
Slightly nudging W, slightly taking W,0,0 and adding a small h to it would increase 01:41:11.240 |
the loss mildly, because this gradient is positive. 01:41:18.820 |
So that's telling us about the gradient information, and we can use this gradient information to 01:41:28.500 |
It's going to be very similar to what we had in micrograd. 01:41:30.960 |
We need no loop over all the parameters, because we only have one parameter tensor, and that 01:41:42.360 |
We can actually copy this almost exactly, -0.1 times W.grad. 01:41:59.060 |
And because the tensor is updated, we would expect that now the loss should decrease. 01:42:04.520 |
So here, if I print loss.item, it was 3.76, right? 01:42:16.180 |
So if I recalculate forward pass, loss now should be slightly lower. 01:42:26.040 |
And then we can again set grad to none and backward, update. 01:42:35.080 |
So if we recalculate the forward pass, we expect a lower loss again, 3.72. 01:42:48.780 |
And when we achieve a low loss, that will mean that the network is assigning high probabilities 01:42:55.040 |
Okay, so I rearranged everything and I put it all together from scratch. 01:42:59.640 |
So here is where we construct our data set of bigrams. 01:43:03.480 |
You see that we are still iterating only over the first word, Emma. 01:43:09.120 |
I added a number that counts the number of elements in X's so that we explicitly see 01:43:14.800 |
that number of examples is five, because currently we're just working with Emma. 01:43:20.840 |
And here I added a loop of exactly what we had before. 01:43:23.920 |
So we had 10 iterations of gradient descent of forward pass, backward pass, and an update. 01:43:29.180 |
And so running these two cells, initialization and gradient descent, gives us some improvement 01:43:42.000 |
And there's not five, but 228,000 bigrams now. 01:43:46.520 |
However, this should require no modification whatsoever. 01:43:49.780 |
Everything should just run, because all the code we wrote doesn't care if there's five 01:44:00.080 |
But now we are optimizing over the entire training set of all the bigrams. 01:44:04.920 |
And you see now that we are decreasing very slightly. 01:44:07.700 |
So actually, we can probably afford a larger learning rate. 01:44:11.080 |
We can probably afford an even larger learning rate. 01:44:20.980 |
Even 50 seems to work on this very, very simple example. 01:44:24.140 |
So let me reinitialize, and let's run 100 iterations. 01:44:36.440 |
We seem to be coming up to some pretty good losses here. 01:44:44.760 |
What is the number that we expect, by the way, in the loss? 01:44:47.440 |
We expect to get something around what we had originally, actually. 01:44:52.200 |
So all the way back, if you remember, in the beginning of this video, when we optimized 01:44:57.480 |
just by counting, our loss was roughly 2.47 after we added smoothing. 01:45:03.720 |
But before smoothing, we had roughly 2.45 likelihood. 01:45:09.920 |
And so that's actually roughly the vicinity of what we expect to achieve. 01:45:16.060 |
And here, we are achieving roughly the same result, but with gradient-based optimization. 01:45:26.500 |
And that makes sense, because fundamentally, we're not taking in any additional information. 01:45:30.080 |
We're still just taking in the previous character and trying to predict the next one. 01:45:33.840 |
But instead of doing it explicitly by counting and normalizing, we are doing it with gradient-based 01:45:40.360 |
And it just so happens that the explicit approach happens to very well optimize the loss function 01:45:45.620 |
without any need for gradient-based optimization, because the setup for bigram language models 01:45:53.040 |
We can just afford to estimate those probabilities directly and maintain them in a table. 01:45:59.080 |
But the gradient-based approach is significantly more flexible. 01:46:03.080 |
So we've actually gained a lot, because what we can do now is we can expand this approach 01:46:13.000 |
So currently, we're just taking a single character and feeding it into a neural net. 01:46:17.980 |
But we're about to iterate on this substantially. 01:46:20.600 |
We're going to be taking multiple previous characters. 01:46:23.520 |
And we're going to be feeding them into increasingly more complex neural nets. 01:46:27.640 |
But fundamentally, the output of the neural net will always just be logits. 01:46:32.880 |
And those logits will go through the exact same transformation. 01:46:35.760 |
We're going to take them through a softmax, calculate the loss function and the negative 01:46:39.920 |
log likelihood, and do gradient-based optimization. 01:46:43.780 |
And so actually, as we complexify the neural nets and work all the way up to transformers, 01:46:49.920 |
none of this will really fundamentally change. 01:46:53.760 |
The only thing that will change is the way we do the forward pass, where we take in some 01:46:58.200 |
previous characters and calculate logits for the next character in the sequence. 01:47:05.440 |
And we'll use the same machinery to optimize it. 01:47:09.160 |
And it's not obvious how we would have extended this bigram approach into the case where there 01:47:19.440 |
Because eventually, these tables would get way too large, because there's way too many 01:47:23.120 |
combinations of what previous characters could be. 01:47:28.000 |
If you only have one previous character, we can just keep everything in a table that counts. 01:47:32.240 |
But if you have the last 10 characters that are input, we can't actually keep everything 01:47:37.580 |
So this is fundamentally an unscalable approach. 01:47:40.020 |
And the neural network approach is significantly more scalable. 01:47:43.320 |
And it's something that actually we can improve on over time. 01:47:51.360 |
Number one, I want you to notice that this xnk here, this is made up of one-hot vectors. 01:47:59.160 |
And then those one-hot vectors are multiplied by this W matrix. 01:48:03.400 |
And we think of this as multiple neurons being forwarded in a fully connected manner. 01:48:08.880 |
But actually what's happening here is that, for example, if you have a one-hot vector 01:48:13.600 |
here that has a 1 at, say, the fifth dimension, then because of the way the matrix multiplication 01:48:19.800 |
works, multiplying that one-hot vector with W actually ends up plucking out the fifth 01:48:25.560 |
row of W. Logits would become just the fifth row of W. And that's because of the way the 01:48:40.760 |
But that's actually exactly what happened before. 01:48:43.540 |
Because remember, all the way up here, we have a bigram. 01:48:47.720 |
We took the first character, and then that first character indexed into a row of this 01:48:55.160 |
And that row gave us the probability distribution for the next character. 01:48:58.820 |
So the first character was used as a lookup into a matrix here to get the probability 01:49:06.120 |
Well, that's actually exactly what's happening here. 01:49:08.800 |
Because we're taking the index, we're encoding it as one-hot, and multiplying it by W. So 01:49:13.680 |
Logits literally becomes the appropriate row of W. And that gets, just as before, exponentiated 01:49:23.360 |
to create the counts, and then normalized and becomes probability. 01:49:27.640 |
So this W here is literally the same as this array here. 01:49:35.320 |
But W, remember, is the log counts, not the counts. 01:49:39.080 |
So it's more precise to say that W exponentiated, W.exp, is this array. 01:49:46.440 |
But this array was filled in by counting, and by basically populating the counts of 01:49:54.480 |
So in the gradient-based framework, we initialize it randomly, and then we let the loss guide 01:50:03.400 |
So this array exactly here is basically the array W at the end of optimization, except 01:50:10.560 |
we arrived at it piece by piece by following the loss. 01:50:15.240 |
And that's why we also obtain the same loss function at the end. 01:50:18.160 |
And the second note is, if I come here, remember the smoothing where we added fake counts to 01:50:23.840 |
our counts in order to smooth out and make more uniform the distributions of these probabilities. 01:50:31.280 |
And that prevented us from assigning zero probability to any one bigram. 01:50:37.360 |
Now if I increase the count here, what's happening to the probability? 01:50:43.040 |
As I increase the count, probability becomes more and more uniform, because these counts 01:50:51.680 |
So if I'm adding plus a million to every single number here, you can see how the row and its 01:50:57.480 |
probability then, when we divide, is just going to become more and more close to exactly 01:51:05.400 |
It turns out that the gradient-based framework has an equivalent to smoothing. 01:51:10.960 |
In particular, think through these W's here, which we initialized randomly. 01:51:18.760 |
We could also think about initializing W's to be zero. 01:51:22.320 |
If all the entries of W are zero, then you'll see that logits will become all zero, and 01:51:29.080 |
then exponentiating those logits becomes all one, and then the probabilities turn out to 01:51:35.980 |
So basically when W's are all equal to each other, or say especially zero, then the probabilities 01:51:44.640 |
So trying to incentivize W to be near zero is basically equivalent to label smoothing. 01:51:53.120 |
And the more you incentivize that in a loss function, the more smooth distribution you're 01:51:59.140 |
So this brings us to something that's called regularization, where we can actually augment 01:52:03.520 |
the loss function to have a small component that we call a regularization loss. 01:52:07.960 |
In particular, what we're going to do is we can take W, and we can, for example, square 01:52:12.800 |
all of its entries, and then we can take all the entries of W and we can sum them. 01:52:23.880 |
And because we're squaring, there will be no signs anymore. 01:52:28.680 |
Negatives and positives all get squashed to be positive numbers. 01:52:31.760 |
And then the way this works is you achieve zero loss if W is exactly or zero, but if 01:52:41.420 |
And so we can actually take this and we can add it on here. 01:52:45.080 |
So we can do something like loss plus W square dot sum, or let's actually instead of sum, 01:52:53.640 |
let's take a mean, because otherwise the sum gets too large. 01:52:57.720 |
So mean is like a little bit more manageable. 01:53:01.560 |
And then we have a regularization loss here, like say 0.01 times or something like that. 01:53:12.280 |
And now this optimization actually has two components. 01:53:15.480 |
Not only is it trying to make all the probabilities work out, but in addition to that, there's 01:53:19.440 |
an additional component that simultaneously tries to make all W's be zero, because if 01:53:26.400 |
And so minimizing this, the only way to achieve that is for W to be zero. 01:53:30.840 |
And so you can think of this as adding like a spring force or like a gravity force that 01:53:38.000 |
So W wants to be zero and the probabilities want to be uniform, but they also simultaneously 01:53:42.720 |
want to match up your probabilities as indicated by the data. 01:53:47.600 |
And so the strength of this regularization is exactly controlling the amount of counts 01:53:57.520 |
Adding a lot more counts here corresponds to increasing this number, because the more 01:54:05.200 |
you increase it, the more this part of the loss function dominates this part, and the 01:54:09.840 |
more these weights will be unable to grow, because as they grow, they accumulate way 01:54:18.600 |
And so if this is strong enough, then we are not able to overcome the force of this loss, 01:54:25.040 |
and we will never... and basically everything will be uniform predictions. 01:54:30.880 |
Okay, and lastly, before we wrap up, I wanted to show you how you would sample from this 01:54:37.160 |
And I copy-pasted the sampling code from before, where remember that we sampled five times, 01:54:44.560 |
and all we did is we started at zero, we grabbed the current ix row of p, and that was our 01:54:50.800 |
probability row, from which we sampled the next index and just accumulated that and a 01:55:03.960 |
I still have the p in memory, so this is fine. 01:55:07.720 |
Now this p doesn't come from the row of p, instead it comes from this neural net. 01:55:15.280 |
First we take ix, and we encode it into a one-hot row of xnc. 01:55:22.640 |
This xnc multiplies our w, which really just plucks out the row of w corresponding to ix, 01:55:30.660 |
And then that gets our logits, and then we normalize those logits, exponentiate to get 01:55:35.600 |
counts, and then normalize to get the distribution, and then we can sample from the distribution. 01:55:41.560 |
So if I run this, kind of anticlimactic or climatic, depending how you look at it, but 01:55:52.480 |
And that's because this is the identical model. 01:55:54.880 |
Not only does it achieve the same loss, but as I mentioned, these are identical models, 01:56:00.040 |
and this w is the log counts of what we've estimated before. 01:56:04.440 |
But we came to this answer in a very different way, and it's got a very different interpretation. 01:56:09.680 |
But fundamentally this is basically the same model, and gives the same samples here, and 01:56:15.640 |
Okay, so we've actually covered a lot of ground. 01:56:18.200 |
We introduced the bigram character-level language model. 01:56:22.160 |
We saw how we can train the model, how we can sample from the model, and how we can 01:56:26.000 |
evaluate the quality of the model using the negative log likelihood loss. 01:56:30.440 |
And then we actually trained the model in two completely different ways that actually 01:56:36.480 |
In the first way, we just counted up the frequency of all the bigrams and normalized. 01:56:41.680 |
In the second way, we used the negative log likelihood loss as a guide to optimizing the 01:56:48.920 |
counts matrix, or the counts array, so that the loss is minimized in a gradient-based 01:56:56.200 |
And we saw that both of them give the same result, and that's it. 01:57:01.600 |
Now the second one of these, the gradient-based framework, is much more flexible. 01:57:05.160 |
And right now our neural network is super simple. 01:57:07.760 |
We're taking a single previous character, and we're taking it through a single linear 01:57:16.080 |
So in the follow-up videos, we're going to be taking more and more of these characters, 01:57:20.800 |
and we're going to be feeding them into a neural net. 01:57:23.140 |
But this neural net will still output the exact same thing. 01:57:28.320 |
And these logits will still be normalized in the exact same way, and all the loss and 01:57:31.600 |
everything else in the gradient-based framework, everything stays identical. 01:57:35.640 |
It's just that this neural net will now complexify all the way to transformers. 01:57:40.680 |
So that's going to be pretty awesome, and I'm looking forward to it.