Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 5

So we're now starting in week three with lecture five. So unfortunately, in the last class, I guess I really got behind and went a bit slowly. I guess I must just enjoy talking about natural languages too much. And so I never really got to the punchline of showing how you could do good things with the neural dependency parser.

So today for the first piece, I'll in some sense be finishing the content of last time and talk about neural dependency parsing, which also gives us the opportunity to introduce a simple feed forward neural net classifier. That will then lead into a little bit of just background things that you need to know about neural networks content, because the fact of the matter is there is a bunch of stuff you need to know about neural networks.

And then after both of those things, I'll get into what's really meant to be the topic of today's lecture, which is looking at language modeling and recurrent neural networks. And that's then going to lead into those two things are important topics that we'll then be talking about really for the whole of next week as well.

So there's a couple of reminders before we get underway. The first is that you should have handed in assignment two before you joined this class. And in turn, assignment three is out today. And it's an assignment where you're going to build essentially the neural dependency parser that I'm just about to present in PyTorch.

So part of the role of this assignment is actually to get you up to speed with PyTorch. So this assignment is highly scaffolded with lots of comments and hints about what to do. And so the hope is that by the time you come to the end of it, you'll feel fairly familiar and comfortable with PyTorch.

Don't forget there was also a tutorial on PyTorch last week. If you didn't catch that at the time, you might want to go back and look at the video. Another thing to mention about the assignments is that assignment three is the last assignment where our great team of TAs are happy to look at your code and sort out your bugs for you.

So maybe take advantage of that, but not too much. But starting on assignment four for assignments four, five and the final project, the TAs are very happy to help in general, but it's just not going to be their job to be actually sorting out bugs for you. You should be looking at your code and discussing ideas and concepts and reasons why things might not work with them.

Okay, so if you remember where we were last time, I'd introduced this idea of transition-based dependency parsers and that these were an efficient linear time method for giving the syntactic structure of natural language text. And that they worked pretty well before neural nets came along and took over NLP again, but they had some disadvantages.

And their biggest disadvantage is that like most machine learning models of that time, they worked with indicator features. So that means that you are specifying some condition and then checking whether it was true of a configuration. So something like the word on the top of the stack is good and it's part of speech is adjective, or the next word coming up is a personal pronoun, that those are conditions that would be features in a conventional transition-based dependency parser.

And so what are the problems with doing that? Well, one problem is that those features are very sparse. A second problem is the features are incomplete. Well, what I mean by that is depending on what words and configurations occurred in the training data, there are certain features that will exist because you sort of saw a certain word preceding a verb and certain features that just won't exist because that word never occurred before a verb in the training data.

But perhaps the biggest problem and opportunity for doing better with the neural dependency parser is that it turns out that in a symbolic dependency parser, computing all these features just turns out to actually be pretty expensive. That although the actual transition system that I showed last time is fast and efficient to run, you actually have to compute all of these features.

And what you found was that about 95% of the parsing time of one of these models was spent just computing all of the features of every configuration. So that suggests that perhaps we can do better with a neural approach where we're going to learn a dense and compact feature representation.

And so that's what I wanna go through now. So this time, we're still gonna have exactly the same kind of configuration of a stack and a buffer and running exactly the same transition sequence, except this time, rather than representing the configuration of the stack and the buffer by having several million symbolic features, we're instead going to summarize this configuration as a dense vector of dimensionality, perhaps approximately a thousand.

And our neural approach is going to learn this dense compact feature representation. And so quite explicitly, what I'm gonna show you now briefly and what you're going to implement is essentially the neural dependency parser that was developed by Danqi Chen in 2014. And to skip to the advertisement right at the beginning as to how this works so well, these are the kind of results that you got from it using the measures that I introduced at the last time, the unlabeled attachment score, whether you attach dependencies correctly to the right word and the labeled attachment score as to whether you also get the type of grammatical relation of that dependency correct.

And so essentially, this Chen and Manning parser gave a neural version of something like a transition-based dependency parser like Malt parser in yellow. And the interesting thing was that taking advantage of a neural classifier in ways that I'm about to explain, that that could produce something that was about 2% more accurate than the symbolic dependency parser.

And because of the fact that it's not doing all of the symbolic feature computation, despite the fact that you might think at first that there's a lot of real number math and matrix vector multiplies in a neural dependency parser, it actually ran noticeably faster than the symbolic dependency parser because it didn't have all of the feature computation.

The other major approach to dependency parsing that I'm also showing here, and I'll get back to at the end, is what's referred to as graph-based dependency parsing. And so that's a different approach to dependency parsing. And so these are two symbolic graph-based dependency parsers and in the pre-neural world, they were somewhat more accurate than the transition-based parsers as you could see, but on the other hand, they were close to two orders of magnitude slower.

And so essentially with the Chern-Manning parser, we were able to provide something that was basically as accurate as the best graph-based dependency parsers, which were the best dependency parsers while operating about two orders of magnitude more quickly. So how did we do it? It was actually a very straightforward implementation, which is part of what makes it great for doing for assignment three.

But this is how we did it and we got wins. So the first win, which is what we've already talked about extensively starting in week one, is to make use of distributed representations. So we represent each word as a word embedding, and you've had a lot of experience with that already.

And so that means when words weren't seen in a particular configuration, we still know what they're like because they'll be, will have seen similar words in the correct configuration. But we don't stop only with word embeddings. The other things that are central to our dependency parser are the parts of speech of words and the dependency labels.

And so what we decided to do is that although those are much smaller sets, so the dependency labels are about 40 in number and the parts of speech are of around that order of magnitude, sometimes less, sometimes more, that even within those sets of categories, there are ones that are very strongly related.

So we also adopted distributed representations for them. So for example, there might be parts of speech for singular nouns and plural nouns. And basically most of the time they behave similarly and there are adjectival modifiers and numerical modifiers. So these are just numbers like three, four, five. And again, a lot of the time they behave the same that you have both three cows and brown cows.

Okay, so everything is going to be represented in the distributed representation. So at that point, we have exactly the same kind of configuration where we have our stack, our buffer, and we've started to build some arcs. And so the classification decisions of the next transition are going to be made out of a few elements of this configuration.

So we're looking at the top thing on the stack, the thing that's second on the stack, the first word on the buffer. And then we actually added in some additional features that have been to the extent that we've already built arcs for words on the stack, that we can be looking at the dependence on the left and right of those words that are on the stack that are already in the sets of arcs.

And so for each of those things, there is a word, there is a part of speech, and for some of them, there is a dependency where it's already connected up to something else. So for example, the left corner of S2 here has an in-sub dependency back to the second thing on the stack.

So we can take these elements of the configuration and can look up the embedding of each one. So we have word embeddings, part of speech embeddings, and dependency embeddings, and just concatenate them all together, kind of like we did before with the window classifier, and that will give us a neural representation of the configuration.

Now, there's a second reason why we can hope to win by using a deep learning classifier to predict the next transition. And we haven't really said much about that yet. So I just wanted to detour and say a little bit more about that. So the simplest kind of classifier that's close to what we've been talking about in neural models is a softmax classifier.

So that if we have D dimensional vectors X, and we have Y classes to assign things to, oh, sorry, Y is an element of a set of C classes to assign things to, then we can build a softmax classifier using the softmax distribution that we've seen before, where we decide the classes based on having a weight matrix that's C by D, and we train on supervised data, the values of this W weight matrix to minimize our negative log likelihood loss that we've seen before, a loss that's also commonly referred to as cross entropy loss, a term that you'll see in PyTorch among other places.

So that is a straightforward machine learning classifier. And if you've done 229, you've seen softmax classifiers. But a simple softmax classifier like this shares with most traditional machine learning classifiers. So models that include naive Bayes models, support vector machines, logistic regression, that at the end of the day, they're not very powerful classifiers.

They're classifiers that only give linear decision boundaries. And so this can be quite limiting. So if you have a difficult problem, like the one I'm indicating in the picture in the bottom left, well, there's just no way you can divide the green points from the red points by simply drawing a straight line.

So you're going to have a quite imperfect classifier. So the second big win of neural classifiers is that they can be much more powerful because they can provide nonlinear classification. So rather than only being able to do something like in the left picture, we can come up with classifiers that do something like in the right picture and therefore can separate the green and the red points.

As an aside, these pictures I've taken from Andrej Karpathy's ConvNet JS software, which is a kind of a fun little tool to play around with if you've got a bit of spare time. And so there's something subtle going on here is because our more powerful neural net classifiers, at the end of the day, what they have at the top of them is a softmax layer.

And so this softmax layer is indeed a linear classifier, and it's still a linear classifier. But what they have below that is other layers of neural net. And so effectively what happens is that the classification decisions are linear as far as the top softmax is concerned, but nonlinear in the original representation space.

So precisely what a neural net can do is warp the space around and move the representation of data points to provide something that at the end of the day can be classified by a linear classifier. And so that's what a simple feed forward neural network multiclass classifier does. So it starts with an input representation.

So these are is some dense representation of the input. It puts it through a hidden layer H with a matrix multiply followed by nonlinearity. So that matrix multiply can transform the space and map things around. And so then the output of that, we can then put into a softmax layer and get out softmax probabilities from which we make our classification decisions.

And to the extent that our probabilities don't assign one to the correct class, we then get some log loss or cross entropy error, which we back propagate towards the parameters and embeddings of our model. And as the learning that goes on via back propagation, we increasingly well learn parameters of this hidden layer of the model, which learn to re-represent the input.

They move the inputs around in an intermediate hidden vector space. So it can be easily classified with what at the end of the day is the linear softmax. So this is basically the whole of a simple feed forward neural network, multi-class classifier. And if we had something like a visual signal, we just sort of feed straight in here, real numbers and we've been done.

But normally with human language material, we actually effectively have one more layer that we're feeding in before that. 'Cause really below this dense input layer, we actually have one hot vectors for what words or parts of speech were involved. And then we're doing a lookup process, which you can think of as one more matrix multiply to convert the one hot features into our dense input layer.

Okay, in my picture here, the one other thing that's different is I've introduced a different non-linearity in the hidden layer, which is a rectified linear unit. And that's what we'll be using now, neural dependency parsers. It looks like the picture in the bottom right, and I'll come back to that in a few minutes.

That's one of the extra neural net things to talk about. Okay, so our neural net dependency parser model architecture is essentially exactly that, but applied to the configuration of our transition-based dependency parser. So based on our transition-based dependency parser configuration, we construct an input layer embedding by looking up the various elements as I discussed previously, and then we feed it through this hidden layer to the softmax layer to get probabilities out of which we can choose what the next action is.

And it's no more complicated than that. But what we found is that just simply, in some sense using the simplest kind of feed-forward neural classifier could provide a very accurate dependency parser that determines the structure of sentences, supporting meaning interpretation, the kind of way that I suggested last time.

Indeed, despite the fact that it was a quite simple architecture in 2014, this was the first successful neural dependency parser. And the dense representations especially, but also partly the non-linearity of the classifier gave us this good result that it could both outperform symbolic parsers in terms of accuracy, and it could outperform them in terms of speed.

So that was 2014. Just quickly here are a couple more slides on what's happened since then. So lots of people got excited by the success of this neural dependency parser, and a number of people, particularly at Google, then set about building a bigger, fancier, transition-based neural dependency parser. So they explored bigger, deeper networks.

There's no reason to only have one hidden layer. You can have two hidden layers. You can do beam search that I briefly mentioned last time. Another thing that I'm not gonna talk about now is adding conditional random field style inference over decision sequences. And that then led in 2016 for a model that they called Parsey-McParse face, which is hard to say with a straight face, which was then about 2.5, 3% more accurate than the model that we had produced, but still in basically the same family of transition-based parser with a neural net classifier to choose the next transition.

The alternative to transition-based parsers is graph-based dependency parsers. And for a graph-based dependency parser, what you're doing is effectively considering every pair of words and considering a word as a dependent of root, and you're coming up with a score as to how likely is that big is a dependent of root or how likely is big to be dependent of cat.

And similarly for every other word, for the word sat, how likely is it to be a dependent of root or a dependent of the, et cetera. And well, to do that well, you need to know more than just what the two words involved are. And so what you want to do is understand the context.

So you want to have an understanding of the context of big, what's to the left of it, what's to the right of it to understand how you might hook it up into the dependency representations of the sentence. And so while there'd been previous work in graph-based dependency parsing, like the MST parser I showed on the earlier results slide, it seemed appealing that we could come up with a much better representation of context using neural nets that look at context.

And how we do that is actually what I'll be talking about in the end part of the lecture. And so at Stanford, we became interested in trying to work out how to come up with a better graph-based dependency parser using context. Sorry, I forgot this. This was showing that if we can score each pairwise dependency, we can simply choose the best one.

So we can say probably big is a dependent of cat and to a first approximation, we're gonna want to choose for each word that it is a dependent of the word that seems most likely to be a dependent. But we wanna do that with some constraints because we wanna get out something that is a tree with a single root, as I discussed last time.

And you can do that by making use of a minimum spanning tree algorithm that uses the scores of how likely different dependencies are. Okay, so then in 2017, another student, Tim Dozat and me then worked on saying, well, can we now also build a much better neural graph-based dependency parser?

And we developed a novel method for scoring neural scoring dependency parsers in a graph-based model, which I'm not gonna get into the details of right now, but that also had a very nice result 'cause getting back to graph-based parsing, we could then build a graph-based parser that performed about a percent better than the best of the Google transition-based neural dependency parsers.

But I should point out that this is a mixed win 'cause although its accuracy is better, these graph-based parsers are just in squared and performance rather than linear time. So kind of like the early results I showed, they don't operate nearly as quickly when you're wanting to parse large amounts of text with complex long sentences.

Okay, so that's everything you need to know about dependency parsers and to do assignment three. So grab it this evening and start to work. But I did want to sort of, before going on to the next topic, just mention a few more things about neural networks. Since some of you know this well already, some of you have seen less of it, but there just are a bunch of things you have to be aware of for building neural networks.

Now, again, for assignment three, essentially we give you everything. And if you follow the recipe, your parser should work well. But what you should minimally do is actually look carefully at some of the things that this parser does, which is questions like, how do we initialize our matrices of our neural network?

What kind of optimizers do we use? And things like that. 'Cause these are all important decisions. And so I wanted to say just a few words about that. Okay, so the first thing that we haven't discussed at all is the concept of regularization. So when we're building these neural nets, we're now building models with a huge number of parameters.

So essentially just about all neural net models that work well, actually their full loss function is a regularized loss function. So for this loss function here of J, well, this part here is the part that we've seen before where we're using a softmax classifier and then taking a negative log likelihood loss, which we're then averaging over the different examples.

But actually we then stick on the end of it, this regularization term. And so this regularization term sums the square of every parameter in the model. And so what that effectively says is you only wanna make parameters non-zero if they're really useful, right? So to the extent that the parameters don't help much, you're just being penalized here by making them non-zero.

But to the extent that the parameters do help, you'll gain in your estimation of likelihood and therefore it's okay for them to be non-zero. In particular, notice that this penalty is assessed only once per parameter. It's not being assessed separately for each example. Okay, and having this kind of regularization is essential to build neural net models that regularize well.

So the classic problem is referred to as overfitting. And what overfitting means is that if you have a particular training dataset and you start training your model, your error will go down because you'll shift the parameters so they better predict the correct answer for data points in the model.

And you can keep on doing that and it will keep on reducing your error rate. But if you then look at your partially trained classifier and say, how well does this classifier classify independent data, different test data that you weren't training the model on, what you will find is up until a certain point, you'll get better at classifying independent test examples as well.

And after that, commonly what will happen is you'll actually start to get worse at classifying independent test examples, even though you're continuing to get better at predicting the training examples. And so this was then referred to as you're overfitting the training examples, that you're fiddling the parameters of the model so that they're really good at predicting the training examples, which aren't useful things that can then predict on independent examples that you've come to at runtime.

Okay, that classic view of regularization is sort of actually outmoded and wrong for modern neural networks. So the right way to think of it for the kind of modern big neural networks that we build is that overfitting on the training data isn't a problem, but nevertheless, you need regularization to make sure that your models generalize well to independent test data.

So what you would like is for your graph not to look like this example with test error starting to head up. You'd like to have it at worst case flat line and best case still be gradually dropping. It'll always be higher than the training error, but it's not actually showing a failure to generalize.

So when we train big neural nets these days, our big neural nets always overfit on the training data. They hugely overfit on the training data. In fact, in many circumstances, our neural nets have so many parameters that you can continue to train them on the training data until the error on the training data is zero.

They get every single example right because they can just memorize enough stuff about it to predict the right answer. But in general, providing the models are regularized well, those models will still also generalize well and predict well on independent data. And so for part of what we wanna do for that is to work out how much to regularize.

And so this lambda parameter here is the strength of regularization. So if you're making that lambda number big, you're getting more regularization. And if you make it small, you're getting less. And you don't wanna have it be too big or else you won't fit the data well. And you don't want it to be too small or else you have the problem that you don't generalize well.

Okay, so this is classic L2 regularization and it's a starting point. But our big neural nets are sufficiently complex and have sufficiently many parameters that essentially L2 regularization doesn't cut it. So the next thing that you should know about and is a very standard good feature for building neural nets is a technique called dropout.

So dropout is generally introduced as a sort of a slightly funny process that you do when training to avoid feature co-adaptation. So in dropout, what you do is at the time that you're training your model, that for each instance or for each batch in your training, then for each neuron in the model, you drop 50% of its inputs.

You just treat them as zero. And so that you can do by sort of zeroing out elements of the sort of layers. And then at test time, you don't drop any of the model weights, you keep them all, but actually you have all the model weights because you're now keeping twice as many things as you'd used at training data.

And so effectively that little recipe prevents what's called feature co-adaptation. So you can't have features that are only useful in the presence of particular other features because the model can't guarantee which features are gonna be present for different examples because different features are being randomly dropped all of the time.

And so effectively dropout gives you a kind of a middle ground between naive Bayes and a logistic regression model. In a naive Bayes models, all the weights are set independently and a logistic regression model, all the weights are set in the context of all the others. And here you are aware of other weights, but they can randomly disappear from you.

It's also related to ensemble models like model bagging because you're using different subsets of the features every time. But after all of those explanations, there's actually another way of thinking about dropout which was actually developed here at Stanford. This is a paper by Percy Liang and students, which is to argue that really what dropout gives you is a strong regularizer that isn't a uniform regularizer like L2 that regularizes everything with an L2 loss, but can learn a feature dependent regularization.

And so that dropout has just emerged as in general, the best way to do regularization for neural nets. I think you've already seen and heard this one, but just to have it on my slides once. If you want to have your neural networks go fast, it's really essential that you make use of vectors, matrices, tensors, and you don't do things with for loops.

So here's a teeny example where I'm using time it, which is a useful thing that you can use too to see how fast your neural nets run and different ways of writing that. And so when I'm doing this, doing these dot products here, I can either do the dot product in a for loop against each word vector, or I can do the dot product with a single word vector matrix.

And if I do it in a for loop, doing each loop takes me almost a second. Whereas if I do it with a matrix multiply, it takes me an order of magnitude less time. So you should always be looking to use vectors and matrices, not for loops. And this is a speed up of about 10 times when you're doing things on a CPU.

Heading forward, we're going to be using GPUs and they only further exaggerate the advantages of using vectors and matrices where you'll commonly get two orders of magnitude speed up by doing things that way. Yeah, so for the backward pass, you are running a backward pass as before on the dropped out examples, right?

So for the things that were dropped out, no gradient is going through them because they weren't present, they're not affecting things. So in a particular batch, you're only training weights for the things that aren't dropped out. But then since you, for each successive batch, you drop out different things that over a bunch of batches, you're then training all of the weights of the model.

And so feature dependent regularizer is meaning that how much a feature, the different features can be regularized different amounts to maximize performance. So back in this model, every feature was just sort of being penalized by taking lambda times that squared value. So this is sort of uniform regularization where the end result of this dropout style training is that you end up with some features being regularized much more strongly and some other features being regularized less strongly.

And how much they are regularized depends on how much they're being used. So you're regularizing more features that are being used less but I'm not gonna get through into the details of how you can understand that perspective. That's outside of the context of what I'm gonna get through right now.

So the final bit is I just wanted to give a little bit of perspective on nonlinearities in our neural nets. So the first thing to remember is you have to have nonlinearities. So if you're building a multi-layer neural net and you've just got W1X plus B1 then you put it through W2X plus B2 and then put through W3X, well, I guess they're different hidden layers.

Sorry, I should have said X. They should be hidden one, hidden two, hidden three. W3, hidden three plus B3. That multiple linear transformations composed so they can be just collapsed down into a single linear transformation. So you don't get any power as a data representation by having multiple linear layers.

There's a slightly longer story there 'cause you actually do get some interesting learning effects but I'm not gonna talk about that now. But standardly, we have to have some kind of nonlinearity to do something interesting in a deep neural network. Okay, so there's a starting point is the most classic nonlinearity is the logistic often just called the sigmoid nonlinearity because of its S shape which we've seen before in previous lectures.

So this will take any real number and map it on to the range of zero one. And that was sort of basically what people used in sort of 1980s neural nets. Now, one disadvantage of this nonlinearity is that it's moving everything into the positive space because the output is always between zero and one.

So people then decided that for many purposes, it was useful to have this variant sigmoid shape of hyperbolic tan, which is then being shown in the second picture. Now, logistic and hyperbolic tan, they sound like they're very different things but actually, as you maybe remember from a math class, hyperbolic tan can be represented in terms of exponentials as well.

And if you do a bit of math, which possibly we might make you do on an assignment, it's actually the case that a hyperbolic tan is just a rescaled and shifted version of the logistics. So it's really exactly the same curve, just squeezed a bit. So it goes now symmetrically between minus one and one.

Well, these kinds of transcendental functions like hyperbolic tan, they're kind of slow and expensive to compute, right? Even on our fast computers, calculating exponentials is a bit slow. So something people became interested in was, well, could we do things with much simpler non-linearity? So what if we used a so-called hard tan H?

So the hard tan H, up to some point it just flat lines at minus one, then it is Y equals X up until one, and then it just flat lines again. And that seems a slightly weird thing to use because if your input is over on the left or over on the right, you're sort of not getting any discrimination and if for things giving the same output.

But somewhat surprisingly, I mean, I was surprised when people started doing this, these kinds of models proved to be very successful. And so that then led into what's proven to be kind of the most successful and generally widely used non-linearity in a lot of recent deep learning work, which was what was being used in the dependency parser model I showed is what's called the rectified linear unit or ReLU.

So a ReLU is kind of the simplest kind of non-linearity that you can imagine. So if the value of X is negative, its value is zero. So effectively it's just dead, it's not doing anything in the computation. And if its value of X is greater than zero, then it's just simply Y equals X, the value is being passed through.

And at first sight, this might seem really, really weird and how could this be useful as a non-linearity? But if you sort of think a bit about how you can approximate things with piecewise linear functions very accurately, you might kind of start to see how you could use this to do accurate function approximation with piecewise linear functions.

And that's what ReLU units have been found to do extremely, extremely successfully. So logistic and tanh are still used in various places. You use logistic when you want a probability output. We'll see tanh again very soon when we get to recurrent neural networks. But they're no longer the default when making deep networks that in a lot of places, the first thing you should think about trying is ReLU non-linearities.

And so in particular, that part of why they're good is that ReLU non-networks train very quickly because you get this sort of very straightforward gradient backflow because providing you on the right-hand side of it, you then just gain this sort of constant gradient backflow from the slope one. And so they train very quickly.

The somewhat surprising fact is that sort of almost the simplest non-linearity imaginable is still enough to have a very good neural network, but it just is. People have played around with variants of that. So people have then played around with leaky ReLUs where rather than the left-hand side just going completely to zero, it goes slightly negative on a much shallower slope.

And then there's been a parametric ReLU where you have an extra parameter where you learn the slope of the negative part. Another thing that's been used recently is this swish non-linearity, which looks almost like a ReLU, but it sort of curves down just a little bit there and starts to go up.

I mean, I think it's fair to say that, you know, none of these have really proven themselves vastly superior. There are papers saying I can get better results by using one of these and maybe you can, but you know, it's not night and day and a vast majority of work that you see around is still just using ReLUs in many places.

Okay, a couple more things. Parameter initialization. So in almost all cases, you must, must, must initialize the matrices of your neural nets with small random values. Neural nets just don't work if you start the matrices off as zero 'cause effectively then everything is symmetric, nothing can specialize in different ways and you then get sort of, you just don't have an ability for a neural net to learn.

You sort of get this defective solution. So standardly you're using some methods such as drawing random numbers uniformly between minus R and R for a small value R and just filling in all the parameters with that. Exception is with bias weights. It's fine to set bias weights to zero and in some sense that's better.

In terms of choosing what the R value is, essentially for traditional neural nets, what we want to set that R range for is so that the numbers in our neural network stay of a reasonable size. They don't get too big and they don't get too small. And whether they kind of blow up or not depends on how many connections there are in the neural networks.

I'm looking at the fan in and fan out of connections in the neural network. And so a very common initialization that you'll see in PyTorch is what's called Harvey initialization named after a person who suggested that. And it's working out a value of R based on this fan in and fan out of the layers, but you can just sort of ask for it, say initialize with this initialization and it will.

This is another area where there have been some subsequent developments. So around week five, we'll start talking about layer normalization. And if you're using layer normalization, then it sort of doesn't matter the same how you initialize the weights. So finally, we have to train our models. And I've briefly introduced the idea of stochastic gradient descent.

And the good news is that most of the time that if training your networks with stochastic gradient descent works just fine, use it and you will get good results. However, often that requires choosing a suitable learning rate, which is my final slide of tips on the next slide. But there's been an enormous amount of work on optimization of neural networks and people have come up with a whole series of more sophisticated optimizers.

And I'm not gonna get into the details of optimization in this class, but the very loose idea is that these optimizers are adaptive in that they can kind of keep track of how much slope there was, how much gradient there is for different parameters. And therefore based on that, make decisions as to how much to adjust the weights when doing the gradient update rather than adjusting it by a constant amount.

And so in that family of methods, there are methods that include AdaGrad, RMSProp, Adam, and then a variance of Adam, including SparseAdam, AdamW, et cetera. The one called Adam is a pretty good place to start. And a lot of the time that's a good one to use. And again, from the perspective of PyTorch, when you're initializing an optimizer, you can just say, please use Adam and you don't actually need to know much more about it than that.

If you are using simple stochastic gradient descent, you have to change, choose a learning rate. So that was the eta value that you multiplied the gradient by for how much to adjust the weights. And so I talked about that slightly, how you didn't want it to be too big or your model could diverge or bounce around.

You didn't want it to be too small or else training could take place exceedingly slowly and you'll miss the assignment deadline. How big it should be depends on all sorts of details of the model. And so you sort of wanna try out some different order of magnitude numbers to see what numbers seem to work well for training stably but reasonably quickly.

Something around 10 to the minus three or 10 to the minus four is in a crazy place to start. In principle, you can do fine just using a constant learning rate in SGD. In practice, people generally find they can get better results by decreasing learning rates as you train.

So a very common recipe is that you half the learning rate after every K epochs, where an epoch means that you've made a pass through the entire set of training data. So perhaps something like every three epochs you have the learning rate. And a final little note there in purple is when you make a pass through the data, you don't wanna go through the data items in the same order each time 'cause that leads you just kind of have a sort of patterning of the training examples that the model will sort of fall into that periodicity of those patterns.

So it's best to shuffle the data before each pass through it. Okay, there are more sophisticated ways to set learning rates. And I won't really get into those now. Fancier optimizers like Adam also have a learning rate. So you still have to choose a learning rate value, but it's effectively, it's an initial learning rate, which typically the optimizer shrinks as it runs.

And so you commonly want to have the number it starts off with beyond the larger size because it'll be shrinking as it goes. Okay, so that's all by way of introduction. And I'm now ready to start on language models and RNNs. So what is language modeling? I mean, as two words of English, language modeling could mean just about anything, but in the natural language processing literature, language modeling has a very precise technical definition, which you should know.

So language modeling is the task of predicting the word that comes next. So if you have some context, like the students open there, you want to be able to predict what words will come next. Is it their books, their laptops, their exams, their minds? And so in particular, what you want to be doing is being able to give a probability that different words will occur in this context.

So a language model is a probability distribution over next words given a preceding context. And a system that does that is called a language model. So as a result of that, you can also think of a language model as a system that assigns a probability score to a piece of text.

So if we have a piece of text, then we can just work out its probability according to a language model. So the probability of a sequence of tokens, we can decompose via the chain rule, probability of the first times probability of the second given the first, et cetera, et cetera.

And then we can work that out using what our language model provides as a product of each probability of predicting the next word. Okay, language models are really the cornerstone of human language technology. Everything that you do with computers that involves human language, you are using language models. So when you're using your phone and it's suggesting whether well or badly, what the next word that you probably want to type is, that's a language model working to try and predict the likely next words.

When the same thing happens in a Google Doc and it's suggesting a next word or a next few words, that's a language model. The main reason why the one in Google Docs works much better than the one on your phone is that for the keyboard phone models, they have to be very compact so they can run quickly and not much memory.

So they're sort of only mediocre language models where something like Google Docs can do a much better language modeling job. Query completion, same thing, there's a language model. And so then the question is, well, how do we build language models? And so I briefly wanted to first again, give the traditional answer since you should have at least some understanding of how NLP was done without a neural network.

And the traditional answer that powered speech recognition and other applications for at least two decades, three decades really, was what were called N-gram language models. And these were very simple, but still quite effective idea. So we want to give probabilities of next words. So what we're gonna work with is what are referred to as N-grams.

And so N-grams is just a chunk of N consecutive words, which are usually referred to as unigrams, bigrams, trigrams, and then four grams and five grams. A horrible set of names, which would offend any humanist, but that's what people normally say. And so effectively what we do is just collect statistics about how often different N-grams occur in a large amount of text, and then use those to build a probability model.

So the first thing we do is what's referred to as making a Markov assumption. So these are also referred to as Markov models. And we decide that the word in position T plus one only depends on the preceding N minus one words. So if we want to predict T plus one given the entire preceding text, we actually throw away the early words and just use the preceding N minus one words as context.

Well, once we've made that simplification, we can then just use the definition of conditional probability and say, all that conditional probability is the probability of N words divided by the preceding N minus one words. And so we have the probability of an N-gram over the probability of an N minus one gram.

And so then how do we get these N-gram and N minus one gram probabilities? We simply take a larger amount of text in some language and we count how often the different N-grams occur. And so our crude statistical approximation starts off as the count of the N-gram over the count of the N minus one gram.

So here's an example of that. Suppose we are learning a four-gram language model. Okay, so we throw away all words apart from the last three words, and they're our conditioning. We look in some large, we use the counts from some large training corpus and we see how often did students open their books occur?

How often did students open their minds occur? And then for each of those counts, we divide through by the count of how often students open their occurred. And that gives us our probability estimates. So for example, if in the corpus students open, they occurred a thousand times, students open their books occurred 400 times, we'd get a probability estimate of 0.4 for books.

If exams occurred a hundred times, we'd get 0.1 for exams. And we sort of see here already the disadvantage of having made the Markov assumption and have gotten rid of all of this earlier context, which would have been useful for helping us to predict. The one other point that I'll just mention that I can fuse myself on is this count of the N-gram language model.

So for a four-gram language model, it's called a four-gram language model because in its estimation, you're using four grams in the numerator and trigrams in the denominator. So you use the size of the numerator. So that terminology is different to the terminology that's used in Markov models. So when people talk about the order of a Markov model, that refers to the amount of context you're using.

So this would correspond to a third order Markov model. Yeah, so someone said, "Is this similar to a Naive Bayes model?" Sort of. Naive Bayes models, you also estimate the probabilities just by counting. So they're related and there are sort of in some sense, two differences. The first difference or specialization is that Naive Bayes models work out probabilities of words independent of their neighbors.

So in one part that a Naive Bayes language model is a unigram language model. So you're just using the counts of individual words. But the other part of a Naive Bayes model is you're learning a different set of unigram counts for every class for your classifier. And so you've then got sort of, so effectively a Naive Bayes model is you've got class specific unigram language models.

Okay, I gave this as a simple statistical model for estimating your probabilities with an N-gram model. You can't actually get away with just doing that because you have sparsity problems. So, often it'll be the case that for many words, students open their books or students opened their backpacks just never occurred in the training data.

That if you think about it, if you have something like 10 to the fifth different words even, and you want to have then a sequence of four words or a problem and there are 10 to the fifth of each, there's sort of 10 to the 20th different combinations. So unless you're seeing and it's truly astronomical amount of data, most four word sequences you've never seen.

So then your numerator will be zero and your probability estimate will be zero. And so that's bad. And so the commonest way of solving that is just to add a little Delta to every count and then everything is non-zero and that's called smoothing. But well, sometimes it's worse than that 'cause sometimes you won't even have seen students open theirs and that's more problematic 'cause that means our denominator is zero.

And so the division will be ill-defined and we can't usefully calculate any probabilities in a context that we've never seen. And so the standard solution to that is to shorten the context and that's called back off. So we condition only on open there or if we still haven't seen the open there, we'll condition only on there or we could just forget all conditioning and actually use a unigram model for our probabilities.

Yeah. And so as you increase the order N of the N-gram language model, these sparsity problems become worse and worse. So in the early days, people normally worked with trigram models. As it became easier to collect billions of words of text, people commonly moved to five-gram models. But every time you go up an order of conditioning, you effectively need to be collecting orders of magnitude more data because of the size of the vocabularies of human languages.

There's also a problem that these models are huge. So you basically have to be storing counts of all of these words sequences so you can work out these probabilities. And I mean, that's actually had a big effect in terms of what technology is available. So in the 2000s decade up till about whenever it was, 2014, that there was already Google Translate using probabilistic models that included language models of the N-gram language model sort.

But the only way they could possibly be run is in the cloud because you needed to have these huge tables of probabilities. But now we have neural nets and you can have Google Translate just actually run on your phone. And that's possible because neural net models can be massively more compact than these old N-gram language models.

Yeah. But nevertheless, before we get onto the neural models, let's just sort of look at the example of how these work. So it's trivial to train an N-gram language model 'cause you really just count how often word sequences occur in a corpus and you're ready to go. So these models can be trained in seconds.

That's really good. That's not like sitting around for training neural networks. So if I train on my laptop, a small language model on, you know, about 1.7 million words as a trigram model, I can then ask it to generate text. If I give it a couple of words today, I can then get it to sort of suggest a word that might come next.

And the way I do that is the language model knows the probability distribution of things that can come next. Now there's a kind of a crude probability distribution. I mean, 'cause effectively over this relatively small corpus, there were things that occurred once, Italian and Emirate. There are things that occurred twice, price.

There were things that occurred four times, company and bank. It's sort of fairly crude and rough, but I nevertheless get probability estimates. I can then say, okay, based on this, let's take this probability distribution and then we'll just sample the next word. So the two most likely words to sample, a company or bank, but we're rolling the dice and we might get any of the words that had come next.

So maybe I sample price. Now I'll condition on price, the price, and look up the probability distribution of what comes next. The most likely thing is of. And so again, I'll sample, and maybe this time I'll pick up of. And then I will now condition on price of, and I will look up the probability distribution of words following that.

And I get this probability distribution and I'll sample randomly some word from it. And maybe this time I'll sample a rare, but possible one like gold. And I can keep on going and I'll get out something like this. Today, the price of gold per tonne while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an IMF demand to rebuild depleted European stocks, SEP 30 N primary 76 cents a share.

So what just a simple trigram model can produce over not very much text is actually already kind of interesting. Like it's actually surprisingly grammatical, right? There are whole pieces of it while production of shoe lasts and shoe industry, the bank intervene just after it considered and rejected an IMF demand, right?

It's really actually pretty good grammatical text. So it's sort of amazing that these simple N-gram models actually can model a lot of human language. On the other hand, it's not a very good piece of text. It's completely incoherent and makes no sense. And so to actually be able to generate text that seems like it makes sense, we're going to need a considerably better language model.

And that's precisely what neural language models have allowed us to build as we'll see later. Okay, so how can we build a neural language model? And so first of all, we're gonna do a simple one and then we'll see where we get, but to move into a current neural nets might still take us to the next time.

So we've gonna have input sequence of words and we want a probability distribution over the next word. Well, the simplest thing that we could try is to say, well, kind of the only tool we have so far is a window-based classifier. So what we can say, what we'd done previously either for our named entity recognized in lecture three or what I just showed you for the dependency parser is we have some context window, we put it through a neural net and we predict something as a classifier.

So before we were predicting a location, but maybe instead we could reuse exactly the same technology and say, we're going to have a window-based classifier. So we're discarding the further away words just like in N-gram language model, but we'll feed this fixed window into a neural net. So we can catenate the word embeddings, we put it through a hidden layer and then we have a softmax classifier over our vocabulary.

And so now rather than predicting something like location or left arc in the dependency parser, we're going to have a softmax over the entire vocabulary sort of like we did with the skip-gram negative sampling model in the first two lectures. And so we're going to see this choice as predicting what word that comes next, whether it produces laptops, minds, books, et cetera.

Okay, so this is a fairly simple fixed window neural net classifier, but this is essentially a famous early model in the use of neural nets for NLP applications. So first the 2000 conference paper and then a somewhat later journal paper, Yoshua Bengio and colleagues introduced precisely this model as the neural probabilistic language model.

And they were already able to show that this could give interesting good results for language modeling. And so it wasn't a great solution for neural language modeling, but it still had value. So it didn't solve the problem of allowing us to have bigger context to predict what words are going to come next.

It's in that way limited exactly like an N-gram language model is, but it does have all the advantages of distributed representations. So rather than having these counts for word sequences that are very sparse and very crude, we can use distributed representations of words, which then make predictions that semantically similar words should give similar probability distribution.

So the idea of that is if we'd use some other word here, like maybe the pupils open there, well, maybe in our training data, we'd seen sentences about students, but we'd never seen sentences about pupils. An N-gram language model then would sort of have no idea what probabilities to use, whereas a neural language model can say, well, pupils is kind of similar to students.

Therefore I can predict similarly to what I would have predicted for students. Okay, so there's now no sparsity problem. We don't need to store billions of N-gram counts. We simply need to store our word vectors and our W and U matrices, but we still have the remaining problems that our fixed window is too small.

We can try and make the window larger. If we do that, the W matrix gets bigger, but that also points out another problem with this model. Not only can the window never be large enough, but W is just a trained matrix. And so therefore we're learning completely different weights for each position of context, the word minus one position, the word minus two, the word minus three, and the word minus four, so that there's no sharing in the model as to how it treats words in different positions, even though in some sense they will contribute semantic components that are at least somewhat position independent.

So again, for those of, if you sort of think back to either a naive Bayes model or what we saw with the word2vec model at the beginning, the word2vec model or a naive Bayes model completely ignores word order. So it has one set of parameters regardless of what position things occur in.

That doesn't work well for language modeling 'cause word order is really important in language modeling. If the last word is the, that's a really good predictor of there being an adjective or noun following where if the word4vec is the, it doesn't give you the same information. So you do wanna somewhat make use of word order, but this model is at the opposite extreme that each position is being modeled completely independently.

So what we'd like to have is a neural architecture that can process an arbitrary amount of context and have more sharing of the parameters while still be sensitive to proximity. And so that's the idea of recurrent neural networks. And I'll say about five minutes about these today, and then next time we'll return and do more about recurrent neural networks.

So for the recurrent neural network, rather than having a single hidden layer inside our classifier here that we compute each time, for the recurrent neural network, we have the hidden layer, which often is referred to as the hidden state, but we maintain it over time and we feed it back into itself.

So that's what the word recurrent is meaning, that you're sort of feeding the hidden layer back into itself. So what we do is based on the first word, we compute a hidden representation kind of like before, which can be used to predict the next word. But then for when we want to predict what comes after the second word, we not only feed in the second word, we feed in the hidden layer from the previous word to have it help predict the hidden layer above the second word.

And so formally the way we're doing that is we're taking the hidden layer above the first word, multiplying it by a matrix W, and then that's going to be going in together with X2 to generate the next hidden step. And so we keep on doing that at each time step, so that we are kind of repeating a pattern of creating a next hidden layer based on the next input word and the previous hidden state by updating it, by multiplying it by a matrix W.

Okay, so on my slide here, I've still only got four words of context because it's nice for my slide, but in principle there could be, any number of words of context now. Okay, so what we're doing is, so that we start off by having input vectors, which can be our word vectors that we've looked up for each word.

So, sorry, yeah, so we can have the one hot vectors for word identity. We look up our word embeddings, so then we've got word embeddings for each word. And then we want to compute hidden states. So we need to start from somewhere, H0 is the initial hidden state and H0 is normally taken as a zero vector.

So this is actually just initialized as zeros. And so for working out the first hidden state, we calculate it based on the first word embedding by multiplying this embedding by a matrix WE. And that gives us the first hidden state. But then, as we go on, we want to apply the same formula over again.

So we have just two parameter matrices in the recurrent neural network, one matrix for multiplying input embeddings and one matrix for updating the hidden state of the network. And so for the second word, from its word embedding, we multiply it by the WE matrix. We take the previous time steps hidden state and multiply it by the WH matrix.

And we use the two of those to generate the new hidden state and precisely how we generate the new hidden state is then by shown on this equation on the left. So we take the previous hidden state, multiply it by WH. We take the input embedding, multiply it by WE.

We sum those two. We add on a learn bias weight, and then we put that through a non-linearity. And although on this slide, that non-linearity is written as sigma, by far the most common non-linearity to use here actually is a tanh non-linearity. And so this is the core equation for a simple recurrent neural network.

And for each successive time step, we're just gonna keep on applying that to work out hidden states. And then from those hidden states, we can use them just like in our window classifier to predict what would be the next word. So at any position, we can take this hidden vector, put it through a softmax layer, which is multiplying by U matrix and adding on another bias, and then making a softmax distribution out of that.

And that will then give us a probability distribution over next words. What we saw here, right? This is the entire math of a simple recurrent neural network. And next time I'll come back and say more about them, but this is the entirety of what you need to know in some sense for the computation of the forward model of a simple recurrent neural network.

So the advantages we have now is it can process a text import of any length. In theory, at least it can use information from any number of steps back. We'll talk more about in practice how well that actually works. The model size is fixed. It doesn't matter how much of a past context there is.

All we have is our WH and WE parameters. And at each time step, we use exactly the same weights to update our hidden state. So there's a symmetry in how different inputs are processed in producing our predictions. RNNs in practice though, or the simple RNNs in practice aren't perfect.

So a disadvantage is that they're actually kind of slow 'cause with this recurrent computation, in some sense we are sort of stuck with having to have on the outside a for loop. So we can do vector matrix multiplies on the inside here, but really we have to do for time step equals one to N, calculate the successive hidden states.

And so that's not a perfect neural net architecture and we'll discuss alternatives to that later. And although in theory, this model can access information any number of steps back, in practice we find that it's pretty imperfect at doing that and that will then lead to more advanced forms of recurrent neural network that I'll talk about next time that are able to more effectively access past context.

Okay, I think I'll stop there for the day.

Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 5 - Recurrent Neural networks (RNNs)

Chapters

Transcript