back to indexStanford CS224N - NLP w/ DL | Winter 2021 | Lecture 5 - Recurrent Neural networks (RNNs)
Chapters
0:0 Introduction
2:45 Dependency parsers
5:42 Neural dependency parser
8:14 Distributed representations
11:19 Softmax classifier
14:51 Multiclass classifier
17:15 Dependency parser model
18:57 Whats happened since 2014
20:6 Context
22:21 Graphbased parsers
24:35 Regularization
27:59 L2 Regularization
30:18 Dropout
34:30 Backward Pass
36:21 Nonlinearity
43:27 Parameter initialization
45:51 Training
47:39 Learning Rate
50:0 Language Models
52:58 Engram Language Models
54:16 Markov Model
55:37 Foreground Language Model
58:43 Sparsity Problems
00:00:00.000 |
So we're now starting in week three with lecture five. 00:00:15.320 |
I guess I really got behind and went a bit slowly. 00:00:30.200 |
I'll in some sense be finishing the content of last time 00:00:38.620 |
to introduce a simple feed forward neural net classifier. 00:00:46.040 |
of just background things that you need to know 00:00:50.360 |
because the fact of the matter is there is a bunch of stuff 00:00:57.380 |
I'll get into what's really meant to be the topic 00:00:59.880 |
of today's lecture, which is looking at language modeling 00:01:06.160 |
And that's then going to lead into those two things 00:01:10.000 |
are important topics that we'll then be talking about 00:01:16.920 |
So there's a couple of reminders before we get underway. 00:01:19.640 |
The first is that you should have handed in assignment two 00:01:28.920 |
And it's an assignment where you're going to build 00:01:40.960 |
is actually to get you up to speed with PyTorch. 00:01:48.040 |
with lots of comments and hints about what to do. 00:01:54.120 |
to the end of it, you'll feel fairly familiar 00:02:04.560 |
you might want to go back and look at the video. 00:02:07.040 |
Another thing to mention about the assignments 00:02:11.000 |
is that assignment three is the last assignment 00:02:14.680 |
where our great team of TAs are happy to look at your code 00:02:25.320 |
But starting on assignment four for assignments four, five 00:02:41.720 |
and reasons why things might not work with them. 00:02:44.120 |
Okay, so if you remember where we were last time, 00:02:53.880 |
and that these were an efficient linear time method 00:03:05.320 |
before neural nets came along and took over NLP again, 00:03:13.160 |
is that like most machine learning models of that time, 00:03:19.380 |
So that means that you are specifying some condition 00:03:24.000 |
and then checking whether it was true of a configuration. 00:03:27.200 |
So something like the word on the top of the stack is good 00:03:32.640 |
or the next word coming up is a personal pronoun, 00:03:36.800 |
that those are conditions that would be features 00:03:39.480 |
in a conventional transition-based dependency parser. 00:03:44.480 |
And so what are the problems with doing that? 00:03:48.540 |
Well, one problem is that those features are very sparse. 00:03:53.540 |
A second problem is the features are incomplete. 00:03:57.440 |
Well, what I mean by that is depending on what words 00:04:02.440 |
and configurations occurred in the training data, 00:04:09.040 |
because you sort of saw a certain word preceding a verb 00:04:18.580 |
But perhaps the biggest problem and opportunity 00:04:22.180 |
for doing better with the neural dependency parser 00:04:25.140 |
is that it turns out that in a symbolic dependency parser, 00:04:31.500 |
just turns out to actually be pretty expensive. 00:04:36.100 |
that I showed last time is fast and efficient to run, 00:04:40.660 |
you actually have to compute all of these features. 00:04:43.980 |
And what you found was that about 95% of the parsing time 00:04:48.380 |
of one of these models was spent just computing 00:04:54.340 |
So that suggests that perhaps we can do better 00:04:58.940 |
with a neural approach where we're going to learn 00:05:10.420 |
the same kind of configuration of a stack and a buffer 00:05:15.020 |
and running exactly the same transition sequence, 00:05:18.700 |
except this time, rather than representing the configuration 00:05:27.100 |
we're instead going to summarize this configuration 00:05:42.740 |
And so quite explicitly, what I'm gonna show you now briefly 00:05:58.060 |
And to skip to the advertisement right at the beginning 00:06:04.860 |
these are the kind of results that you got from it 00:06:08.000 |
using the measures that I introduced at the last time, 00:06:12.860 |
whether you attach dependencies correctly to the right word 00:06:19.340 |
as to whether you also get the type of grammatical relation 00:06:24.220 |
And so essentially, this Chen and Manning parser 00:06:37.520 |
And the interesting thing was that taking advantage 00:06:41.260 |
of a neural classifier in ways that I'm about to explain, 00:06:58.400 |
despite the fact that you might think at first 00:07:02.540 |
and matrix vector multiplies in a neural dependency parser, 00:07:11.580 |
because it didn't have all of the feature computation. 00:07:16.100 |
The other major approach to dependency parsing 00:07:22.220 |
is what's referred to as graph-based dependency parsing. 00:07:25.740 |
And so that's a different approach to dependency parsing. 00:07:29.060 |
And so these are two symbolic graph-based dependency parsers 00:07:38.700 |
than the transition-based parsers as you could see, 00:07:43.420 |
they were close to two orders of magnitude slower. 00:07:46.980 |
And so essentially with the Chern-Manning parser, 00:07:58.300 |
while operating about two orders of magnitude more quickly. 00:08:04.160 |
It was actually a very straightforward implementation, 00:08:18.960 |
which is what we've already talked about extensively 00:08:23.140 |
is to make use of distributed representations. 00:08:26.280 |
So we represent each word as a word embedding, 00:08:29.580 |
and you've had a lot of experience with that already. 00:08:41.260 |
will have seen similar words in the correct configuration. 00:08:50.460 |
The other things that are central to our dependency parser 00:08:53.860 |
are the parts of speech of words and the dependency labels. 00:09:01.040 |
is that although those are much smaller sets, 00:09:04.660 |
so the dependency labels are about 40 in number 00:09:16.620 |
there are ones that are very strongly related. 00:09:19.320 |
So we also adopted distributed representations for them. 00:09:29.100 |
And basically most of the time they behave similarly 00:09:32.180 |
and there are adjectival modifiers and numerical modifiers. 00:09:36.380 |
So these are just numbers like three, four, five. 00:09:38.820 |
And again, a lot of the time they behave the same 00:09:41.700 |
that you have both three cows and brown cows. 00:09:46.140 |
Okay, so everything is going to be represented 00:09:54.660 |
we have exactly the same kind of configuration 00:10:03.780 |
And so the classification decisions of the next transition 00:10:12.940 |
So we're looking at the top thing on the stack, 00:10:19.460 |
And then we actually added in some additional features 00:10:24.140 |
that we've already built arcs for words on the stack, 00:10:30.060 |
on the left and right of those words that are on the stack 00:10:47.280 |
where it's already connected up to something else. 00:10:58.580 |
So we can take these elements of the configuration 00:11:05.780 |
So we have word embeddings, part of speech embeddings, 00:11:11.680 |
kind of like we did before with the window classifier, 00:11:14.880 |
and that will give us a neural representation 00:11:18.420 |
Now, there's a second reason why we can hope to win 00:11:27.680 |
And we haven't really said much about that yet. 00:11:37.320 |
that's close to what we've been talking about 00:11:54.240 |
oh, sorry, Y is an element of a set of C classes 00:12:03.680 |
using the softmax distribution that we've seen before, 00:12:09.100 |
based on having a weight matrix that's C by D, 00:12:28.240 |
a term that you'll see in PyTorch among other places. 00:12:32.400 |
So that is a straightforward machine learning classifier. 00:12:36.120 |
And if you've done 229, you've seen softmax classifiers. 00:12:45.160 |
shares with most traditional machine learning classifiers. 00:12:51.680 |
support vector machines, logistic regression, 00:13:11.240 |
well, there's just no way you can divide the green points 00:13:14.920 |
from the red points by simply drawing a straight line. 00:13:18.440 |
So you're going to have a quite imperfect classifier. 00:13:28.680 |
because they can provide nonlinear classification. 00:13:32.160 |
So rather than only being able to do something like 00:13:40.560 |
and therefore can separate the green and the red points. 00:13:50.600 |
which is a kind of a fun little tool to play around with 00:13:55.480 |
And so there's something subtle going on here 00:13:59.920 |
is because our more powerful neural net classifiers, 00:14:06.520 |
what they have at the top of them is a softmax layer. 00:14:10.720 |
And so this softmax layer is indeed a linear classifier, 00:14:18.380 |
But what they have below that is other layers of neural net. 00:14:26.540 |
is that the classification decisions are linear 00:14:32.600 |
but nonlinear in the original representation space. 00:14:44.880 |
to provide something that at the end of the day 00:15:00.860 |
So these are is some dense representation of the input. 00:15:08.160 |
with a matrix multiply followed by nonlinearity. 00:15:13.160 |
So that matrix multiply can transform the space 00:15:25.840 |
from which we make our classification decisions. 00:15:35.320 |
we then get some log loss or cross entropy error, 00:15:38.520 |
which we back propagate towards the parameters 00:15:44.280 |
And as the learning that goes on via back propagation, 00:16:04.680 |
with what at the end of the day is the linear softmax. 00:16:15.960 |
And if we had something like a visual signal, 00:16:38.880 |
for what words or parts of speech were involved. 00:16:44.040 |
which you can think of as one more matrix multiply 00:16:47.180 |
to convert the one hot features into our dense input layer. 00:16:58.400 |
in the hidden layer, which is a rectified linear unit. 00:17:06.200 |
It looks like the picture in the bottom right, 00:17:11.120 |
That's one of the extra neural net things to talk about. 00:17:14.460 |
Okay, so our neural net dependency parser model architecture 00:17:39.360 |
by looking up the various elements as I discussed previously, 00:17:43.560 |
and then we feed it through this hidden layer 00:17:52.280 |
out of which we can choose what the next action is. 00:18:13.600 |
could provide a very accurate dependency parser 00:18:30.600 |
that it was a quite simple architecture in 2014, 00:18:34.760 |
this was the first successful neural dependency parser. 00:18:42.720 |
but also partly the non-linearity of the classifier 00:18:47.540 |
that it could both outperform symbolic parsers 00:18:53.120 |
and it could outperform them in terms of speed. 00:19:08.880 |
by the success of this neural dependency parser, 00:19:11.860 |
and a number of people, particularly at Google, 00:19:23.320 |
There's no reason to only have one hidden layer. 00:19:27.600 |
You can do beam search that I briefly mentioned last time. 00:19:31.560 |
Another thing that I'm not gonna talk about now 00:19:33.600 |
is adding conditional random field style inference 00:19:42.520 |
for a model that they called Parsey-McParse face, 00:19:59.440 |
of transition-based parser with a neural net classifier 00:20:22.100 |
and considering a word as a dependent of root, 00:20:27.440 |
as to how likely is that big is a dependent of root 00:20:52.200 |
And so what you want to do is understand the context. 00:20:56.480 |
So you want to have an understanding of the context of big, 00:21:04.520 |
into the dependency representations of the sentence. 00:21:13.640 |
like the MST parser I showed on the earlier results slide, 00:21:26.760 |
And how we do that is actually what I'll be talking about 00:21:37.220 |
how to come up with a better graph-based dependency parser 00:21:49.100 |
So we can say probably big is a dependent of cat 00:22:25.420 |
another student, Tim Dozat and me then worked on saying, 00:22:38.540 |
for scoring neural scoring dependency parsers 00:22:44.500 |
which I'm not gonna get into the details of right now, 00:23:03.460 |
But I should point out that this is a mixed win 00:23:10.420 |
these graph-based parsers are just in squared 00:23:21.300 |
when you're wanting to parse large amounts of text 00:23:31.940 |
about dependency parsers and to do assignment three. 00:23:42.360 |
just mention a few more things about neural networks. 00:23:53.620 |
you have to be aware of for building neural networks. 00:24:01.740 |
And if you follow the recipe, your parser should work well. 00:24:06.280 |
But what you should minimally do is actually look carefully 00:24:17.780 |
how do we initialize our matrices of our neural network? 00:24:31.160 |
And so I wanted to say just a few words about that. 00:24:34.300 |
Okay, so the first thing that we haven't discussed at all 00:24:45.260 |
we're now building models with a huge number of parameters. 00:24:50.260 |
So essentially just about all neural net models 00:25:08.260 |
well, this part here is the part that we've seen before 00:25:16.540 |
and then taking a negative log likelihood loss, 00:25:18.980 |
which we're then averaging over the different examples. 00:25:27.700 |
And so this regularization term sums the square 00:25:46.600 |
So to the extent that the parameters don't help much, 00:25:49.720 |
you're just being penalized here by making them non-zero. 00:25:54.220 |
But to the extent that the parameters do help, 00:25:59.500 |
and therefore it's okay for them to be non-zero. 00:26:08.300 |
It's not being assessed separately for each example. 00:26:21.500 |
So the classic problem is referred to as overfitting. 00:26:27.140 |
is that if you have a particular training dataset 00:26:45.200 |
and it will keep on reducing your error rate. 00:26:49.460 |
But if you then look at your partially trained classifier 00:27:00.140 |
different test data that you weren't training the model on, 00:27:04.160 |
what you will find is up until a certain point, 00:27:31.800 |
that you're fiddling the parameters of the model 00:27:39.960 |
that can then predict on independent examples 00:28:01.880 |
for the kind of modern big neural networks that we build 00:28:06.560 |
is that overfitting on the training data isn't a problem, 00:28:17.000 |
to make sure that your models generalize well 00:28:32.240 |
You'd like to have it at worst case flat line 00:28:40.160 |
It'll always be higher than the training error, 00:28:43.080 |
but it's not actually showing a failure to generalize. 00:28:52.800 |
our big neural nets always overfit on the training data. 00:29:06.700 |
that you can continue to train them on the training data 00:29:10.020 |
until the error on the training data is zero. 00:29:14.740 |
because they can just memorize enough stuff about it 00:29:19.260 |
But in general, providing the models are regularized well, 00:29:47.440 |
And if you make it small, you're getting less. 00:29:56.300 |
or else you have the problem that you don't generalize well. 00:30:06.920 |
But our big neural nets are sufficiently complex 00:30:11.680 |
that essentially L2 regularization doesn't cut it. 00:30:22.040 |
for building neural nets is a technique called dropout. 00:30:35.180 |
that you do when training to avoid feature co-adaptation. 00:30:47.820 |
that for each instance or for each batch in your training, 00:31:03.400 |
And so that you can do by sort of zeroing out elements 00:31:17.200 |
you keep them all, but actually you have all the model weights 00:31:20.940 |
because you're now keeping twice as many things 00:31:30.600 |
prevents what's called feature co-adaptation. 00:31:33.840 |
So you can't have features that are only useful 00:31:45.800 |
which features are gonna be present for different examples 00:31:48.760 |
because different features are being randomly dropped 00:32:05.440 |
all the weights are set in the context of all the others. 00:32:13.680 |
It's also related to ensemble models like model bagging 00:32:26.120 |
there's actually another way of thinking about dropout 00:32:29.800 |
which was actually developed here at Stanford. 00:32:34.840 |
which is to argue that really what dropout gives you 00:32:39.120 |
is a strong regularizer that isn't a uniform regularizer 00:32:43.560 |
like L2 that regularizes everything with an L2 loss, 00:32:47.040 |
but can learn a feature dependent regularization. 00:32:50.160 |
And so that dropout has just emerged as in general, 00:32:53.480 |
the best way to do regularization for neural nets. 00:32:57.920 |
I think you've already seen and heard this one, 00:33:04.280 |
If you want to have your neural networks go fast, 00:33:09.080 |
it's really essential that you make use of vectors, 00:33:12.640 |
matrices, tensors, and you don't do things with for loops. 00:33:16.840 |
So here's a teeny example where I'm using time it, 00:33:36.980 |
I can either do the dot product in a for loop 00:34:06.080 |
So you should always be looking to use vectors 00:34:16.360 |
Heading forward, we're going to be using GPUs 00:34:19.880 |
and they only further exaggerate the advantages 00:34:25.220 |
where you'll commonly get two orders of magnitude speed up 00:34:59.680 |
But then since you, for each successive batch, 00:35:06.940 |
you're then training all of the weights of the model. 00:35:32.200 |
every feature was just sort of being penalized 00:35:44.120 |
where the end result of this dropout style training 00:35:54.300 |
and some other features being regularized less strongly. 00:36:05.240 |
So you're regularizing more features that are being used less 00:36:09.800 |
but I'm not gonna get through into the details 00:36:21.520 |
So the final bit is I just wanted to give a little bit 00:36:24.840 |
of perspective on nonlinearities in our neural nets. 00:36:35.760 |
So if you're building a multi-layer neural net 00:36:47.720 |
well, I guess they're different hidden layers. 00:36:53.240 |
They should be hidden one, hidden two, hidden three. 00:36:59.040 |
That multiple linear transformations composed 00:37:07.660 |
So you don't get any power as a data representation 00:37:19.740 |
'cause you actually do get some interesting learning effects 00:37:25.080 |
But standardly, we have to have some kind of nonlinearity 00:37:30.080 |
to do something interesting in a deep neural network. 00:37:39.340 |
is the most classic nonlinearity is the logistic 00:37:48.620 |
which we've seen before in previous lectures. 00:37:57.680 |
And that was sort of basically what people used 00:38:08.860 |
is that it's moving everything into the positive space 00:38:14.580 |
because the output is always between zero and one. 00:38:17.480 |
So people then decided that for many purposes, 00:38:21.260 |
it was useful to have this variant sigmoid shape 00:38:27.060 |
which is then being shown in the second picture. 00:38:34.740 |
they sound like they're very different things 00:38:37.040 |
but actually, as you maybe remember from a math class, 00:38:46.140 |
which possibly we might make you do on an assignment, 00:38:52.020 |
is just a rescaled and shifted version of the logistics. 00:38:55.860 |
So it's really exactly the same curve, just squeezed a bit. 00:38:59.400 |
So it goes now symmetrically between minus one and one. 00:39:02.880 |
Well, these kinds of transcendental functions 00:39:11.000 |
they're kind of slow and expensive to compute, right? 00:39:18.260 |
So something people became interested in was, 00:39:21.540 |
well, could we do things with much simpler non-linearity? 00:39:31.260 |
up to some point it just flat lines at minus one, 00:39:54.300 |
you're sort of not getting any discrimination 00:40:02.880 |
I mean, I was surprised when people started doing this, 00:40:06.380 |
these kinds of models proved to be very successful. 00:40:11.660 |
And so that then led into what's proven to be kind of 00:40:15.660 |
the most successful and generally widely used non-linearity 00:40:28.220 |
is what's called the rectified linear unit or ReLU. 00:40:31.860 |
So a ReLU is kind of the simplest kind of non-linearity 00:40:36.500 |
So if the value of X is negative, its value is zero. 00:40:55.460 |
And at first sight, this might seem really, really weird 00:41:00.640 |
and how could this be useful as a non-linearity? 00:41:08.780 |
with piecewise linear functions very accurately, 00:41:12.000 |
you might kind of start to see how you could use this 00:41:27.500 |
So logistic and tanh are still used in various places. 00:41:31.860 |
You use logistic when you want a probability output. 00:41:40.060 |
But they're no longer the default when making deep networks 00:41:44.780 |
the first thing you should think about trying 00:42:00.740 |
because you get this sort of very straightforward 00:42:08.260 |
you then just gain this sort of constant gradient backflow 00:42:22.300 |
is still enough to have a very good neural network, 00:42:27.940 |
People have played around with variants of that. 00:42:30.820 |
So people have then played around with leaky ReLUs 00:42:39.580 |
it goes slightly negative on a much shallower slope. 00:42:49.980 |
where you learn the slope of the negative part. 00:43:01.300 |
but it sort of curves down just a little bit there 00:43:06.180 |
I mean, I think it's fair to say that, you know, 00:43:09.060 |
none of these have really proven themselves vastly superior. 00:43:12.580 |
There are papers saying I can get better results 00:43:20.020 |
and a vast majority of work that you see around 00:43:41.700 |
of your neural nets with small random values. 00:43:51.660 |
'cause effectively then everything is symmetric, 00:44:02.100 |
you just don't have an ability for a neural net to learn. 00:44:20.980 |
and just filling in all the parameters with that. 00:44:49.700 |
They don't get too big and they don't get too small. 00:45:23.940 |
based on this fan in and fan out of the layers, 00:45:30.940 |
say initialize with this initialization and it will. 00:45:42.220 |
we'll start talking about layer normalization. 00:46:07.220 |
with stochastic gradient descent works just fine, 00:46:21.340 |
which is my final slide of tips on the next slide. 00:46:42.220 |
but the very loose idea is that these optimizers 00:46:46.340 |
are adaptive in that they can kind of keep track 00:46:52.500 |
how much gradient there is for different parameters. 00:46:58.020 |
make decisions as to how much to adjust the weights 00:47:03.860 |
rather than adjusting it by a constant amount. 00:47:09.260 |
there are methods that include AdaGrad, RMSProp, Adam, 00:47:20.460 |
The one called Adam is a pretty good place to start. 00:47:24.660 |
And a lot of the time that's a good one to use. 00:47:38.660 |
If you are using simple stochastic gradient descent, 00:47:47.100 |
So that was the eta value that you multiplied 00:47:50.980 |
the gradient by for how much to adjust the weights. 00:47:59.580 |
or your model could diverge or bounce around. 00:48:04.940 |
or else training could take place exceedingly slowly 00:48:11.780 |
How big it should be depends on all sorts of details 00:48:30.740 |
or 10 to the minus four is in a crazy place to start. 00:48:42.220 |
they can get better results by decreasing learning rates 00:48:47.220 |
So a very common recipe is that you half the learning rate 00:49:13.980 |
'cause that leads you just kind of have a sort of patterning 00:49:21.940 |
will sort of fall into that periodicity of those patterns. 00:49:38.420 |
Fancier optimizers like Adam also have a learning rate. 00:49:42.100 |
So you still have to choose a learning rate value, 00:49:45.180 |
but it's effectively, it's an initial learning rate, 00:49:47.900 |
which typically the optimizer shrinks as it runs. 00:50:04.060 |
And I'm now ready to start on language models and RNNs. 00:50:12.020 |
language modeling could mean just about anything, 00:50:15.220 |
but in the natural language processing literature, 00:50:18.860 |
language modeling has a very precise technical definition, 00:50:30.860 |
So if you have some context, like the students open there, 00:50:37.180 |
you want to be able to predict what words will come next. 00:50:40.740 |
Is it their books, their laptops, their exams, their minds? 00:50:45.740 |
And so in particular, what you want to be doing 00:50:54.100 |
that different words will occur in this context. 00:50:58.460 |
So a language model is a probability distribution 00:51:07.460 |
And a system that does that is called a language model. 00:51:36.860 |
probability of the first times probability of the second 00:51:50.980 |
Okay, language models are really the cornerstone 00:52:14.740 |
what the next word that you probably want to type is, 00:52:26.140 |
and it's suggesting a next word or a next few words, 00:52:46.060 |
So they're sort of only mediocre language models 00:52:53.460 |
Query completion, same thing, there's a language model. 00:53:11.300 |
since you should have at least some understanding 00:53:13.660 |
of how NLP was done without a neural network. 00:53:17.740 |
And the traditional answer that powered speech recognition 00:53:22.340 |
and other applications for at least two decades, 00:53:30.420 |
And these were very simple, but still quite effective idea. 00:53:34.940 |
So we want to give probabilities of next words. 00:53:44.780 |
And so N-grams is just a chunk of N consecutive words, 00:53:49.580 |
which are usually referred to as unigrams, bigrams, 00:53:52.340 |
trigrams, and then four grams and five grams. 00:53:55.980 |
A horrible set of names, which would offend any humanist, 00:54:02.740 |
And so effectively what we do is just collect statistics 00:54:13.180 |
and then use those to build a probability model. 00:54:18.940 |
is what's referred to as making a Markov assumption. 00:54:21.900 |
So these are also referred to as Markov models. 00:54:24.820 |
And we decide that the word in position T plus one 00:54:29.380 |
only depends on the preceding N minus one words. 00:54:45.100 |
and just use the preceding N minus one words as context. 00:55:18.140 |
We simply take a larger amount of text in some language 00:55:22.460 |
and we count how often the different N-grams occur. 00:55:38.980 |
Suppose we are learning a four-gram language model. 00:55:54.020 |
we use the counts from some large training corpus 00:55:57.700 |
and we see how often did students open their books occur? 00:56:01.820 |
How often did students open their minds occur? 00:56:15.980 |
So for example, if in the corpus students open, 00:56:22.060 |
students open their books occurred 400 times, 00:56:25.660 |
we'd get a probability estimate of 0.4 for books. 00:56:33.540 |
And we sort of see here already the disadvantage 00:56:39.580 |
and have gotten rid of all of this earlier context, 00:56:43.620 |
which would have been useful for helping us to predict. 00:57:14.940 |
So that terminology is different to the terminology 00:57:21.020 |
So when people talk about the order of a Markov model, 00:57:24.700 |
that refers to the amount of context you're using. 00:57:28.220 |
So this would correspond to a third order Markov model. 00:57:43.380 |
you also estimate the probabilities just by counting. 00:57:49.980 |
and there are sort of in some sense, two differences. 00:58:07.620 |
So in one part that a Naive Bayes language model 00:58:14.260 |
So you're just using the counts of individual words. 00:58:20.340 |
is you're learning a different set of unigram counts 00:58:35.380 |
is you've got class specific unigram language models. 00:58:40.380 |
Okay, I gave this as a simple statistical model 00:58:48.140 |
for estimating your probabilities with an N-gram model. 00:58:52.140 |
You can't actually get away with just doing that 00:58:57.380 |
So, often it'll be the case that for many words, 00:59:15.380 |
and you want to have then a sequence of four words 00:59:18.380 |
or a problem and there are 10 to the fifth of each, 00:59:20.820 |
there's sort of 10 to the 20th different combinations. 00:59:43.100 |
and then everything is non-zero and that's called smoothing. 00:59:51.820 |
students open theirs and that's more problematic 01:00:01.740 |
and we can't usefully calculate any probabilities 01:00:09.580 |
is to shorten the context and that's called back off. 01:00:25.220 |
and actually use a unigram model for our probabilities. 01:00:38.740 |
these sparsity problems become worse and worse. 01:00:46.300 |
As it became easier to collect billions of words of text, 01:00:53.420 |
But every time you go up an order of conditioning, 01:01:03.340 |
because of the size of the vocabularies of human languages. 01:01:07.060 |
There's also a problem that these models are huge. 01:01:27.620 |
So in the 2000s decade up till about whenever it was, 01:01:32.620 |
2014, that there was already Google Translate 01:01:59.780 |
And that's possible because neural net models 01:02:13.420 |
But nevertheless, before we get onto the neural models, 01:02:16.860 |
let's just sort of look at the example of how these work. 01:02:21.860 |
So it's trivial to train an N-gram language model 01:02:29.900 |
word sequences occur in a corpus and you're ready to go. 01:02:36.540 |
That's not like sitting around for training neural networks. 01:02:45.500 |
you know, about 1.7 million words as a trigram model, 01:03:08.820 |
Now there's a kind of a crude probability distribution. 01:03:16.740 |
there were things that occurred once, Italian and Emirate. 01:03:22.540 |
There were things that occurred four times, company and bank. 01:03:28.580 |
but I nevertheless get probability estimates. 01:03:44.380 |
a company or bank, but we're rolling the dice 01:03:48.140 |
and we might get any of the words that had come next. 01:04:12.980 |
and I will look up the probability distribution 01:04:36.580 |
while production of shoe lasts and shoe industry, 01:04:50.220 |
So what just a simple trigram model can produce 01:04:59.180 |
Like it's actually surprisingly grammatical, right? 01:05:03.740 |
while production of shoe lasts and shoe industry, 01:05:11.540 |
It's really actually pretty good grammatical text. 01:05:14.420 |
So it's sort of amazing that these simple N-gram models 01:05:21.940 |
On the other hand, it's not a very good piece of text. 01:05:25.060 |
It's completely incoherent and makes no sense. 01:05:35.500 |
we're going to need a considerably better language model. 01:05:38.900 |
And that's precisely what neural language models 01:05:45.060 |
Okay, so how can we build a neural language model? 01:05:50.380 |
And so first of all, we're gonna do a simple one 01:06:05.540 |
and we want a probability distribution over the next word. 01:06:10.140 |
Well, the simplest thing that we could try is to say, 01:06:19.740 |
So what we can say, what we'd done previously 01:06:24.140 |
either for our named entity recognized in lecture three 01:06:28.660 |
or what I just showed you for the dependency parser 01:06:48.460 |
we're going to have a window-based classifier. 01:06:57.820 |
but we'll feed this fixed window into a neural net. 01:07:07.660 |
and then we have a softmax classifier over our vocabulary. 01:07:12.660 |
And so now rather than predicting something like location 01:07:21.900 |
we're going to have a softmax over the entire vocabulary 01:07:25.780 |
sort of like we did with the skip-gram negative sampling 01:07:37.820 |
whether it produces laptops, minds, books, et cetera. 01:07:55.540 |
in the use of neural nets for NLP applications. 01:08:05.940 |
Yoshua Bengio and colleagues introduced precisely 01:08:09.740 |
this model as the neural probabilistic language model. 01:08:17.140 |
that this could give interesting good results 01:08:25.300 |
for neural language modeling, but it still had value. 01:08:29.220 |
So it didn't solve the problem of allowing us 01:08:47.900 |
So rather than having these counts for word sequences 01:08:56.620 |
we can use distributed representations of words, 01:09:05.260 |
should give similar probability distribution. 01:09:09.020 |
So the idea of that is if we'd use some other word here, 01:09:24.580 |
An N-gram language model then would sort of have no idea 01:09:45.460 |
We don't need to store billions of N-gram counts. 01:10:11.380 |
but that also points out another problem with this model. 01:10:15.580 |
Not only can the window never be large enough, 01:10:23.180 |
And so therefore we're learning completely different weights 01:10:29.540 |
the word minus one position, the word minus two, 01:10:32.340 |
the word minus three, and the word minus four, 01:10:37.940 |
as to how it treats words in different positions, 01:10:48.260 |
that are at least somewhat position independent. 01:10:54.140 |
if you sort of think back to either a naive Bayes model 01:10:57.300 |
or what we saw with the word2vec model at the beginning, 01:11:13.820 |
'cause word order is really important in language modeling. 01:11:16.900 |
If the last word is the, that's a really good predictor 01:11:20.260 |
of there being an adjective or noun following 01:11:28.180 |
So you do wanna somewhat make use of word order, 01:11:41.660 |
So what we'd like to have is a neural architecture 01:11:46.100 |
that can process an arbitrary amount of context 01:11:58.500 |
And so that's the idea of recurrent neural networks. 01:12:02.340 |
And I'll say about five minutes about these today, 01:12:19.340 |
inside our classifier here that we compute each time, 01:12:29.100 |
which often is referred to as the hidden state, 01:12:38.180 |
So that's what the word recurrent is meaning, 01:12:50.300 |
we compute a hidden representation kind of like before, 01:13:07.500 |
we feed in the hidden layer from the previous word 01:13:20.700 |
is we're taking the hidden layer above the first word, 01:13:28.500 |
and then that's going to be going in together with X2 01:13:35.660 |
And so we keep on doing that at each time step, 01:13:48.580 |
and the previous hidden state by updating it, 01:14:08.100 |
so that we start off by having input vectors, 01:14:21.580 |
so we can have the one hot vectors for word identity. 01:14:26.780 |
so then we've got word embeddings for each word. 01:14:42.580 |
So this is actually just initialized as zeros. 01:14:45.740 |
And so for working out the first hidden state, 01:14:49.420 |
we calculate it based on the first word embedding 01:14:54.420 |
by multiplying this embedding by a matrix WE. 01:15:05.340 |
we want to apply the same formula over again. 01:15:20.460 |
and one matrix for updating the hidden state of the network. 01:15:24.860 |
And so for the second word, from its word embedding, 01:15:41.300 |
And we use the two of those to generate the new hidden state 01:15:45.820 |
and precisely how we generate the new hidden state 01:15:48.820 |
is then by shown on this equation on the left. 01:15:52.500 |
So we take the previous hidden state, multiply it by WH. 01:15:56.620 |
We take the input embedding, multiply it by WE. 01:16:06.860 |
and then we put that through a non-linearity. 01:16:15.260 |
by far the most common non-linearity to use here 01:16:38.660 |
we can use them just like in our window classifier 01:16:45.500 |
So at any position, we can take this hidden vector, 01:16:56.260 |
and then making a softmax distribution out of that. 01:16:59.020 |
And that will then give us a probability distribution 01:17:05.900 |
This is the entire math of a simple recurrent neural network. 01:17:10.900 |
And next time I'll come back and say more about them, 01:17:16.380 |
but this is the entirety of what you need to know 01:17:20.540 |
in some sense for the computation of the forward model 01:17:28.380 |
is it can process a text import of any length. 01:17:45.700 |
It doesn't matter how much of a past context there is. 01:17:56.580 |
we use exactly the same weights to update our hidden state. 01:18:00.380 |
So there's a symmetry in how different inputs are processed 01:18:10.180 |
or the simple RNNs in practice aren't perfect. 01:18:13.860 |
So a disadvantage is that they're actually kind of slow 01:18:23.300 |
with having to have on the outside a for loop. 01:18:26.100 |
So we can do vector matrix multiplies on the inside here, 01:18:30.500 |
but really we have to do for time step equals one to N, 01:18:39.340 |
And so that's not a perfect neural net architecture 01:18:42.700 |
and we'll discuss alternatives to that later. 01:18:49.220 |
this model can access information any number of steps back, 01:18:53.140 |
in practice we find that it's pretty imperfect at doing that 01:18:57.780 |
and that will then lead to more advanced forms 01:19:01.220 |
of recurrent neural network that I'll talk about next time 01:19:05.060 |
that are able to more effectively access past context.