Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 5

00:00:00.000 | So we're now starting in week three with lecture five.

00:00:05.000 | So unfortunately, in the last class,

00:00:15.320 | I guess I really got behind and went a bit slowly.

00:00:18.400 | I guess I must just enjoy talking

00:00:20.640 | about natural languages too much.

00:00:22.640 | And so I never really got to the punchline

00:00:24.940 | of showing how you could do good things

00:00:26.720 | with the neural dependency parser.

00:00:28.600 | So today for the first piece,

00:00:30.200 | I'll in some sense be finishing the content of last time

00:00:33.640 | and talk about neural dependency parsing,

00:00:36.080 | which also gives us the opportunity

00:00:38.620 | to introduce a simple feed forward neural net classifier.

00:00:43.620 | That will then lead into a little bit

00:00:46.040 | of just background things that you need to know

00:00:48.640 | about neural networks content,

00:00:50.360 | because the fact of the matter is there is a bunch of stuff

00:00:52.660 | you need to know about neural networks.

00:00:55.160 | And then after both of those things,

00:00:57.380 | I'll get into what's really meant to be the topic

00:00:59.880 | of today's lecture, which is looking at language modeling

00:01:03.920 | and recurrent neural networks.

00:01:06.160 | And that's then going to lead into those two things

00:01:10.000 | are important topics that we'll then be talking about

00:01:14.020 | really for the whole of next week as well.

00:01:16.920 | So there's a couple of reminders before we get underway.

00:01:19.640 | The first is that you should have handed in assignment two

00:01:23.200 | before you joined this class.

00:01:25.860 | And in turn, assignment three is out today.

00:01:28.920 | And it's an assignment where you're going to build

00:01:33.060 | essentially the neural dependency parser

00:01:36.080 | that I'm just about to present in PyTorch.

00:01:38.840 | So part of the role of this assignment

00:01:40.960 | is actually to get you up to speed with PyTorch.

00:01:44.120 | So this assignment is highly scaffolded

00:01:48.040 | with lots of comments and hints about what to do.

00:01:51.120 | And so the hope is that by the time you come

00:01:54.120 | to the end of it, you'll feel fairly familiar

00:01:57.280 | and comfortable with PyTorch.

00:01:59.640 | Don't forget there was also a tutorial

00:02:01.280 | on PyTorch last week.

00:02:02.760 | If you didn't catch that at the time,

00:02:04.560 | you might want to go back and look at the video.

00:02:07.040 | Another thing to mention about the assignments

00:02:11.000 | is that assignment three is the last assignment

00:02:14.680 | where our great team of TAs are happy to look at your code

00:02:19.120 | and sort out your bugs for you.

00:02:21.560 | So maybe take advantage of that,

00:02:23.680 | but not too much.

00:02:25.320 | But starting on assignment four for assignments four, five

00:02:28.240 | and the final project,

00:02:30.040 | the TAs are very happy to help in general,

00:02:32.760 | but it's just not going to be their job

00:02:34.720 | to be actually sorting out bugs for you.

00:02:37.760 | You should be looking at your code

00:02:39.600 | and discussing ideas and concepts

00:02:41.720 | and reasons why things might not work with them.

00:02:44.120 | Okay, so if you remember where we were last time,

00:02:49.120 | I'd introduced this idea

00:02:50.640 | of transition-based dependency parsers

00:02:53.880 | and that these were an efficient linear time method

00:02:58.120 | for giving the syntactic structure

00:03:00.240 | of natural language text.

00:03:02.320 | And that they worked pretty well

00:03:05.320 | before neural nets came along and took over NLP again,

00:03:08.760 | but they had some disadvantages.

00:03:11.120 | And their biggest disadvantage

00:03:13.160 | is that like most machine learning models of that time,

00:03:16.820 | they worked with indicator features.

00:03:19.380 | So that means that you are specifying some condition

00:03:24.000 | and then checking whether it was true of a configuration.

00:03:27.200 | So something like the word on the top of the stack is good

00:03:30.360 | and it's part of speech is adjective,

00:03:32.640 | or the next word coming up is a personal pronoun,

00:03:36.800 | that those are conditions that would be features

00:03:39.480 | in a conventional transition-based dependency parser.

00:03:44.480 | And so what are the problems with doing that?

00:03:48.540 | Well, one problem is that those features are very sparse.

00:03:53.540 | A second problem is the features are incomplete.

00:03:57.440 | Well, what I mean by that is depending on what words

00:04:02.440 | and configurations occurred in the training data,

00:04:05.980 | there are certain features that will exist

00:04:09.040 | because you sort of saw a certain word preceding a verb

00:04:12.520 | and certain features that just won't exist

00:04:14.440 | because that word never occurred

00:04:16.320 | before a verb in the training data.

00:04:18.580 | But perhaps the biggest problem and opportunity

00:04:22.180 | for doing better with the neural dependency parser

00:04:25.140 | is that it turns out that in a symbolic dependency parser,

00:04:29.880 | computing all these features

00:04:31.500 | just turns out to actually be pretty expensive.

00:04:34.000 | That although the actual transition system

00:04:36.100 | that I showed last time is fast and efficient to run,

00:04:40.660 | you actually have to compute all of these features.

00:04:43.980 | And what you found was that about 95% of the parsing time

00:04:48.380 | of one of these models was spent just computing

00:04:51.800 | all of the features of every configuration.

00:04:54.340 | So that suggests that perhaps we can do better

00:04:58.940 | with a neural approach where we're going to learn

00:05:01.300 | a dense and compact feature representation.

00:05:04.100 | And so that's what I wanna go through now.

00:05:06.980 | So this time, we're still gonna have exactly

00:05:10.420 | the same kind of configuration of a stack and a buffer

00:05:15.020 | and running exactly the same transition sequence,

00:05:18.700 | except this time, rather than representing the configuration

00:05:22.500 | of the stack and the buffer

00:05:23.860 | by having several million symbolic features,

00:05:27.100 | we're instead going to summarize this configuration

00:05:30.560 | as a dense vector of dimensionality,

00:05:34.020 | perhaps approximately a thousand.

00:05:35.880 | And our neural approach is going to learn

00:05:39.540 | this dense compact feature representation.

00:05:42.740 | And so quite explicitly, what I'm gonna show you now briefly

00:05:47.580 | and what you're going to implement

00:05:51.180 | is essentially the neural dependency parser

00:05:54.020 | that was developed by Danqi Chen in 2014.

00:05:58.060 | And to skip to the advertisement right at the beginning

00:06:01.940 | as to how this works so well,

00:06:04.860 | these are the kind of results that you got from it

00:06:08.000 | using the measures that I introduced at the last time,

00:06:10.980 | the unlabeled attachment score,

00:06:12.860 | whether you attach dependencies correctly to the right word

00:06:17.020 | and the labeled attachment score

00:06:19.340 | as to whether you also get the type of grammatical relation

00:06:22.540 | of that dependency correct.

00:06:24.220 | And so essentially, this Chen and Manning parser

00:06:29.020 | gave a neural version of something

00:06:32.580 | like a transition-based dependency parser

00:06:35.340 | like Malt parser in yellow.

00:06:37.520 | And the interesting thing was that taking advantage

00:06:41.260 | of a neural classifier in ways that I'm about to explain,

00:06:46.260 | that that could produce something

00:06:48.020 | that was about 2% more accurate

00:06:50.660 | than the symbolic dependency parser.

00:06:53.200 | And because of the fact that it's not doing

00:06:55.460 | all of the symbolic feature computation,

00:06:58.400 | despite the fact that you might think at first

00:07:00.760 | that there's a lot of real number math

00:07:02.540 | and matrix vector multiplies in a neural dependency parser,

00:07:07.140 | it actually ran noticeably faster

00:07:09.700 | than the symbolic dependency parser

00:07:11.580 | because it didn't have all of the feature computation.

00:07:16.100 | The other major approach to dependency parsing

00:07:18.300 | that I'm also showing here,

00:07:19.920 | and I'll get back to at the end,

00:07:22.220 | is what's referred to as graph-based dependency parsing.

00:07:25.740 | And so that's a different approach to dependency parsing.

00:07:29.060 | And so these are two symbolic graph-based dependency parsers

00:07:33.700 | and in the pre-neural world,

00:07:36.340 | they were somewhat more accurate

00:07:38.700 | than the transition-based parsers as you could see,

00:07:41.900 | but on the other hand,

00:07:43.420 | they were close to two orders of magnitude slower.

00:07:46.980 | And so essentially with the Chern-Manning parser,

00:07:49.780 | we were able to provide something

00:07:51.700 | that was basically as accurate

00:07:53.380 | as the best graph-based dependency parsers,

00:07:55.940 | which were the best dependency parsers

00:07:58.300 | while operating about two orders of magnitude more quickly.

00:08:01.780 | So how did we do it?

00:08:04.160 | It was actually a very straightforward implementation,

00:08:09.160 | which is part of what makes it great

00:08:11.800 | for doing for assignment three.

00:08:13.960 | But this is how we did it and we got wins.

00:08:17.660 | So the first win,

00:08:18.960 | which is what we've already talked about extensively

00:08:21.700 | starting in week one,

00:08:23.140 | is to make use of distributed representations.

00:08:26.280 | So we represent each word as a word embedding,

00:08:29.580 | and you've had a lot of experience with that already.

00:08:33.140 | And so that means when words weren't seen

00:08:36.580 | in a particular configuration,

00:08:38.620 | we still know what they're like

00:08:40.180 | because they'll be,

00:08:41.260 | will have seen similar words in the correct configuration.

00:08:45.180 | But we don't stop only with word embeddings.

00:08:50.460 | The other things that are central to our dependency parser

00:08:53.860 | are the parts of speech of words and the dependency labels.

00:08:58.260 | And so what we decided to do

00:09:01.040 | is that although those are much smaller sets,

00:09:04.660 | so the dependency labels are about 40 in number

00:09:07.760 | and the parts of speech are of around

00:09:09.960 | that order of magnitude,

00:09:11.080 | sometimes less, sometimes more,

00:09:13.260 | that even within those sets of categories,

00:09:16.620 | there are ones that are very strongly related.

00:09:19.320 | So we also adopted distributed representations for them.

00:09:24.320 | So for example,

00:09:25.260 | there might be parts of speech

00:09:26.780 | for singular nouns and plural nouns.

00:09:29.100 | And basically most of the time they behave similarly

00:09:32.180 | and there are adjectival modifiers and numerical modifiers.

00:09:36.380 | So these are just numbers like three, four, five.

00:09:38.820 | And again, a lot of the time they behave the same

00:09:41.700 | that you have both three cows and brown cows.

00:09:46.140 | Okay, so everything is going to be represented

00:09:50.920 | in the distributed representation.

00:09:53.300 | So at that point,

00:09:54.660 | we have exactly the same kind of configuration

00:09:58.200 | where we have our stack, our buffer,

00:10:01.100 | and we've started to build some arcs.

00:10:03.780 | And so the classification decisions of the next transition

00:10:08.780 | are going to be made out of a few elements

00:10:11.680 | of this configuration.

00:10:12.940 | So we're looking at the top thing on the stack,

00:10:15.620 | the thing that's second on the stack,

00:10:17.580 | the first word on the buffer.

00:10:19.460 | And then we actually added in some additional features

00:10:22.580 | that have been to the extent

00:10:24.140 | that we've already built arcs for words on the stack,

00:10:27.460 | that we can be looking at the dependence

00:10:30.060 | on the left and right of those words that are on the stack

00:10:33.720 | that are already in the sets of arcs.

00:10:36.120 | And so for each of those things,

00:10:39.080 | there is a word, there is a part of speech,

00:10:43.040 | and for some of them, there is a dependency

00:10:47.280 | where it's already connected up to something else.

00:10:50.920 | So for example, the left corner of S2 here

00:10:54.160 | has an in-sub dependency

00:10:56.240 | back to the second thing on the stack.

00:10:58.580 | So we can take these elements of the configuration

00:11:02.320 | and can look up the embedding of each one.

00:11:05.780 | So we have word embeddings, part of speech embeddings,

00:11:07.920 | and dependency embeddings,

00:11:09.360 | and just concatenate them all together,

00:11:11.680 | kind of like we did before with the window classifier,

00:11:14.880 | and that will give us a neural representation

00:11:17.120 | of the configuration.

00:11:18.420 | Now, there's a second reason why we can hope to win

00:11:23.200 | by using a deep learning classifier

00:11:25.720 | to predict the next transition.

00:11:27.680 | And we haven't really said much about that yet.

00:11:29.840 | So I just wanted to detour

00:11:31.920 | and say a little bit more about that.

00:11:34.560 | So the simplest kind of classifier

00:11:37.320 | that's close to what we've been talking about

00:11:41.360 | in neural models is a softmax classifier.

00:11:44.640 | So that if we have D dimensional vectors X,

00:11:48.480 | and we have Y classes to assign things to,

00:11:54.240 | oh, sorry, Y is an element of a set of C classes

00:11:59.240 | to assign things to,

00:12:01.280 | then we can build a softmax classifier

00:12:03.680 | using the softmax distribution that we've seen before,

00:12:07.120 | where we decide the classes

00:12:09.100 | based on having a weight matrix that's C by D,

00:12:13.680 | and we train on supervised data,

00:12:16.920 | the values of this W weight matrix

00:12:19.400 | to minimize our negative log likelihood loss

00:12:23.360 | that we've seen before,

00:12:24.880 | a loss that's also commonly referred

00:12:26.800 | to as cross entropy loss,

00:12:28.240 | a term that you'll see in PyTorch among other places.

00:12:32.400 | So that is a straightforward machine learning classifier.

00:12:36.120 | And if you've done 229, you've seen softmax classifiers.

00:12:41.120 | But a simple softmax classifier like this

00:12:45.160 | shares with most traditional machine learning classifiers.

00:12:49.080 | So models that include naive Bayes models,

00:12:51.680 | support vector machines, logistic regression,

00:12:54.960 | that at the end of the day,

00:12:56.820 | they're not very powerful classifiers.

00:12:59.480 | They're classifiers that only

00:13:01.040 | give linear decision boundaries.

00:13:03.280 | And so this can be quite limiting.

00:13:05.160 | So if you have a difficult problem,

00:13:07.500 | like the one I'm indicating in the picture

00:13:10.000 | in the bottom left,

00:13:11.240 | well, there's just no way you can divide the green points

00:13:14.920 | from the red points by simply drawing a straight line.

00:13:18.440 | So you're going to have a quite imperfect classifier.

00:13:21.560 | So the second big win of neural classifiers

00:13:25.760 | is that they can be much more powerful

00:13:28.680 | because they can provide nonlinear classification.

00:13:32.160 | So rather than only being able to do something like

00:13:34.440 | in the left picture,

00:13:36.120 | we can come up with classifiers

00:13:38.200 | that do something like in the right picture

00:13:40.560 | and therefore can separate the green and the red points.

00:13:45.340 | As an aside, these pictures I've taken

00:13:47.520 | from Andrej Karpathy's ConvNet JS software,

00:13:50.600 | which is a kind of a fun little tool to play around with

00:13:53.720 | if you've got a bit of spare time.

00:13:55.480 | And so there's something subtle going on here

00:13:59.920 | is because our more powerful neural net classifiers,

00:14:04.920 | at the end of the day,

00:14:06.520 | what they have at the top of them is a softmax layer.

00:14:10.720 | And so this softmax layer is indeed a linear classifier,

00:14:15.720 | and it's still a linear classifier.

00:14:18.380 | But what they have below that is other layers of neural net.

00:14:23.080 | And so effectively what happens

00:14:26.540 | is that the classification decisions are linear

00:14:29.760 | as far as the top softmax is concerned,

00:14:32.600 | but nonlinear in the original representation space.

00:14:36.600 | So precisely what a neural net can do

00:14:38.920 | is warp the space around

00:14:41.840 | and move the representation of data points

00:14:44.880 | to provide something that at the end of the day

00:14:47.720 | can be classified by a linear classifier.

00:14:50.520 | And so that's what a simple feed forward

00:14:54.140 | neural network multiclass classifier does.

00:14:57.520 | So it starts with an input representation.

00:15:00.860 | So these are is some dense representation of the input.

00:15:05.400 | It puts it through a hidden layer H

00:15:08.160 | with a matrix multiply followed by nonlinearity.

00:15:13.160 | So that matrix multiply can transform the space

00:15:16.440 | and map things around.

00:15:18.200 | And so then the output of that,

00:15:20.200 | we can then put into a softmax layer

00:15:22.980 | and get out softmax probabilities

00:15:25.840 | from which we make our classification decisions.

00:15:29.060 | And to the extent that our probabilities

00:15:32.960 | don't assign one to the correct class,

00:15:35.320 | we then get some log loss or cross entropy error,

00:15:38.520 | which we back propagate towards the parameters

00:15:41.720 | and embeddings of our model.

00:15:44.280 | And as the learning that goes on via back propagation,

00:15:48.760 | we increasingly well learn parameters

00:15:52.240 | of this hidden layer of the model,

00:15:54.440 | which learn to re-represent the input.

00:15:57.360 | They move the inputs around

00:15:59.400 | in an intermediate hidden vector space.

00:16:02.560 | So it can be easily classified

00:16:04.680 | with what at the end of the day is the linear softmax.

00:16:07.980 | So this is basically the whole

00:16:11.880 | of a simple feed forward neural network,

00:16:14.240 | multi-class classifier.

00:16:15.960 | And if we had something like a visual signal,

00:16:20.200 | we just sort of feed straight in here,

00:16:22.800 | real numbers and we've been done.

00:16:24.840 | But normally with human language material,

00:16:27.920 | we actually effectively have one more layer

00:16:31.200 | that we're feeding in before that.

00:16:33.060 | 'Cause really below this dense input layer,

00:16:36.480 | we actually have one hot vectors

00:16:38.880 | for what words or parts of speech were involved.

00:16:41.720 | And then we're doing a lookup process,

00:16:44.040 | which you can think of as one more matrix multiply

00:16:47.180 | to convert the one hot features into our dense input layer.

00:16:51.080 | Okay, in my picture here,

00:16:53.760 | the one other thing that's different

00:16:55.560 | is I've introduced a different non-linearity

00:16:58.400 | in the hidden layer, which is a rectified linear unit.

00:17:02.400 | And that's what we'll be using now,

00:17:04.080 | neural dependency parsers.

00:17:06.200 | It looks like the picture in the bottom right,

00:17:08.240 | and I'll come back to that in a few minutes.

00:17:11.120 | That's one of the extra neural net things to talk about.

00:17:14.460 | Okay, so our neural net dependency parser model architecture

00:17:20.720 | is essentially exactly that,

00:17:24.000 | but applied to the configuration

00:17:27.400 | of our transition-based dependency parser.

00:17:30.720 | So based on our transition-based

00:17:33.920 | dependency parser configuration,

00:17:36.360 | we construct an input layer embedding

00:17:39.360 | by looking up the various elements as I discussed previously,

00:17:43.560 | and then we feed it through this hidden layer

00:17:48.120 | to the softmax layer to get probabilities

00:17:52.280 | out of which we can choose what the next action is.

00:17:56.120 | And it's no more complicated than that.

00:17:58.840 | But what we found is that just simply,

00:18:06.720 | in some sense using the simplest kind

00:18:09.240 | of feed-forward neural classifier

00:18:13.600 | could provide a very accurate dependency parser

00:18:18.600 | that determines the structure of sentences,

00:18:22.960 | supporting meaning interpretation,

00:18:24.840 | the kind of way that I suggested last time.

00:18:28.080 | Indeed, despite the fact

00:18:30.600 | that it was a quite simple architecture in 2014,

00:18:34.760 | this was the first successful neural dependency parser.

00:18:38.840 | And the dense representations especially,

00:18:42.720 | but also partly the non-linearity of the classifier

00:18:46.200 | gave us this good result

00:18:47.540 | that it could both outperform symbolic parsers

00:18:51.480 | in terms of accuracy,

00:18:53.120 | and it could outperform them in terms of speed.

00:18:56.080 | So that was 2014.

00:19:00.720 | Just quickly here are a couple more slides

00:19:03.480 | on what's happened since then.

00:19:06.820 | So lots of people got excited

00:19:08.880 | by the success of this neural dependency parser,

00:19:11.860 | and a number of people, particularly at Google,

00:19:14.520 | then set about building a bigger, fancier,

00:19:18.680 | transition-based neural dependency parser.

00:19:21.040 | So they explored bigger, deeper networks.

00:19:23.320 | There's no reason to only have one hidden layer.

00:19:25.760 | You can have two hidden layers.

00:19:27.600 | You can do beam search that I briefly mentioned last time.

00:19:31.560 | Another thing that I'm not gonna talk about now

00:19:33.600 | is adding conditional random field style inference

00:19:37.660 | over decision sequences.

00:19:39.520 | And that then led in 2016

00:19:42.520 | for a model that they called Parsey-McParse face,

00:19:47.240 | which is hard to say with a straight face,

00:19:49.860 | which was then about 2.5, 3% more accurate

00:19:54.860 | than the model that we had produced,

00:19:56.660 | but still in basically the same family

00:19:59.440 | of transition-based parser with a neural net classifier

00:20:03.520 | to choose the next transition.

00:20:05.260 | The alternative to transition-based parsers

00:20:11.520 | is graph-based dependency parsers.

00:20:14.160 | And for a graph-based dependency parser,

00:20:17.400 | what you're doing is effectively considering

00:20:20.080 | every pair of words

00:20:22.100 | and considering a word as a dependent of root,

00:20:24.860 | and you're coming up with a score

00:20:27.440 | as to how likely is that big is a dependent of root

00:20:32.440 | or how likely is big to be dependent of cat.

00:20:35.920 | And similarly for every other word,

00:20:37.520 | for the word sat,

00:20:39.580 | how likely is it to be a dependent of root

00:20:43.020 | or a dependent of the, et cetera.

00:20:45.620 | And well, to do that well,

00:20:47.860 | you need to know more

00:20:48.920 | than just what the two words involved are.

00:20:52.200 | And so what you want to do is understand the context.

00:20:56.480 | So you want to have an understanding of the context of big,

00:21:00.040 | what's to the left of it,

00:21:01.040 | what's to the right of it

00:21:02.320 | to understand how you might hook it up

00:21:04.520 | into the dependency representations of the sentence.

00:21:08.100 | And so while there'd been previous work

00:21:11.520 | in graph-based dependency parsing,

00:21:13.640 | like the MST parser I showed on the earlier results slide,

00:21:18.240 | it seemed appealing that we could come up

00:21:21.200 | with a much better representation of context

00:21:24.100 | using neural nets that look at context.

00:21:26.760 | And how we do that is actually what I'll be talking about

00:21:29.160 | in the end part of the lecture.

00:21:31.060 | And so at Stanford,

00:21:33.720 | we became interested in trying to work out

00:21:37.220 | how to come up with a better graph-based dependency parser

00:21:40.260 | using context.

00:21:41.860 | Sorry, I forgot this.

00:21:42.700 | This was showing that

00:21:44.420 | if we can score each pairwise dependency,

00:21:46.820 | we can simply choose the best one.

00:21:49.100 | So we can say probably big is a dependent of cat

00:21:53.580 | and to a first approximation,

00:21:55.860 | we're gonna want to choose for each word

00:21:58.380 | that it is a dependent of the word

00:22:01.880 | that seems most likely to be a dependent.

00:22:04.380 | But we wanna do that with some constraints

00:22:06.440 | because we wanna get out something

00:22:08.140 | that is a tree with a single root,

00:22:10.420 | as I discussed last time.

00:22:12.180 | And you can do that by making use

00:22:14.260 | of a minimum spanning tree algorithm

00:22:16.640 | that uses the scores

00:22:18.520 | of how likely different dependencies are.

00:22:22.020 | Okay, so then in 2017,

00:22:25.420 | another student, Tim Dozat and me then worked on saying,

00:22:29.780 | well, can we now also build a much better

00:22:33.500 | neural graph-based dependency parser?

00:22:36.220 | And we developed a novel method

00:22:38.540 | for scoring neural scoring dependency parsers

00:22:43.140 | in a graph-based model,

00:22:44.500 | which I'm not gonna get into the details of right now,

00:22:48.180 | but that also had a very nice result

00:22:50.380 | 'cause getting back to graph-based parsing,

00:22:53.620 | we could then build a graph-based parser

00:22:55.860 | that performed about a percent better

00:22:58.300 | than the best of the Google transition-based

00:23:01.180 | neural dependency parsers.

00:23:03.460 | But I should point out that this is a mixed win

00:23:06.980 | 'cause although its accuracy is better,

00:23:10.420 | these graph-based parsers are just in squared

00:23:13.540 | and performance rather than linear time.

00:23:15.860 | So kind of like the early results I showed,

00:23:18.980 | they don't operate nearly as quickly

00:23:21.300 | when you're wanting to parse large amounts of text

00:23:25.460 | with complex long sentences.

00:23:28.500 | Okay, so that's everything you need to know

00:23:31.940 | about dependency parsers and to do assignment three.

00:23:35.220 | So grab it this evening and start to work.

00:23:37.700 | But I did want to sort of,

00:23:40.600 | before going on to the next topic,

00:23:42.360 | just mention a few more things about neural networks.

00:23:46.820 | Since some of you know this well already,

00:23:49.940 | some of you have seen less of it,

00:23:51.780 | but there just are a bunch of things

00:23:53.620 | you have to be aware of for building neural networks.

00:23:57.220 | Now, again, for assignment three,

00:23:59.900 | essentially we give you everything.

00:24:01.740 | And if you follow the recipe, your parser should work well.

00:24:06.280 | But what you should minimally do is actually look carefully

00:24:11.280 | at some of the things that this parser does,

00:24:16.160 | which is questions like,

00:24:17.780 | how do we initialize our matrices of our neural network?

00:24:22.780 | What kind of optimizers do we use?

00:24:26.180 | And things like that.

00:24:27.400 | 'Cause these are all important decisions.

00:24:31.160 | And so I wanted to say just a few words about that.

00:24:34.300 | Okay, so the first thing that we haven't discussed at all

00:24:39.140 | is the concept of regularization.

00:24:42.380 | So when we're building these neural nets,

00:24:45.260 | we're now building models with a huge number of parameters.

00:24:50.260 | So essentially just about all neural net models

00:24:54.880 | that work well,

00:24:57.100 | actually their full loss function

00:25:00.900 | is a regularized loss function.

00:25:03.260 | So for this loss function here of J,

00:25:08.260 | well, this part here is the part that we've seen before

00:25:12.540 | where we're using a softmax classifier

00:25:16.540 | and then taking a negative log likelihood loss,

00:25:18.980 | which we're then averaging over the different examples.

00:25:21.640 | But actually we then stick on the end of it,

00:25:25.380 | this regularization term.

00:25:27.700 | And so this regularization term sums the square

00:25:32.320 | of every parameter in the model.

00:25:34.960 | And so what that effectively says is

00:25:39.260 | you only wanna make parameters non-zero

00:25:44.060 | if they're really useful, right?

00:25:46.600 | So to the extent that the parameters don't help much,

00:25:49.720 | you're just being penalized here by making them non-zero.

00:25:54.220 | But to the extent that the parameters do help,

00:25:56.780 | you'll gain in your estimation of likelihood

00:25:59.500 | and therefore it's okay for them to be non-zero.

00:26:02.560 | In particular, notice that this penalty

00:26:05.680 | is assessed only once per parameter.

00:26:08.300 | It's not being assessed separately for each example.

00:26:11.960 | Okay, and having this kind of regularization

00:26:16.060 | is essential to build neural net models

00:26:19.700 | that regularize well.

00:26:21.500 | So the classic problem is referred to as overfitting.

00:26:25.220 | And what overfitting means

00:26:27.140 | is that if you have a particular training dataset

00:26:29.940 | and you start training your model,

00:26:32.660 | your error will go down

00:26:35.000 | because you'll shift the parameters

00:26:36.500 | so they better predict the correct answer

00:26:41.260 | for data points in the model.

00:26:43.260 | And you can keep on doing that

00:26:45.200 | and it will keep on reducing your error rate.

00:26:49.460 | But if you then look at your partially trained classifier

00:26:53.740 | and say, how well does this classifier

00:26:57.140 | classify independent data,

00:27:00.140 | different test data that you weren't training the model on,

00:27:04.160 | what you will find is up until a certain point,

00:27:07.200 | you'll get better at classifying

00:27:10.120 | independent test examples as well.

00:27:12.680 | And after that, commonly what will happen

00:27:16.280 | is you'll actually start to get worse

00:27:18.640 | at classifying independent test examples,

00:27:22.120 | even though you're continuing to get better

00:27:24.120 | at predicting the training examples.

00:27:26.200 | And so this was then referred to

00:27:27.960 | as you're overfitting the training examples,

00:27:31.800 | that you're fiddling the parameters of the model

00:27:34.120 | so that they're really good

00:27:35.160 | at predicting the training examples,

00:27:37.240 | which aren't useful things

00:27:39.960 | that can then predict on independent examples

00:27:44.440 | that you've come to at runtime.

00:27:47.440 | Okay, that classic view of regularization

00:27:52.200 | is sort of actually outmoded and wrong

00:27:55.680 | for modern neural networks.

00:27:58.920 | So the right way to think of it

00:28:01.880 | for the kind of modern big neural networks that we build

00:28:06.560 | is that overfitting on the training data isn't a problem,

00:28:11.560 | but nevertheless, you need regularization

00:28:17.000 | to make sure that your models generalize well

00:28:20.920 | to independent test data.

00:28:23.520 | So what you would like is for your graph

00:28:26.200 | not to look like this example

00:28:29.900 | with test error starting to head up.

00:28:32.240 | You'd like to have it at worst case flat line

00:28:37.240 | and best case still be gradually dropping.

00:28:40.160 | It'll always be higher than the training error,

00:28:43.080 | but it's not actually showing a failure to generalize.

00:28:47.840 | So when we train big neural nets these days,

00:28:52.800 | our big neural nets always overfit on the training data.

00:28:59.280 | They hugely overfit on the training data.

00:29:02.100 | In fact, in many circumstances,

00:29:04.020 | our neural nets have so many parameters

00:29:06.700 | that you can continue to train them on the training data

00:29:10.020 | until the error on the training data is zero.

00:29:12.700 | They get every single example right

00:29:14.740 | because they can just memorize enough stuff about it

00:29:17.460 | to predict the right answer.

00:29:19.260 | But in general, providing the models are regularized well,

00:29:23.740 | those models will still also generalize well

00:29:27.020 | and predict well on independent data.

00:29:29.340 | And so for part of what we wanna do for that

00:29:33.740 | is to work out how much to regularize.

00:29:36.740 | And so this lambda parameter here

00:29:39.280 | is the strength of regularization.

00:29:41.880 | So if you're making that lambda number big,

00:29:44.640 | you're getting more regularization.

00:29:47.440 | And if you make it small, you're getting less.

00:29:49.960 | And you don't wanna have it be too big

00:29:51.760 | or else you won't fit the data well.

00:29:53.640 | And you don't want it to be too small

00:29:56.300 | or else you have the problem that you don't generalize well.

00:30:01.300 | Okay, so this is classic L2 regularization

00:30:05.160 | and it's a starting point.

00:30:06.920 | But our big neural nets are sufficiently complex

00:30:09.600 | and have sufficiently many parameters

00:30:11.680 | that essentially L2 regularization doesn't cut it.

00:30:15.800 | So the next thing that you should know about

00:30:18.520 | and is a very standard good feature

00:30:22.040 | for building neural nets is a technique called dropout.

00:30:25.980 | So dropout is generally introduced

00:30:30.980 | as a sort of a slightly funny process

00:30:35.180 | that you do when training to avoid feature co-adaptation.

00:30:40.180 | So in dropout, what you do is at the time

00:30:44.820 | that you're training your model,

00:30:47.820 | that for each instance or for each batch in your training,

00:30:55.140 | then for each neuron in the model,

00:30:58.480 | you drop 50% of its inputs.

00:31:01.120 | You just treat them as zero.

00:31:03.400 | And so that you can do by sort of zeroing out elements

00:31:07.280 | of the sort of layers.

00:31:10.460 | And then at test time,

00:31:14.840 | you don't drop any of the model weights,

00:31:17.200 | you keep them all, but actually you have all the model weights

00:31:20.940 | because you're now keeping twice as many things

00:31:23.560 | as you'd used at training data.

00:31:25.320 | And so effectively that little recipe

00:31:30.600 | prevents what's called feature co-adaptation.

00:31:33.840 | So you can't have features that are only useful

00:31:38.840 | in the presence of particular other features

00:31:43.300 | because the model can't guarantee

00:31:45.800 | which features are gonna be present for different examples

00:31:48.760 | because different features are being randomly dropped

00:31:51.320 | all of the time.

00:31:52.740 | And so effectively dropout gives you a kind

00:31:56.040 | of a middle ground between naive Bayes

00:31:58.160 | and a logistic regression model.

00:32:00.000 | In a naive Bayes models,

00:32:01.680 | all the weights are set independently

00:32:03.520 | and a logistic regression model,

00:32:05.440 | all the weights are set in the context of all the others.

00:32:08.360 | And here you are aware of other weights,

00:32:10.800 | but they can randomly disappear from you.

00:32:13.680 | It's also related to ensemble models like model bagging

00:32:17.280 | because you're using different subsets

00:32:19.000 | of the features every time.

00:32:22.200 | But after all of those explanations,

00:32:26.120 | there's actually another way of thinking about dropout

00:32:29.800 | which was actually developed here at Stanford.

00:32:31.760 | This is a paper by Percy Liang and students,

00:32:34.840 | which is to argue that really what dropout gives you

00:32:39.120 | is a strong regularizer that isn't a uniform regularizer

00:32:43.560 | like L2 that regularizes everything with an L2 loss,

00:32:47.040 | but can learn a feature dependent regularization.

00:32:50.160 | And so that dropout has just emerged as in general,

00:32:53.480 | the best way to do regularization for neural nets.

00:32:57.920 | I think you've already seen and heard this one,

00:33:00.420 | but just to have it on my slides once.

00:33:04.280 | If you want to have your neural networks go fast,

00:33:09.080 | it's really essential that you make use of vectors,

00:33:12.640 | matrices, tensors, and you don't do things with for loops.

00:33:16.840 | So here's a teeny example where I'm using time it,

00:33:21.840 | which is a useful thing that you can use too

00:33:24.040 | to see how fast your neural nets run

00:33:26.240 | and different ways of writing that.

00:33:28.840 | And so when I'm doing this,

00:33:31.500 | doing these dot products here,

00:33:36.980 | I can either do the dot product in a for loop

00:33:41.980 | against each word vector,

00:33:44.780 | or I can do the dot product

00:33:46.680 | with a single word vector matrix.

00:33:49.580 | And if I do it in a for loop,

00:33:51.860 | doing each loop takes me almost a second.

00:33:57.960 | Whereas if I do it with a matrix multiply,

00:34:02.960 | it takes me an order of magnitude less time.

00:34:06.080 | So you should always be looking to use vectors

00:34:08.460 | and matrices, not for loops.

00:34:10.520 | And this is a speed up of about 10 times

00:34:13.640 | when you're doing things on a CPU.

00:34:16.360 | Heading forward, we're going to be using GPUs

00:34:19.880 | and they only further exaggerate the advantages

00:34:23.180 | of using vectors and matrices

00:34:25.220 | where you'll commonly get two orders of magnitude speed up

00:34:28.720 | by doing things that way.

00:34:30.880 | Yeah, so for the backward pass,

00:34:34.160 | you are running a backward pass as before

00:34:39.040 | on the dropped out examples, right?

00:34:42.800 | So for the things that were dropped out,

00:34:45.680 | no gradient is going through them

00:34:47.680 | because they weren't present,

00:34:49.200 | they're not affecting things.

00:34:51.640 | So in a particular batch,

00:34:55.320 | you're only training weights

00:34:57.800 | for the things that aren't dropped out.

00:34:59.680 | But then since you, for each successive batch,

00:35:03.280 | you drop out different things

00:35:05.180 | that over a bunch of batches,

00:35:06.940 | you're then training all of the weights of the model.

00:35:11.280 | And so feature dependent regularizer

00:35:14.680 | is meaning that how much a feature,

00:35:19.680 | the different features can be regularized

00:35:23.680 | different amounts to maximize performance.

00:35:28.680 | So back in this model,

00:35:32.200 | every feature was just sort of being penalized

00:35:36.920 | by taking lambda times that squared value.

00:35:39.720 | So this is sort of uniform regularization

00:35:44.120 | where the end result of this dropout style training

00:35:48.280 | is that you end up with some features

00:35:51.960 | being regularized much more strongly

00:35:54.300 | and some other features being regularized less strongly.

00:35:59.300 | And how much they are regularized

00:36:03.280 | depends on how much they're being used.

00:36:05.240 | So you're regularizing more features that are being used less

00:36:09.800 | but I'm not gonna get through into the details

00:36:13.040 | of how you can understand that perspective.

00:36:16.240 | That's outside of the context

00:36:19.080 | of what I'm gonna get through right now.

00:36:21.520 | So the final bit is I just wanted to give a little bit

00:36:24.840 | of perspective on nonlinearities in our neural nets.

00:36:29.840 | So the first thing to remember

00:36:33.340 | is you have to have nonlinearities.

00:36:35.760 | So if you're building a multi-layer neural net

00:36:39.240 | and you've just got W1X plus B1

00:36:42.400 | then you put it through W2X plus B2

00:36:45.920 | and then put through W3X,

00:36:47.720 | well, I guess they're different hidden layers.

00:36:52.240 | Sorry, I should have said X.

00:36:53.240 | They should be hidden one, hidden two, hidden three.

00:36:55.360 | W3, hidden three plus B3.

00:36:59.040 | That multiple linear transformations composed

00:37:03.820 | so they can be just collapsed down

00:37:05.380 | into a single linear transformation.

00:37:07.660 | So you don't get any power as a data representation

00:37:12.660 | by having multiple linear layers.

00:37:17.840 | There's a slightly longer story there

00:37:19.740 | 'cause you actually do get some interesting learning effects

00:37:22.380 | but I'm not gonna talk about that now.

00:37:25.080 | But standardly, we have to have some kind of nonlinearity

00:37:30.080 | to do something interesting in a deep neural network.

00:37:35.820 | Okay, so there's a starting point

00:37:39.340 | is the most classic nonlinearity is the logistic

00:37:43.420 | often just called the sigmoid nonlinearity

00:37:46.420 | because of its S shape

00:37:48.620 | which we've seen before in previous lectures.

00:37:51.300 | So this will take any real number

00:37:54.140 | and map it on to the range of zero one.

00:37:57.680 | And that was sort of basically what people used

00:38:02.140 | in sort of 1980s neural nets.

00:38:04.900 | Now, one disadvantage of this nonlinearity

00:38:08.860 | is that it's moving everything into the positive space

00:38:14.580 | because the output is always between zero and one.

00:38:17.480 | So people then decided that for many purposes,

00:38:21.260 | it was useful to have this variant sigmoid shape

00:38:24.980 | of hyperbolic tan,

00:38:27.060 | which is then being shown in the second picture.

00:38:29.880 | Now, logistic and hyperbolic tan,

00:38:34.740 | they sound like they're very different things

00:38:37.040 | but actually, as you maybe remember from a math class,

00:38:40.340 | hyperbolic tan can be represented

00:38:42.540 | in terms of exponentials as well.

00:38:44.780 | And if you do a bit of math,

00:38:46.140 | which possibly we might make you do on an assignment,

00:38:49.620 | it's actually the case that a hyperbolic tan

00:38:52.020 | is just a rescaled and shifted version of the logistics.

00:38:55.860 | So it's really exactly the same curve, just squeezed a bit.

00:38:59.400 | So it goes now symmetrically between minus one and one.

00:39:02.880 | Well, these kinds of transcendental functions

00:39:08.700 | like hyperbolic tan,

00:39:11.000 | they're kind of slow and expensive to compute, right?

00:39:13.740 | Even on our fast computers,

00:39:15.640 | calculating exponentials is a bit slow.

00:39:18.260 | So something people became interested in was,

00:39:21.540 | well, could we do things with much simpler non-linearity?

00:39:25.380 | So what if we used a so-called hard tan H?

00:39:29.340 | So the hard tan H,

00:39:31.260 | up to some point it just flat lines at minus one,

00:39:37.140 | then it is Y equals X up until one,

00:39:42.140 | and then it just flat lines again.

00:39:44.700 | And that seems a slightly weird thing to use

00:39:49.480 | because if your input is over on the left

00:39:52.900 | or over on the right,

00:39:54.300 | you're sort of not getting any discrimination

00:39:57.660 | and if for things giving the same output.

00:40:01.040 | But somewhat surprisingly,

00:40:02.880 | I mean, I was surprised when people started doing this,

00:40:06.380 | these kinds of models proved to be very successful.

00:40:11.660 | And so that then led into what's proven to be kind of

00:40:15.660 | the most successful and generally widely used non-linearity

00:40:19.620 | in a lot of recent deep learning work,

00:40:21.740 | which was what was being used

00:40:24.660 | in the dependency parser model I showed

00:40:28.220 | is what's called the rectified linear unit or ReLU.

00:40:31.860 | So a ReLU is kind of the simplest kind of non-linearity

00:40:35.140 | that you can imagine.

00:40:36.500 | So if the value of X is negative, its value is zero.

00:40:40.600 | So effectively it's just dead,

00:40:42.460 | it's not doing anything in the computation.

00:40:45.020 | And if its value of X is greater than zero,

00:40:50.020 | then it's just simply Y equals X,

00:40:53.000 | the value is being passed through.

00:40:55.460 | And at first sight, this might seem really, really weird

00:41:00.640 | and how could this be useful as a non-linearity?

00:41:04.200 | But if you sort of think a bit

00:41:05.900 | about how you can approximate things

00:41:08.780 | with piecewise linear functions very accurately,

00:41:12.000 | you might kind of start to see how you could use this

00:41:15.180 | to do accurate function approximation

00:41:18.540 | with piecewise linear functions.

00:41:20.780 | And that's what ReLU units have been found

00:41:23.760 | to do extremely, extremely successfully.

00:41:27.500 | So logistic and tanh are still used in various places.

00:41:31.860 | You use logistic when you want a probability output.

00:41:34.880 | We'll see tanh again very soon

00:41:37.460 | when we get to recurrent neural networks.

00:41:40.060 | But they're no longer the default when making deep networks

00:41:43.420 | that in a lot of places,

00:41:44.780 | the first thing you should think about trying

00:41:47.060 | is ReLU non-linearities.

00:41:49.180 | And so in particular,

00:41:51.180 | that part of why they're good

00:41:55.940 | is that ReLU non-networks train very quickly

00:42:00.740 | because you get this sort of very straightforward

00:42:03.340 | gradient backflow because providing you

00:42:05.380 | on the right-hand side of it,

00:42:08.260 | you then just gain this sort of constant gradient backflow

00:42:11.700 | from the slope one.

00:42:13.100 | And so they train very quickly.

00:42:15.120 | The somewhat surprising fact is that sort of

00:42:19.300 | almost the simplest non-linearity imaginable

00:42:22.300 | is still enough to have a very good neural network,

00:42:26.300 | but it just is.

00:42:27.940 | People have played around with variants of that.

00:42:30.820 | So people have then played around with leaky ReLUs

00:42:34.100 | where rather than the left-hand side

00:42:37.620 | just going completely to zero,

00:42:39.580 | it goes slightly negative on a much shallower slope.

00:42:44.580 | And then there's been a parametric ReLU

00:42:48.060 | where you have an extra parameter

00:42:49.980 | where you learn the slope of the negative part.

00:42:52.480 | Another thing that's been used recently

00:42:56.340 | is this swish non-linearity,

00:42:58.180 | which looks almost like a ReLU,

00:43:01.300 | but it sort of curves down just a little bit there

00:43:04.780 | and starts to go up.

00:43:06.180 | I mean, I think it's fair to say that, you know,

00:43:09.060 | none of these have really proven themselves vastly superior.

00:43:12.580 | There are papers saying I can get better results

00:43:15.020 | by using one of these and maybe you can,

00:43:17.780 | but you know, it's not night and day

00:43:20.020 | and a vast majority of work that you see around

00:43:22.860 | is still just using ReLUs in many places.

00:43:26.120 | Okay, a couple more things.

00:43:31.260 | Parameter initialization.

00:43:33.580 | So in almost all cases,

00:43:36.700 | you must, must, must initialize the matrices

00:43:41.700 | of your neural nets with small random values.

00:43:46.660 | Neural nets just don't work

00:43:48.620 | if you start the matrices off as zero

00:43:51.660 | 'cause effectively then everything is symmetric,

00:43:56.060 | nothing can specialize in different ways

00:43:59.900 | and you then get sort of,

00:44:02.100 | you just don't have an ability for a neural net to learn.

00:44:07.060 | You sort of get this defective solution.

00:44:09.940 | So standardly you're using some methods

00:44:13.820 | such as drawing random numbers uniformly

00:44:17.380 | between minus R and R for a small value R

00:44:20.980 | and just filling in all the parameters with that.

00:44:24.340 | Exception is with bias weights.

00:44:26.820 | It's fine to set bias weights to zero

00:44:29.140 | and in some sense that's better.

00:44:30.980 | In terms of choosing what the R value is,

00:44:37.300 | essentially for traditional neural nets,

00:44:40.940 | what we want to set that R range for

00:44:44.220 | is so that the numbers in our neural network

00:44:47.460 | stay of a reasonable size.

00:44:49.700 | They don't get too big and they don't get too small.

00:44:53.540 | And whether they kind of blow up or not

00:44:56.940 | depends on how many connections there are

00:44:59.940 | in the neural networks.

00:45:01.140 | I'm looking at the fan in and fan out

00:45:03.860 | of connections in the neural network.

00:45:06.860 | And so a very common initialization

00:45:11.860 | that you'll see in PyTorch

00:45:13.940 | is what's called Harvey initialization

00:45:16.780 | named after a person who suggested that.

00:45:20.580 | And it's working out a value of R

00:45:23.940 | based on this fan in and fan out of the layers,

00:45:28.940 | but you can just sort of ask for it,

00:45:30.940 | say initialize with this initialization and it will.

00:45:34.820 | This is another area where there have been

00:45:36.700 | some subsequent developments.

00:45:39.060 | So around week five,

00:45:42.220 | we'll start talking about layer normalization.

00:45:44.540 | And if you're using layer normalization,

00:45:46.740 | then it sort of doesn't matter the same

00:45:48.380 | how you initialize the weights.

00:45:52.140 | So finally, we have to train our models.

00:45:56.220 | And I've briefly introduced the idea

00:45:58.940 | of stochastic gradient descent.

00:46:01.060 | And the good news is that most of the time

00:46:05.220 | that if training your networks

00:46:07.220 | with stochastic gradient descent works just fine,

00:46:11.620 | use it and you will get good results.

00:46:14.580 | However, often that requires

00:46:19.500 | choosing a suitable learning rate,

00:46:21.340 | which is my final slide of tips on the next slide.

00:46:25.220 | But there's been an enormous amount of work

00:46:28.060 | on optimization of neural networks

00:46:31.460 | and people have come up with a whole series

00:46:34.300 | of more sophisticated optimizers.

00:46:38.060 | And I'm not gonna get into the details

00:46:40.260 | of optimization in this class,

00:46:42.220 | but the very loose idea is that these optimizers

00:46:46.340 | are adaptive in that they can kind of keep track

00:46:50.140 | of how much slope there was,

00:46:52.500 | how much gradient there is for different parameters.

00:46:56.140 | And therefore based on that,

00:46:58.020 | make decisions as to how much to adjust the weights

00:47:01.820 | when doing the gradient update

00:47:03.860 | rather than adjusting it by a constant amount.

00:47:06.460 | And so in that family of methods,

00:47:09.260 | there are methods that include AdaGrad, RMSProp, Adam,

00:47:13.940 | and then a variance of Adam,

00:47:15.700 | including SparseAdam, AdamW, et cetera.

00:47:20.460 | The one called Adam is a pretty good place to start.

00:47:24.660 | And a lot of the time that's a good one to use.

00:47:27.340 | And again, from the perspective of PyTorch,

00:47:29.980 | when you're initializing an optimizer,

00:47:32.420 | you can just say, please use Adam

00:47:34.620 | and you don't actually need to know

00:47:36.740 | much more about it than that.

00:47:38.660 | If you are using simple stochastic gradient descent,

00:47:44.700 | you have to change, choose a learning rate.

00:47:47.100 | So that was the eta value that you multiplied

00:47:50.980 | the gradient by for how much to adjust the weights.

00:47:54.500 | And so I talked about that slightly,

00:47:57.020 | how you didn't want it to be too big

00:47:59.580 | or your model could diverge or bounce around.

00:48:02.740 | You didn't want it to be too small

00:48:04.940 | or else training could take place exceedingly slowly

00:48:09.780 | and you'll miss the assignment deadline.

00:48:11.780 | How big it should be depends on all sorts of details

00:48:16.300 | of the model.

00:48:17.220 | And so you sort of wanna try out

00:48:18.820 | some different order of magnitude numbers

00:48:22.940 | to see what numbers seem to work well

00:48:25.820 | for training stably but reasonably quickly.

00:48:28.940 | Something around 10 to the minus three

00:48:30.740 | or 10 to the minus four is in a crazy place to start.

00:48:34.580 | In principle, you can do fine

00:48:36.220 | just using a constant learning rate in SGD.

00:48:39.500 | In practice, people generally find

00:48:42.220 | they can get better results by decreasing learning rates

00:48:45.980 | as you train.

00:48:47.220 | So a very common recipe is that you half the learning rate

00:48:51.660 | after every K epochs, where an epoch means

00:48:55.060 | that you've made a pass through

00:48:56.660 | the entire set of training data.

00:48:59.060 | So perhaps something like every three epochs

00:49:01.700 | you have the learning rate.

00:49:03.620 | And a final little note there in purple

00:49:07.540 | is when you make a pass through the data,

00:49:09.300 | you don't wanna go through the data items

00:49:11.620 | in the same order each time

00:49:13.980 | 'cause that leads you just kind of have a sort of patterning

00:49:18.620 | of the training examples that the model

00:49:21.940 | will sort of fall into that periodicity of those patterns.

00:49:24.980 | So it's best to shuffle the data

00:49:27.380 | before each pass through it.

00:49:29.020 | Okay, there are more sophisticated ways

00:49:32.260 | to set learning rates.

00:49:34.900 | And I won't really get into those now.

00:49:38.420 | Fancier optimizers like Adam also have a learning rate.

00:49:42.100 | So you still have to choose a learning rate value,

00:49:45.180 | but it's effectively, it's an initial learning rate,

00:49:47.900 | which typically the optimizer shrinks as it runs.

00:49:51.060 | And so you commonly want to have the number

00:49:53.660 | it starts off with beyond the larger size

00:49:56.180 | because it'll be shrinking as it goes.

00:49:58.940 | Okay, so that's all by way of introduction.

00:50:04.060 | And I'm now ready to start on language models and RNNs.

00:50:08.020 | So what is language modeling?

00:50:10.260 | I mean, as two words of English,

00:50:12.020 | language modeling could mean just about anything,

00:50:15.220 | but in the natural language processing literature,

00:50:18.860 | language modeling has a very precise technical definition,

00:50:23.060 | which you should know.

00:50:24.380 | So language modeling is the task

00:50:27.500 | of predicting the word that comes next.

00:50:30.860 | So if you have some context, like the students open there,

00:50:37.180 | you want to be able to predict what words will come next.

00:50:40.740 | Is it their books, their laptops, their exams, their minds?

00:50:45.740 | And so in particular, what you want to be doing

00:50:51.060 | is being able to give a probability

00:50:54.100 | that different words will occur in this context.

00:50:58.460 | So a language model is a probability distribution

00:51:02.980 | over next words given a preceding context.

00:51:07.460 | And a system that does that is called a language model.

00:51:13.740 | So as a result of that,

00:51:17.780 | you can also think of a language model

00:51:19.820 | as a system that assigns a probability score

00:51:23.020 | to a piece of text.

00:51:24.460 | So if we have a piece of text,

00:51:26.660 | then we can just work out its probability

00:51:29.020 | according to a language model.

00:51:30.820 | So the probability of a sequence of tokens,

00:51:33.580 | we can decompose via the chain rule,

00:51:36.860 | probability of the first times probability of the second

00:51:39.940 | given the first, et cetera, et cetera.

00:51:42.020 | And then we can work that out

00:51:44.340 | using what our language model provides

00:51:46.980 | as a product of each probability

00:51:49.340 | of predicting the next word.

00:51:50.980 | Okay, language models are really the cornerstone

00:51:57.500 | of human language technology.

00:52:00.620 | Everything that you do with computers

00:52:04.100 | that involves human language,

00:52:06.380 | you are using language models.

00:52:09.300 | So when you're using your phone

00:52:11.780 | and it's suggesting whether well or badly,

00:52:14.740 | what the next word that you probably want to type is,

00:52:17.580 | that's a language model working

00:52:19.900 | to try and predict the likely next words.

00:52:23.300 | When the same thing happens in a Google Doc

00:52:26.140 | and it's suggesting a next word or a next few words,

00:52:29.460 | that's a language model.

00:52:31.140 | The main reason why the one in Google Docs

00:52:35.580 | works much better than the one on your phone

00:52:38.060 | is that for the keyboard phone models,

00:52:40.660 | they have to be very compact

00:52:42.940 | so they can run quickly and not much memory.

00:52:46.060 | So they're sort of only mediocre language models

00:52:49.020 | where something like Google Docs

00:52:50.780 | can do a much better language modeling job.

00:52:53.460 | Query completion, same thing, there's a language model.

00:52:59.260 | And so then the question is,

00:53:00.860 | well, how do we build language models?

00:53:04.980 | And so I briefly wanted to first again,

00:53:08.660 | give the traditional answer

00:53:11.300 | since you should have at least some understanding

00:53:13.660 | of how NLP was done without a neural network.

00:53:17.740 | And the traditional answer that powered speech recognition

00:53:22.340 | and other applications for at least two decades,

00:53:25.540 | three decades really,

00:53:27.420 | was what were called N-gram language models.

00:53:30.420 | And these were very simple, but still quite effective idea.

00:53:34.940 | So we want to give probabilities of next words.

00:53:38.100 | So what we're gonna work with

00:53:42.500 | is what are referred to as N-grams.

00:53:44.780 | And so N-grams is just a chunk of N consecutive words,

00:53:49.580 | which are usually referred to as unigrams, bigrams,

00:53:52.340 | trigrams, and then four grams and five grams.

00:53:55.980 | A horrible set of names, which would offend any humanist,

00:54:00.700 | but that's what people normally say.

00:54:02.740 | And so effectively what we do is just collect statistics

00:54:07.980 | about how often different N-grams occur

00:54:11.420 | in a large amount of text,

00:54:13.180 | and then use those to build a probability model.

00:54:16.060 | So the first thing we do

00:54:18.940 | is what's referred to as making a Markov assumption.

00:54:21.900 | So these are also referred to as Markov models.

00:54:24.820 | And we decide that the word in position T plus one

00:54:29.380 | only depends on the preceding N minus one words.

00:54:33.380 | So if we want to predict T plus one

00:54:38.820 | given the entire preceding text,

00:54:41.980 | we actually throw away the early words

00:54:45.100 | and just use the preceding N minus one words as context.

00:54:49.540 | Well, once we've made that simplification,

00:54:52.260 | we can then just use the definition

00:54:54.100 | of conditional probability and say,

00:54:56.380 | all that conditional probability

00:54:58.540 | is the probability of N words

00:55:01.940 | divided by the preceding N minus one words.

00:55:06.940 | And so we have the probability of an N-gram

00:55:09.500 | over the probability of an N minus one gram.

00:55:12.700 | And so then how do we get these N-gram

00:55:15.900 | and N minus one gram probabilities?

00:55:18.140 | We simply take a larger amount of text in some language

00:55:22.460 | and we count how often the different N-grams occur.

00:55:26.100 | And so our crude statistical approximation

00:55:30.660 | starts off as the count of the N-gram

00:55:33.740 | over the count of the N minus one gram.

00:55:37.340 | So here's an example of that.

00:55:38.980 | Suppose we are learning a four-gram language model.

00:55:42.020 | Okay, so we throw away all words

00:55:44.540 | apart from the last three words,

00:55:47.020 | and they're our conditioning.

00:55:49.020 | We look in some large,

00:55:54.020 | we use the counts from some large training corpus

00:55:57.700 | and we see how often did students open their books occur?

00:56:01.820 | How often did students open their minds occur?

00:56:04.900 | And then for each of those counts,

00:56:07.020 | we divide through by the count

00:56:08.740 | of how often students open their occurred.

00:56:11.420 | And that gives us our probability estimates.

00:56:15.980 | So for example, if in the corpus students open,

00:56:19.500 | they occurred a thousand times,

00:56:22.060 | students open their books occurred 400 times,

00:56:25.660 | we'd get a probability estimate of 0.4 for books.

00:56:29.140 | If exams occurred a hundred times,

00:56:30.740 | we'd get 0.1 for exams.

00:56:33.540 | And we sort of see here already the disadvantage

00:56:37.260 | of having made the Markov assumption

00:56:39.580 | and have gotten rid of all of this earlier context,

00:56:43.620 | which would have been useful for helping us to predict.

00:56:46.460 | The one other point that I'll just mention

00:56:52.140 | that I can fuse myself on

00:56:54.140 | is this count of the N-gram language model.

00:56:58.300 | So for a four-gram language model,

00:57:00.940 | it's called a four-gram language model

00:57:03.340 | because in its estimation,

00:57:05.460 | you're using four grams in the numerator

00:57:08.340 | and trigrams in the denominator.

00:57:11.460 | So you use the size of the numerator.

00:57:14.940 | So that terminology is different to the terminology

00:57:18.900 | that's used in Markov models.

00:57:21.020 | So when people talk about the order of a Markov model,

00:57:24.700 | that refers to the amount of context you're using.

00:57:28.220 | So this would correspond to a third order Markov model.

00:57:32.460 | Yeah, so someone said,

00:57:36.260 | "Is this similar to a Naive Bayes model?"

00:57:40.780 | Sort of.

00:57:42.260 | Naive Bayes models,

00:57:43.380 | you also estimate the probabilities just by counting.

00:57:46.860 | So they're related

00:57:49.980 | and there are sort of in some sense, two differences.

00:57:54.740 | The first difference or specialization

00:57:59.340 | is that Naive Bayes models

00:58:02.300 | work out probabilities of words

00:58:05.980 | independent of their neighbors.

00:58:07.620 | So in one part that a Naive Bayes language model

00:58:12.100 | is a unigram language model.

00:58:14.260 | So you're just using the counts of individual words.

00:58:17.340 | But the other part of a Naive Bayes model

00:58:20.340 | is you're learning a different set of unigram counts

00:58:24.500 | for every class for your classifier.

00:58:27.820 | And so you've then got sort of,

00:58:33.020 | so effectively a Naive Bayes model

00:58:35.380 | is you've got class specific unigram language models.

00:58:40.380 | Okay, I gave this as a simple statistical model

00:58:48.140 | for estimating your probabilities with an N-gram model.

00:58:52.140 | You can't actually get away with just doing that

00:58:54.940 | because you have sparsity problems.

00:58:57.380 | So, often it'll be the case that for many words,

00:59:01.460 | students open their books

00:59:03.140 | or students opened their backpacks

00:59:07.180 | just never occurred in the training data.

00:59:09.620 | That if you think about it,

00:59:10.700 | if you have something like 10 to the fifth

00:59:13.740 | different words even,

00:59:15.380 | and you want to have then a sequence of four words

00:59:18.380 | or a problem and there are 10 to the fifth of each,

00:59:20.820 | there's sort of 10 to the 20th different combinations.

00:59:24.260 | So unless you're seeing

00:59:25.620 | and it's truly astronomical amount of data,

00:59:28.780 | most four word sequences you've never seen.

00:59:31.700 | So then your numerator will be zero

00:59:34.380 | and your probability estimate will be zero.

00:59:36.700 | And so that's bad.

00:59:38.060 | And so the commonest way of solving that

00:59:40.220 | is just to add a little Delta to every count

00:59:43.100 | and then everything is non-zero and that's called smoothing.

00:59:46.580 | But well, sometimes it's worse than that

00:59:49.820 | 'cause sometimes you won't even have seen

00:59:51.820 | students open theirs and that's more problematic

00:59:55.020 | 'cause that means our denominator is zero.

00:59:58.740 | And so the division will be ill-defined

01:00:01.740 | and we can't usefully calculate any probabilities

01:00:05.020 | in a context that we've never seen.

01:00:07.340 | And so the standard solution to that

01:00:09.580 | is to shorten the context and that's called back off.

01:00:13.420 | So we condition only on open there

01:00:16.580 | or if we still haven't seen the open there,

01:00:20.100 | we'll condition only on there

01:00:22.460 | or we could just forget all conditioning

01:00:25.220 | and actually use a unigram model for our probabilities.

01:00:28.540 | Yeah.

01:00:31.500 | And so as you increase the order N

01:00:37.100 | of the N-gram language model,

01:00:38.740 | these sparsity problems become worse and worse.

01:00:42.180 | So in the early days,

01:00:43.660 | people normally worked with trigram models.

01:00:46.300 | As it became easier to collect billions of words of text,

01:00:50.540 | people commonly moved to five-gram models.

01:00:53.420 | But every time you go up an order of conditioning,

01:00:58.420 | you effectively need to be collecting

01:01:01.580 | orders of magnitude more data

01:01:03.340 | because of the size of the vocabularies of human languages.

01:01:07.060 | There's also a problem that these models are huge.

01:01:13.700 | So you basically have to be storing counts

01:01:17.460 | of all of these words sequences

01:01:19.500 | so you can work out these probabilities.

01:01:22.100 | And I mean, that's actually had a big effect

01:01:24.340 | in terms of what technology is available.

01:01:27.620 | So in the 2000s decade up till about whenever it was,

01:01:32.620 | 2014, that there was already Google Translate

01:01:37.620 | using probabilistic models

01:01:40.940 | that included language models

01:01:43.220 | of the N-gram language model sort.

01:01:45.260 | But the only way they could possibly be run

01:01:47.980 | is in the cloud

01:01:49.620 | because you needed to have these huge tables

01:01:52.980 | of probabilities.

01:01:54.660 | But now we have neural nets

01:01:55.980 | and you can have Google Translate

01:01:57.820 | just actually run on your phone.

01:01:59.780 | And that's possible because neural net models

01:02:03.620 | can be massively more compact

01:02:05.700 | than these old N-gram language models.

01:02:08.380 | Yeah.

01:02:13.420 | But nevertheless, before we get onto the neural models,

01:02:16.860 | let's just sort of look at the example of how these work.

01:02:21.860 | So it's trivial to train an N-gram language model

01:02:27.860 | 'cause you really just count how often

01:02:29.900 | word sequences occur in a corpus and you're ready to go.

01:02:33.260 | So these models can be trained in seconds.

01:02:35.460 | That's really good.

01:02:36.540 | That's not like sitting around for training neural networks.

01:02:39.540 | So if I train on my laptop,

01:02:42.780 | a small language model on,

01:02:45.500 | you know, about 1.7 million words as a trigram model,

01:02:50.500 | I can then ask it to generate text.

01:02:53.420 | If I give it a couple of words today,

01:02:55.900 | I can then get it to sort of suggest a word

01:02:59.380 | that might come next.

01:03:00.860 | And the way I do that is the language model

01:03:03.780 | knows the probability distribution

01:03:06.180 | of things that can come next.

01:03:08.820 | Now there's a kind of a crude probability distribution.

01:03:12.260 | I mean, 'cause effectively

01:03:14.220 | over this relatively small corpus,

01:03:16.740 | there were things that occurred once, Italian and Emirate.

01:03:20.220 | There are things that occurred twice, price.

01:03:22.540 | There were things that occurred four times, company and bank.

01:03:26.420 | It's sort of fairly crude and rough,

01:03:28.580 | but I nevertheless get probability estimates.

01:03:31.460 | I can then say, okay, based on this,

01:03:36.380 | let's take this probability distribution

01:03:39.660 | and then we'll just sample the next word.

01:03:41.980 | So the two most likely words to sample,

01:03:44.380 | a company or bank, but we're rolling the dice

01:03:48.140 | and we might get any of the words that had come next.

01:03:50.740 | So maybe I sample price.

01:03:53.860 | Now I'll condition on price, the price,

01:03:57.740 | and look up the probability distribution

01:04:00.500 | of what comes next.

01:04:01.940 | The most likely thing is of.

01:04:03.940 | And so again, I'll sample,

01:04:05.700 | and maybe this time I'll pick up of.

01:04:09.300 | And then I will now condition on price of,

01:04:12.980 | and I will look up the probability distribution

01:04:16.140 | of words following that.

01:04:18.140 | And I get this probability distribution

01:04:21.060 | and I'll sample randomly some word from it.

01:04:24.380 | And maybe this time I'll sample a rare,

01:04:27.460 | but possible one like gold.

01:04:29.740 | And I can keep on going

01:04:31.820 | and I'll get out something like this.

01:04:34.220 | Today, the price of gold per tonne

01:04:36.580 | while production of shoe lasts and shoe industry,

01:04:39.660 | the bank intervened just after it considered

01:04:41.860 | and rejected an IMF demand

01:04:43.820 | to rebuild depleted European stocks,

01:04:46.500 | SEP 30 N primary 76 cents a share.

01:04:50.220 | So what just a simple trigram model can produce

01:04:54.500 | over not very much text

01:04:56.620 | is actually already kind of interesting.

01:04:59.180 | Like it's actually surprisingly grammatical, right?

01:05:02.220 | There are whole pieces of it

01:05:03.740 | while production of shoe lasts and shoe industry,

01:05:07.020 | the bank intervene just after it considered

01:05:09.620 | and rejected an IMF demand, right?

01:05:11.540 | It's really actually pretty good grammatical text.

01:05:14.420 | So it's sort of amazing that these simple N-gram models

01:05:17.980 | actually can model a lot of human language.

01:05:21.940 | On the other hand, it's not a very good piece of text.

01:05:25.060 | It's completely incoherent and makes no sense.

01:05:29.180 | And so to actually be able to generate text

01:05:33.060 | that seems like it makes sense,

01:05:35.500 | we're going to need a considerably better language model.

01:05:38.900 | And that's precisely what neural language models

01:05:42.540 | have allowed us to build as we'll see later.

01:05:45.060 | Okay, so how can we build a neural language model?

01:05:50.380 | And so first of all, we're gonna do a simple one

01:05:54.060 | and then we'll see where we get,

01:05:56.300 | but to move into a current neural nets

01:05:58.500 | might still take us to the next time.

01:06:02.420 | So we've gonna have input sequence of words

01:06:05.540 | and we want a probability distribution over the next word.

01:06:10.140 | Well, the simplest thing that we could try is to say,

01:06:13.700 | well, kind of the only tool we have so far

01:06:17.420 | is a window-based classifier.

01:06:19.740 | So what we can say, what we'd done previously

01:06:24.140 | either for our named entity recognized in lecture three

01:06:28.660 | or what I just showed you for the dependency parser

01:06:31.340 | is we have some context window,

01:06:33.900 | we put it through a neural net

01:06:35.660 | and we predict something as a classifier.

01:06:38.380 | So before we were predicting a location,

01:06:41.900 | but maybe instead we could reuse exactly

01:06:45.580 | the same technology and say,

01:06:48.460 | we're going to have a window-based classifier.

01:06:51.060 | So we're discarding the further away words

01:06:53.620 | just like in N-gram language model,

01:06:57.820 | but we'll feed this fixed window into a neural net.

01:07:02.100 | So we can catenate the word embeddings,

01:07:05.540 | we put it through a hidden layer

01:07:07.660 | and then we have a softmax classifier over our vocabulary.

01:07:12.660 | And so now rather than predicting something like location

01:07:18.540 | or left arc in the dependency parser,

01:07:21.900 | we're going to have a softmax over the entire vocabulary

01:07:25.780 | sort of like we did with the skip-gram negative sampling

01:07:29.380 | model in the first two lectures.

01:07:31.900 | And so we're going to see this choice

01:07:35.100 | as predicting what word that comes next,

01:07:37.820 | whether it produces laptops, minds, books, et cetera.

01:07:42.100 | Okay, so this is a fairly simple

01:07:46.540 | fixed window neural net classifier,

01:07:49.100 | but this is essentially a famous early model

01:07:55.540 | in the use of neural nets for NLP applications.

01:08:00.100 | So first the 2000 conference paper

01:08:02.580 | and then a somewhat later journal paper,

01:08:05.940 | Yoshua Bengio and colleagues introduced precisely

01:08:09.740 | this model as the neural probabilistic language model.

01:08:14.180 | And they were already able to show

01:08:17.140 | that this could give interesting good results

01:08:20.220 | for language modeling.

01:08:21.780 | And so it wasn't a great solution

01:08:25.300 | for neural language modeling, but it still had value.

01:08:29.220 | So it didn't solve the problem of allowing us

01:08:32.380 | to have bigger context to predict

01:08:34.780 | what words are going to come next.

01:08:36.980 | It's in that way limited exactly

01:08:40.660 | like an N-gram language model is,

01:08:43.260 | but it does have all the advantages

01:08:45.540 | of distributed representations.

01:08:47.900 | So rather than having these counts for word sequences

01:08:52.700 | that are very sparse and very crude,

01:08:56.620 | we can use distributed representations of words,

01:09:00.820 | which then make predictions

01:09:03.100 | that semantically similar words

01:09:05.260 | should give similar probability distribution.

01:09:09.020 | So the idea of that is if we'd use some other word here,

01:09:13.340 | like maybe the pupils open there,

01:09:16.980 | well, maybe in our training data,

01:09:19.180 | we'd seen sentences about students,

01:09:21.860 | but we'd never seen sentences about pupils.

01:09:24.580 | An N-gram language model then would sort of have no idea

01:09:27.700 | what probabilities to use,

01:09:29.660 | whereas a neural language model can say,

01:09:32.900 | well, pupils is kind of similar to students.

01:09:35.500 | Therefore I can predict similarly

01:09:37.300 | to what I would have predicted for students.

01:09:39.500 | Okay, so there's now no sparsity problem.

01:09:45.460 | We don't need to store billions of N-gram counts.

01:09:50.460 | We simply need to store our word vectors

01:09:56.220 | and our W and U matrices,

01:09:59.340 | but we still have the remaining problems

01:10:01.660 | that our fixed window is too small.

01:10:04.700 | We can try and make the window larger.

01:10:07.660 | If we do that, the W matrix gets bigger,

01:10:11.380 | but that also points out another problem with this model.

01:10:15.580 | Not only can the window never be large enough,

01:10:18.900 | but W is just a trained matrix.

01:10:23.180 | And so therefore we're learning completely different weights

01:10:27.500 | for each position of context,

01:10:29.540 | the word minus one position, the word minus two,

01:10:32.340 | the word minus three, and the word minus four,

01:10:35.260 | so that there's no sharing in the model

01:10:37.940 | as to how it treats words in different positions,

01:10:42.940 | even though in some sense

01:10:45.460 | they will contribute semantic components

01:10:48.260 | that are at least somewhat position independent.

01:10:52.580 | So again, for those of,

01:10:54.140 | if you sort of think back to either a naive Bayes model

01:10:57.300 | or what we saw with the word2vec model at the beginning,

01:11:01.740 | the word2vec model or a naive Bayes model

01:11:04.180 | completely ignores word order.

01:11:06.500 | So it has one set of parameters

01:11:08.540 | regardless of what position things occur in.

01:11:11.300 | That doesn't work well for language modeling

01:11:13.820 | 'cause word order is really important in language modeling.

01:11:16.900 | If the last word is the, that's a really good predictor

01:11:20.260 | of there being an adjective or noun following

01:11:22.660 | where if the word4vec is the,

01:11:24.940 | it doesn't give you the same information.

01:11:28.180 | So you do wanna somewhat make use of word order,

01:11:33.180 | but this model is at the opposite extreme

01:11:36.940 | that each position is being modeled

01:11:39.100 | completely independently.

01:11:41.660 | So what we'd like to have is a neural architecture

01:11:46.100 | that can process an arbitrary amount of context

01:11:51.100 | and have more sharing of the parameters

01:11:54.460 | while still be sensitive to proximity.

01:11:58.500 | And so that's the idea of recurrent neural networks.

01:12:02.340 | And I'll say about five minutes about these today,

01:12:06.380 | and then next time we'll return

01:12:07.900 | and do more about recurrent neural networks.

01:12:11.900 | So for the recurrent neural network,

01:12:14.940 | rather than having a single hidden layer

01:12:19.340 | inside our classifier here that we compute each time,

01:12:24.340 | for the recurrent neural network,

01:12:26.980 | we have the hidden layer,

01:12:29.100 | which often is referred to as the hidden state,

01:12:32.420 | but we maintain it over time

01:12:35.300 | and we feed it back into itself.

01:12:38.180 | So that's what the word recurrent is meaning,

01:12:40.620 | that you're sort of feeding the hidden layer

01:12:43.540 | back into itself.

01:12:45.300 | So what we do is based on the first word,

01:12:50.300 | we compute a hidden representation kind of like before,

01:12:55.180 | which can be used to predict the next word.

01:12:58.820 | But then for when we want to predict

01:13:02.500 | what comes after the second word,

01:13:04.900 | we not only feed in the second word,

01:13:07.500 | we feed in the hidden layer from the previous word

01:13:12.500 | to have it help predict the hidden layer

01:13:16.700 | above the second word.

01:13:18.420 | And so formally the way we're doing that

01:13:20.700 | is we're taking the hidden layer above the first word,

01:13:24.860 | multiplying it by a matrix W,

01:13:28.500 | and then that's going to be going in together with X2

01:13:32.860 | to generate the next hidden step.

01:13:35.660 | And so we keep on doing that at each time step,

01:13:38.900 | so that we are kind of repeating a pattern

01:13:42.100 | of creating a next hidden layer

01:13:44.780 | based on the next input word

01:13:48.580 | and the previous hidden state by updating it,

01:13:52.020 | by multiplying it by a matrix W.

01:13:54.540 | Okay, so on my slide here,

01:13:55.980 | I've still only got four words of context

01:13:58.100 | because it's nice for my slide,

01:13:59.940 | but in principle there could be,

01:14:02.660 | any number of words of context now.

01:14:04.900 | Okay, so what we're doing is,

01:14:08.100 | so that we start off by having input vectors,

01:14:13.100 | which can be our word vectors

01:14:15.980 | that we've looked up for each word.

01:14:18.020 | So, sorry, yeah,

01:14:21.580 | so we can have the one hot vectors for word identity.

01:14:25.220 | We look up our word embeddings,

01:14:26.780 | so then we've got word embeddings for each word.

01:14:30.020 | And then we want to compute hidden states.

01:14:33.620 | So we need to start from somewhere,

01:14:36.100 | H0 is the initial hidden state

01:14:38.820 | and H0 is normally taken as a zero vector.

01:14:42.580 | So this is actually just initialized as zeros.

01:14:45.740 | And so for working out the first hidden state,

01:14:49.420 | we calculate it based on the first word embedding

01:14:54.420 | by multiplying this embedding by a matrix WE.

01:14:59.340 | And that gives us the first hidden state.

01:15:02.540 | But then, as we go on,

01:15:05.340 | we want to apply the same formula over again.

01:15:09.740 | So we have just two parameter matrices

01:15:14.140 | in the recurrent neural network,

01:15:17.020 | one matrix for multiplying input embeddings

01:15:20.460 | and one matrix for updating the hidden state of the network.

01:15:24.860 | And so for the second word, from its word embedding,

01:15:29.380 | we multiply it by the WE matrix.

01:15:34.060 | We take the previous time steps hidden state

01:15:37.020 | and multiply it by the WH matrix.

01:15:41.300 | And we use the two of those to generate the new hidden state

01:15:45.820 | and precisely how we generate the new hidden state

01:15:48.820 | is then by shown on this equation on the left.

01:15:52.500 | So we take the previous hidden state, multiply it by WH.

01:15:56.620 | We take the input embedding, multiply it by WE.

01:16:00.900 | We sum those two.

01:16:02.620 | We add on a learn bias weight,

01:16:06.860 | and then we put that through a non-linearity.

01:16:11.060 | And although on this slide,

01:16:12.780 | that non-linearity is written as sigma,

01:16:15.260 | by far the most common non-linearity to use here

01:16:18.740 | actually is a tanh non-linearity.

01:16:22.380 | And so this is the core equation

01:16:26.100 | for a simple recurrent neural network.

01:16:28.980 | And for each successive time step,

01:16:31.340 | we're just gonna keep on applying that

01:16:33.860 | to work out hidden states.

01:16:36.180 | And then from those hidden states,

01:16:38.660 | we can use them just like in our window classifier

01:16:42.940 | to predict what would be the next word.

01:16:45.500 | So at any position, we can take this hidden vector,

01:16:50.740 | put it through a softmax layer,

01:16:52.580 | which is multiplying by U matrix

01:16:54.620 | and adding on another bias,

01:16:56.260 | and then making a softmax distribution out of that.

01:16:59.020 | And that will then give us a probability distribution

01:17:01.940 | over next words.

01:17:04.140 | What we saw here, right?

01:17:05.900 | This is the entire math of a simple recurrent neural network.

01:17:10.900 | And next time I'll come back and say more about them,

01:17:16.380 | but this is the entirety of what you need to know

01:17:20.540 | in some sense for the computation of the forward model

01:17:24.140 | of a simple recurrent neural network.

01:17:26.220 | So the advantages we have now

01:17:28.380 | is it can process a text import of any length.

01:17:33.140 | In theory, at least it can use information

01:17:37.140 | from any number of steps back.

01:17:39.420 | We'll talk more about in practice

01:17:41.060 | how well that actually works.

01:17:42.860 | The model size is fixed.

01:17:45.700 | It doesn't matter how much of a past context there is.

01:17:49.820 | All we have is our WH and WE parameters.

01:17:54.020 | And at each time step,

01:17:56.580 | we use exactly the same weights to update our hidden state.

01:18:00.380 | So there's a symmetry in how different inputs are processed

01:18:05.340 | in producing our predictions.

01:18:08.140 | RNNs in practice though,

01:18:10.180 | or the simple RNNs in practice aren't perfect.

01:18:13.860 | So a disadvantage is that they're actually kind of slow

01:18:18.220 | 'cause with this recurrent computation,

01:18:21.300 | in some sense we are sort of stuck

01:18:23.300 | with having to have on the outside a for loop.

01:18:26.100 | So we can do vector matrix multiplies on the inside here,

01:18:30.500 | but really we have to do for time step equals one to N,

01:18:35.500 | calculate the successive hidden states.

01:18:39.340 | And so that's not a perfect neural net architecture

01:18:42.700 | and we'll discuss alternatives to that later.

01:18:47.580 | And although in theory,

01:18:49.220 | this model can access information any number of steps back,

01:18:53.140 | in practice we find that it's pretty imperfect at doing that

01:18:57.780 | and that will then lead to more advanced forms

01:19:01.220 | of recurrent neural network that I'll talk about next time

01:19:05.060 | that are able to more effectively access past context.

01:19:10.060 | Okay, I think I'll stop there for the day.

01:19:12.300 | [BLANK_AUDIO]

Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 5 - Recurrent Neural networks (RNNs)

Chapters