back to index

Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 5 - Recurrent Neural networks (RNNs)


Chapters

0:0 Introduction
2:45 Dependency parsers
5:42 Neural dependency parser
8:14 Distributed representations
11:19 Softmax classifier
14:51 Multiclass classifier
17:15 Dependency parser model
18:57 Whats happened since 2014
20:6 Context
22:21 Graphbased parsers
24:35 Regularization
27:59 L2 Regularization
30:18 Dropout
34:30 Backward Pass
36:21 Nonlinearity
43:27 Parameter initialization
45:51 Training
47:39 Learning Rate
50:0 Language Models
52:58 Engram Language Models
54:16 Markov Model
55:37 Foreground Language Model
58:43 Sparsity Problems

Whisper Transcript | Transcript Only Page

00:00:00.000 | So we're now starting in week three with lecture five.
00:00:05.000 | So unfortunately, in the last class,
00:00:15.320 | I guess I really got behind and went a bit slowly.
00:00:18.400 | I guess I must just enjoy talking
00:00:20.640 | about natural languages too much.
00:00:22.640 | And so I never really got to the punchline
00:00:24.940 | of showing how you could do good things
00:00:26.720 | with the neural dependency parser.
00:00:28.600 | So today for the first piece,
00:00:30.200 | I'll in some sense be finishing the content of last time
00:00:33.640 | and talk about neural dependency parsing,
00:00:36.080 | which also gives us the opportunity
00:00:38.620 | to introduce a simple feed forward neural net classifier.
00:00:43.620 | That will then lead into a little bit
00:00:46.040 | of just background things that you need to know
00:00:48.640 | about neural networks content,
00:00:50.360 | because the fact of the matter is there is a bunch of stuff
00:00:52.660 | you need to know about neural networks.
00:00:55.160 | And then after both of those things,
00:00:57.380 | I'll get into what's really meant to be the topic
00:00:59.880 | of today's lecture, which is looking at language modeling
00:01:03.920 | and recurrent neural networks.
00:01:06.160 | And that's then going to lead into those two things
00:01:10.000 | are important topics that we'll then be talking about
00:01:14.020 | really for the whole of next week as well.
00:01:16.920 | So there's a couple of reminders before we get underway.
00:01:19.640 | The first is that you should have handed in assignment two
00:01:23.200 | before you joined this class.
00:01:25.860 | And in turn, assignment three is out today.
00:01:28.920 | And it's an assignment where you're going to build
00:01:33.060 | essentially the neural dependency parser
00:01:36.080 | that I'm just about to present in PyTorch.
00:01:38.840 | So part of the role of this assignment
00:01:40.960 | is actually to get you up to speed with PyTorch.
00:01:44.120 | So this assignment is highly scaffolded
00:01:48.040 | with lots of comments and hints about what to do.
00:01:51.120 | And so the hope is that by the time you come
00:01:54.120 | to the end of it, you'll feel fairly familiar
00:01:57.280 | and comfortable with PyTorch.
00:01:59.640 | Don't forget there was also a tutorial
00:02:01.280 | on PyTorch last week.
00:02:02.760 | If you didn't catch that at the time,
00:02:04.560 | you might want to go back and look at the video.
00:02:07.040 | Another thing to mention about the assignments
00:02:11.000 | is that assignment three is the last assignment
00:02:14.680 | where our great team of TAs are happy to look at your code
00:02:19.120 | and sort out your bugs for you.
00:02:21.560 | So maybe take advantage of that,
00:02:23.680 | but not too much.
00:02:25.320 | But starting on assignment four for assignments four, five
00:02:28.240 | and the final project,
00:02:30.040 | the TAs are very happy to help in general,
00:02:32.760 | but it's just not going to be their job
00:02:34.720 | to be actually sorting out bugs for you.
00:02:37.760 | You should be looking at your code
00:02:39.600 | and discussing ideas and concepts
00:02:41.720 | and reasons why things might not work with them.
00:02:44.120 | Okay, so if you remember where we were last time,
00:02:49.120 | I'd introduced this idea
00:02:50.640 | of transition-based dependency parsers
00:02:53.880 | and that these were an efficient linear time method
00:02:58.120 | for giving the syntactic structure
00:03:00.240 | of natural language text.
00:03:02.320 | And that they worked pretty well
00:03:05.320 | before neural nets came along and took over NLP again,
00:03:08.760 | but they had some disadvantages.
00:03:11.120 | And their biggest disadvantage
00:03:13.160 | is that like most machine learning models of that time,
00:03:16.820 | they worked with indicator features.
00:03:19.380 | So that means that you are specifying some condition
00:03:24.000 | and then checking whether it was true of a configuration.
00:03:27.200 | So something like the word on the top of the stack is good
00:03:30.360 | and it's part of speech is adjective,
00:03:32.640 | or the next word coming up is a personal pronoun,
00:03:36.800 | that those are conditions that would be features
00:03:39.480 | in a conventional transition-based dependency parser.
00:03:44.480 | And so what are the problems with doing that?
00:03:48.540 | Well, one problem is that those features are very sparse.
00:03:53.540 | A second problem is the features are incomplete.
00:03:57.440 | Well, what I mean by that is depending on what words
00:04:02.440 | and configurations occurred in the training data,
00:04:05.980 | there are certain features that will exist
00:04:09.040 | because you sort of saw a certain word preceding a verb
00:04:12.520 | and certain features that just won't exist
00:04:14.440 | because that word never occurred
00:04:16.320 | before a verb in the training data.
00:04:18.580 | But perhaps the biggest problem and opportunity
00:04:22.180 | for doing better with the neural dependency parser
00:04:25.140 | is that it turns out that in a symbolic dependency parser,
00:04:29.880 | computing all these features
00:04:31.500 | just turns out to actually be pretty expensive.
00:04:34.000 | That although the actual transition system
00:04:36.100 | that I showed last time is fast and efficient to run,
00:04:40.660 | you actually have to compute all of these features.
00:04:43.980 | And what you found was that about 95% of the parsing time
00:04:48.380 | of one of these models was spent just computing
00:04:51.800 | all of the features of every configuration.
00:04:54.340 | So that suggests that perhaps we can do better
00:04:58.940 | with a neural approach where we're going to learn
00:05:01.300 | a dense and compact feature representation.
00:05:04.100 | And so that's what I wanna go through now.
00:05:06.980 | So this time, we're still gonna have exactly
00:05:10.420 | the same kind of configuration of a stack and a buffer
00:05:15.020 | and running exactly the same transition sequence,
00:05:18.700 | except this time, rather than representing the configuration
00:05:22.500 | of the stack and the buffer
00:05:23.860 | by having several million symbolic features,
00:05:27.100 | we're instead going to summarize this configuration
00:05:30.560 | as a dense vector of dimensionality,
00:05:34.020 | perhaps approximately a thousand.
00:05:35.880 | And our neural approach is going to learn
00:05:39.540 | this dense compact feature representation.
00:05:42.740 | And so quite explicitly, what I'm gonna show you now briefly
00:05:47.580 | and what you're going to implement
00:05:51.180 | is essentially the neural dependency parser
00:05:54.020 | that was developed by Danqi Chen in 2014.
00:05:58.060 | And to skip to the advertisement right at the beginning
00:06:01.940 | as to how this works so well,
00:06:04.860 | these are the kind of results that you got from it
00:06:08.000 | using the measures that I introduced at the last time,
00:06:10.980 | the unlabeled attachment score,
00:06:12.860 | whether you attach dependencies correctly to the right word
00:06:17.020 | and the labeled attachment score
00:06:19.340 | as to whether you also get the type of grammatical relation
00:06:22.540 | of that dependency correct.
00:06:24.220 | And so essentially, this Chen and Manning parser
00:06:29.020 | gave a neural version of something
00:06:32.580 | like a transition-based dependency parser
00:06:35.340 | like Malt parser in yellow.
00:06:37.520 | And the interesting thing was that taking advantage
00:06:41.260 | of a neural classifier in ways that I'm about to explain,
00:06:46.260 | that that could produce something
00:06:48.020 | that was about 2% more accurate
00:06:50.660 | than the symbolic dependency parser.
00:06:53.200 | And because of the fact that it's not doing
00:06:55.460 | all of the symbolic feature computation,
00:06:58.400 | despite the fact that you might think at first
00:07:00.760 | that there's a lot of real number math
00:07:02.540 | and matrix vector multiplies in a neural dependency parser,
00:07:07.140 | it actually ran noticeably faster
00:07:09.700 | than the symbolic dependency parser
00:07:11.580 | because it didn't have all of the feature computation.
00:07:16.100 | The other major approach to dependency parsing
00:07:18.300 | that I'm also showing here,
00:07:19.920 | and I'll get back to at the end,
00:07:22.220 | is what's referred to as graph-based dependency parsing.
00:07:25.740 | And so that's a different approach to dependency parsing.
00:07:29.060 | And so these are two symbolic graph-based dependency parsers
00:07:33.700 | and in the pre-neural world,
00:07:36.340 | they were somewhat more accurate
00:07:38.700 | than the transition-based parsers as you could see,
00:07:41.900 | but on the other hand,
00:07:43.420 | they were close to two orders of magnitude slower.
00:07:46.980 | And so essentially with the Chern-Manning parser,
00:07:49.780 | we were able to provide something
00:07:51.700 | that was basically as accurate
00:07:53.380 | as the best graph-based dependency parsers,
00:07:55.940 | which were the best dependency parsers
00:07:58.300 | while operating about two orders of magnitude more quickly.
00:08:01.780 | So how did we do it?
00:08:04.160 | It was actually a very straightforward implementation,
00:08:09.160 | which is part of what makes it great
00:08:11.800 | for doing for assignment three.
00:08:13.960 | But this is how we did it and we got wins.
00:08:17.660 | So the first win,
00:08:18.960 | which is what we've already talked about extensively
00:08:21.700 | starting in week one,
00:08:23.140 | is to make use of distributed representations.
00:08:26.280 | So we represent each word as a word embedding,
00:08:29.580 | and you've had a lot of experience with that already.
00:08:33.140 | And so that means when words weren't seen
00:08:36.580 | in a particular configuration,
00:08:38.620 | we still know what they're like
00:08:40.180 | because they'll be,
00:08:41.260 | will have seen similar words in the correct configuration.
00:08:45.180 | But we don't stop only with word embeddings.
00:08:50.460 | The other things that are central to our dependency parser
00:08:53.860 | are the parts of speech of words and the dependency labels.
00:08:58.260 | And so what we decided to do
00:09:01.040 | is that although those are much smaller sets,
00:09:04.660 | so the dependency labels are about 40 in number
00:09:07.760 | and the parts of speech are of around
00:09:09.960 | that order of magnitude,
00:09:11.080 | sometimes less, sometimes more,
00:09:13.260 | that even within those sets of categories,
00:09:16.620 | there are ones that are very strongly related.
00:09:19.320 | So we also adopted distributed representations for them.
00:09:24.320 | So for example,
00:09:25.260 | there might be parts of speech
00:09:26.780 | for singular nouns and plural nouns.
00:09:29.100 | And basically most of the time they behave similarly
00:09:32.180 | and there are adjectival modifiers and numerical modifiers.
00:09:36.380 | So these are just numbers like three, four, five.
00:09:38.820 | And again, a lot of the time they behave the same
00:09:41.700 | that you have both three cows and brown cows.
00:09:46.140 | Okay, so everything is going to be represented
00:09:50.920 | in the distributed representation.
00:09:53.300 | So at that point,
00:09:54.660 | we have exactly the same kind of configuration
00:09:58.200 | where we have our stack, our buffer,
00:10:01.100 | and we've started to build some arcs.
00:10:03.780 | And so the classification decisions of the next transition
00:10:08.780 | are going to be made out of a few elements
00:10:11.680 | of this configuration.
00:10:12.940 | So we're looking at the top thing on the stack,
00:10:15.620 | the thing that's second on the stack,
00:10:17.580 | the first word on the buffer.
00:10:19.460 | And then we actually added in some additional features
00:10:22.580 | that have been to the extent
00:10:24.140 | that we've already built arcs for words on the stack,
00:10:27.460 | that we can be looking at the dependence
00:10:30.060 | on the left and right of those words that are on the stack
00:10:33.720 | that are already in the sets of arcs.
00:10:36.120 | And so for each of those things,
00:10:39.080 | there is a word, there is a part of speech,
00:10:43.040 | and for some of them, there is a dependency
00:10:47.280 | where it's already connected up to something else.
00:10:50.920 | So for example, the left corner of S2 here
00:10:54.160 | has an in-sub dependency
00:10:56.240 | back to the second thing on the stack.
00:10:58.580 | So we can take these elements of the configuration
00:11:02.320 | and can look up the embedding of each one.
00:11:05.780 | So we have word embeddings, part of speech embeddings,
00:11:07.920 | and dependency embeddings,
00:11:09.360 | and just concatenate them all together,
00:11:11.680 | kind of like we did before with the window classifier,
00:11:14.880 | and that will give us a neural representation
00:11:17.120 | of the configuration.
00:11:18.420 | Now, there's a second reason why we can hope to win
00:11:23.200 | by using a deep learning classifier
00:11:25.720 | to predict the next transition.
00:11:27.680 | And we haven't really said much about that yet.
00:11:29.840 | So I just wanted to detour
00:11:31.920 | and say a little bit more about that.
00:11:34.560 | So the simplest kind of classifier
00:11:37.320 | that's close to what we've been talking about
00:11:41.360 | in neural models is a softmax classifier.
00:11:44.640 | So that if we have D dimensional vectors X,
00:11:48.480 | and we have Y classes to assign things to,
00:11:54.240 | oh, sorry, Y is an element of a set of C classes
00:11:59.240 | to assign things to,
00:12:01.280 | then we can build a softmax classifier
00:12:03.680 | using the softmax distribution that we've seen before,
00:12:07.120 | where we decide the classes
00:12:09.100 | based on having a weight matrix that's C by D,
00:12:13.680 | and we train on supervised data,
00:12:16.920 | the values of this W weight matrix
00:12:19.400 | to minimize our negative log likelihood loss
00:12:23.360 | that we've seen before,
00:12:24.880 | a loss that's also commonly referred
00:12:26.800 | to as cross entropy loss,
00:12:28.240 | a term that you'll see in PyTorch among other places.
00:12:32.400 | So that is a straightforward machine learning classifier.
00:12:36.120 | And if you've done 229, you've seen softmax classifiers.
00:12:41.120 | But a simple softmax classifier like this
00:12:45.160 | shares with most traditional machine learning classifiers.
00:12:49.080 | So models that include naive Bayes models,
00:12:51.680 | support vector machines, logistic regression,
00:12:54.960 | that at the end of the day,
00:12:56.820 | they're not very powerful classifiers.
00:12:59.480 | They're classifiers that only
00:13:01.040 | give linear decision boundaries.
00:13:03.280 | And so this can be quite limiting.
00:13:05.160 | So if you have a difficult problem,
00:13:07.500 | like the one I'm indicating in the picture
00:13:10.000 | in the bottom left,
00:13:11.240 | well, there's just no way you can divide the green points
00:13:14.920 | from the red points by simply drawing a straight line.
00:13:18.440 | So you're going to have a quite imperfect classifier.
00:13:21.560 | So the second big win of neural classifiers
00:13:25.760 | is that they can be much more powerful
00:13:28.680 | because they can provide nonlinear classification.
00:13:32.160 | So rather than only being able to do something like
00:13:34.440 | in the left picture,
00:13:36.120 | we can come up with classifiers
00:13:38.200 | that do something like in the right picture
00:13:40.560 | and therefore can separate the green and the red points.
00:13:45.340 | As an aside, these pictures I've taken
00:13:47.520 | from Andrej Karpathy's ConvNet JS software,
00:13:50.600 | which is a kind of a fun little tool to play around with
00:13:53.720 | if you've got a bit of spare time.
00:13:55.480 | And so there's something subtle going on here
00:13:59.920 | is because our more powerful neural net classifiers,
00:14:04.920 | at the end of the day,
00:14:06.520 | what they have at the top of them is a softmax layer.
00:14:10.720 | And so this softmax layer is indeed a linear classifier,
00:14:15.720 | and it's still a linear classifier.
00:14:18.380 | But what they have below that is other layers of neural net.
00:14:23.080 | And so effectively what happens
00:14:26.540 | is that the classification decisions are linear
00:14:29.760 | as far as the top softmax is concerned,
00:14:32.600 | but nonlinear in the original representation space.
00:14:36.600 | So precisely what a neural net can do
00:14:38.920 | is warp the space around
00:14:41.840 | and move the representation of data points
00:14:44.880 | to provide something that at the end of the day
00:14:47.720 | can be classified by a linear classifier.
00:14:50.520 | And so that's what a simple feed forward
00:14:54.140 | neural network multiclass classifier does.
00:14:57.520 | So it starts with an input representation.
00:15:00.860 | So these are is some dense representation of the input.
00:15:05.400 | It puts it through a hidden layer H
00:15:08.160 | with a matrix multiply followed by nonlinearity.
00:15:13.160 | So that matrix multiply can transform the space
00:15:16.440 | and map things around.
00:15:18.200 | And so then the output of that,
00:15:20.200 | we can then put into a softmax layer
00:15:22.980 | and get out softmax probabilities
00:15:25.840 | from which we make our classification decisions.
00:15:29.060 | And to the extent that our probabilities
00:15:32.960 | don't assign one to the correct class,
00:15:35.320 | we then get some log loss or cross entropy error,
00:15:38.520 | which we back propagate towards the parameters
00:15:41.720 | and embeddings of our model.
00:15:44.280 | And as the learning that goes on via back propagation,
00:15:48.760 | we increasingly well learn parameters
00:15:52.240 | of this hidden layer of the model,
00:15:54.440 | which learn to re-represent the input.
00:15:57.360 | They move the inputs around
00:15:59.400 | in an intermediate hidden vector space.
00:16:02.560 | So it can be easily classified
00:16:04.680 | with what at the end of the day is the linear softmax.
00:16:07.980 | So this is basically the whole
00:16:11.880 | of a simple feed forward neural network,
00:16:14.240 | multi-class classifier.
00:16:15.960 | And if we had something like a visual signal,
00:16:20.200 | we just sort of feed straight in here,
00:16:22.800 | real numbers and we've been done.
00:16:24.840 | But normally with human language material,
00:16:27.920 | we actually effectively have one more layer
00:16:31.200 | that we're feeding in before that.
00:16:33.060 | 'Cause really below this dense input layer,
00:16:36.480 | we actually have one hot vectors
00:16:38.880 | for what words or parts of speech were involved.
00:16:41.720 | And then we're doing a lookup process,
00:16:44.040 | which you can think of as one more matrix multiply
00:16:47.180 | to convert the one hot features into our dense input layer.
00:16:51.080 | Okay, in my picture here,
00:16:53.760 | the one other thing that's different
00:16:55.560 | is I've introduced a different non-linearity
00:16:58.400 | in the hidden layer, which is a rectified linear unit.
00:17:02.400 | And that's what we'll be using now,
00:17:04.080 | neural dependency parsers.
00:17:06.200 | It looks like the picture in the bottom right,
00:17:08.240 | and I'll come back to that in a few minutes.
00:17:11.120 | That's one of the extra neural net things to talk about.
00:17:14.460 | Okay, so our neural net dependency parser model architecture
00:17:20.720 | is essentially exactly that,
00:17:24.000 | but applied to the configuration
00:17:27.400 | of our transition-based dependency parser.
00:17:30.720 | So based on our transition-based
00:17:33.920 | dependency parser configuration,
00:17:36.360 | we construct an input layer embedding
00:17:39.360 | by looking up the various elements as I discussed previously,
00:17:43.560 | and then we feed it through this hidden layer
00:17:48.120 | to the softmax layer to get probabilities
00:17:52.280 | out of which we can choose what the next action is.
00:17:56.120 | And it's no more complicated than that.
00:17:58.840 | But what we found is that just simply,
00:18:06.720 | in some sense using the simplest kind
00:18:09.240 | of feed-forward neural classifier
00:18:13.600 | could provide a very accurate dependency parser
00:18:18.600 | that determines the structure of sentences,
00:18:22.960 | supporting meaning interpretation,
00:18:24.840 | the kind of way that I suggested last time.
00:18:28.080 | Indeed, despite the fact
00:18:30.600 | that it was a quite simple architecture in 2014,
00:18:34.760 | this was the first successful neural dependency parser.
00:18:38.840 | And the dense representations especially,
00:18:42.720 | but also partly the non-linearity of the classifier
00:18:46.200 | gave us this good result
00:18:47.540 | that it could both outperform symbolic parsers
00:18:51.480 | in terms of accuracy,
00:18:53.120 | and it could outperform them in terms of speed.
00:18:56.080 | So that was 2014.
00:19:00.720 | Just quickly here are a couple more slides
00:19:03.480 | on what's happened since then.
00:19:06.820 | So lots of people got excited
00:19:08.880 | by the success of this neural dependency parser,
00:19:11.860 | and a number of people, particularly at Google,
00:19:14.520 | then set about building a bigger, fancier,
00:19:18.680 | transition-based neural dependency parser.
00:19:21.040 | So they explored bigger, deeper networks.
00:19:23.320 | There's no reason to only have one hidden layer.
00:19:25.760 | You can have two hidden layers.
00:19:27.600 | You can do beam search that I briefly mentioned last time.
00:19:31.560 | Another thing that I'm not gonna talk about now
00:19:33.600 | is adding conditional random field style inference
00:19:37.660 | over decision sequences.
00:19:39.520 | And that then led in 2016
00:19:42.520 | for a model that they called Parsey-McParse face,
00:19:47.240 | which is hard to say with a straight face,
00:19:49.860 | which was then about 2.5, 3% more accurate
00:19:54.860 | than the model that we had produced,
00:19:56.660 | but still in basically the same family
00:19:59.440 | of transition-based parser with a neural net classifier
00:20:03.520 | to choose the next transition.
00:20:05.260 | The alternative to transition-based parsers
00:20:11.520 | is graph-based dependency parsers.
00:20:14.160 | And for a graph-based dependency parser,
00:20:17.400 | what you're doing is effectively considering
00:20:20.080 | every pair of words
00:20:22.100 | and considering a word as a dependent of root,
00:20:24.860 | and you're coming up with a score
00:20:27.440 | as to how likely is that big is a dependent of root
00:20:32.440 | or how likely is big to be dependent of cat.
00:20:35.920 | And similarly for every other word,
00:20:37.520 | for the word sat,
00:20:39.580 | how likely is it to be a dependent of root
00:20:43.020 | or a dependent of the, et cetera.
00:20:45.620 | And well, to do that well,
00:20:47.860 | you need to know more
00:20:48.920 | than just what the two words involved are.
00:20:52.200 | And so what you want to do is understand the context.
00:20:56.480 | So you want to have an understanding of the context of big,
00:21:00.040 | what's to the left of it,
00:21:01.040 | what's to the right of it
00:21:02.320 | to understand how you might hook it up
00:21:04.520 | into the dependency representations of the sentence.
00:21:08.100 | And so while there'd been previous work
00:21:11.520 | in graph-based dependency parsing,
00:21:13.640 | like the MST parser I showed on the earlier results slide,
00:21:18.240 | it seemed appealing that we could come up
00:21:21.200 | with a much better representation of context
00:21:24.100 | using neural nets that look at context.
00:21:26.760 | And how we do that is actually what I'll be talking about
00:21:29.160 | in the end part of the lecture.
00:21:31.060 | And so at Stanford,
00:21:33.720 | we became interested in trying to work out
00:21:37.220 | how to come up with a better graph-based dependency parser
00:21:40.260 | using context.
00:21:41.860 | Sorry, I forgot this.
00:21:42.700 | This was showing that
00:21:44.420 | if we can score each pairwise dependency,
00:21:46.820 | we can simply choose the best one.
00:21:49.100 | So we can say probably big is a dependent of cat
00:21:53.580 | and to a first approximation,
00:21:55.860 | we're gonna want to choose for each word
00:21:58.380 | that it is a dependent of the word
00:22:01.880 | that seems most likely to be a dependent.
00:22:04.380 | But we wanna do that with some constraints
00:22:06.440 | because we wanna get out something
00:22:08.140 | that is a tree with a single root,
00:22:10.420 | as I discussed last time.
00:22:12.180 | And you can do that by making use
00:22:14.260 | of a minimum spanning tree algorithm
00:22:16.640 | that uses the scores
00:22:18.520 | of how likely different dependencies are.
00:22:22.020 | Okay, so then in 2017,
00:22:25.420 | another student, Tim Dozat and me then worked on saying,
00:22:29.780 | well, can we now also build a much better
00:22:33.500 | neural graph-based dependency parser?
00:22:36.220 | And we developed a novel method
00:22:38.540 | for scoring neural scoring dependency parsers
00:22:43.140 | in a graph-based model,
00:22:44.500 | which I'm not gonna get into the details of right now,
00:22:48.180 | but that also had a very nice result
00:22:50.380 | 'cause getting back to graph-based parsing,
00:22:53.620 | we could then build a graph-based parser
00:22:55.860 | that performed about a percent better
00:22:58.300 | than the best of the Google transition-based
00:23:01.180 | neural dependency parsers.
00:23:03.460 | But I should point out that this is a mixed win
00:23:06.980 | 'cause although its accuracy is better,
00:23:10.420 | these graph-based parsers are just in squared
00:23:13.540 | and performance rather than linear time.
00:23:15.860 | So kind of like the early results I showed,
00:23:18.980 | they don't operate nearly as quickly
00:23:21.300 | when you're wanting to parse large amounts of text
00:23:25.460 | with complex long sentences.
00:23:28.500 | Okay, so that's everything you need to know
00:23:31.940 | about dependency parsers and to do assignment three.
00:23:35.220 | So grab it this evening and start to work.
00:23:37.700 | But I did want to sort of,
00:23:40.600 | before going on to the next topic,
00:23:42.360 | just mention a few more things about neural networks.
00:23:46.820 | Since some of you know this well already,
00:23:49.940 | some of you have seen less of it,
00:23:51.780 | but there just are a bunch of things
00:23:53.620 | you have to be aware of for building neural networks.
00:23:57.220 | Now, again, for assignment three,
00:23:59.900 | essentially we give you everything.
00:24:01.740 | And if you follow the recipe, your parser should work well.
00:24:06.280 | But what you should minimally do is actually look carefully
00:24:11.280 | at some of the things that this parser does,
00:24:16.160 | which is questions like,
00:24:17.780 | how do we initialize our matrices of our neural network?
00:24:22.780 | What kind of optimizers do we use?
00:24:26.180 | And things like that.
00:24:27.400 | 'Cause these are all important decisions.
00:24:31.160 | And so I wanted to say just a few words about that.
00:24:34.300 | Okay, so the first thing that we haven't discussed at all
00:24:39.140 | is the concept of regularization.
00:24:42.380 | So when we're building these neural nets,
00:24:45.260 | we're now building models with a huge number of parameters.
00:24:50.260 | So essentially just about all neural net models
00:24:54.880 | that work well,
00:24:57.100 | actually their full loss function
00:25:00.900 | is a regularized loss function.
00:25:03.260 | So for this loss function here of J,
00:25:08.260 | well, this part here is the part that we've seen before
00:25:12.540 | where we're using a softmax classifier
00:25:16.540 | and then taking a negative log likelihood loss,
00:25:18.980 | which we're then averaging over the different examples.
00:25:21.640 | But actually we then stick on the end of it,
00:25:25.380 | this regularization term.
00:25:27.700 | And so this regularization term sums the square
00:25:32.320 | of every parameter in the model.
00:25:34.960 | And so what that effectively says is
00:25:39.260 | you only wanna make parameters non-zero
00:25:44.060 | if they're really useful, right?
00:25:46.600 | So to the extent that the parameters don't help much,
00:25:49.720 | you're just being penalized here by making them non-zero.
00:25:54.220 | But to the extent that the parameters do help,
00:25:56.780 | you'll gain in your estimation of likelihood
00:25:59.500 | and therefore it's okay for them to be non-zero.
00:26:02.560 | In particular, notice that this penalty
00:26:05.680 | is assessed only once per parameter.
00:26:08.300 | It's not being assessed separately for each example.
00:26:11.960 | Okay, and having this kind of regularization
00:26:16.060 | is essential to build neural net models
00:26:19.700 | that regularize well.
00:26:21.500 | So the classic problem is referred to as overfitting.
00:26:25.220 | And what overfitting means
00:26:27.140 | is that if you have a particular training dataset
00:26:29.940 | and you start training your model,
00:26:32.660 | your error will go down
00:26:35.000 | because you'll shift the parameters
00:26:36.500 | so they better predict the correct answer
00:26:41.260 | for data points in the model.
00:26:43.260 | And you can keep on doing that
00:26:45.200 | and it will keep on reducing your error rate.
00:26:49.460 | But if you then look at your partially trained classifier
00:26:53.740 | and say, how well does this classifier
00:26:57.140 | classify independent data,
00:27:00.140 | different test data that you weren't training the model on,
00:27:04.160 | what you will find is up until a certain point,
00:27:07.200 | you'll get better at classifying
00:27:10.120 | independent test examples as well.
00:27:12.680 | And after that, commonly what will happen
00:27:16.280 | is you'll actually start to get worse
00:27:18.640 | at classifying independent test examples,
00:27:22.120 | even though you're continuing to get better
00:27:24.120 | at predicting the training examples.
00:27:26.200 | And so this was then referred to
00:27:27.960 | as you're overfitting the training examples,
00:27:31.800 | that you're fiddling the parameters of the model
00:27:34.120 | so that they're really good
00:27:35.160 | at predicting the training examples,
00:27:37.240 | which aren't useful things
00:27:39.960 | that can then predict on independent examples
00:27:44.440 | that you've come to at runtime.
00:27:47.440 | Okay, that classic view of regularization
00:27:52.200 | is sort of actually outmoded and wrong
00:27:55.680 | for modern neural networks.
00:27:58.920 | So the right way to think of it
00:28:01.880 | for the kind of modern big neural networks that we build
00:28:06.560 | is that overfitting on the training data isn't a problem,
00:28:11.560 | but nevertheless, you need regularization
00:28:17.000 | to make sure that your models generalize well
00:28:20.920 | to independent test data.
00:28:23.520 | So what you would like is for your graph
00:28:26.200 | not to look like this example
00:28:29.900 | with test error starting to head up.
00:28:32.240 | You'd like to have it at worst case flat line
00:28:37.240 | and best case still be gradually dropping.
00:28:40.160 | It'll always be higher than the training error,
00:28:43.080 | but it's not actually showing a failure to generalize.
00:28:47.840 | So when we train big neural nets these days,
00:28:52.800 | our big neural nets always overfit on the training data.
00:28:59.280 | They hugely overfit on the training data.
00:29:02.100 | In fact, in many circumstances,
00:29:04.020 | our neural nets have so many parameters
00:29:06.700 | that you can continue to train them on the training data
00:29:10.020 | until the error on the training data is zero.
00:29:12.700 | They get every single example right
00:29:14.740 | because they can just memorize enough stuff about it
00:29:17.460 | to predict the right answer.
00:29:19.260 | But in general, providing the models are regularized well,
00:29:23.740 | those models will still also generalize well
00:29:27.020 | and predict well on independent data.
00:29:29.340 | And so for part of what we wanna do for that
00:29:33.740 | is to work out how much to regularize.
00:29:36.740 | And so this lambda parameter here
00:29:39.280 | is the strength of regularization.
00:29:41.880 | So if you're making that lambda number big,
00:29:44.640 | you're getting more regularization.
00:29:47.440 | And if you make it small, you're getting less.
00:29:49.960 | And you don't wanna have it be too big
00:29:51.760 | or else you won't fit the data well.
00:29:53.640 | And you don't want it to be too small
00:29:56.300 | or else you have the problem that you don't generalize well.
00:30:01.300 | Okay, so this is classic L2 regularization
00:30:05.160 | and it's a starting point.
00:30:06.920 | But our big neural nets are sufficiently complex
00:30:09.600 | and have sufficiently many parameters
00:30:11.680 | that essentially L2 regularization doesn't cut it.
00:30:15.800 | So the next thing that you should know about
00:30:18.520 | and is a very standard good feature
00:30:22.040 | for building neural nets is a technique called dropout.
00:30:25.980 | So dropout is generally introduced
00:30:30.980 | as a sort of a slightly funny process
00:30:35.180 | that you do when training to avoid feature co-adaptation.
00:30:40.180 | So in dropout, what you do is at the time
00:30:44.820 | that you're training your model,
00:30:47.820 | that for each instance or for each batch in your training,
00:30:55.140 | then for each neuron in the model,
00:30:58.480 | you drop 50% of its inputs.
00:31:01.120 | You just treat them as zero.
00:31:03.400 | And so that you can do by sort of zeroing out elements
00:31:07.280 | of the sort of layers.
00:31:10.460 | And then at test time,
00:31:14.840 | you don't drop any of the model weights,
00:31:17.200 | you keep them all, but actually you have all the model weights
00:31:20.940 | because you're now keeping twice as many things
00:31:23.560 | as you'd used at training data.
00:31:25.320 | And so effectively that little recipe
00:31:30.600 | prevents what's called feature co-adaptation.
00:31:33.840 | So you can't have features that are only useful
00:31:38.840 | in the presence of particular other features
00:31:43.300 | because the model can't guarantee
00:31:45.800 | which features are gonna be present for different examples
00:31:48.760 | because different features are being randomly dropped
00:31:51.320 | all of the time.
00:31:52.740 | And so effectively dropout gives you a kind
00:31:56.040 | of a middle ground between naive Bayes
00:31:58.160 | and a logistic regression model.
00:32:00.000 | In a naive Bayes models,
00:32:01.680 | all the weights are set independently
00:32:03.520 | and a logistic regression model,
00:32:05.440 | all the weights are set in the context of all the others.
00:32:08.360 | And here you are aware of other weights,
00:32:10.800 | but they can randomly disappear from you.
00:32:13.680 | It's also related to ensemble models like model bagging
00:32:17.280 | because you're using different subsets
00:32:19.000 | of the features every time.
00:32:22.200 | But after all of those explanations,
00:32:26.120 | there's actually another way of thinking about dropout
00:32:29.800 | which was actually developed here at Stanford.
00:32:31.760 | This is a paper by Percy Liang and students,
00:32:34.840 | which is to argue that really what dropout gives you
00:32:39.120 | is a strong regularizer that isn't a uniform regularizer
00:32:43.560 | like L2 that regularizes everything with an L2 loss,
00:32:47.040 | but can learn a feature dependent regularization.
00:32:50.160 | And so that dropout has just emerged as in general,
00:32:53.480 | the best way to do regularization for neural nets.
00:32:57.920 | I think you've already seen and heard this one,
00:33:00.420 | but just to have it on my slides once.
00:33:04.280 | If you want to have your neural networks go fast,
00:33:09.080 | it's really essential that you make use of vectors,
00:33:12.640 | matrices, tensors, and you don't do things with for loops.
00:33:16.840 | So here's a teeny example where I'm using time it,
00:33:21.840 | which is a useful thing that you can use too
00:33:24.040 | to see how fast your neural nets run
00:33:26.240 | and different ways of writing that.
00:33:28.840 | And so when I'm doing this,
00:33:31.500 | doing these dot products here,
00:33:36.980 | I can either do the dot product in a for loop
00:33:41.980 | against each word vector,
00:33:44.780 | or I can do the dot product
00:33:46.680 | with a single word vector matrix.
00:33:49.580 | And if I do it in a for loop,
00:33:51.860 | doing each loop takes me almost a second.
00:33:57.960 | Whereas if I do it with a matrix multiply,
00:34:02.960 | it takes me an order of magnitude less time.
00:34:06.080 | So you should always be looking to use vectors
00:34:08.460 | and matrices, not for loops.
00:34:10.520 | And this is a speed up of about 10 times
00:34:13.640 | when you're doing things on a CPU.
00:34:16.360 | Heading forward, we're going to be using GPUs
00:34:19.880 | and they only further exaggerate the advantages
00:34:23.180 | of using vectors and matrices
00:34:25.220 | where you'll commonly get two orders of magnitude speed up
00:34:28.720 | by doing things that way.
00:34:30.880 | Yeah, so for the backward pass,
00:34:34.160 | you are running a backward pass as before
00:34:39.040 | on the dropped out examples, right?
00:34:42.800 | So for the things that were dropped out,
00:34:45.680 | no gradient is going through them
00:34:47.680 | because they weren't present,
00:34:49.200 | they're not affecting things.
00:34:51.640 | So in a particular batch,
00:34:55.320 | you're only training weights
00:34:57.800 | for the things that aren't dropped out.
00:34:59.680 | But then since you, for each successive batch,
00:35:03.280 | you drop out different things
00:35:05.180 | that over a bunch of batches,
00:35:06.940 | you're then training all of the weights of the model.
00:35:11.280 | And so feature dependent regularizer
00:35:14.680 | is meaning that how much a feature,
00:35:19.680 | the different features can be regularized
00:35:23.680 | different amounts to maximize performance.
00:35:28.680 | So back in this model,
00:35:32.200 | every feature was just sort of being penalized
00:35:36.920 | by taking lambda times that squared value.
00:35:39.720 | So this is sort of uniform regularization
00:35:44.120 | where the end result of this dropout style training
00:35:48.280 | is that you end up with some features
00:35:51.960 | being regularized much more strongly
00:35:54.300 | and some other features being regularized less strongly.
00:35:59.300 | And how much they are regularized
00:36:03.280 | depends on how much they're being used.
00:36:05.240 | So you're regularizing more features that are being used less
00:36:09.800 | but I'm not gonna get through into the details
00:36:13.040 | of how you can understand that perspective.
00:36:16.240 | That's outside of the context
00:36:19.080 | of what I'm gonna get through right now.
00:36:21.520 | So the final bit is I just wanted to give a little bit
00:36:24.840 | of perspective on nonlinearities in our neural nets.
00:36:29.840 | So the first thing to remember
00:36:33.340 | is you have to have nonlinearities.
00:36:35.760 | So if you're building a multi-layer neural net
00:36:39.240 | and you've just got W1X plus B1
00:36:42.400 | then you put it through W2X plus B2
00:36:45.920 | and then put through W3X,
00:36:47.720 | well, I guess they're different hidden layers.
00:36:52.240 | Sorry, I should have said X.
00:36:53.240 | They should be hidden one, hidden two, hidden three.
00:36:55.360 | W3, hidden three plus B3.
00:36:59.040 | That multiple linear transformations composed
00:37:03.820 | so they can be just collapsed down
00:37:05.380 | into a single linear transformation.
00:37:07.660 | So you don't get any power as a data representation
00:37:12.660 | by having multiple linear layers.
00:37:17.840 | There's a slightly longer story there
00:37:19.740 | 'cause you actually do get some interesting learning effects
00:37:22.380 | but I'm not gonna talk about that now.
00:37:25.080 | But standardly, we have to have some kind of nonlinearity
00:37:30.080 | to do something interesting in a deep neural network.
00:37:35.820 | Okay, so there's a starting point
00:37:39.340 | is the most classic nonlinearity is the logistic
00:37:43.420 | often just called the sigmoid nonlinearity
00:37:46.420 | because of its S shape
00:37:48.620 | which we've seen before in previous lectures.
00:37:51.300 | So this will take any real number
00:37:54.140 | and map it on to the range of zero one.
00:37:57.680 | And that was sort of basically what people used
00:38:02.140 | in sort of 1980s neural nets.
00:38:04.900 | Now, one disadvantage of this nonlinearity
00:38:08.860 | is that it's moving everything into the positive space
00:38:14.580 | because the output is always between zero and one.
00:38:17.480 | So people then decided that for many purposes,
00:38:21.260 | it was useful to have this variant sigmoid shape
00:38:24.980 | of hyperbolic tan,
00:38:27.060 | which is then being shown in the second picture.
00:38:29.880 | Now, logistic and hyperbolic tan,
00:38:34.740 | they sound like they're very different things
00:38:37.040 | but actually, as you maybe remember from a math class,
00:38:40.340 | hyperbolic tan can be represented
00:38:42.540 | in terms of exponentials as well.
00:38:44.780 | And if you do a bit of math,
00:38:46.140 | which possibly we might make you do on an assignment,
00:38:49.620 | it's actually the case that a hyperbolic tan
00:38:52.020 | is just a rescaled and shifted version of the logistics.
00:38:55.860 | So it's really exactly the same curve, just squeezed a bit.
00:38:59.400 | So it goes now symmetrically between minus one and one.
00:39:02.880 | Well, these kinds of transcendental functions
00:39:08.700 | like hyperbolic tan,
00:39:11.000 | they're kind of slow and expensive to compute, right?
00:39:13.740 | Even on our fast computers,
00:39:15.640 | calculating exponentials is a bit slow.
00:39:18.260 | So something people became interested in was,
00:39:21.540 | well, could we do things with much simpler non-linearity?
00:39:25.380 | So what if we used a so-called hard tan H?
00:39:29.340 | So the hard tan H,
00:39:31.260 | up to some point it just flat lines at minus one,
00:39:37.140 | then it is Y equals X up until one,
00:39:42.140 | and then it just flat lines again.
00:39:44.700 | And that seems a slightly weird thing to use
00:39:49.480 | because if your input is over on the left
00:39:52.900 | or over on the right,
00:39:54.300 | you're sort of not getting any discrimination
00:39:57.660 | and if for things giving the same output.
00:40:01.040 | But somewhat surprisingly,
00:40:02.880 | I mean, I was surprised when people started doing this,
00:40:06.380 | these kinds of models proved to be very successful.
00:40:11.660 | And so that then led into what's proven to be kind of
00:40:15.660 | the most successful and generally widely used non-linearity
00:40:19.620 | in a lot of recent deep learning work,
00:40:21.740 | which was what was being used
00:40:24.660 | in the dependency parser model I showed
00:40:28.220 | is what's called the rectified linear unit or ReLU.
00:40:31.860 | So a ReLU is kind of the simplest kind of non-linearity
00:40:35.140 | that you can imagine.
00:40:36.500 | So if the value of X is negative, its value is zero.
00:40:40.600 | So effectively it's just dead,
00:40:42.460 | it's not doing anything in the computation.
00:40:45.020 | And if its value of X is greater than zero,
00:40:50.020 | then it's just simply Y equals X,
00:40:53.000 | the value is being passed through.
00:40:55.460 | And at first sight, this might seem really, really weird
00:41:00.640 | and how could this be useful as a non-linearity?
00:41:04.200 | But if you sort of think a bit
00:41:05.900 | about how you can approximate things
00:41:08.780 | with piecewise linear functions very accurately,
00:41:12.000 | you might kind of start to see how you could use this
00:41:15.180 | to do accurate function approximation
00:41:18.540 | with piecewise linear functions.
00:41:20.780 | And that's what ReLU units have been found
00:41:23.760 | to do extremely, extremely successfully.
00:41:27.500 | So logistic and tanh are still used in various places.
00:41:31.860 | You use logistic when you want a probability output.
00:41:34.880 | We'll see tanh again very soon
00:41:37.460 | when we get to recurrent neural networks.
00:41:40.060 | But they're no longer the default when making deep networks
00:41:43.420 | that in a lot of places,
00:41:44.780 | the first thing you should think about trying
00:41:47.060 | is ReLU non-linearities.
00:41:49.180 | And so in particular,
00:41:51.180 | that part of why they're good
00:41:55.940 | is that ReLU non-networks train very quickly
00:42:00.740 | because you get this sort of very straightforward
00:42:03.340 | gradient backflow because providing you
00:42:05.380 | on the right-hand side of it,
00:42:08.260 | you then just gain this sort of constant gradient backflow
00:42:11.700 | from the slope one.
00:42:13.100 | And so they train very quickly.
00:42:15.120 | The somewhat surprising fact is that sort of
00:42:19.300 | almost the simplest non-linearity imaginable
00:42:22.300 | is still enough to have a very good neural network,
00:42:26.300 | but it just is.
00:42:27.940 | People have played around with variants of that.
00:42:30.820 | So people have then played around with leaky ReLUs
00:42:34.100 | where rather than the left-hand side
00:42:37.620 | just going completely to zero,
00:42:39.580 | it goes slightly negative on a much shallower slope.
00:42:44.580 | And then there's been a parametric ReLU
00:42:48.060 | where you have an extra parameter
00:42:49.980 | where you learn the slope of the negative part.
00:42:52.480 | Another thing that's been used recently
00:42:56.340 | is this swish non-linearity,
00:42:58.180 | which looks almost like a ReLU,
00:43:01.300 | but it sort of curves down just a little bit there
00:43:04.780 | and starts to go up.
00:43:06.180 | I mean, I think it's fair to say that, you know,
00:43:09.060 | none of these have really proven themselves vastly superior.
00:43:12.580 | There are papers saying I can get better results
00:43:15.020 | by using one of these and maybe you can,
00:43:17.780 | but you know, it's not night and day
00:43:20.020 | and a vast majority of work that you see around
00:43:22.860 | is still just using ReLUs in many places.
00:43:26.120 | Okay, a couple more things.
00:43:31.260 | Parameter initialization.
00:43:33.580 | So in almost all cases,
00:43:36.700 | you must, must, must initialize the matrices
00:43:41.700 | of your neural nets with small random values.
00:43:46.660 | Neural nets just don't work
00:43:48.620 | if you start the matrices off as zero
00:43:51.660 | 'cause effectively then everything is symmetric,
00:43:56.060 | nothing can specialize in different ways
00:43:59.900 | and you then get sort of,
00:44:02.100 | you just don't have an ability for a neural net to learn.
00:44:07.060 | You sort of get this defective solution.
00:44:09.940 | So standardly you're using some methods
00:44:13.820 | such as drawing random numbers uniformly
00:44:17.380 | between minus R and R for a small value R
00:44:20.980 | and just filling in all the parameters with that.
00:44:24.340 | Exception is with bias weights.
00:44:26.820 | It's fine to set bias weights to zero
00:44:29.140 | and in some sense that's better.
00:44:30.980 | In terms of choosing what the R value is,
00:44:37.300 | essentially for traditional neural nets,
00:44:40.940 | what we want to set that R range for
00:44:44.220 | is so that the numbers in our neural network
00:44:47.460 | stay of a reasonable size.
00:44:49.700 | They don't get too big and they don't get too small.
00:44:53.540 | And whether they kind of blow up or not
00:44:56.940 | depends on how many connections there are
00:44:59.940 | in the neural networks.
00:45:01.140 | I'm looking at the fan in and fan out
00:45:03.860 | of connections in the neural network.
00:45:06.860 | And so a very common initialization
00:45:11.860 | that you'll see in PyTorch
00:45:13.940 | is what's called Harvey initialization
00:45:16.780 | named after a person who suggested that.
00:45:20.580 | And it's working out a value of R
00:45:23.940 | based on this fan in and fan out of the layers,
00:45:28.940 | but you can just sort of ask for it,
00:45:30.940 | say initialize with this initialization and it will.
00:45:34.820 | This is another area where there have been
00:45:36.700 | some subsequent developments.
00:45:39.060 | So around week five,
00:45:42.220 | we'll start talking about layer normalization.
00:45:44.540 | And if you're using layer normalization,
00:45:46.740 | then it sort of doesn't matter the same
00:45:48.380 | how you initialize the weights.
00:45:52.140 | So finally, we have to train our models.
00:45:56.220 | And I've briefly introduced the idea
00:45:58.940 | of stochastic gradient descent.
00:46:01.060 | And the good news is that most of the time
00:46:05.220 | that if training your networks
00:46:07.220 | with stochastic gradient descent works just fine,
00:46:11.620 | use it and you will get good results.
00:46:14.580 | However, often that requires
00:46:19.500 | choosing a suitable learning rate,
00:46:21.340 | which is my final slide of tips on the next slide.
00:46:25.220 | But there's been an enormous amount of work
00:46:28.060 | on optimization of neural networks
00:46:31.460 | and people have come up with a whole series
00:46:34.300 | of more sophisticated optimizers.
00:46:38.060 | And I'm not gonna get into the details
00:46:40.260 | of optimization in this class,
00:46:42.220 | but the very loose idea is that these optimizers
00:46:46.340 | are adaptive in that they can kind of keep track
00:46:50.140 | of how much slope there was,
00:46:52.500 | how much gradient there is for different parameters.
00:46:56.140 | And therefore based on that,
00:46:58.020 | make decisions as to how much to adjust the weights
00:47:01.820 | when doing the gradient update
00:47:03.860 | rather than adjusting it by a constant amount.
00:47:06.460 | And so in that family of methods,
00:47:09.260 | there are methods that include AdaGrad, RMSProp, Adam,
00:47:13.940 | and then a variance of Adam,
00:47:15.700 | including SparseAdam, AdamW, et cetera.
00:47:20.460 | The one called Adam is a pretty good place to start.
00:47:24.660 | And a lot of the time that's a good one to use.
00:47:27.340 | And again, from the perspective of PyTorch,
00:47:29.980 | when you're initializing an optimizer,
00:47:32.420 | you can just say, please use Adam
00:47:34.620 | and you don't actually need to know
00:47:36.740 | much more about it than that.
00:47:38.660 | If you are using simple stochastic gradient descent,
00:47:44.700 | you have to change, choose a learning rate.
00:47:47.100 | So that was the eta value that you multiplied
00:47:50.980 | the gradient by for how much to adjust the weights.
00:47:54.500 | And so I talked about that slightly,
00:47:57.020 | how you didn't want it to be too big
00:47:59.580 | or your model could diverge or bounce around.
00:48:02.740 | You didn't want it to be too small
00:48:04.940 | or else training could take place exceedingly slowly
00:48:09.780 | and you'll miss the assignment deadline.
00:48:11.780 | How big it should be depends on all sorts of details
00:48:16.300 | of the model.
00:48:17.220 | And so you sort of wanna try out
00:48:18.820 | some different order of magnitude numbers
00:48:22.940 | to see what numbers seem to work well
00:48:25.820 | for training stably but reasonably quickly.
00:48:28.940 | Something around 10 to the minus three
00:48:30.740 | or 10 to the minus four is in a crazy place to start.
00:48:34.580 | In principle, you can do fine
00:48:36.220 | just using a constant learning rate in SGD.
00:48:39.500 | In practice, people generally find
00:48:42.220 | they can get better results by decreasing learning rates
00:48:45.980 | as you train.
00:48:47.220 | So a very common recipe is that you half the learning rate
00:48:51.660 | after every K epochs, where an epoch means
00:48:55.060 | that you've made a pass through
00:48:56.660 | the entire set of training data.
00:48:59.060 | So perhaps something like every three epochs
00:49:01.700 | you have the learning rate.
00:49:03.620 | And a final little note there in purple
00:49:07.540 | is when you make a pass through the data,
00:49:09.300 | you don't wanna go through the data items
00:49:11.620 | in the same order each time
00:49:13.980 | 'cause that leads you just kind of have a sort of patterning
00:49:18.620 | of the training examples that the model
00:49:21.940 | will sort of fall into that periodicity of those patterns.
00:49:24.980 | So it's best to shuffle the data
00:49:27.380 | before each pass through it.
00:49:29.020 | Okay, there are more sophisticated ways
00:49:32.260 | to set learning rates.
00:49:34.900 | And I won't really get into those now.
00:49:38.420 | Fancier optimizers like Adam also have a learning rate.
00:49:42.100 | So you still have to choose a learning rate value,
00:49:45.180 | but it's effectively, it's an initial learning rate,
00:49:47.900 | which typically the optimizer shrinks as it runs.
00:49:51.060 | And so you commonly want to have the number
00:49:53.660 | it starts off with beyond the larger size
00:49:56.180 | because it'll be shrinking as it goes.
00:49:58.940 | Okay, so that's all by way of introduction.
00:50:04.060 | And I'm now ready to start on language models and RNNs.
00:50:08.020 | So what is language modeling?
00:50:10.260 | I mean, as two words of English,
00:50:12.020 | language modeling could mean just about anything,
00:50:15.220 | but in the natural language processing literature,
00:50:18.860 | language modeling has a very precise technical definition,
00:50:23.060 | which you should know.
00:50:24.380 | So language modeling is the task
00:50:27.500 | of predicting the word that comes next.
00:50:30.860 | So if you have some context, like the students open there,
00:50:37.180 | you want to be able to predict what words will come next.
00:50:40.740 | Is it their books, their laptops, their exams, their minds?
00:50:45.740 | And so in particular, what you want to be doing
00:50:51.060 | is being able to give a probability
00:50:54.100 | that different words will occur in this context.
00:50:58.460 | So a language model is a probability distribution
00:51:02.980 | over next words given a preceding context.
00:51:07.460 | And a system that does that is called a language model.
00:51:13.740 | So as a result of that,
00:51:17.780 | you can also think of a language model
00:51:19.820 | as a system that assigns a probability score
00:51:23.020 | to a piece of text.
00:51:24.460 | So if we have a piece of text,
00:51:26.660 | then we can just work out its probability
00:51:29.020 | according to a language model.
00:51:30.820 | So the probability of a sequence of tokens,
00:51:33.580 | we can decompose via the chain rule,
00:51:36.860 | probability of the first times probability of the second
00:51:39.940 | given the first, et cetera, et cetera.
00:51:42.020 | And then we can work that out
00:51:44.340 | using what our language model provides
00:51:46.980 | as a product of each probability
00:51:49.340 | of predicting the next word.
00:51:50.980 | Okay, language models are really the cornerstone
00:51:57.500 | of human language technology.
00:52:00.620 | Everything that you do with computers
00:52:04.100 | that involves human language,
00:52:06.380 | you are using language models.
00:52:09.300 | So when you're using your phone
00:52:11.780 | and it's suggesting whether well or badly,
00:52:14.740 | what the next word that you probably want to type is,
00:52:17.580 | that's a language model working
00:52:19.900 | to try and predict the likely next words.
00:52:23.300 | When the same thing happens in a Google Doc
00:52:26.140 | and it's suggesting a next word or a next few words,
00:52:29.460 | that's a language model.
00:52:31.140 | The main reason why the one in Google Docs
00:52:35.580 | works much better than the one on your phone
00:52:38.060 | is that for the keyboard phone models,
00:52:40.660 | they have to be very compact
00:52:42.940 | so they can run quickly and not much memory.
00:52:46.060 | So they're sort of only mediocre language models
00:52:49.020 | where something like Google Docs
00:52:50.780 | can do a much better language modeling job.
00:52:53.460 | Query completion, same thing, there's a language model.
00:52:59.260 | And so then the question is,
00:53:00.860 | well, how do we build language models?
00:53:04.980 | And so I briefly wanted to first again,
00:53:08.660 | give the traditional answer
00:53:11.300 | since you should have at least some understanding
00:53:13.660 | of how NLP was done without a neural network.
00:53:17.740 | And the traditional answer that powered speech recognition
00:53:22.340 | and other applications for at least two decades,
00:53:25.540 | three decades really,
00:53:27.420 | was what were called N-gram language models.
00:53:30.420 | And these were very simple, but still quite effective idea.
00:53:34.940 | So we want to give probabilities of next words.
00:53:38.100 | So what we're gonna work with
00:53:42.500 | is what are referred to as N-grams.
00:53:44.780 | And so N-grams is just a chunk of N consecutive words,
00:53:49.580 | which are usually referred to as unigrams, bigrams,
00:53:52.340 | trigrams, and then four grams and five grams.
00:53:55.980 | A horrible set of names, which would offend any humanist,
00:54:00.700 | but that's what people normally say.
00:54:02.740 | And so effectively what we do is just collect statistics
00:54:07.980 | about how often different N-grams occur
00:54:11.420 | in a large amount of text,
00:54:13.180 | and then use those to build a probability model.
00:54:16.060 | So the first thing we do
00:54:18.940 | is what's referred to as making a Markov assumption.
00:54:21.900 | So these are also referred to as Markov models.
00:54:24.820 | And we decide that the word in position T plus one
00:54:29.380 | only depends on the preceding N minus one words.
00:54:33.380 | So if we want to predict T plus one
00:54:38.820 | given the entire preceding text,
00:54:41.980 | we actually throw away the early words
00:54:45.100 | and just use the preceding N minus one words as context.
00:54:49.540 | Well, once we've made that simplification,
00:54:52.260 | we can then just use the definition
00:54:54.100 | of conditional probability and say,
00:54:56.380 | all that conditional probability
00:54:58.540 | is the probability of N words
00:55:01.940 | divided by the preceding N minus one words.
00:55:06.940 | And so we have the probability of an N-gram
00:55:09.500 | over the probability of an N minus one gram.
00:55:12.700 | And so then how do we get these N-gram
00:55:15.900 | and N minus one gram probabilities?
00:55:18.140 | We simply take a larger amount of text in some language
00:55:22.460 | and we count how often the different N-grams occur.
00:55:26.100 | And so our crude statistical approximation
00:55:30.660 | starts off as the count of the N-gram
00:55:33.740 | over the count of the N minus one gram.
00:55:37.340 | So here's an example of that.
00:55:38.980 | Suppose we are learning a four-gram language model.
00:55:42.020 | Okay, so we throw away all words
00:55:44.540 | apart from the last three words,
00:55:47.020 | and they're our conditioning.
00:55:49.020 | We look in some large,
00:55:54.020 | we use the counts from some large training corpus
00:55:57.700 | and we see how often did students open their books occur?
00:56:01.820 | How often did students open their minds occur?
00:56:04.900 | And then for each of those counts,
00:56:07.020 | we divide through by the count
00:56:08.740 | of how often students open their occurred.
00:56:11.420 | And that gives us our probability estimates.
00:56:15.980 | So for example, if in the corpus students open,
00:56:19.500 | they occurred a thousand times,
00:56:22.060 | students open their books occurred 400 times,
00:56:25.660 | we'd get a probability estimate of 0.4 for books.
00:56:29.140 | If exams occurred a hundred times,
00:56:30.740 | we'd get 0.1 for exams.
00:56:33.540 | And we sort of see here already the disadvantage
00:56:37.260 | of having made the Markov assumption
00:56:39.580 | and have gotten rid of all of this earlier context,
00:56:43.620 | which would have been useful for helping us to predict.
00:56:46.460 | The one other point that I'll just mention
00:56:52.140 | that I can fuse myself on
00:56:54.140 | is this count of the N-gram language model.
00:56:58.300 | So for a four-gram language model,
00:57:00.940 | it's called a four-gram language model
00:57:03.340 | because in its estimation,
00:57:05.460 | you're using four grams in the numerator
00:57:08.340 | and trigrams in the denominator.
00:57:11.460 | So you use the size of the numerator.
00:57:14.940 | So that terminology is different to the terminology
00:57:18.900 | that's used in Markov models.
00:57:21.020 | So when people talk about the order of a Markov model,
00:57:24.700 | that refers to the amount of context you're using.
00:57:28.220 | So this would correspond to a third order Markov model.
00:57:32.460 | Yeah, so someone said,
00:57:36.260 | "Is this similar to a Naive Bayes model?"
00:57:40.780 | Sort of.
00:57:42.260 | Naive Bayes models,
00:57:43.380 | you also estimate the probabilities just by counting.
00:57:46.860 | So they're related
00:57:49.980 | and there are sort of in some sense, two differences.
00:57:54.740 | The first difference or specialization
00:57:59.340 | is that Naive Bayes models
00:58:02.300 | work out probabilities of words
00:58:05.980 | independent of their neighbors.
00:58:07.620 | So in one part that a Naive Bayes language model
00:58:12.100 | is a unigram language model.
00:58:14.260 | So you're just using the counts of individual words.
00:58:17.340 | But the other part of a Naive Bayes model
00:58:20.340 | is you're learning a different set of unigram counts
00:58:24.500 | for every class for your classifier.
00:58:27.820 | And so you've then got sort of,
00:58:33.020 | so effectively a Naive Bayes model
00:58:35.380 | is you've got class specific unigram language models.
00:58:40.380 | Okay, I gave this as a simple statistical model
00:58:48.140 | for estimating your probabilities with an N-gram model.
00:58:52.140 | You can't actually get away with just doing that
00:58:54.940 | because you have sparsity problems.
00:58:57.380 | So, often it'll be the case that for many words,
00:59:01.460 | students open their books
00:59:03.140 | or students opened their backpacks
00:59:07.180 | just never occurred in the training data.
00:59:09.620 | That if you think about it,
00:59:10.700 | if you have something like 10 to the fifth
00:59:13.740 | different words even,
00:59:15.380 | and you want to have then a sequence of four words
00:59:18.380 | or a problem and there are 10 to the fifth of each,
00:59:20.820 | there's sort of 10 to the 20th different combinations.
00:59:24.260 | So unless you're seeing
00:59:25.620 | and it's truly astronomical amount of data,
00:59:28.780 | most four word sequences you've never seen.
00:59:31.700 | So then your numerator will be zero
00:59:34.380 | and your probability estimate will be zero.
00:59:36.700 | And so that's bad.
00:59:38.060 | And so the commonest way of solving that
00:59:40.220 | is just to add a little Delta to every count
00:59:43.100 | and then everything is non-zero and that's called smoothing.
00:59:46.580 | But well, sometimes it's worse than that
00:59:49.820 | 'cause sometimes you won't even have seen
00:59:51.820 | students open theirs and that's more problematic
00:59:55.020 | 'cause that means our denominator is zero.
00:59:58.740 | And so the division will be ill-defined
01:00:01.740 | and we can't usefully calculate any probabilities
01:00:05.020 | in a context that we've never seen.
01:00:07.340 | And so the standard solution to that
01:00:09.580 | is to shorten the context and that's called back off.
01:00:13.420 | So we condition only on open there
01:00:16.580 | or if we still haven't seen the open there,
01:00:20.100 | we'll condition only on there
01:00:22.460 | or we could just forget all conditioning
01:00:25.220 | and actually use a unigram model for our probabilities.
01:00:28.540 | Yeah.
01:00:31.500 | And so as you increase the order N
01:00:37.100 | of the N-gram language model,
01:00:38.740 | these sparsity problems become worse and worse.
01:00:42.180 | So in the early days,
01:00:43.660 | people normally worked with trigram models.
01:00:46.300 | As it became easier to collect billions of words of text,
01:00:50.540 | people commonly moved to five-gram models.
01:00:53.420 | But every time you go up an order of conditioning,
01:00:58.420 | you effectively need to be collecting
01:01:01.580 | orders of magnitude more data
01:01:03.340 | because of the size of the vocabularies of human languages.
01:01:07.060 | There's also a problem that these models are huge.
01:01:13.700 | So you basically have to be storing counts
01:01:17.460 | of all of these words sequences
01:01:19.500 | so you can work out these probabilities.
01:01:22.100 | And I mean, that's actually had a big effect
01:01:24.340 | in terms of what technology is available.
01:01:27.620 | So in the 2000s decade up till about whenever it was,
01:01:32.620 | 2014, that there was already Google Translate
01:01:37.620 | using probabilistic models
01:01:40.940 | that included language models
01:01:43.220 | of the N-gram language model sort.
01:01:45.260 | But the only way they could possibly be run
01:01:47.980 | is in the cloud
01:01:49.620 | because you needed to have these huge tables
01:01:52.980 | of probabilities.
01:01:54.660 | But now we have neural nets
01:01:55.980 | and you can have Google Translate
01:01:57.820 | just actually run on your phone.
01:01:59.780 | And that's possible because neural net models
01:02:03.620 | can be massively more compact
01:02:05.700 | than these old N-gram language models.
01:02:08.380 | Yeah.
01:02:13.420 | But nevertheless, before we get onto the neural models,
01:02:16.860 | let's just sort of look at the example of how these work.
01:02:21.860 | So it's trivial to train an N-gram language model
01:02:27.860 | 'cause you really just count how often
01:02:29.900 | word sequences occur in a corpus and you're ready to go.
01:02:33.260 | So these models can be trained in seconds.
01:02:35.460 | That's really good.
01:02:36.540 | That's not like sitting around for training neural networks.
01:02:39.540 | So if I train on my laptop,
01:02:42.780 | a small language model on,
01:02:45.500 | you know, about 1.7 million words as a trigram model,
01:02:50.500 | I can then ask it to generate text.
01:02:53.420 | If I give it a couple of words today,
01:02:55.900 | I can then get it to sort of suggest a word
01:02:59.380 | that might come next.
01:03:00.860 | And the way I do that is the language model
01:03:03.780 | knows the probability distribution
01:03:06.180 | of things that can come next.
01:03:08.820 | Now there's a kind of a crude probability distribution.
01:03:12.260 | I mean, 'cause effectively
01:03:14.220 | over this relatively small corpus,
01:03:16.740 | there were things that occurred once, Italian and Emirate.
01:03:20.220 | There are things that occurred twice, price.
01:03:22.540 | There were things that occurred four times, company and bank.
01:03:26.420 | It's sort of fairly crude and rough,
01:03:28.580 | but I nevertheless get probability estimates.
01:03:31.460 | I can then say, okay, based on this,
01:03:36.380 | let's take this probability distribution
01:03:39.660 | and then we'll just sample the next word.
01:03:41.980 | So the two most likely words to sample,
01:03:44.380 | a company or bank, but we're rolling the dice
01:03:48.140 | and we might get any of the words that had come next.
01:03:50.740 | So maybe I sample price.
01:03:53.860 | Now I'll condition on price, the price,
01:03:57.740 | and look up the probability distribution
01:04:00.500 | of what comes next.
01:04:01.940 | The most likely thing is of.
01:04:03.940 | And so again, I'll sample,
01:04:05.700 | and maybe this time I'll pick up of.
01:04:09.300 | And then I will now condition on price of,
01:04:12.980 | and I will look up the probability distribution
01:04:16.140 | of words following that.
01:04:18.140 | And I get this probability distribution
01:04:21.060 | and I'll sample randomly some word from it.
01:04:24.380 | And maybe this time I'll sample a rare,
01:04:27.460 | but possible one like gold.
01:04:29.740 | And I can keep on going
01:04:31.820 | and I'll get out something like this.
01:04:34.220 | Today, the price of gold per tonne
01:04:36.580 | while production of shoe lasts and shoe industry,
01:04:39.660 | the bank intervened just after it considered
01:04:41.860 | and rejected an IMF demand
01:04:43.820 | to rebuild depleted European stocks,
01:04:46.500 | SEP 30 N primary 76 cents a share.
01:04:50.220 | So what just a simple trigram model can produce
01:04:54.500 | over not very much text
01:04:56.620 | is actually already kind of interesting.
01:04:59.180 | Like it's actually surprisingly grammatical, right?
01:05:02.220 | There are whole pieces of it
01:05:03.740 | while production of shoe lasts and shoe industry,
01:05:07.020 | the bank intervene just after it considered
01:05:09.620 | and rejected an IMF demand, right?
01:05:11.540 | It's really actually pretty good grammatical text.
01:05:14.420 | So it's sort of amazing that these simple N-gram models
01:05:17.980 | actually can model a lot of human language.
01:05:21.940 | On the other hand, it's not a very good piece of text.
01:05:25.060 | It's completely incoherent and makes no sense.
01:05:29.180 | And so to actually be able to generate text
01:05:33.060 | that seems like it makes sense,
01:05:35.500 | we're going to need a considerably better language model.
01:05:38.900 | And that's precisely what neural language models
01:05:42.540 | have allowed us to build as we'll see later.
01:05:45.060 | Okay, so how can we build a neural language model?
01:05:50.380 | And so first of all, we're gonna do a simple one
01:05:54.060 | and then we'll see where we get,
01:05:56.300 | but to move into a current neural nets
01:05:58.500 | might still take us to the next time.
01:06:02.420 | So we've gonna have input sequence of words
01:06:05.540 | and we want a probability distribution over the next word.
01:06:10.140 | Well, the simplest thing that we could try is to say,
01:06:13.700 | well, kind of the only tool we have so far
01:06:17.420 | is a window-based classifier.
01:06:19.740 | So what we can say, what we'd done previously
01:06:24.140 | either for our named entity recognized in lecture three
01:06:28.660 | or what I just showed you for the dependency parser
01:06:31.340 | is we have some context window,
01:06:33.900 | we put it through a neural net
01:06:35.660 | and we predict something as a classifier.
01:06:38.380 | So before we were predicting a location,
01:06:41.900 | but maybe instead we could reuse exactly
01:06:45.580 | the same technology and say,
01:06:48.460 | we're going to have a window-based classifier.
01:06:51.060 | So we're discarding the further away words
01:06:53.620 | just like in N-gram language model,
01:06:57.820 | but we'll feed this fixed window into a neural net.
01:07:02.100 | So we can catenate the word embeddings,
01:07:05.540 | we put it through a hidden layer
01:07:07.660 | and then we have a softmax classifier over our vocabulary.
01:07:12.660 | And so now rather than predicting something like location
01:07:18.540 | or left arc in the dependency parser,
01:07:21.900 | we're going to have a softmax over the entire vocabulary
01:07:25.780 | sort of like we did with the skip-gram negative sampling
01:07:29.380 | model in the first two lectures.
01:07:31.900 | And so we're going to see this choice
01:07:35.100 | as predicting what word that comes next,
01:07:37.820 | whether it produces laptops, minds, books, et cetera.
01:07:42.100 | Okay, so this is a fairly simple
01:07:46.540 | fixed window neural net classifier,
01:07:49.100 | but this is essentially a famous early model
01:07:55.540 | in the use of neural nets for NLP applications.
01:08:00.100 | So first the 2000 conference paper
01:08:02.580 | and then a somewhat later journal paper,
01:08:05.940 | Yoshua Bengio and colleagues introduced precisely
01:08:09.740 | this model as the neural probabilistic language model.
01:08:14.180 | And they were already able to show
01:08:17.140 | that this could give interesting good results
01:08:20.220 | for language modeling.
01:08:21.780 | And so it wasn't a great solution
01:08:25.300 | for neural language modeling, but it still had value.
01:08:29.220 | So it didn't solve the problem of allowing us
01:08:32.380 | to have bigger context to predict
01:08:34.780 | what words are going to come next.
01:08:36.980 | It's in that way limited exactly
01:08:40.660 | like an N-gram language model is,
01:08:43.260 | but it does have all the advantages
01:08:45.540 | of distributed representations.
01:08:47.900 | So rather than having these counts for word sequences
01:08:52.700 | that are very sparse and very crude,
01:08:56.620 | we can use distributed representations of words,
01:09:00.820 | which then make predictions
01:09:03.100 | that semantically similar words
01:09:05.260 | should give similar probability distribution.
01:09:09.020 | So the idea of that is if we'd use some other word here,
01:09:13.340 | like maybe the pupils open there,
01:09:16.980 | well, maybe in our training data,
01:09:19.180 | we'd seen sentences about students,
01:09:21.860 | but we'd never seen sentences about pupils.
01:09:24.580 | An N-gram language model then would sort of have no idea
01:09:27.700 | what probabilities to use,
01:09:29.660 | whereas a neural language model can say,
01:09:32.900 | well, pupils is kind of similar to students.
01:09:35.500 | Therefore I can predict similarly
01:09:37.300 | to what I would have predicted for students.
01:09:39.500 | Okay, so there's now no sparsity problem.
01:09:45.460 | We don't need to store billions of N-gram counts.
01:09:50.460 | We simply need to store our word vectors
01:09:56.220 | and our W and U matrices,
01:09:59.340 | but we still have the remaining problems
01:10:01.660 | that our fixed window is too small.
01:10:04.700 | We can try and make the window larger.
01:10:07.660 | If we do that, the W matrix gets bigger,
01:10:11.380 | but that also points out another problem with this model.
01:10:15.580 | Not only can the window never be large enough,
01:10:18.900 | but W is just a trained matrix.
01:10:23.180 | And so therefore we're learning completely different weights
01:10:27.500 | for each position of context,
01:10:29.540 | the word minus one position, the word minus two,
01:10:32.340 | the word minus three, and the word minus four,
01:10:35.260 | so that there's no sharing in the model
01:10:37.940 | as to how it treats words in different positions,
01:10:42.940 | even though in some sense
01:10:45.460 | they will contribute semantic components
01:10:48.260 | that are at least somewhat position independent.
01:10:52.580 | So again, for those of,
01:10:54.140 | if you sort of think back to either a naive Bayes model
01:10:57.300 | or what we saw with the word2vec model at the beginning,
01:11:01.740 | the word2vec model or a naive Bayes model
01:11:04.180 | completely ignores word order.
01:11:06.500 | So it has one set of parameters
01:11:08.540 | regardless of what position things occur in.
01:11:11.300 | That doesn't work well for language modeling
01:11:13.820 | 'cause word order is really important in language modeling.
01:11:16.900 | If the last word is the, that's a really good predictor
01:11:20.260 | of there being an adjective or noun following
01:11:22.660 | where if the word4vec is the,
01:11:24.940 | it doesn't give you the same information.
01:11:28.180 | So you do wanna somewhat make use of word order,
01:11:33.180 | but this model is at the opposite extreme
01:11:36.940 | that each position is being modeled
01:11:39.100 | completely independently.
01:11:41.660 | So what we'd like to have is a neural architecture
01:11:46.100 | that can process an arbitrary amount of context
01:11:51.100 | and have more sharing of the parameters
01:11:54.460 | while still be sensitive to proximity.
01:11:58.500 | And so that's the idea of recurrent neural networks.
01:12:02.340 | And I'll say about five minutes about these today,
01:12:06.380 | and then next time we'll return
01:12:07.900 | and do more about recurrent neural networks.
01:12:11.900 | So for the recurrent neural network,
01:12:14.940 | rather than having a single hidden layer
01:12:19.340 | inside our classifier here that we compute each time,
01:12:24.340 | for the recurrent neural network,
01:12:26.980 | we have the hidden layer,
01:12:29.100 | which often is referred to as the hidden state,
01:12:32.420 | but we maintain it over time
01:12:35.300 | and we feed it back into itself.
01:12:38.180 | So that's what the word recurrent is meaning,
01:12:40.620 | that you're sort of feeding the hidden layer
01:12:43.540 | back into itself.
01:12:45.300 | So what we do is based on the first word,
01:12:50.300 | we compute a hidden representation kind of like before,
01:12:55.180 | which can be used to predict the next word.
01:12:58.820 | But then for when we want to predict
01:13:02.500 | what comes after the second word,
01:13:04.900 | we not only feed in the second word,
01:13:07.500 | we feed in the hidden layer from the previous word
01:13:12.500 | to have it help predict the hidden layer
01:13:16.700 | above the second word.
01:13:18.420 | And so formally the way we're doing that
01:13:20.700 | is we're taking the hidden layer above the first word,
01:13:24.860 | multiplying it by a matrix W,
01:13:28.500 | and then that's going to be going in together with X2
01:13:32.860 | to generate the next hidden step.
01:13:35.660 | And so we keep on doing that at each time step,
01:13:38.900 | so that we are kind of repeating a pattern
01:13:42.100 | of creating a next hidden layer
01:13:44.780 | based on the next input word
01:13:48.580 | and the previous hidden state by updating it,
01:13:52.020 | by multiplying it by a matrix W.
01:13:54.540 | Okay, so on my slide here,
01:13:55.980 | I've still only got four words of context
01:13:58.100 | because it's nice for my slide,
01:13:59.940 | but in principle there could be,
01:14:02.660 | any number of words of context now.
01:14:04.900 | Okay, so what we're doing is,
01:14:08.100 | so that we start off by having input vectors,
01:14:13.100 | which can be our word vectors
01:14:15.980 | that we've looked up for each word.
01:14:18.020 | So, sorry, yeah,
01:14:21.580 | so we can have the one hot vectors for word identity.
01:14:25.220 | We look up our word embeddings,
01:14:26.780 | so then we've got word embeddings for each word.
01:14:30.020 | And then we want to compute hidden states.
01:14:33.620 | So we need to start from somewhere,
01:14:36.100 | H0 is the initial hidden state
01:14:38.820 | and H0 is normally taken as a zero vector.
01:14:42.580 | So this is actually just initialized as zeros.
01:14:45.740 | And so for working out the first hidden state,
01:14:49.420 | we calculate it based on the first word embedding
01:14:54.420 | by multiplying this embedding by a matrix WE.
01:14:59.340 | And that gives us the first hidden state.
01:15:02.540 | But then, as we go on,
01:15:05.340 | we want to apply the same formula over again.
01:15:09.740 | So we have just two parameter matrices
01:15:14.140 | in the recurrent neural network,
01:15:17.020 | one matrix for multiplying input embeddings
01:15:20.460 | and one matrix for updating the hidden state of the network.
01:15:24.860 | And so for the second word, from its word embedding,
01:15:29.380 | we multiply it by the WE matrix.
01:15:34.060 | We take the previous time steps hidden state
01:15:37.020 | and multiply it by the WH matrix.
01:15:41.300 | And we use the two of those to generate the new hidden state
01:15:45.820 | and precisely how we generate the new hidden state
01:15:48.820 | is then by shown on this equation on the left.
01:15:52.500 | So we take the previous hidden state, multiply it by WH.
01:15:56.620 | We take the input embedding, multiply it by WE.
01:16:00.900 | We sum those two.
01:16:02.620 | We add on a learn bias weight,
01:16:06.860 | and then we put that through a non-linearity.
01:16:11.060 | And although on this slide,
01:16:12.780 | that non-linearity is written as sigma,
01:16:15.260 | by far the most common non-linearity to use here
01:16:18.740 | actually is a tanh non-linearity.
01:16:22.380 | And so this is the core equation
01:16:26.100 | for a simple recurrent neural network.
01:16:28.980 | And for each successive time step,
01:16:31.340 | we're just gonna keep on applying that
01:16:33.860 | to work out hidden states.
01:16:36.180 | And then from those hidden states,
01:16:38.660 | we can use them just like in our window classifier
01:16:42.940 | to predict what would be the next word.
01:16:45.500 | So at any position, we can take this hidden vector,
01:16:50.740 | put it through a softmax layer,
01:16:52.580 | which is multiplying by U matrix
01:16:54.620 | and adding on another bias,
01:16:56.260 | and then making a softmax distribution out of that.
01:16:59.020 | And that will then give us a probability distribution
01:17:01.940 | over next words.
01:17:04.140 | What we saw here, right?
01:17:05.900 | This is the entire math of a simple recurrent neural network.
01:17:10.900 | And next time I'll come back and say more about them,
01:17:16.380 | but this is the entirety of what you need to know
01:17:20.540 | in some sense for the computation of the forward model
01:17:24.140 | of a simple recurrent neural network.
01:17:26.220 | So the advantages we have now
01:17:28.380 | is it can process a text import of any length.
01:17:33.140 | In theory, at least it can use information
01:17:37.140 | from any number of steps back.
01:17:39.420 | We'll talk more about in practice
01:17:41.060 | how well that actually works.
01:17:42.860 | The model size is fixed.
01:17:45.700 | It doesn't matter how much of a past context there is.
01:17:49.820 | All we have is our WH and WE parameters.
01:17:54.020 | And at each time step,
01:17:56.580 | we use exactly the same weights to update our hidden state.
01:18:00.380 | So there's a symmetry in how different inputs are processed
01:18:05.340 | in producing our predictions.
01:18:08.140 | RNNs in practice though,
01:18:10.180 | or the simple RNNs in practice aren't perfect.
01:18:13.860 | So a disadvantage is that they're actually kind of slow
01:18:18.220 | 'cause with this recurrent computation,
01:18:21.300 | in some sense we are sort of stuck
01:18:23.300 | with having to have on the outside a for loop.
01:18:26.100 | So we can do vector matrix multiplies on the inside here,
01:18:30.500 | but really we have to do for time step equals one to N,
01:18:35.500 | calculate the successive hidden states.
01:18:39.340 | And so that's not a perfect neural net architecture
01:18:42.700 | and we'll discuss alternatives to that later.
01:18:47.580 | And although in theory,
01:18:49.220 | this model can access information any number of steps back,
01:18:53.140 | in practice we find that it's pretty imperfect at doing that
01:18:57.780 | and that will then lead to more advanced forms
01:19:01.220 | of recurrent neural network that I'll talk about next time
01:19:05.060 | that are able to more effectively access past context.
01:19:10.060 | Okay, I think I'll stop there for the day.
01:19:12.300 | [BLANK_AUDIO]