back to indexStanford CS224N - NLP w/ DL | Winter 2021 | Lecture 4 - Syntactic Structure and Dependency Parsing
Chapters
0:0 Introduction
2:55 Meaning
7:18 Review
13:35 Dependency grammar
20:15 Why are we interested in syntactic structure
22:43 Examples of natural language ambiguities
27:52 Whales from space
28:22 Prepositional phrases
34:23 Coordination scope ambiguities
36:28 Newspaper headlines
37:43 Modifiers
41:46 Semantic Interpretation
43:12 What is a dependency grammar
45:12 History of dependency grammars
50:15 Universal dependencies tree banks
52:57 Why are tree banks good
57:5 Sources of information
00:00:00.000 |
Okay, so for today, we're actually going to take a bit of a change of pace from what the 00:00:15.480 |
And we're going to focus much more on linguistics and natural language processing. 00:00:21.800 |
And so in particular, we're going to start looking at the topic of dependency parsing. 00:00:32.120 |
And so this is the plan of what to go through today. 00:00:34.820 |
So I'm going to start out by going through some ideas that have been used for the syntactic 00:00:39.460 |
structure of languages of constituency and dependency and introduce those. 00:00:44.840 |
And then focusing in more on dependency structure, I'm then going to look at dependency grammars 00:00:55.680 |
And then having done that, we're then going to move back into thinking about how to build 00:01:02.600 |
And so I'm going to introduce the idea of transition based dependency parsing. 00:01:07.400 |
And then in particular, having developed that idea, I'm going to talk about a way to build 00:01:13.080 |
a simple but highly effective neural dependency parser. 00:01:17.640 |
And so this simple, highly effective neural dependency parser is essentially what we'll 00:01:22.240 |
be asking you to build in the third assignment. 00:01:26.120 |
So in some sense, we're getting a little bit ahead of ourselves here, because in week two 00:01:30.040 |
of the class, we teach you how to do both assignments two and three. 00:01:35.660 |
But all of this material will come in really useful. 00:01:39.840 |
Before I get underway, just a couple of announcements. 00:01:44.760 |
So again, for assignment two, you don't yet need to use the PyTorch framework. 00:01:53.080 |
But now's a good time to work on getting PyTorch installed for your Python programming. 00:02:00.420 |
Assignment three is in part also an introduction to using PyTorch. 00:02:04.480 |
It's got a lot of scaffolding included in the assignment. 00:02:08.020 |
But beyond that, this Friday, we've got a PyTorch tutorial, and thoroughly encourage 00:02:18.880 |
And in the second half, the first day of week four, we have an explicit class that partly 00:02:26.480 |
focuses on the final projects and what the choices are for those. 00:02:30.720 |
But it's never too late to start thinking about the final project and what kind of things 00:02:40.720 |
There are sort of resources on the course pages about what different TAs know about. 00:02:45.640 |
I've also talked to a number of people about final projects, but clearly I can't talk to 00:02:51.040 |
So I encourage you to also be thinking about what you want to do for final projects. 00:02:55.320 |
Okay, so what I wanted to do today was introduce how people think about the structure of sentences. 00:03:07.160 |
And put structure on top of them to explain how human language conveys meaning. 00:03:13.700 |
And so our starting point for meaning and essentially what we've dealt with with word 00:03:23.160 |
And words are obviously an important part of the meaning of human languages. 00:03:29.160 |
But for words in human languages, there's more that we can do with them in thinking 00:03:39.400 |
So in particular, the first most basic way that we think about words when we are thinking 00:03:45.560 |
about how sentences are structured is we give to them what's called a part of speech. 00:03:51.120 |
We can say that cat is a noun, by is a preposition, door is another noun, cuddly is an adjective. 00:04:02.760 |
And then for the word the, if it was given a different part of speech, if you saw any 00:04:08.720 |
parts of speech in school, it was probably you're told it was an article. 00:04:14.240 |
Sometimes that is just put into the class of adjectives. 00:04:18.280 |
In modern linguistics and what you'll see in the resources that we use, words like the 00:04:25.980 |
And the idea is that there's a bunch of words includes a and the, but also other words like 00:04:31.240 |
this and that, or even every, which are words that occur at the beginning of something like 00:04:41.760 |
the cuddly cat, which have a determinative function of sort of picking out which cats 00:04:53.520 |
But it's not the case that when we want to communicate with language, that we just have 00:04:58.320 |
this word salad where we say a bunch of words. 00:05:02.520 |
We just say, you know, whatever, leaking, kitchen, tap, and let the other person put 00:05:10.160 |
We put words together in a particular way to express meanings. 00:05:14.300 |
And so therefore, languages have larger units of putting meaning together. 00:05:20.080 |
And the question is how we represent and think about those. 00:05:24.640 |
Now in modern work, in particular, in modern United States linguistics, or even what you 00:05:34.880 |
see in computer science classes, when thinking about formal languages, the most common way 00:05:40.720 |
to approach this is with the idea of context-free grammars, which you see at least a little 00:05:46.760 |
bit of in 103, if you've done 103, what a linguist would often refer to as phrase structure 00:05:54.640 |
And the idea there is to say, well, there are bigger units in languages that we refer 00:06:02.040 |
So something like the cuddly cat is a cat with some other words modifying it. 00:06:12.120 |
But then we have ways in which phrases can get larger by building things inside phrases. 00:06:22.240 |
So the door here is also a noun phrase, but then we can build something bigger around 00:06:37.000 |
So we can then make something like the cuddly cat by the door, and then the door is a noun 00:06:50.840 |
But then when you put it all together, the whole of this thing becomes a bigger noun 00:06:57.040 |
And so it's working with these ideas of nested phrases, what in context-free grammar terms 00:07:08.440 |
So noun phrase and prepositional phrase would be non-terminals in the context-free grammar. 00:07:14.180 |
We can build up a bigger structure of human languages. 00:07:18.280 |
So let's just do that for a little bit to review what happens here. 00:07:22.860 |
So we start off saying, okay, you can say the cat and the dog. 00:07:31.160 |
And so we want a rule that can explain those. 00:07:33.700 |
So we could say a noun phrase goes to the termina noun. 00:07:38.320 |
And then somewhere over the side, we'd have a lexicon. 00:07:42.560 |
And in our lexicon, we'd say that dog is a noun and cat is a noun and a is a determiner 00:07:55.560 |
So then we notice you can do a bit more than that. 00:07:59.720 |
So you can say things like the large cat, a barking dog. 00:08:06.320 |
So that suggests we can have a noun phrase after the determiner. 00:08:14.840 |
And then there's the noun and that can explain some things we can say. 00:08:18.960 |
But we can also say the cat by the door or a barking dog in a crate. 00:08:25.600 |
And so we can also put a prepositional phrase at the end and that's optional, but you can 00:08:35.200 |
For the example I gave, like a barking dog on the table. 00:08:44.400 |
So then we'll keep on and say, well, actually you can use multiple adjectives. 00:08:51.120 |
So you can say a large barking dog or a large barking cuddly cat. 00:09:00.440 |
So we have any number of adjectives which we can represent with a star, what's referred 00:09:12.720 |
For by the door, I have to have a rule for producing by the door. 00:09:16.200 |
So I also need a rule that's a prepositional phrase goes to a preposition followed by a 00:09:22.800 |
And so then I also have to have prepositions and that can be in or on or by. 00:09:32.240 |
And you can make other sentences, of course, with this as well, like the large crate on 00:09:38.040 |
the table or something like that, or the large crate on the large table. 00:09:44.800 |
So I chug along and then, well, I could have something like talk to the cat. 00:09:52.120 |
So talk is a verb and to is still looks like a preposition. 00:09:58.240 |
So I need to be able to make up something with that as well. 00:10:06.080 |
So what I can do is say, I can also have a rule for a verb phrase that goes to a verb. 00:10:13.840 |
And then after that, for something like talk to the cat, that it can take a prepositional 00:10:21.560 |
And then I can say that the verb goes to talk or walked. 00:10:31.960 |
Then I can pause and then I can cover those sentences. 00:10:37.680 |
So that's, that's the end of what I have here. 00:10:41.360 |
So in this sort of a way, I'm handwriting a grammar. 00:10:47.040 |
So here is now I have this grammar and a lexicon. 00:10:52.920 |
And for the examples that I've written, I've written down here, this grammar and this lexicon 00:11:02.080 |
is sufficient to pause these sort of fragments of showing expansion that I just wrote down. 00:11:09.560 |
I mean, of course, there's a lot more to English than what you see here. 00:11:15.120 |
So if I have something like, you know, the cat walked behind the dog, then I need some 00:11:27.920 |
So it seems then I need a rule that says I can have a sentence that goes to a noun phrase 00:11:42.240 |
Let's see, one question that Ruth Ann asked was about what do the brackets mean? 00:11:50.560 |
And is the first NP different from the second? 00:11:55.880 |
So for this notation on the brackets here, I mean, this is actually a common notation 00:12:05.840 |
It's sort of in some sense a little bit different to traditional computer science notation, 00:12:12.560 |
since the star is used in both to mean zero or more of something. 00:12:18.000 |
So you could have zero, one, two, three, four, five adjectives. 00:12:21.960 |
Somehow it's usual in linguistics that when you're using the star, you also put parentheses 00:12:30.600 |
So sort of parentheses and star are used together to mean any number of something. 00:12:36.320 |
When it's parentheses just by themselves, that's then meaning zero or one. 00:12:42.200 |
And then for are these two noun phrases different? 00:12:51.360 |
And so in our grammar, we can have multiple rules that expand noun phrase in different 00:12:58.800 |
But actually in my example here, my second rule, because I wrote it quite generally, 00:13:09.120 |
So actually at that point, I can cross out this first rule because I don't actually need 00:13:15.160 |
But in general, you have a choice between writing multiple rules for noun phrase goes 00:13:22.440 |
to categories, which effectively gives you a disjunction or working out by various syntactic 00:13:36.700 |
So that was what gets referred to in natural language processing as constituency grammars, 00:13:44.040 |
where the standard form of constituency grammar is a context-free grammar of the sort that 00:13:50.200 |
I trust you saw at least a teeny bit of either in CS 103 or something like a programming 00:14:00.200 |
There are other forms of grammars that also pick out constituency. 00:14:04.420 |
There are things like tree adjoining grammars, but I'm not going to really talk about any 00:14:09.320 |
What I actually want to present is a somewhat different way of looking at grammar, which 00:14:14.600 |
is referred to as dependency grammar, which puts a dependency structure over sentences. 00:14:23.120 |
Now actually, it's not that these two ways of looking at grammar have nothing to do with 00:14:29.920 |
I mean, there's a whole formal theory about the relationships between different kinds 00:14:35.760 |
of grammars, and you can very precisely state relationships and isomorphisms between different 00:14:45.920 |
But on the surface, these two kinds of grammars look sort of different and emphasize different 00:14:53.480 |
And for reasons of their sort of closeness to picking out relationships and sentences 00:15:02.280 |
and their ease of use, it turns out that in modern natural language processing, starting, 00:15:09.840 |
I guess, around 2000, so really in the last 20 years, NLP people have really swung behind 00:15:19.400 |
So if you look around now where people are using grammars in NLP, by far the most common 00:15:25.120 |
thing that's being used is dependency grammars. 00:15:28.480 |
So I'm going to teach us today a bit about those and for what we're going to build in 00:15:34.240 |
assignment three is building, using supervised learning, a neural dependency parser. 00:15:41.600 |
So the idea of dependency grammar is that when we have a sentence, what we're going 00:15:47.560 |
to do is we're going to say for each word, what other words modify it. 00:15:55.880 |
So what we're going to do is when we say the large crate, we're going to say, okay, well, 00:16:01.920 |
large is modifying crate and that is modifying crate. 00:16:12.920 |
And so I'm showing a modification, a dependency or an attachment relationship by drawing an 00:16:21.720 |
arrow from the head to what's referred to in dependency grammar as the dependent. 00:16:28.320 |
The thing that modifies, further specifies or attaches to the head. 00:16:40.400 |
Well another dependency that is that, well, look in the large crate, that where you're 00:16:51.960 |
So you're going to want to have the large, in the large crate as being a dependent of 00:16:58.480 |
And so that's also going to be a dependency relationship here. 00:17:03.600 |
And then there's one final bit that might seem a little bit confusing to people. 00:17:10.640 |
And that's actually when we have these prepositions. 00:17:14.200 |
There are two ways that you can think that this might work. 00:17:18.300 |
So if it was something like look in the crate, it seems like that is a dependent of crate. 00:17:29.860 |
But you could think that you want to say look in and it's in the crate and give this dependency 00:17:36.980 |
relationship with the sort of preposition as sort of thinking of it as the head of what 00:17:46.740 |
And that's a possible strategy in the dependency grammar. 00:17:50.900 |
But what I'm going to show you today and what you're going to use in the assignment is dependency 00:17:59.420 |
grammars that follow the representation of universal dependencies. 00:18:04.260 |
And universal dependencies is a framework which actually I was involved in creating, 00:18:09.340 |
which was set up to try and give a common dependency grammar over many different human 00:18:16.180 |
And in the design decisions that were made in the context of designing universal dependencies, 00:18:24.820 |
what we decided was that for what in some languages you use prepositions, lots of other 00:18:32.980 |
languages make much more use of case markings. 00:18:36.380 |
So if you've seen something like German, you've seen more case markings like genitive and 00:18:44.820 |
And in other languages like Latin or Finnish, lots of Native American languages, you have 00:18:52.180 |
many more case markings again, which cover most of the role of prepositions. 00:18:57.860 |
So in universal dependencies, essentially in the crate is treated like a case marked 00:19:05.700 |
And so what we say is that the in is also a dependent of crate. 00:19:14.940 |
So in the structure we adopt, in is a dependent of crate. 00:19:27.380 |
And then we have these prepositional phrases in the kitchen, by the door, and we want to 00:19:35.780 |
Well, in the kitchen is modifying crate, right? 00:19:42.460 |
So we're going to say that it's, this piece is a dependent of crate. 00:19:51.360 |
Well, it's not really meaning that it's a kitchen by the door. 00:20:01.280 |
And so what we're going to have is the crate also has door as a dependent. 00:20:06.300 |
And so that gives us our full dependency structure of this sentence. 00:20:17.200 |
And so that's a teeny introduction to syntactic structure. 00:20:22.640 |
I'm going to say a bit more about it and give a few more examples. 00:20:27.940 |
But let me just for a moment sort of say a little bit about, well, why are we interested 00:20:34.400 |
Why do we need to know the structure of sentences? 00:20:38.200 |
And this gets into how does human languages work? 00:20:42.720 |
So human languages can communicate very complex ideas. 00:20:49.320 |
I mean, in fact, you know, anything that humans know how to communicate to one another, they 00:20:59.000 |
So we can structure and communicate very complex ideas. 00:21:03.000 |
But we can't communicate a really complex idea by one word. 00:21:09.520 |
We can't just, you know, choose a word like, you know, empathy and say it with a lot of 00:21:15.560 |
meaning and say empathy and the other person is meant to understand everything about what 00:21:22.160 |
So we can compose a complex meaning that explains things by putting words together 00:21:29.520 |
And the syntax of a language allows us to put words together into bigger units where 00:21:36.440 |
we can build up and convey to other people a complex meaning. 00:21:41.440 |
And so then the listener doesn't get this syntactic structure. 00:21:47.160 |
The syntactic structure of the sentence is hidden from the listener. 00:21:51.520 |
All the listener gets is a sequence of words, one after another, bang, bang, bang. 00:21:57.680 |
So the listener has to be able to do what I was just trying to do in this example, that 00:22:04.000 |
as the sequence of words comes in, that the listener works out which words modify which 00:22:11.500 |
other words and therefore can construct the structure of the sentence and hence the meaning 00:22:20.800 |
And so in the same way, if we want to build clever neural net models that can understand 00:22:27.400 |
the meaning of sentences, those clever neural net models also have to understand what is 00:22:33.680 |
the structure of the sentence so that they can interpret the language correctly. 00:22:39.160 |
And we'll go through some examples and see more of that. 00:22:44.560 |
So the fundamental point that we're going to sort of spend a bit more time on is that 00:22:51.680 |
these choices of how you build up the structure of a language change the interpretation of 00:22:59.880 |
And a human listener or equally a natural language understanding program has to make 00:23:08.040 |
in a sort of probabilistic fashion choices as to which words modify, i.e. depend upon 00:23:15.280 |
which other words so that they're coming up with the interpretation of the sentence that 00:23:21.160 |
they think was intended by the person who said it. 00:23:26.000 |
So to get a sense of this and how sentence structure is interesting and difficult, what 00:23:34.920 |
I'm going to go through now is a few examples of different ambiguities that you find in 00:23:43.280 |
And I've got some funny examples from newspaper headlines, but these are all real natural 00:23:50.400 |
language ambiguities that you find throughout natural language. 00:23:55.800 |
Well, at this point I should say this is where I'm being guilty of saying true language, 00:24:04.760 |
Some of these ambiguities you find in lots of other languages as well, but which ambiguities 00:24:11.040 |
that offer syntactic structure partly depend on the details of the language. 00:24:17.320 |
So different languages have different syntactic constructions, different word orders, different 00:24:22.760 |
amounts of words having different forms of words like case markings. 00:24:29.240 |
And so depending on those details, there might be different ambiguities. 00:24:34.500 |
So here's one ambiguity, which is one of the commonest ambiguities in English. 00:24:47.960 |
Either it's the San Jose cops who are killing a man and they're killing a man with a knife. 00:24:57.440 |
And so that corresponds to a dependency structure where the San Jose cops are the subject of 00:25:04.780 |
killing, the man is the object of killing, and then the knife is then the instrument 00:25:16.540 |
So that the knife is an oblique modifier for the instrument of killing. 00:25:22.280 |
And so that's one possible structure for this sentence. 00:25:29.620 |
So what it actually probably was, was that it was a man with a knife and the San Jose 00:25:38.700 |
So that corresponds to the knife then being a noun modifier of the man and then kill is 00:25:52.100 |
So the man is the object of killing and the cops are still the subject. 00:25:57.420 |
And so whenever you have a prepositional phrase like this, that's coming further on in a sentence, 00:26:08.980 |
It could be either interpreted as modifying a noun phrase that comes before it, or it 00:26:15.540 |
can be interpreted as modifying a verb that comes before it. 00:26:20.420 |
So systematically in English, you get these prepositional phrase attachment ambiguities 00:26:29.460 |
But to give two further observations on that, the first observation is you encounter sentences 00:26:41.620 |
with prepositional phrase attachment ambiguities every time you read a newspaper article, every 00:26:48.740 |
time you talk to somebody, but most of the time you never notice them. 00:26:54.100 |
And that's because our human brains are incredibly good at considering the possible interpretations 00:27:01.300 |
and going with the one that makes sense according to context. 00:27:07.700 |
The second comment, as I said, different human languages expose different ambiguities. 00:27:14.680 |
So for example, this is an ambiguity that you normally don't get in Chinese because 00:27:20.060 |
in Chinese prepositional phrases, modifying a verb are normally placed before the verb. 00:27:27.460 |
And so there you don't standardly get this ambiguity, but there are different other ambiguities 00:27:39.080 |
So this ambiguity you find everywhere because prepositional phrases are really common at 00:27:47.000 |
So here's another one, scientists count whales from space. 00:27:51.380 |
So that gives us these two possible interpretations that there are whales from space and scientists 00:28:00.460 |
And then the other one is how the scientists are counting the whales is that they're counting 00:28:07.100 |
them from space and they're using satellites to count the sails, which is the correct interpretation 00:28:14.580 |
that the newspaper hopes that you're getting. 00:28:21.500 |
And this problem gets much, much more complex because many sentences in English have prepositional 00:28:33.680 |
So here's the kind of boring sentence that you find in the financial news. 00:28:39.360 |
The board approved its acquisition by Royal Trust Co. Ltd. of Toronto for $27 a share 00:28:47.500 |
And while if you look at the structure of this sentence, what we find is, you know, 00:28:51.980 |
here's a verb, then here's the object noun phrase. 00:29:03.860 |
Well, we find a prepositional phrase, another prepositional phrase, another prepositional 00:29:12.040 |
And how to attach each of these is then ambiguous. 00:29:16.220 |
So the basic rule of how you can attach them is you can attach them to things to the left, 00:29:23.700 |
providing you don't create crossing attachments. 00:29:27.380 |
So in principle, by Royal Trust Co. Ltd. could be attached to either approved or acquisition. 00:29:34.940 |
But in this case, by Royal Trust Co. Ltd. it's the acquirer. 00:29:51.380 |
So of Toronto could be modifying Royal Trust Co. Ltd. 00:29:55.660 |
It could be modifying the acquisition or it can be modifying the approved. 00:30:00.780 |
And in this case, the of Toronto is telling you more about the company. 00:30:05.520 |
And so it's a modifier of Royal Trust Co. Ltd. 00:30:09.740 |
Okay, so then the next one is for $27 a share. 00:30:14.780 |
And that could be modifying Toronto, Royal Trust Co. Ltd., the acquisition or the approving. 00:30:21.300 |
And well, in this case, that's talking about the price of the acquisition. 00:30:31.740 |
And this is now a prepositional phrase that's modifying the acquisition. 00:30:38.120 |
And then at the end, at its monthly meeting, well, that's where the approval is happening 00:30:47.920 |
So rather than any of these preceding four noun phrases, at its monthly meeting is modifying 00:31:06.100 |
But as I think maybe you can see that, you know, none of these dependencies cross each 00:31:11.220 |
other and they connect at different places ambiguously. 00:31:16.260 |
So because we can chain these prepositions like this and attach them at different places 00:31:21.540 |
like this, human language sentences are actually extremely ambiguous. 00:31:29.680 |
So the number, if you have a sentence with K prepositional phrases at the end of it, 00:31:38.420 |
where here we have K equals four, the number of parses this sentence has, the number of 00:31:44.660 |
different ways you can make these attachments is given by the Catalan numbers. 00:31:49.540 |
So the Catalan numbers are an exponentially growing series which arises in many tree-like 00:31:57.140 |
So if you're doing something like triangulations of a polygon, you get Catalan numbers. 00:32:01.900 |
If you're doing triangulation and graphical models in CS228, you get Catalan numbers. 00:32:08.500 |
But we don't need to worry about the details here. 00:32:10.840 |
The central point is this is an exponential series. 00:32:13.980 |
And so you're getting an exponential number of parses in terms of the number of prepositional 00:32:21.580 |
And so in general, you know, the number of parses human languages have is exponential 00:32:30.740 |
Because if you're then trying to enumerate all the parses, you might fear that you really 00:32:38.820 |
The thing to notice about structures like these prepositional phrase attachment ambiguities 00:32:45.940 |
is that there's nothing that resolves these ambiguities in terms of the structure of the 00:32:53.700 |
So if you've done something like looked at the kind of grammars that are used in compilers, 00:33:01.440 |
that the grammars used in compilers for programming languages are mainly made to be unambiguous. 00:33:08.780 |
And to the extent that there are any ambiguities, there are default rules that are used to say, 00:33:17.180 |
choose this one particular parse tree for your piece of a programming language. 00:33:29.540 |
And the listening human is just meant to be smart enough to figure out what was intended. 00:33:35.300 |
So the analogy would be that, you know, in programming languages, when you're working 00:33:44.300 |
out what does an else clause modify, well, you've got the answer that you can either 00:33:53.220 |
look at the curly braces to work out what the else clause modifies, or if you're using 00:33:59.020 |
Python, you look at the indentation and it tells you what the else clause modifies. 00:34:04.500 |
Where by contrast, for human languages, it would be just write down else something. 00:34:18.260 |
The human being will just figure out what the else clause is meant to pair up with. 00:34:24.340 |
Lots of other forms of ambiguities in human languages. 00:34:32.020 |
Another one that's very common over all sorts of languages is coordination scope ambiguities. 00:34:40.580 |
Shuttle veteran and longtime NASA executive Fred Gregory appointed to board. 00:34:55.740 |
There's a shuttle veteran and there's a longtime NASA executive Fred Gregory. 00:35:01.140 |
And they were both appointed to the board, two people. 00:35:06.220 |
And the other possibility is there's someone named Fred Gregory who's a shuttle veteran 00:35:15.300 |
and longtime NASA executive, and they're appointed to the verb one person. 00:35:21.860 |
And these two interpretations, again, correspond to having different past structures. 00:35:28.860 |
So in one structure, we've got a coordination of the shuttle veteran and the longtime NASA 00:35:42.620 |
In one case, these are coordinated, and then Fred Gregory specifies the name of the NASA 00:35:52.300 |
So it's then specifying who that executive is, where in the other one, the shuttle veteran 00:36:00.460 |
and longtime NASA executive all together is then something that is a modifier of Fred 00:36:11.260 |
Okay, so one time, this is the unit that modifies Fred Gregory. 00:36:17.660 |
In the other one up here, just longtime NASA executive modifies Fred Gregory, and then 00:36:24.380 |
that's conjoined together with the shuttle veteran. 00:36:28.020 |
And so that also gives different interpretations. 00:36:32.220 |
So this is a slightly reduced example of the, I mean, in newspaper headlines tend to be 00:36:40.020 |
more ambiguous than many other pieces of text because they're written in this shortened 00:36:47.980 |
And this is an especially shortened form, where it's actually left out in the explicit 00:36:54.860 |
But this headline says, "Doctor, no heart, cognitive issues." 00:37:01.740 |
And this was after, I guess, one of Trump, it was after Trump's first physical. 00:37:06.420 |
And while this is an ambiguity, because there are two ways that you can read this, you can 00:37:10.660 |
either read this as saying, "Doctor, no heart and cognitive issues," which gives you one 00:37:20.780 |
Instead of that, the way we should read it is that it's "heart or cognitive," and so 00:37:29.260 |
it's then saying, "No heart or cognitive issues." 00:37:34.180 |
And we have a different, narrower scope of the coordination, and then we get a different 00:37:45.740 |
I want to give a couple more examples of different kinds of ambiguities. 00:37:50.660 |
Another one you see quite a bit is when you have modifiers that are adjectives and adverbs, 00:37:57.140 |
that there are different ways that you can have things modifying other things. 00:38:01.860 |
This example is a little bit not safe for work, but here goes. 00:38:11.700 |
So this is an ambiguous sentence, and again, we can think of it as a syntactic ambiguity 00:38:17.820 |
in terms of which things modify which other things. 00:38:24.260 |
So the nice, polite way to render this sentence is that "first" is modifying "hand," so we've 00:38:33.940 |
got "first-hand," it's "job experience," so "job" is a compound noun modifying "experience," 00:38:41.540 |
and it's "first-hand experience," so "first-hand" is then modifying "experience," and then "get" 00:38:49.900 |
is the object of -- sorry, "first-hand job experience" is the object of "get," and "students" 00:39:01.740 |
But if you have a smuttier mind, you can interpret this a different way, and in the alternative 00:39:08.340 |
interpretation, you then have "hand" going together with "job," and the "first" is then 00:39:20.820 |
a modifier of "experience," and "job" is still a modifier of "experience," and so then you 00:39:27.460 |
get this different past structure and different interpretation there. 00:39:35.540 |
In a way, this example is similar to the previous one. 00:39:38.940 |
It's sort of having modifier pieces that can modify different things, but rather than it 00:39:45.740 |
just being with individual adjectives or individual adverbs, it's then much larger units, such 00:39:54.160 |
as verb phrases, can often have attachment ambiguities. 00:39:58.140 |
So this sentence headline is "Mutilated body washes up on Rio beach to be used for Olympics 00:40:06.900 |
So we have this big verb phrase here of "to be used for Olympics beach volleyball," and 00:40:13.140 |
then again we have this attachment decision that we could either say that that big verb 00:40:21.900 |
phrase is modifying, i.e., is attached to the Rio beach, or we could say, "No, no, the 00:40:33.020 |
'to be used for Olympics beach volleyball,' that is modifying the mutilated body, and 00:40:42.340 |
it's a body that's to be used for the Olympics beach volleyball," which gives the funny reading. 00:40:49.420 |
Yeah, so I hope that's given you at least a little bit of a sense of how human language 00:40:55.620 |
syntactic structure is complex, ambiguous, and to work out the intended interpretations, 00:41:03.620 |
you need to know something about that structure. 00:41:07.660 |
In terms of how much you need to understand, I mean, you know, this isn't a linguistics 00:41:14.820 |
If you'd like to learn more about human language structure, you can go off and do a syntax 00:41:19.180 |
class, but, you know, we're not really going to spend a lot of time working through language 00:41:25.660 |
structure, but there will be some questions on this in the assignment, and so we're expecting 00:41:31.180 |
that you can be at the level that you can have sort of some intuitions as to which words 00:41:37.220 |
and phrases are modifying other words and phrases, and therefore you could choose between 00:41:46.180 |
Okay, I've spent quite a bit of time on that, so better keep going. 00:41:53.060 |
Okay, so the general idea is that knowing this sort of syntactic structure of a sentence 00:42:04.340 |
I mean, as well as just generally saying we can understand language, it's also used in 00:42:08.900 |
many cases for simple practical forms of semantic extraction, so people such as in biomedical 00:42:15.300 |
informatics often want to get out particular relations such as protein-protein interactions, 00:42:22.660 |
The results demonstrated that Chi C interacts rhythmically with Sass A, Chi A, and Chi B, 00:42:31.460 |
and commonly that people can get out those kind of relationships by looking at patterns 00:42:37.380 |
of dependency relations with particular verbs, so for the interacts verb, if you have a pattern 00:42:43.420 |
of something being the subject and something else being the noun modifier of interacts, 00:42:49.780 |
well that's an interaction relationship, but it gets a bit more complicated than that, 00:42:54.060 |
as in this example, because often there are conjunctions, so you also want to have another 00:42:58.740 |
pattern where you have also interactions between the subject and the noun modifiers conjunct, 00:43:07.420 |
which will allow us to also find the Chi A and Chi B examples. 00:43:13.420 |
Okay, so I've sort of given an informal tour of dependency grammar to just try and quickly 00:43:22.460 |
say a little bit more about formally what a dependency grammar is. 00:43:27.140 |
So in dependency syntax, what we say is that the syntactic structure of a sentence consists 00:43:35.940 |
of relations between pairs of words, and it's a binary asymmetric relation, i.e. we draw 00:43:43.780 |
arrows between pairs of words, which we call dependencies. 00:43:48.980 |
Now normally dependency grammars then type those grammatical relation, type those arrows 00:43:55.660 |
to express what kind of relation that there is, and so that they have some kind of taxonomy 00:44:01.340 |
of grammatical relations, so we might have a subject grammatical relation, a verbal auxiliary 00:44:06.580 |
grammatical relation, an oblique modifier grammatical relation, we have some kind of 00:44:16.820 |
And we refer to the arrows going between the head, here's the head here, and something 00:44:24.540 |
that is a dependent of it, so the subject of a verb is a dependent of the verb, or when 00:44:31.420 |
you have a noun modifier like our sort of cuddly cat, we say that cuddly is a dependent 00:44:41.980 |
of cat, and so cat is the head of cuddly cat. 00:44:46.540 |
And so normally dependencies, like in these examples, form a tree, which is formal, so 00:44:55.580 |
it's not just any graph with arrows, we have a graph which is connected, acyclic, and has 00:45:03.460 |
a single root, so here's the root of the graph, and so that gives us a dependency tree analysis. 00:45:13.660 |
Language-based grammar has a really, really long history, so the famous first linguist 00:45:20.740 |
was Panini, who wrote about the structure of Sanskrit, and mainly he worked on the sound 00:45:29.580 |
system of Sanskrit and how sounds change in various contexts, which is what linguists 00:45:33.820 |
call phonology, and the different forms of Sanskrit words, Sanskrit has rich morphology 00:45:40.860 |
of inflecting nouns and verbs for different cases and forms, but he also worked a little 00:45:46.900 |
on the syntactic structure of Sanskrit sentences, and essentially what he proposed was a dependency 00:45:57.340 |
And it turns out that sort of for most of recorded history, when people have then gone 00:46:04.380 |
on and tried to put structures over human sentences, what they have used is dependency 00:46:11.460 |
grammars, so there was a lot of work in the first millennium by Arabic grammarians trying 00:46:17.860 |
to work out the grammar structure of sentences, and effectively what they used was akin to 00:46:25.660 |
what I've just presented as a dependency grammar. 00:46:29.060 |
So compared to, you know, 2500 years of history, the ideas of having context-free grammars 00:46:37.140 |
and having constituency grammars is actually a really, really recent invention, so it was 00:46:41.820 |
really sort of in the middle of the 20th century that the ideas of constituency grammar and 00:46:48.220 |
context-free grammars were developed, first by Wells in the 40s, and then by Noam Chomsky 00:46:53.900 |
in the early 50s, leading to things like the Chomsky hierarchy that you might see in CS103 00:47:04.500 |
So for modern work on dependency grammar, using kind of the terminology and notation 00:47:11.100 |
that I've just introduced, that's normally attributed to Lucien Taignere, who was a French 00:47:16.300 |
linguist in around the sort of middle of the 20th century as well. 00:47:22.980 |
Constituency grammar was widely used in the 20th century in a number of places. 00:47:28.820 |
I mean, in particular, it tends to be sort of much more natural and easier to think about 00:47:34.580 |
for languages that have a lot of different case markings on nouns, like nominative, accusative, 00:47:41.260 |
genitive, dative, instrumental kind of cases, like you get in a language like Latin or Russian, 00:47:47.380 |
and a lot of those languages have much freer word order than English. 00:47:51.460 |
So the subject or object of, you know, in English, the subject has to be before the 00:47:56.100 |
verb and the object has to be after the verb, but lots of other languages have much freer 00:48:00.780 |
word order and instead use different forms of nouns to show you what's the subject or 00:48:08.820 |
And dependency grammars can often seem much more natural for those kinds of languages. 00:48:15.420 |
Dependency grammars were also prominent in the very beginnings of computational linguistics. 00:48:19.500 |
So one of the first people working in computational linguistics in the US was David Hayes. 00:48:27.080 |
So the Professional Society for Computational Linguistics is called the Association for 00:48:30.940 |
Computational Linguistics, and he was actually one of the founders of the Association for 00:48:37.400 |
And he published in the early 1960s, an early, perhaps the first dependency grammar, sorry, 00:48:48.260 |
Yeah, a little teeny note, just in case you see other things. 00:48:54.280 |
When you have these arrows, you can draw them in either direction. 00:48:58.820 |
You can either draw arrows from the head or to the dependent or from the dependent to 00:49:05.500 |
And actually, different people have done one and the other, right? 00:49:09.620 |
So the way Tenier drew them was to draw them from the head to the dependent, and we're 00:49:15.940 |
But, you know, if you're looking at something that somebody else has written with dependency 00:49:21.500 |
arrows, the first thing you have to work out is, are they using the arrow heads or the 00:49:28.020 |
Now, one other thing here is that a sentence is seen as having the overall head word of 00:49:37.420 |
the sentence, which every other word of the sentence hangs off. 00:49:42.060 |
It's a common convention to add this sort of fake root to every sentence that then points 00:49:48.700 |
to the head word of the whole sentence here completed. 00:49:53.380 |
That just tends to make the algorithmic stuff easier, because then you can say that every 00:49:58.620 |
word of the sentence is dependent on precisely one other node, where what you can be dependent 00:50:05.980 |
on is either another word on the sentence or the fake root of the sentence. 00:50:11.060 |
And when we build our parsers, we will introduce that fake root. 00:50:18.740 |
So that's sort of dependency grammars and dependency structure. 00:50:26.000 |
I now want to get us back to natural language processing and starting to build parsers for 00:50:37.220 |
But before doing that, I just want to say, yeah, where do we get our data from? 00:50:44.020 |
And that's actually an interesting story in some sense. 00:50:49.420 |
So the answer to that is, well, what we do is get human beings, commonly linguists or 00:51:00.180 |
other people who are actually interested in the structure of human sentences, and we get 00:51:06.100 |
them to sit around and hand parse sentences and give them dependency structures, and we 00:51:13.420 |
collect a lot of those parsers, and we call that a tree bank. 00:51:20.660 |
And so this is something that really only started happening in the late '80s and took 00:51:30.820 |
Until then, no one had attempted to build tree banks. 00:51:34.300 |
Lots of people had attempted to build parsers, and it seemed like, well, if you want to build 00:51:40.060 |
a parser, the efficient way to do it is to start writing a grammar. 00:51:45.180 |
So you start writing some grammar rules, and you start writing a lexicon with words and 00:51:50.140 |
parts of speech, and you sit around working on your grammar. 00:51:54.980 |
When I was a PhD student, one of my first summer jobs was spending the summer handwriting 00:52:02.100 |
And it sort of seems like writing a grammar is more efficient because you're writing this 00:52:06.900 |
one general thing that tells you the structure of a human language. 00:52:11.920 |
But there's just been this massive sea change, partly driven by the adoption of machine learning 00:52:16.640 |
techniques, where it's now seen as axiomatic that the way to make progress is to have annotated 00:52:25.400 |
data, namely here a tree bank that shows you the structure of sentences. 00:52:31.700 |
And so what I'm showing here is a teeny extract from a universal dependencies tree bank. 00:52:38.160 |
And so that's why I mentioned earlier that this has been this effort to try and have 00:52:42.740 |
a common dependency grammar representation that you can apply to lots of different human 00:52:49.300 |
And so you can go over to this URL and see that there's about 60 different languages 00:52:53.720 |
at the moment which have universal dependencies tree banks. 00:53:01.500 |
I mean, it sort of seems like it's bad news if you have to have people sitting around 00:53:11.100 |
It seems a lot slower and actually a lot less useful than having somebody writing a grammar 00:53:19.580 |
which just has a much bigger multiplier factor in the utility of their effort. 00:53:27.980 |
It turns out that although that initial feeling seems sort of valid, that in practice there's 00:53:35.420 |
just a lot more you can do with the tree bank. 00:53:41.660 |
You know, one reason is that tree banks are highly reusable. 00:53:46.580 |
So typically when people have written grammars, they've written grammars for, you know, one 00:53:52.300 |
particular parser and the only thing it was ever used in is that one particular parser. 00:53:59.540 |
But when you build a tree bank, that's just a useful data resource and people use it for 00:54:09.260 |
So the well-known tree banks have been used by hundreds and hundreds of people. 00:54:14.540 |
And although all tree banks were initially built for the purposes of, hey, let's help 00:54:20.340 |
natural language processing systems, it turns out that people have actually been able to 00:54:28.220 |
So for example, these days, psycholinguists commonly use tree banks to get various kinds 00:54:34.220 |
of statistics about data for thinking about psycholinguistic models. 00:54:40.060 |
Linguists use tree banks for looking at patterns of different syntactic constructions that 00:54:45.140 |
occur that there's just been a lot of reuse of this data for all kinds of purposes. 00:54:52.300 |
But they have other advantages that I mentioned here. 00:54:55.380 |
You know, when people are just sitting around saying, oh, what sentences are good, they 00:54:59.820 |
tend to only think of the core of language where lots of weird things happen in language. 00:55:05.780 |
And so if you actually just have some sentences and you have to go off and parse them, then 00:55:11.460 |
you actually have to deal with the totality of language. 00:55:15.540 |
Since you're parsing actual sentences, you get statistics. 00:55:19.620 |
So you naturally get the kind of statistics that are useful to machine learning systems 00:55:25.300 |
by constructing a tree bank where you don't get them for free if you hand write a grammar. 00:55:30.980 |
But then a final way, which is perhaps the most important of all, is if you actually 00:55:38.980 |
want to be able to do science of building systems, you need a way to evaluate these 00:55:49.100 |
I mean, it seems hard to believe now, but, you know, back in the 90s, 80s, when people 00:55:58.820 |
built NLP parsers, it was literally the case that the way they were evaluated was you said 00:56:05.940 |
to your friend, I built this parser, type in a sentence on the terminal and see what 00:56:17.420 |
Whereas what we'd like to know is, well, as I showed you earlier, English sentences can 00:56:25.740 |
Can this system choose the right parsers for particular sentences and therefore have the 00:56:33.220 |
basis of interpreting them as a human being would? 00:56:37.500 |
And while we can only systematically do that evaluation if we have a whole bunch of sentences 00:56:43.020 |
that have been hand parsed by humans with their correct interpretations. 00:56:48.220 |
So the rise of tree banks turned parser building into an empirical science where people could 00:56:54.940 |
then compete rigorously on the basis of, look, my parser has 2% higher accuracy than your 00:57:01.980 |
parser in choosing the correct parsers for sentences. 00:57:07.820 |
So well, how do we build a parser once we've got dependencies? 00:57:13.020 |
So there's sort of a bunch of sources of information that you could hope to use. 00:57:18.420 |
So one source of information is looking at the words on either end of the dependency. 00:57:26.020 |
So discussing issues, that seems a reasonable thing to say. 00:57:31.160 |
And so it's likely that issues could be the object of discussing. 00:57:38.300 |
Whereas if it was some other word, right, if you were thinking of making, you know, 00:57:46.060 |
outstanding the object of discussion, discussing outstanding, that doesn't sound right. 00:57:53.380 |
The second source of information is distance. 00:57:56.480 |
So most dependencies are relatively short distance. 00:58:02.180 |
Some of them are long distance dependencies, but they're relatively rare. 00:58:06.620 |
The vast majority of dependencies are nearby. 00:58:11.180 |
Another source of information is the intervening material. 00:58:14.800 |
So there are certain things that dependencies rarely span. 00:58:22.020 |
So clauses and sentences are normally organized around verbs. 00:58:28.360 |
And so dependencies rarely span across intervening verbs. 00:58:33.980 |
We can also use punctuation in written language, things like commas, which can give some indication 00:58:41.140 |
And so punctuation may also indicate bad places to have long distance dependencies over. 00:58:49.240 |
And there's one final source of information, which is what's referred to as valency, which 00:58:55.940 |
is for a head, what kind of information does it usually have around it? 00:59:02.080 |
So if you have a noun, there are things that you just know about what kinds of dependence 00:59:10.920 |
So it's common that it will have a determiner to the left, the cat. 00:59:16.680 |
On the other hand, it's not going to be the case that there's a determiner to the right, 00:59:28.620 |
On the left, you're also likely to have an adjectival modifier. 00:59:35.900 |
But again, it's not so likely you're going to have the adjectival modifier over on the 00:59:44.160 |
So there are sort of facts about what things different kinds of words take on the left 00:59:53.040 |
And that's also a useful source of information. 00:59:56.660 |
Okay, so what do we need to do using that information to build a parser? 01:00:03.700 |
Well, effectively, what we do is have a sentence. 01:00:06.740 |
I'll give a talk tomorrow on neural networks. 01:00:09.580 |
But what we have to do is say for every word in that sentence, we have to choose some other 01:00:15.060 |
word that it's a dependent of where one possibility is it's a dependent of root. 01:00:22.500 |
So we're giving it a structure where we're saying, okay, for this word, I've decided 01:00:31.660 |
And then for this word, it's also dependent on networks. 01:00:49.880 |
So only one word is a dependent of root, we have a tree. 01:00:56.740 |
So we don't want to say that word A is dependent on word B and word B is dependent on word 01:01:03.180 |
And then this one final issue, which is whether arrows can cross or not. 01:01:12.600 |
So in this particular sentence, we actually have these crossing dependencies, you can 01:01:17.700 |
see there, I'll give a talk tomorrow on neural networks. 01:01:22.780 |
And this is the correct dependency parse for this sentence. 01:01:26.840 |
Because what we have here is that it's a talk, and it's a talk on neural networks. 01:01:31.860 |
So the on neural networks modifies the talk, but which leads to these crossing dependencies. 01:01:40.820 |
I could have said, I'll give a talk on neural networks tomorrow, and then on neural networks 01:01:48.560 |
So most of the time, in languages, dependencies are projective, the things stay together. 01:01:57.140 |
So the dependencies have a kind of a nesting structure of the kind that you also see in 01:02:04.580 |
But most languages have at least a few phenomena where you ended up with these ability for 01:02:11.300 |
phrases to be split apart, which lead to non projective dependencies. 01:02:16.740 |
So in particular, one of them in English is that you can take modifying phrases and clauses 01:02:23.460 |
like the on neural networks here and shift them right towards the end of the sentence 01:02:28.380 |
and get I'll give a talk tomorrow on neural networks. 01:02:32.340 |
And that then leads to non projective sentences. 01:02:37.860 |
So a parse is projective if there are no crossing dependency arcs when the words are laid out 01:02:43.460 |
in their linear order with all arcs above the words. 01:02:48.080 |
And if you have a dependency parse that corresponds to a context free grammar tree, it actually 01:02:53.560 |
has to be projective because context free grammars necessarily have this sort of nested 01:03:03.200 |
But dependency grammars normally allow non projective structures to account for displaced 01:03:10.440 |
And you can't easily get the semantics of certain constructions right without these 01:03:18.100 |
So here's another example in English with question formation with what's called preposition 01:03:26.200 |
So the sentence is, who did Bill buy the coffee from yesterday? 01:03:34.120 |
It's less natural in English, but I could have said, from who did Bill from who did 01:03:44.240 |
In many languages of the world, that's the only way you could have said it. 01:03:48.600 |
And when you do that, from who is kept together and you have a projective parse for the sentence. 01:03:56.920 |
But English allows and indeed much prefers you to do what is referred to as preposition 01:04:02.960 |
stranding where you move the who, but you just leave the preposition behind. 01:04:09.320 |
And so you get, who did Bill buy the coffee from yesterday? 01:04:13.660 |
And so then we're ending up with this non projective dependency structure as I've shown 01:04:20.200 |
Okay, I'll come back to non projectivity in a little bit. 01:04:26.720 |
How do we go about building dependency parsers? 01:04:30.520 |
Well, there are a whole bunch of ways that you can build dependency parsers. 01:04:37.120 |
Very quickly, I'll just say a few names and I'll tell you about one of them. 01:04:40.920 |
So you can use dynamic programming methods to build dependency parsers. 01:04:44.920 |
So I showed earlier that you can have an exponential number of parsers for a sentence. 01:04:50.500 |
And that sounds like really bad news for building a system. 01:04:53.240 |
Well, it turns out that you can be clever and you can work out a way to dynamic program, 01:04:59.000 |
finding that exponential number of parsers, and then you can have an O N cubed algorithm. 01:05:07.500 |
You can use graph algorithms and I'll say a bit about that later, but that may spill 01:05:14.600 |
So you can see, since we're wanting to kind of connect up all the words into a tree using 01:05:22.360 |
graph edges, that you could think of doing that using a minimum spanning tree algorithm 01:05:27.880 |
of the sort that you hopefully saw in CS 161. 01:05:35.560 |
Transition satisfaction ideas that you might have seen in CS 221 have been used for dependency 01:05:43.320 |
But the way I'm going to show now is transition based parsing or sometimes referred to as 01:05:51.720 |
And the idea of this is one's going to use a transition system. 01:06:00.560 |
If you've seen shift reduce parsing in something like a compiler's class or formal languages 01:06:06.560 |
class that shift and reduce our transition steps. 01:06:10.880 |
And so use a transition system to guide the construction of parsers. 01:06:26.240 |
So this was an idea that was made prominent by Joakim Nivre, who's a Swedish computational 01:06:35.520 |
linguist who introduced this idea of greedy transition based parsing. 01:06:42.140 |
So his idea is, well, what we're going to do for dependency parsing is we're going to 01:06:49.000 |
be able to parse sentences by having a set of transitions which are kind of like shift 01:06:57.080 |
And it's going to just work left to right, bottom up, and parse a sentence. 01:07:03.200 |
So we're going to say we have a stack sigma, a buffer beta of the words that we have to 01:07:11.320 |
And we're going to build up a set of dependency arcs by using actions, which are shift and 01:07:18.640 |
And putting those together, this will give us the ability to put parse structures over 01:07:29.360 |
And this is a little bit hairy when you first see it. 01:07:35.040 |
And this kind of transition based dependency parser is what we'll use in assignment three. 01:07:44.480 |
So what we have, so this is our transition system. 01:07:47.360 |
We have a starting point where we start with a stack that just has the root symbol on it 01:07:53.880 |
and a buffer that has the sentence that's about to parse. 01:07:59.080 |
And so far, we haven't built any dependency arcs. 01:08:04.200 |
And so at each point in time, we can choose one of three actions. 01:08:09.080 |
We can shift, which moves the next word onto the stack. 01:08:16.960 |
We can then do actions that are the reduce actions. 01:08:22.480 |
So there are two reduce actions to make it a dependency grammar. 01:08:26.840 |
We can either do a left arc reduce or a right arc reduce. 01:08:31.920 |
So when we do either of those, we take the top two items on the stack and we make one 01:08:42.360 |
So we can either say, okay, let's make WI a dependent of WJ or else we can say, okay, 01:08:52.160 |
And so the result of when we do that is the one that's the dependent disappears from the 01:09:00.720 |
And so in the stacks over here, there's one less item. 01:09:04.160 |
But then we add a dependency arc to our arc set so that we say that we've got either a 01:09:10.680 |
dependency from J to I or a dependency from I to J. 01:09:15.600 |
And commonly when we do this, we actually also specify what grammatical relation connects 01:09:22.080 |
the two, such as subject, object, noun, modifier. 01:09:35.320 |
So this is how a simple transition-based dependency parser, what's referred to as an arc standard 01:09:42.720 |
transition-based dependency parser, would parse up I ate the fish. 01:09:47.440 |
So remember, these are the different operations that we can apply. 01:09:51.040 |
So to start off with, we have root on the stack and the sentence in the buffer, and 01:09:59.380 |
So we have to choose one of the three actions. 01:10:02.600 |
And when there's only one thing on the stack, the only thing we can do is shift. 01:10:15.040 |
And at this point, we have a choice because we could immediately reduce. 01:10:19.900 |
So we could say, okay, let's just make I a dependent of root and we'd get a stack size 01:10:28.100 |
But that would be the wrong thing to do because I isn't the head of the sentence. 01:10:34.320 |
So what we should instead do is shift again and get I ate on the stack and fish still 01:10:42.320 |
Well, at that point, we keep on parsing a bit further. 01:10:46.200 |
And so now what we can do is say, well, wait a minute. 01:11:02.560 |
But we add to the set of arcs that we've added that I is the subject of eight. 01:11:09.600 |
Well, after that, we could reduce again because there's still two things on the stack. 01:11:18.000 |
The right thing to do next would be to shift fish onto the stack. 01:11:24.160 |
And then at that point, we can do a right arc reduce saying that eight is the object 01:11:32.420 |
of fish and add a new dependency to our dependency set. 01:11:37.720 |
And then we can one more time do a right arc reduce to say that eight is the root of the 01:11:45.500 |
whole sentence and add in that extra root relation with our pseudo root. 01:11:51.180 |
And at that point, we've reached the end condition. 01:11:54.420 |
So the end condition was the buffer was empty. 01:11:57.300 |
And there's one thing, the root on the stack. 01:12:03.220 |
So this little transition machine does the parsing up of the sentence. 01:12:10.200 |
But there's one thing that's left to explain still here. 01:12:18.220 |
So as soon as you have two things or more on the stack, what you do next, you've always 01:12:25.220 |
You could keep shifting, at least if there's still things on the buffer. 01:12:28.460 |
Or you can do a left arc or you can do a right arc. 01:12:35.260 |
And one answer to that is to say, well, you don't know what choice is correct. 01:12:39.540 |
And that's why parsing is hard and sentences are ambiguous. 01:12:48.860 |
And well, if you naively explore all of them, then you do an exponential amount of work 01:12:56.420 |
So in the early 2000s, Joachim Nivre's -- and that's essentially what people had done in 01:13:10.540 |
But in the early 2000s, Joachim Nivre's essential observation was, but wait a minute. 01:13:21.260 |
So why don't I try and train a classifier which predicts what the next action I should 01:13:29.300 |
take is given this stack and buffer configuration? 01:13:33.900 |
Because if I can write a machine learning classifier which can nearly always correctly 01:13:41.940 |
predict the next action given a stack and buffer, then I'm in a really good position. 01:13:50.020 |
Because then I can build what's referred to as a greedy dependency parser which just goes, 01:14:02.420 |
Run classifier, choose next action, run classifier, choose next action, run classifier, choose 01:14:07.940 |
next action, so that the amount of work that we're doing becomes linear in the length of 01:14:14.500 |
the sentence rather than it being cubic in the length of the sentence using dynamic programming 01:14:22.180 |
or exponential in the length of the sentence if you don't use dynamic programming. 01:14:27.200 |
So at each step we predict the next action using some discriminative classifier. 01:14:34.680 |
So starting off he was using things like support vector machines, but it can be anything at 01:14:40.080 |
all like softmax classifier that's closer to our neural networks. 01:14:44.440 |
And there are either for what I presented three classes if you're just thinking of the 01:14:50.000 |
two reducers in the shift or if you're thinking of you're also assigning a relation and you 01:14:55.280 |
have a set of R relations like 20 relations, then that's the sort of 41 moves that you 01:15:04.720 |
And the features are effectively the configurations I was showing before. 01:15:15.000 |
What's that word's part of speech, et cetera? 01:15:17.520 |
And so in the simplest way of doing this, you're now doing no search at all. 01:15:22.000 |
You're just sort of take each configuration and turn, decide the most likely next move 01:15:29.060 |
And that's a greedy dependency parser, which is widely used. 01:15:33.280 |
You can do better if you want to do a lot more work. 01:15:36.560 |
So you can do what's called a beam search where you maintain a number of fairly good 01:15:42.640 |
parse prefixes at each step and you can extend them out further. 01:15:48.800 |
And then you can evaluate later on which of those seems to be the best. 01:15:53.520 |
And so beam search is one technique to improve dependency parsing by doing a lot of work. 01:16:00.920 |
And it turns out that although these greedy transition-based parsers are a fraction worse 01:16:09.440 |
than the best possible ways known to parse sentences, that they actually work very accurately 01:16:19.000 |
And they have this wonderful advantage that they give you linear time parsing in terms 01:16:29.080 |
And so if you want to do a huge amount of parsing, they're just a fantastic thing to 01:16:34.440 |
use because you've then got an algorithm that scales to the size of the web. 01:16:44.240 |
So I'm kind of a little bit behind, so I guess I'm not going to get through all of 01:16:48.360 |
these slides today and we'll have to finish out the final slides tomorrow. 01:16:52.520 |
But just to push a teeny bit further, I'll just say a couple more on the sort of what 01:16:59.920 |
Neivre did for dependency parser, and then I'll sort of introduce the neural form of 01:17:07.720 |
So conventionally, you had this sort of stack and buffer configuration and you wanted to 01:17:16.200 |
And so the way that was done was by using symbolic features of this configuration. 01:17:24.040 |
And what kind of symbolic features did you use? 01:17:28.160 |
Use these indicator features that picked out a small subset, normally one to three elements 01:17:36.800 |
So you'd have a feature that could be something like the thing on the top of the stack is 01:17:42.240 |
the word good, which is an adjective, or it could be the thing on the top of the stack 01:17:46.800 |
is an adjective and the thing that's first in the buffer is a noun, or it could just 01:17:52.520 |
be looking at one thing and saying the first thing in the buffer is a verb. 01:17:59.760 |
And because these features commonly involved words and commonly involved conjunctions of 01:18:06.440 |
several conditions, you had a lot of features. 01:18:11.480 |
And having mentions of words and conjunctions of conditions definitely helped to make these 01:18:20.600 |
But nevertheless, because you had all of these sort of one zero symbolic features, that you 01:18:28.840 |
So commonly these parsers were built using something like a million to 10 million different 01:18:38.700 |
And I mentioned already the importance of evaluation. 01:18:41.720 |
Let me just sort of quickly say how these parsers were evaluated. 01:18:47.440 |
So to evaluate a parser for a particular sentence, it was hand parsed, our test set was hand 01:18:58.960 |
So we have gold dependencies of what the human thought were right. 01:19:03.120 |
And so we can write those dependencies down as statements of saying the first word is 01:19:11.720 |
the dependent of the second word via a subject dependency. 01:19:16.460 |
And then the parser is also going to make similar claims as to what's a dependent on 01:19:23.260 |
And so there are two common metrics that are used. 01:19:26.980 |
One is just are you getting these dependency facts right? 01:19:33.660 |
And so that's referred to as the unlabeled accuracy score, where we're just sort of measuring 01:19:39.980 |
accuracies, which are all of the dependencies in the gold sentence. 01:19:45.860 |
And remember, we have one dependency per word in the sentence. 01:19:53.700 |
And that's our unlabeled accuracy score of 80%. 01:19:57.260 |
And a slightly more rigorous evaluation is to say, well, no, we're also going to label 01:20:04.180 |
them and we're going to say that this is the subject. 01:20:17.340 |
And you also need to get the grammatical relation label right. 01:20:21.480 |
And so that's then referred to as labeled accuracy score. 01:20:25.220 |
And although I got those two right for that as... 01:20:30.780 |
I guess according to this example, actually, this is wrong. 01:20:40.060 |
So I only got two of the dependencies correct in the sense that I both got what depends 01:20:48.660 |
And so my labeled accuracy score is only 40%. 01:20:53.900 |
So I'll stop there now for the introduction for dependency parsing. 01:20:58.540 |
I still have an IOU, which is how we can then bring neural nets into this picture and how 01:21:05.580 |
they can be used to improve dependency parsing. 01:21:09.300 |
So I'll do that at the start of next time before then proceeding further into neural