Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 4 - Syntactic Structure and Dependency Parsing

Okay, so for today, we're actually going to take a bit of a change of pace from what the last couple of lectures have been about. And we're going to focus much more on linguistics and natural language processing. And so in particular, we're going to start looking at the topic of dependency parsing.

And so this is the plan of what to go through today. So I'm going to start out by going through some ideas that have been used for the syntactic structure of languages of constituency and dependency and introduce those. And then focusing in more on dependency structure, I'm then going to look at dependency grammars and dependency tree banks.

And then having done that, we're then going to move back into thinking about how to build natural language processing systems. And so I'm going to introduce the idea of transition based dependency parsing. And then in particular, having developed that idea, I'm going to talk about a way to build a simple but highly effective neural dependency parser.

And so this simple, highly effective neural dependency parser is essentially what we'll be asking you to build in the third assignment. So in some sense, we're getting a little bit ahead of ourselves here, because in week two of the class, we teach you how to do both assignments two and three.

But all of this material will come in really useful. Before I get underway, just a couple of announcements. So again, for assignment two, you don't yet need to use the PyTorch framework. But now's a good time to work on getting PyTorch installed for your Python programming. Assignment three is in part also an introduction to using PyTorch.

It's got a lot of scaffolding included in the assignment. But beyond that, this Friday, we've got a PyTorch tutorial, and thoroughly encourage you to come along to that as well. Look for it under the Zoom tab. And in the second half, the first day of week four, we have an explicit class that partly focuses on the final projects and what the choices are for those.

But it's never too late to start thinking about the final project and what kind of things you want to do for the final project. Also do come meet with people. There are sort of resources on the course pages about what different TAs know about. I've also talked to a number of people about final projects, but clearly I can't talk to everybody.

So I encourage you to also be thinking about what you want to do for final projects. Okay, so what I wanted to do today was introduce how people think about the structure of sentences. And put structure on top of them to explain how human language conveys meaning. And so our starting point for meaning and essentially what we've dealt with with word vectors up until now is we have words.

And words are obviously an important part of the meaning of human languages. But for words in human languages, there's more that we can do with them in thinking about how to structure sentences. So in particular, the first most basic way that we think about words when we are thinking about how sentences are structured is we give to them what's called a part of speech.

We can say that cat is a noun, by is a preposition, door is another noun, cuddly is an adjective. And then for the word the, if it was given a different part of speech, if you saw any parts of speech in school, it was probably you're told it was an article.

Sometimes that is just put into the class of adjectives. In modern linguistics and what you'll see in the resources that we use, words like the are referred to as determiners. And the idea is that there's a bunch of words includes a and the, but also other words like this and that, or even every, which are words that occur at the beginning of something like the cuddly cat, which have a determinative function of sort of picking out which cats that they're referring to.

And so we refer to those as determiners. But it's not the case that when we want to communicate with language, that we just have this word salad where we say a bunch of words. We just say, you know, whatever, leaking, kitchen, tap, and let the other person put it together.

We put words together in a particular way to express meanings. And so therefore, languages have larger units of putting meaning together. And the question is how we represent and think about those. Now in modern work, in particular, in modern United States linguistics, or even what you see in computer science classes, when thinking about formal languages, the most common way to approach this is with the idea of context-free grammars, which you see at least a little bit of in 103, if you've done 103, what a linguist would often refer to as phrase structure grammars.

And the idea there is to say, well, there are bigger units in languages that we refer to as phrases. So something like the cuddly cat is a cat with some other words modifying it. And so we'll refer to that as a noun phrase. But then we have ways in which phrases can get larger by building things inside phrases.

So the door here is also a noun phrase, but then we can build something bigger around that with a preposition. So this is a preposition. And then we have a prepositional phrase. And in general, we can keep going. So we can then make something like the cuddly cat by the door, and then the door is a noun phrase.

The cuddly cat is a noun phrase. By the door is a prepositional phrase. But then when you put it all together, the whole of this thing becomes a bigger noun phrase. And so it's working with these ideas of nested phrases, what in context-free grammar terms you would refer to as non-terminals.

So noun phrase and prepositional phrase would be non-terminals in the context-free grammar. We can build up a bigger structure of human languages. So let's just do that for a little bit to review what happens here. So we start off saying, okay, you can say the cat and the dog.

And so those are noun phrases. And so we want a rule that can explain those. So we could say a noun phrase goes to the termina noun. And then somewhere over the side, we'd have a lexicon. And in our lexicon, we'd say that dog is a noun and cat is a noun and a is a determiner and that is a determiner.

Okay. So then we notice you can do a bit more than that. So you can say things like the large cat, a barking dog. So that suggests we can have a noun phrase after the determiner. It can optionally be an adjective. And then there's the noun and that can explain some things we can say.

But we can also say the cat by the door or a barking dog in a crate. And so we can also put a prepositional phrase at the end and that's optional, but you can combine it together with an adjective. For the example I gave, like a barking dog on the table.

And so that this grammar can handle that. So then we'll keep on and say, well, actually you can use multiple adjectives. So you can say a large barking dog or a large barking cuddly cat. No, maybe not. Well, sentences like that. So we have any number of adjectives which we can represent with a star, what's referred to as the Kleene star.

So that's good. Oh, but I forgot a bit actually. For by the door, I have to have a rule for producing by the door. So I also need a rule that's a prepositional phrase goes to a preposition followed by a noun phrase. And so then I also have to have prepositions and that can be in or on or by.

Okay. And you can make other sentences, of course, with this as well, like the large crate on the table or something like that, or the large crate on the large table. Okay. So I chug along and then, well, I could have something like talk to the cat. And so now I need more stuff.

So talk is a verb and to is still looks like a preposition. So I need to be able to make up something with that as well. Okay. So what I can do is say, I can also have a rule for a verb phrase that goes to a verb. And then after that, for something like talk to the cat, that it can take a prepositional phrase after it.

And then I can say that the verb goes to talk or walked. Okay. Then I can pause and then I can cover those sentences. Oops. Okay. So that's, that's the end of what I have here. So in this sort of a way, I'm handwriting a grammar. So here is now I have this grammar and a lexicon.

And for the examples that I've written, I've written down here, this grammar and this lexicon is sufficient to pause these sort of fragments of showing expansion that I just wrote down. I mean, of course, there's a lot more to English than what you see here. Right? So if I have something like, you know, the cat walked behind the dog, then I need some more grammar rules.

So it seems then I need a rule that says I can have a sentence that goes to a noun phrase followed by a verb phrase. And I can keep on doing things of this sort. Let's see, one question that Ruth Ann asked was about what do the brackets mean?

And is the first NP different from the second? So for this notation on the brackets here, I mean, this is actually a common notation that's used in linguistics. It's sort of in some sense a little bit different to traditional computer science notation, since the star is used in both to mean zero or more of something.

So you could have zero, one, two, three, four, five adjectives. Somehow it's usual in linguistics that when you're using the star, you also put parentheses around it to mean it's optional. So sort of parentheses and star are used together to mean any number of something. When it's parentheses just by themselves, that's then meaning zero or one.

And then for are these two noun phrases different? No, they're both noun phrase rules. And so in our grammar, we can have multiple rules that expand noun phrase in different ways. But actually in my example here, my second rule, because I wrote it quite generally, it actually covers the first rule as well.

So actually at that point, I can cross out this first rule because I don't actually need it in my grammar. But in general, you have a choice between writing multiple rules for noun phrase goes to categories, which effectively gives you a disjunction or working out by various syntactic conventions how to compress them together.

Okay. So that was what gets referred to in natural language processing as constituency grammars, where the standard form of constituency grammar is a context-free grammar of the sort that I trust you saw at least a teeny bit of either in CS 103 or something like a programming languages compilers formal languages class.

There are other forms of grammars that also pick out constituency. There are things like tree adjoining grammars, but I'm not going to really talk about any of those now. What I actually want to present is a somewhat different way of looking at grammar, which is referred to as dependency grammar, which puts a dependency structure over sentences.

Now actually, it's not that these two ways of looking at grammar have nothing to do with each other. I mean, there's a whole formal theory about the relationships between different kinds of grammars, and you can very precisely state relationships and isomorphisms between different grammars of different kinds. But on the surface, these two kinds of grammars look sort of different and emphasize different things.

And for reasons of their sort of closeness to picking out relationships and sentences and their ease of use, it turns out that in modern natural language processing, starting, I guess, around 2000, so really in the last 20 years, NLP people have really swung behind dependency grammars. So if you look around now where people are using grammars in NLP, by far the most common thing that's being used is dependency grammars.

So I'm going to teach us today a bit about those and for what we're going to build in assignment three is building, using supervised learning, a neural dependency parser. So the idea of dependency grammar is that when we have a sentence, what we're going to do is we're going to say for each word, what other words modify it.

So what we're going to do is when we say the large crate, we're going to say, okay, well, large is modifying crate and that is modifying crate. In the kitchen, that is modifying kitchen. By the door, that is modifying door. And so I'm showing a modification, a dependency or an attachment relationship by drawing an arrow from the head to what's referred to in dependency grammar as the dependent.

The thing that modifies, further specifies or attaches to the head. Okay. So that's the start of this. Well another dependency that is that, well, look in the large crate, that where you're looking is in the large crate. So you're going to want to have the large, in the large crate as being a dependent of look.

And so that's also going to be a dependency relationship here. And then there's one final bit that might seem a little bit confusing to people. And that's actually when we have these prepositions. There are two ways that you can think that this might work. So if it was something like look in the crate, it seems like that is a dependent of crate.

But you could think that you want to say look in and it's in the crate and give this dependency relationship with the sort of preposition as sort of thinking of it as the head of what was before our prepositional phrase. And that's a possible strategy in the dependency grammar.

But what I'm going to show you today and what you're going to use in the assignment is dependency grammars that follow the representation of universal dependencies. And universal dependencies is a framework which actually I was involved in creating, which was set up to try and give a common dependency grammar over many different human languages.

And in the design decisions that were made in the context of designing universal dependencies, what we decided was that for what in some languages you use prepositions, lots of other languages make much more use of case markings. So if you've seen something like German, you've seen more case markings like genitive and dative cases.

And in other languages like Latin or Finnish, lots of Native American languages, you have many more case markings again, which cover most of the role of prepositions. So in universal dependencies, essentially in the crate is treated like a case marked noun. And so what we say is that the in is also a dependent of crate.

And then you're looking in the crate. So in the structure we adopt, in is a dependent of crate. This in is a dependent of kitchen. This by is a dependent of door. And then we have these prepositional phrases in the kitchen, by the door, and we want to work out, well, what they modify.

Well, in the kitchen is modifying crate, right? Because it's a crate in the kitchen. So we're going to say that it's, this piece is a dependent of crate. And then, well, what about by the door? Well, it's not really meaning that it's a kitchen by the door. And it's not meaning to look by the door.

Again, it's a crate by the door. And so what we're going to have is the crate also has door as a dependent. And so that gives us our full dependency structure of this sentence. Okay. And so that's a teeny introduction to syntactic structure. I'm going to say a bit more about it and give a few more examples.

But let me just for a moment sort of say a little bit about, well, why are we interested in syntactic structure? Why do we need to know the structure of sentences? And this gets into how does human languages work? So human languages can communicate very complex ideas. I mean, in fact, you know, anything that humans know how to communicate to one another, they communicate pretty much by using words.

So we can structure and communicate very complex ideas. But we can't communicate a really complex idea by one word. We can't just, you know, choose a word like, you know, empathy and say it with a lot of meaning and say empathy and the other person is meant to understand everything about what that means.

Right? So we can compose a complex meaning that explains things by putting words together into bigger units. And the syntax of a language allows us to put words together into bigger units where we can build up and convey to other people a complex meaning. And so then the listener doesn't get this syntactic structure.

Right? The syntactic structure of the sentence is hidden from the listener. All the listener gets is a sequence of words, one after another, bang, bang, bang. So the listener has to be able to do what I was just trying to do in this example, that as the sequence of words comes in, that the listener works out which words modify which other words and therefore can construct the structure of the sentence and hence the meaning of the sentence.

And so in the same way, if we want to build clever neural net models that can understand the meaning of sentences, those clever neural net models also have to understand what is the structure of the sentence so that they can interpret the language correctly. And we'll go through some examples and see more of that.

Okay. So the fundamental point that we're going to sort of spend a bit more time on is that these choices of how you build up the structure of a language change the interpretation of the language. And a human listener or equally a natural language understanding program has to make in a sort of probabilistic fashion choices as to which words modify, i.e.

depend upon which other words so that they're coming up with the interpretation of the sentence that they think was intended by the person who said it. Okay. So to get a sense of this and how sentence structure is interesting and difficult, what I'm going to go through now is a few examples of different ambiguities that you find in natural language.

And I've got some funny examples from newspaper headlines, but these are all real natural language ambiguities that you find throughout natural language. Well, at this point I should say this is where I'm being guilty of saying true language, but I'm meaning in English. Some of these ambiguities you find in lots of other languages as well, but which ambiguities that offer syntactic structure partly depend on the details of the language.

So different languages have different syntactic constructions, different word orders, different amounts of words having different forms of words like case markings. And so depending on those details, there might be different ambiguities. So here's one ambiguity, which is one of the commonest ambiguities in English. So San Jose cops kill man with knife.

So this sentence has two meanings. Either it's the San Jose cops who are killing a man and they're killing a man with a knife. And so that corresponds to a dependency structure where the San Jose cops are the subject of killing, the man is the object of killing, and then the knife is then the instrument with which they're doing the killing.

So that the knife is an oblique modifier for the instrument of killing. And so that's one possible structure for this sentence. But it's probably not the right one. So what it actually probably was, was that it was a man with a knife and the San Jose cops killed the man.

So that corresponds to the knife then being a noun modifier of the man and then kill is still killing the man. So the man is the object of killing and the cops are still the subject. And so whenever you have a prepositional phrase like this, that's coming further on in a sentence, there's a choice of how to interpret it.

It could be either interpreted as modifying a noun phrase that comes before it, or it can be interpreted as modifying a verb that comes before it. So systematically in English, you get these prepositional phrase attachment ambiguities throughout all of our sentences. But to give two further observations on that, the first observation is you encounter sentences with prepositional phrase attachment ambiguities every time you read a newspaper article, every time you talk to somebody, but most of the time you never notice them.

And that's because our human brains are incredibly good at considering the possible interpretations and going with the one that makes sense according to context. The second comment, as I said, different human languages expose different ambiguities. So for example, this is an ambiguity that you normally don't get in Chinese because in Chinese prepositional phrases, modifying a verb are normally placed before the verb.

And so there you don't standardly get this ambiguity, but there are different other ambiguities that you find commonly in Chinese sentences. So this ambiguity you find everywhere because prepositional phrases are really common at the right ends of sentences. So here's another one, scientists count whales from space. So that gives us these two possible interpretations that there are whales from space and scientists are counting them.

And then the other one is how the scientists are counting the whales is that they're counting them from space and they're using satellites to count the sails, which is the correct interpretation that the newspaper hopes that you're getting. And this problem gets much, much more complex because many sentences in English have prepositional phrases all over the place.

So here's the kind of boring sentence that you find in the financial news. The board approved its acquisition by Royal Trust Co. Ltd. of Toronto for $27 a share at its monthly meeting. And while if you look at the structure of this sentence, what we find is, you know, here's a verb, then here's the object noun phrase.

So we've got the object noun phrase here. And then after that, what do we find? Well, we find a prepositional phrase, another prepositional phrase, another prepositional phrase, and another prepositional phrase. And how to attach each of these is then ambiguous. So the basic rule of how you can attach them is you can attach them to things to the left, providing you don't create crossing attachments.

So in principle, by Royal Trust Co. Ltd. could be attached to either approved or acquisition. But in this case, by Royal Trust Co. Ltd. it's the acquirer. So it's a modifier of the acquisition. Okay, so then we have of Toronto. So of Toronto could be modifying Royal Trust Co.

Ltd. It could be modifying the acquisition or it can be modifying the approved. And in this case, the of Toronto is telling you more about the company. And so it's a modifier of Royal Trust Co. Ltd. Okay, so then the next one is for $27 a share. And that could be modifying Toronto, Royal Trust Co.

Ltd., the acquisition or the approving. And well, in this case, that's talking about the price of the acquisition. So this one jumps back. And this is now a prepositional phrase that's modifying the acquisition. And then at the end, at its monthly meeting, well, that's where the approval is happening by the board.

So rather than any of these preceding four noun phrases, at its monthly meeting is modifying the approval. And so it attaches right back there. And this example is kind of too big. And so I couldn't fit it in one line. But as I think maybe you can see that, you know, none of these dependencies cross each other and they connect at different places ambiguously.

So because we can chain these prepositions like this and attach them at different places like this, human language sentences are actually extremely ambiguous. So the number, if you have a sentence with K prepositional phrases at the end of it, where here we have K equals four, the number of parses this sentence has, the number of different ways you can make these attachments is given by the Catalan numbers.

So the Catalan numbers are an exponentially growing series which arises in many tree-like contexts. So if you're doing something like triangulations of a polygon, you get Catalan numbers. If you're doing triangulation and graphical models in CS228, you get Catalan numbers. But we don't need to worry about the details here.

The central point is this is an exponential series. And so you're getting an exponential number of parses in terms of the number of prepositional phrases. And so in general, you know, the number of parses human languages have is exponential in their length, which is kind of bad news. Because if you're then trying to enumerate all the parses, you might fear that you really have to do a ton of work.

The thing to notice about structures like these prepositional phrase attachment ambiguities is that there's nothing that resolves these ambiguities in terms of the structure of the sentence. So if you've done something like looked at the kind of grammars that are used in compilers, that the grammars used in compilers for programming languages are mainly made to be unambiguous.

And to the extent that there are any ambiguities, there are default rules that are used to say, choose this one particular parse tree for your piece of a programming language. And human languages just aren't like that. They're globally ambiguous. And the listening human is just meant to be smart enough to figure out what was intended.

So the analogy would be that, you know, in programming languages, when you're working out what does an else clause modify, well, you've got the answer that you can either look at the curly braces to work out what the else clause modifies, or if you're using Python, you look at the indentation and it tells you what the else clause modifies.

Where by contrast, for human languages, it would be just write down else something. Doesn't matter how you do it. You don't need parentheses. You don't need indentation. The human being will just figure out what the else clause is meant to pair up with. Okay. Lots of other forms of ambiguities in human languages.

So let's look at a few others. Another one that's very common over all sorts of languages is coordination scope ambiguities. So here's a sentence. Shuttle veteran and longtime NASA executive Fred Gregory appointed to board. Well, this is an ambiguous sentence. There are two possible readings of this. One reading is that there are two people.

There's a shuttle veteran and there's a longtime NASA executive Fred Gregory. And they were both appointed to the board, two people. And the other possibility is there's someone named Fred Gregory who's a shuttle veteran and longtime NASA executive, and they're appointed to the verb one person. And these two interpretations, again, correspond to having different past structures.

So in one structure, we've got a coordination of the shuttle veteran and the longtime NASA executive Fred Gregory coordinated together. In one case, these are coordinated, and then Fred Gregory specifies the name of the NASA executive. So it's then specifying who that executive is, where in the other one, the shuttle veteran and longtime NASA executive all together is then something that is a modifier of Fred Gregory.

Okay, so one time, this is the unit that modifies Fred Gregory. In the other one up here, just longtime NASA executive modifies Fred Gregory, and then that's conjoined together with the shuttle veteran. And so that also gives different interpretations. So this is a slightly reduced example of the, I mean, in newspaper headlines tend to be more ambiguous than many other pieces of text because they're written in this shortened form to get things to fit.

And this is an especially shortened form, where it's actually left out in the explicit conjunction. But this headline says, "Doctor, no heart, cognitive issues." And this was after, I guess, one of Trump, it was after Trump's first physical. And while this is an ambiguity, because there are two ways that you can read this, you can either read this as saying, "Doctor, no heart and cognitive issues," which gives you one interpretation.

Instead of that, the way we should read it is that it's "heart or cognitive," and so it's then saying, "No heart or cognitive issues." And we have a different, narrower scope of the coordination, and then we get a different wording. Okay. I want to give a couple more examples of different kinds of ambiguities.

Another one you see quite a bit is when you have modifiers that are adjectives and adverbs, that there are different ways that you can have things modifying other things. This example is a little bit not safe for work, but here goes. "Hands get first-hand job experience." So this is an ambiguous sentence, and again, we can think of it as a syntactic ambiguity in terms of which things modify which other things.

So the nice, polite way to render this sentence is that "first" is modifying "hand," so we've got "first-hand," it's "job experience," so "job" is a compound noun modifying "experience," and it's "first-hand experience," so "first-hand" is then modifying "experience," and then "get" is the object of -- sorry, "first-hand job experience" is the object of "get," and "students" are the subject of "get." But if you have a smuttier mind, you can interpret this a different way, and in the alternative interpretation, you then have "hand" going together with "job," and the "first" is then a modifier of "experience," and "job" is still a modifier of "experience," and so then you get this different past structure and different interpretation there.

Okay, one more example. In a way, this example is similar to the previous one. It's sort of having modifier pieces that can modify different things, but rather than it just being with individual adjectives or individual adverbs, it's then much larger units, such as verb phrases, can often have attachment ambiguities.

So this sentence headline is "Mutilated body washes up on Rio beach to be used for Olympics beach volleyball." So we have this big verb phrase here of "to be used for Olympics beach volleyball," and then again we have this attachment decision that we could either say that that big verb phrase is modifying, i.e., is attached to the Rio beach, or we could say, "No, no, the 'to be used for Olympics beach volleyball,' that is modifying the mutilated body, and it's a body that's to be used for the Olympics beach volleyball," which gives the funny reading.

Yeah, so I hope that's given you at least a little bit of a sense of how human language syntactic structure is complex, ambiguous, and to work out the intended interpretations, you need to know something about that structure. In terms of how much you need to understand, I mean, you know, this isn't a linguistics class.

If you'd like to learn more about human language structure, you can go off and do a syntax class, but, you know, we're not really going to spend a lot of time working through language structure, but there will be some questions on this in the assignment, and so we're expecting that you can be at the level that you can have sort of some intuitions as to which words and phrases are modifying other words and phrases, and therefore you could choose between two dependency analyses which one's correct.

Okay, I've spent quite a bit of time on that, so better keep going. Okay, so the general idea is that knowing this sort of syntactic structure of a sentence can help us with semantic interpretation. I mean, as well as just generally saying we can understand language, it's also used in many cases for simple practical forms of semantic extraction, so people such as in biomedical informatics often want to get out particular relations such as protein-protein interactions, and well, here's a sentence.

The results demonstrated that Chi C interacts rhythmically with Sass A, Chi A, and Chi B, and commonly that people can get out those kind of relationships by looking at patterns of dependency relations with particular verbs, so for the interacts verb, if you have a pattern of something being the subject and something else being the noun modifier of interacts, well that's an interaction relationship, but it gets a bit more complicated than that, as in this example, because often there are conjunctions, so you also want to have another pattern where you have also interactions between the subject and the noun modifiers conjunct, which will allow us to also find the Chi A and Chi B examples.

Okay, so I've sort of given an informal tour of dependency grammar to just try and quickly say a little bit more about formally what a dependency grammar is. So in dependency syntax, what we say is that the syntactic structure of a sentence consists of relations between pairs of words, and it's a binary asymmetric relation, i.e.

we draw arrows between pairs of words, which we call dependencies. Now normally dependency grammars then type those grammatical relation, type those arrows to express what kind of relation that there is, and so that they have some kind of taxonomy of grammatical relations, so we might have a subject grammatical relation, a verbal auxiliary grammatical relation, an oblique modifier grammatical relation, we have some kind of typology of grammatical relations.

And we refer to the arrows going between the head, here's the head here, and something that is a dependent of it, so the subject of a verb is a dependent of the verb, or when you have a noun modifier like our sort of cuddly cat, we say that cuddly is a dependent of cat, and so cat is the head of cuddly cat.

And so normally dependencies, like in these examples, form a tree, which is formal, so it's not just any graph with arrows, we have a graph which is connected, acyclic, and has a single root, so here's the root of the graph, and so that gives us a dependency tree analysis.

Language-based grammar has a really, really long history, so the famous first linguist was Panini, who wrote about the structure of Sanskrit, and mainly he worked on the sound system of Sanskrit and how sounds change in various contexts, which is what linguists call phonology, and the different forms of Sanskrit words, Sanskrit has rich morphology of inflecting nouns and verbs for different cases and forms, but he also worked a little on the syntactic structure of Sanskrit sentences, and essentially what he proposed was a dependency grammar over Sanskrit sentences.

And it turns out that sort of for most of recorded history, when people have then gone on and tried to put structures over human sentences, what they have used is dependency grammars, so there was a lot of work in the first millennium by Arabic grammarians trying to work out the grammar structure of sentences, and effectively what they used was akin to what I've just presented as a dependency grammar.

So compared to, you know, 2500 years of history, the ideas of having context-free grammars and having constituency grammars is actually a really, really recent invention, so it was really sort of in the middle of the 20th century that the ideas of constituency grammar and context-free grammars were developed, first by Wells in the 40s, and then by Noam Chomsky in the early 50s, leading to things like the Chomsky hierarchy that you might see in CS103 or a formal languages class.

So for modern work on dependency grammar, using kind of the terminology and notation that I've just introduced, that's normally attributed to Lucien Taignere, who was a French linguist in around the sort of middle of the 20th century as well. Constituency grammar was widely used in the 20th century in a number of places.

I mean, in particular, it tends to be sort of much more natural and easier to think about for languages that have a lot of different case markings on nouns, like nominative, accusative, genitive, dative, instrumental kind of cases, like you get in a language like Latin or Russian, and a lot of those languages have much freer word order than English.

So the subject or object of, you know, in English, the subject has to be before the verb and the object has to be after the verb, but lots of other languages have much freer word order and instead use different forms of nouns to show you what's the subject or the object of the sentence.

And dependency grammars can often seem much more natural for those kinds of languages. Dependency grammars were also prominent in the very beginnings of computational linguistics. So one of the first people working in computational linguistics in the US was David Hayes. So the Professional Society for Computational Linguistics is called the Association for Computational Linguistics, and he was actually one of the founders of the Association for Computational Linguistics.

And he published in the early 1960s, an early, perhaps the first dependency grammar, sorry, dependency parser. Okay. Yeah, a little teeny note, just in case you see other things. When you have these arrows, you can draw them in either direction. You can either draw arrows from the head or to the dependent or from the dependent to the head.

And actually, different people have done one and the other, right? So the way Tenier drew them was to draw them from the head to the dependent, and we're following that convention. But, you know, if you're looking at something that somebody else has written with dependency arrows, the first thing you have to work out is, are they using the arrow heads or the dependents?

Now, one other thing here is that a sentence is seen as having the overall head word of the sentence, which every other word of the sentence hangs off. It's a common convention to add this sort of fake root to every sentence that then points to the head word of the whole sentence here completed.

That just tends to make the algorithmic stuff easier, because then you can say that every word of the sentence is dependent on precisely one other node, where what you can be dependent on is either another word on the sentence or the fake root of the sentence. And when we build our parsers, we will introduce that fake root.

Okay. So that's sort of dependency grammars and dependency structure. I now want to get us back to natural language processing and starting to build parsers for dependency grammars. But before doing that, I just want to say, yeah, where do we get our data from? And that's actually an interesting story in some sense.

So the answer to that is, well, what we do is get human beings, commonly linguists or other people who are actually interested in the structure of human sentences, and we get them to sit around and hand parse sentences and give them dependency structures, and we collect a lot of those parsers, and we call that a tree bank.

And so this is something that really only started happening in the late '80s and took off in a bigger way in the '90s. Until then, no one had attempted to build tree banks. Lots of people had attempted to build parsers, and it seemed like, well, if you want to build a parser, the efficient way to do it is to start writing a grammar.

So you start writing some grammar rules, and you start writing a lexicon with words and parts of speech, and you sit around working on your grammar. When I was a PhD student, one of my first summer jobs was spending the summer handwriting a grammar. And it sort of seems like writing a grammar is more efficient because you're writing this one general thing that tells you the structure of a human language.

But there's just been this massive sea change, partly driven by the adoption of machine learning techniques, where it's now seen as axiomatic that the way to make progress is to have annotated data, namely here a tree bank that shows you the structure of sentences. And so what I'm showing here is a teeny extract from a universal dependencies tree bank.

And so that's why I mentioned earlier that this has been this effort to try and have a common dependency grammar representation that you can apply to lots of different human languages. And so you can go over to this URL and see that there's about 60 different languages at the moment which have universal dependencies tree banks.

So why are tree banks good? I mean, it sort of seems like it's bad news if you have to have people sitting around for weeks and months hand parsing sentences. It seems a lot slower and actually a lot less useful than having somebody writing a grammar which just has a much bigger multiplier factor in the utility of their effort.

It turns out that although that initial feeling seems sort of valid, that in practice there's just a lot more you can do with the tree bank. So why are tree banks great? You know, one reason is that tree banks are highly reusable. So typically when people have written grammars, they've written grammars for, you know, one particular parser and the only thing it was ever used in is that one particular parser.

But when you build a tree bank, that's just a useful data resource and people use it for all kinds of things. So the well-known tree banks have been used by hundreds and hundreds of people. And although all tree banks were initially built for the purposes of, hey, let's help natural language processing systems, it turns out that people have actually been able to do lots of other things with tree banks.

So for example, these days, psycholinguists commonly use tree banks to get various kinds of statistics about data for thinking about psycholinguistic models. Linguists use tree banks for looking at patterns of different syntactic constructions that occur that there's just been a lot of reuse of this data for all kinds of purposes.

But they have other advantages that I mentioned here. You know, when people are just sitting around saying, oh, what sentences are good, they tend to only think of the core of language where lots of weird things happen in language. And so if you actually just have some sentences and you have to go off and parse them, then you actually have to deal with the totality of language.

Since you're parsing actual sentences, you get statistics. So you naturally get the kind of statistics that are useful to machine learning systems by constructing a tree bank where you don't get them for free if you hand write a grammar. But then a final way, which is perhaps the most important of all, is if you actually want to be able to do science of building systems, you need a way to evaluate these NLP systems.

I mean, it seems hard to believe now, but, you know, back in the 90s, 80s, when people built NLP parsers, it was literally the case that the way they were evaluated was you said to your friend, I built this parser, type in a sentence on the terminal and see what it gives you back.

It's pretty good, hey? And that was just the way business was done. Whereas what we'd like to know is, well, as I showed you earlier, English sentences can have lots of different parsers commonly. Can this system choose the right parsers for particular sentences and therefore have the basis of interpreting them as a human being would?

And while we can only systematically do that evaluation if we have a whole bunch of sentences that have been hand parsed by humans with their correct interpretations. So the rise of tree banks turned parser building into an empirical science where people could then compete rigorously on the basis of, look, my parser has 2% higher accuracy than your parser in choosing the correct parsers for sentences.

Okay. So well, how do we build a parser once we've got dependencies? So there's sort of a bunch of sources of information that you could hope to use. So one source of information is looking at the words on either end of the dependency. So discussing issues, that seems a reasonable thing to say.

And so it's likely that issues could be the object of discussing. Whereas if it was some other word, right, if you were thinking of making, you know, outstanding the object of discussion, discussing outstanding, that doesn't sound right. So that wouldn't be so good. The second source of information is distance.

So most dependencies are relatively short distance. Some of them aren't. Some of them are long distance dependencies, but they're relatively rare. The vast majority of dependencies are nearby. Another source of information is the intervening material. So there are certain things that dependencies rarely span. So clauses and sentences are normally organized around verbs.

And so dependencies rarely span across intervening verbs. We can also use punctuation in written language, things like commas, which can give some indication of the structure. And so punctuation may also indicate bad places to have long distance dependencies over. And there's one final source of information, which is what's referred to as valency, which is for a head, what kind of information does it usually have around it?

So if you have a noun, there are things that you just know about what kinds of dependence nouns normally have. So it's common that it will have a determiner to the left, the cat. On the other hand, it's not going to be the case that there's a determiner to the right, cat, the.

That's just not what you get in English. On the left, you're also likely to have an adjectival modifier. That's where we had cuddly. But again, it's not so likely you're going to have the adjectival modifier over on the right for cuddly. So there are sort of facts about what things different kinds of words take on the left and the right.

And so that's the valency of the heads. And that's also a useful source of information. Okay, so what do we need to do using that information to build a parser? Well, effectively, what we do is have a sentence. I'll give a talk tomorrow on neural networks. But what we have to do is say for every word in that sentence, we have to choose some other word that it's a dependent of where one possibility is it's a dependent of root.

So we're giving it a structure where we're saying, okay, for this word, I've decided that it's a dependent on networks. And then for this word, it's also dependent on networks. And for this word, it's a dependent on give. So we're choosing one for each word. And there are usually a few constraints.

So only one word is a dependent of root, we have a tree. We don't want cycles. So we don't want to say that word A is dependent on word B and word B is dependent on word A. And then this one final issue, which is whether arrows can cross or not.

So in this particular sentence, we actually have these crossing dependencies, you can see there, I'll give a talk tomorrow on neural networks. And this is the correct dependency parse for this sentence. Because what we have here is that it's a talk, and it's a talk on neural networks. So the on neural networks modifies the talk, but which leads to these crossing dependencies.

I didn't have to say it like that. I could have said, I'll give a talk on neural networks tomorrow, and then on neural networks would be next to the talk. So most of the time, in languages, dependencies are projective, the things stay together. So the dependencies have a kind of a nesting structure of the kind that you also see in context free grammars.

But most languages have at least a few phenomena where you ended up with these ability for phrases to be split apart, which lead to non projective dependencies. So in particular, one of them in English is that you can take modifying phrases and clauses like the on neural networks here and shift them right towards the end of the sentence and get I'll give a talk tomorrow on neural networks.

And that then leads to non projective sentences. So a parse is projective if there are no crossing dependency arcs when the words are laid out in their linear order with all arcs above the words. And if you have a dependency parse that corresponds to a context free grammar tree, it actually has to be projective because context free grammars necessarily have this sort of nested tree structure following the linear order.

But dependency grammars normally allow non projective structures to account for displaced constituents. And you can't easily get the semantics of certain constructions right without these non projective dependencies. So here's another example in English with question formation with what's called preposition stranding. So the sentence is, who did Bill buy the coffee from yesterday?

There's another way I could have said this. It's less natural in English, but I could have said, from who did Bill from who did Bill buy the coffee yesterday? In many languages of the world, that's the only way you could have said it. And when you do that, from who is kept together and you have a projective parse for the sentence.

But English allows and indeed much prefers you to do what is referred to as preposition stranding where you move the who, but you just leave the preposition behind. And so you get, who did Bill buy the coffee from yesterday? And so then we're ending up with this non projective dependency structure as I've shown there.

Okay, I'll come back to non projectivity in a little bit. How do we go about building dependency parsers? Well, there are a whole bunch of ways that you can build dependency parsers. Very quickly, I'll just say a few names and I'll tell you about one of them. So you can use dynamic programming methods to build dependency parsers.

So I showed earlier that you can have an exponential number of parsers for a sentence. And that sounds like really bad news for building a system. Well, it turns out that you can be clever and you can work out a way to dynamic program, finding that exponential number of parsers, and then you can have an O N cubed algorithm.

So you could do that. You can use graph algorithms and I'll say a bit about that later, but that may spill into next time. So you can see, since we're wanting to kind of connect up all the words into a tree using graph edges, that you could think of doing that using a minimum spanning tree algorithm of the sort that you hopefully saw in CS 161.

And so that idea has been used for parsing. Transition satisfaction ideas that you might have seen in CS 221 have been used for dependency parsing. But the way I'm going to show now is transition based parsing or sometimes referred to as deterministic dependency parsing. And the idea of this is one's going to use a transition system.

So that's like shift reduce parsing. If you've seen shift reduce parsing in something like a compiler's class or formal languages class that shift and reduce our transition steps. And so use a transition system to guide the construction of parsers. And so let me just explain about that. So let's see.

So this was an idea that was made prominent by Joakim Nivre, who's a Swedish computational linguist who introduced this idea of greedy transition based parsing. So his idea is, well, what we're going to do for dependency parsing is we're going to be able to parse sentences by having a set of transitions which are kind of like shift reduce parser.

And it's going to just work left to right, bottom up, and parse a sentence. So we're going to say we have a stack sigma, a buffer beta of the words that we have to process. And we're going to build up a set of dependency arcs by using actions, which are shift and reduce actions.

And putting those together, this will give us the ability to put parse structures over sentences. And let me go through the details of this. And this is a little bit hairy when you first see it. It's not so complex, really. And this kind of transition based dependency parser is what we'll use in assignment three.

So what we have, so this is our transition system. We have a starting point where we start with a stack that just has the root symbol on it and a buffer that has the sentence that's about to parse. And so far, we haven't built any dependency arcs. And so at each point in time, we can choose one of three actions.

We can shift, which moves the next word onto the stack. We can then do actions that are the reduce actions. So there are two reduce actions to make it a dependency grammar. We can either do a left arc reduce or a right arc reduce. So when we do either of those, we take the top two items on the stack and we make one of them a dependent of the other one.

So we can either say, okay, let's make WI a dependent of WJ or else we can say, okay, let's make WJ a dependent of WI. And so the result of when we do that is the one that's the dependent disappears from the stack. And so in the stacks over here, there's one less item.

But then we add a dependency arc to our arc set so that we say that we've got either a dependency from J to I or a dependency from I to J. And commonly when we do this, we actually also specify what grammatical relation connects the two, such as subject, object, noun, modifier.

And so we also have here a relation. That's still probably still very abstract. So let's go through an example. So this is how a simple transition-based dependency parser, what's referred to as an arc standard transition-based dependency parser, would parse up I ate the fish. So remember, these are the different operations that we can apply.

So to start off with, we have root on the stack and the sentence in the buffer, and we have no dependency arcs constructed. So we have to choose one of the three actions. And when there's only one thing on the stack, the only thing we can do is shift.

So we shift. And now the stack looks like this. So now we have to take another action. And at this point, we have a choice because we could immediately reduce. So we could say, okay, let's just make I a dependent of root and we'd get a stack size of one again.

But that would be the wrong thing to do because I isn't the head of the sentence. So what we should instead do is shift again and get I ate on the stack and fish still in the buffer. Well, at that point, we keep on parsing a bit further. And so now what we can do is say, well, wait a minute.

Now I is a dependent of eight. And so we can do a left arc reduce. And so I disappears from the stack. So here's our new stack. But we add to the set of arcs that we've added that I is the subject of eight. Okay. Well, after that, we could reduce again because there's still two things on the stack.

But that'd be the wrong thing to do. The right thing to do next would be to shift fish onto the stack. And then at that point, we can do a right arc reduce saying that eight is the object of fish and add a new dependency to our dependency set.

And then we can one more time do a right arc reduce to say that eight is the root of the whole sentence and add in that extra root relation with our pseudo root. And at that point, we've reached the end condition. So the end condition was the buffer was empty.

And there's one thing, the root on the stack. And at that point, we can finish. So this little transition machine does the parsing up of the sentence. But there's one thing that's left to explain still here. Which is how do you choose the next action? So as soon as you have two things or more on the stack, what you do next, you've always got a choice.

You could keep shifting, at least if there's still things on the buffer. Or you can do a left arc or you can do a right arc. And how do you know what choice is correct? And one answer to that is to say, well, you don't know what choice is correct.

And that's why parsing is hard and sentences are ambiguous. You can do any of those things. You have to explore all of them. And well, if you naively explore all of them, then you do an exponential amount of work to parse the sentence. So in the early 2000s, Joachim Nivre's -- and that's essentially what people had done in the '80s and '90s, is explore every path.

But in the early 2000s, Joachim Nivre's essential observation was, but wait a minute. We know about machine learning now. So why don't I try and train a classifier which predicts what the next action I should take is given this stack and buffer configuration? Because if I can write a machine learning classifier which can nearly always correctly predict the next action given a stack and buffer, then I'm in a really good position.

Because then I can build what's referred to as a greedy dependency parser which just goes, bang, bang, bang, word at a time. Okay. Here's the next thing. Run classifier, choose next action, run classifier, choose next action, run classifier, choose next action, so that the amount of work that we're doing becomes linear in the length of the sentence rather than it being cubic in the length of the sentence using dynamic programming or exponential in the length of the sentence if you don't use dynamic programming.

So at each step we predict the next action using some discriminative classifier. So starting off he was using things like support vector machines, but it can be anything at all like softmax classifier that's closer to our neural networks. And there are either for what I presented three classes if you're just thinking of the two reducers in the shift or if you're thinking of you're also assigning a relation and you have a set of R relations like 20 relations, then that's the sort of 41 moves that you could decide on at each point.

And the features are effectively the configurations I was showing before. What's the top of the stack word? What part of speech is it? What's the first word in the buffer? What's that word's part of speech, et cetera? And so in the simplest way of doing this, you're now doing no search at all.

You're just sort of take each configuration and turn, decide the most likely next move and you make it. And that's a greedy dependency parser, which is widely used. You can do better if you want to do a lot more work. So you can do what's called a beam search where you maintain a number of fairly good parse prefixes at each step and you can extend them out further.

And then you can evaluate later on which of those seems to be the best. And so beam search is one technique to improve dependency parsing by doing a lot of work. And it turns out that although these greedy transition-based parsers are a fraction worse than the best possible ways known to parse sentences, that they actually work very accurately almost as well.

And they have this wonderful advantage that they give you linear time parsing in terms of the length of your sentences and text. And so if you want to do a huge amount of parsing, they're just a fantastic thing to use because you've then got an algorithm that scales to the size of the web.

Okay. So I'm kind of a little bit behind, so I guess I'm not going to get through all of these slides today and we'll have to finish out the final slides tomorrow. But just to push a teeny bit further, I'll just say a couple more on the sort of what Neivre did for dependency parser, and then I'll sort of introduce the neural form of that in the next class.

So conventionally, you had this sort of stack and buffer configuration and you wanted to build a machine learning classifier. And so the way that was done was by using symbolic features of this configuration. And what kind of symbolic features did you use? Use these indicator features that picked out a small subset, normally one to three elements of the configuration.

So you'd have a feature that could be something like the thing on the top of the stack is the word good, which is an adjective, or it could be the thing on the top of the stack is an adjective and the thing that's first in the buffer is a noun, or it could just be looking at one thing and saying the first thing in the buffer is a verb.

So you'd have all of these features. And because these features commonly involved words and commonly involved conjunctions of several conditions, you had a lot of features. And having mentions of words and conjunctions of conditions definitely helped to make these parsers work better. But nevertheless, because you had all of these sort of one zero symbolic features, that you had a ton of such features.

So commonly these parsers were built using something like a million to 10 million different features of sentences. And I mentioned already the importance of evaluation. Let me just sort of quickly say how these parsers were evaluated. So to evaluate a parser for a particular sentence, it was hand parsed, our test set was hand parsed in the tree bank.

So we have gold dependencies of what the human thought were right. And so we can write those dependencies down as statements of saying the first word is the dependent of the second word via a subject dependency. And then the parser is also going to make similar claims as to what's a dependent on what.

And so there are two common metrics that are used. One is just are you getting these dependency facts right? So both of these dependency facts match. And so that's referred to as the unlabeled accuracy score, where we're just sort of measuring accuracies, which are all of the dependencies in the gold sentence.

And remember, we have one dependency per word in the sentence. So here we have five. How many of them are correct? And that's our unlabeled accuracy score of 80%. And a slightly more rigorous evaluation is to say, well, no, we're also going to label them and we're going to say that this is the subject.

That's actually called the root. This one's the object. So these dependencies have labels. And you also need to get the grammatical relation label right. And so that's then referred to as labeled accuracy score. And although I got those two right for that as... I guess according to this example, actually, this is wrong.

It looks like I got... Oh, no, this is wrong there. Sorry, that one's wrong there. Okay. So I only got two of the dependencies correct in the sense that I both got what depends on what and the label correct. And so my labeled accuracy score is only 40%. Okay.

So I'll stop there now for the introduction for dependency parsing. I still have an IOU, which is how we can then bring neural nets into this picture and how they can be used to improve dependency parsing. So I'll do that at the start of next time before then proceeding further into neural language models.

Thank you.

Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 4 - Syntactic Structure and Dependency Parsing

Chapters

Transcript