Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 4 - Syntactic Structure and Dependency Parsing

00:00:00.000 | Okay, so for today, we're actually going to take a bit of a change of pace from what the

00:00:12.560 | last couple of lectures have been about.

00:00:15.480 | And we're going to focus much more on linguistics and natural language processing.

00:00:21.800 | And so in particular, we're going to start looking at the topic of dependency parsing.

00:00:32.120 | And so this is the plan of what to go through today.

00:00:34.820 | So I'm going to start out by going through some ideas that have been used for the syntactic

00:00:39.460 | structure of languages of constituency and dependency and introduce those.

00:00:44.840 | And then focusing in more on dependency structure, I'm then going to look at dependency grammars

00:00:53.520 | and dependency tree banks.

00:00:55.680 | And then having done that, we're then going to move back into thinking about how to build

00:01:00.120 | natural language processing systems.

00:01:02.600 | And so I'm going to introduce the idea of transition based dependency parsing.

00:01:07.400 | And then in particular, having developed that idea, I'm going to talk about a way to build

00:01:13.080 | a simple but highly effective neural dependency parser.

00:01:17.640 | And so this simple, highly effective neural dependency parser is essentially what we'll

00:01:22.240 | be asking you to build in the third assignment.

00:01:26.120 | So in some sense, we're getting a little bit ahead of ourselves here, because in week two

00:01:30.040 | of the class, we teach you how to do both assignments two and three.

00:01:35.660 | But all of this material will come in really useful.

00:01:39.840 | Before I get underway, just a couple of announcements.

00:01:44.760 | So again, for assignment two, you don't yet need to use the PyTorch framework.

00:01:53.080 | But now's a good time to work on getting PyTorch installed for your Python programming.

00:02:00.420 | Assignment three is in part also an introduction to using PyTorch.

00:02:04.480 | It's got a lot of scaffolding included in the assignment.

00:02:08.020 | But beyond that, this Friday, we've got a PyTorch tutorial, and thoroughly encourage

00:02:14.000 | you to come along to that as well.

00:02:16.760 | Look for it under the Zoom tab.

00:02:18.880 | And in the second half, the first day of week four, we have an explicit class that partly

00:02:26.480 | focuses on the final projects and what the choices are for those.

00:02:30.720 | But it's never too late to start thinking about the final project and what kind of things

00:02:34.960 | you want to do for the final project.

00:02:37.560 | Also do come meet with people.

00:02:40.720 | There are sort of resources on the course pages about what different TAs know about.

00:02:45.640 | I've also talked to a number of people about final projects, but clearly I can't talk to

00:02:50.040 | everybody.

00:02:51.040 | So I encourage you to also be thinking about what you want to do for final projects.

00:02:55.320 | Okay, so what I wanted to do today was introduce how people think about the structure of sentences.

00:03:07.160 | And put structure on top of them to explain how human language conveys meaning.

00:03:13.700 | And so our starting point for meaning and essentially what we've dealt with with word

00:03:18.560 | vectors up until now is we have words.

00:03:23.160 | And words are obviously an important part of the meaning of human languages.

00:03:29.160 | But for words in human languages, there's more that we can do with them in thinking

00:03:36.400 | about how to structure sentences.

00:03:39.400 | So in particular, the first most basic way that we think about words when we are thinking

00:03:45.560 | about how sentences are structured is we give to them what's called a part of speech.

00:03:51.120 | We can say that cat is a noun, by is a preposition, door is another noun, cuddly is an adjective.

00:04:02.760 | And then for the word the, if it was given a different part of speech, if you saw any

00:04:08.720 | parts of speech in school, it was probably you're told it was an article.

00:04:14.240 | Sometimes that is just put into the class of adjectives.

00:04:18.280 | In modern linguistics and what you'll see in the resources that we use, words like the

00:04:23.800 | are referred to as determiners.

00:04:25.980 | And the idea is that there's a bunch of words includes a and the, but also other words like

00:04:31.240 | this and that, or even every, which are words that occur at the beginning of something like

00:04:41.760 | the cuddly cat, which have a determinative function of sort of picking out which cats

00:04:48.600 | that they're referring to.

00:04:49.880 | And so we refer to those as determiners.

00:04:53.520 | But it's not the case that when we want to communicate with language, that we just have

00:04:58.320 | this word salad where we say a bunch of words.

00:05:02.520 | We just say, you know, whatever, leaking, kitchen, tap, and let the other person put

00:05:09.120 | it together.

00:05:10.160 | We put words together in a particular way to express meanings.

00:05:14.300 | And so therefore, languages have larger units of putting meaning together.

00:05:20.080 | And the question is how we represent and think about those.

00:05:24.640 | Now in modern work, in particular, in modern United States linguistics, or even what you

00:05:34.880 | see in computer science classes, when thinking about formal languages, the most common way

00:05:40.720 | to approach this is with the idea of context-free grammars, which you see at least a little

00:05:46.760 | bit of in 103, if you've done 103, what a linguist would often refer to as phrase structure

00:05:53.640 | grammars.

00:05:54.640 | And the idea there is to say, well, there are bigger units in languages that we refer

00:06:00.320 | to as phrases.

00:06:02.040 | So something like the cuddly cat is a cat with some other words modifying it.

00:06:08.320 | And so we'll refer to that as a noun phrase.

00:06:12.120 | But then we have ways in which phrases can get larger by building things inside phrases.

00:06:22.240 | So the door here is also a noun phrase, but then we can build something bigger around

00:06:28.720 | that with a preposition.

00:06:29.720 | So this is a preposition.

00:06:31.840 | And then we have a prepositional phrase.

00:06:35.000 | And in general, we can keep going.

00:06:37.000 | So we can then make something like the cuddly cat by the door, and then the door is a noun

00:06:42.800 | phrase.

00:06:44.080 | The cuddly cat is a noun phrase.

00:06:47.480 | By the door is a prepositional phrase.

00:06:50.840 | But then when you put it all together, the whole of this thing becomes a bigger noun

00:06:55.480 | phrase.

00:06:57.040 | And so it's working with these ideas of nested phrases, what in context-free grammar terms

00:07:04.880 | you would refer to as non-terminals.

00:07:08.440 | So noun phrase and prepositional phrase would be non-terminals in the context-free grammar.

00:07:14.180 | We can build up a bigger structure of human languages.

00:07:18.280 | So let's just do that for a little bit to review what happens here.

00:07:22.860 | So we start off saying, okay, you can say the cat and the dog.

00:07:28.280 | And so those are noun phrases.

00:07:31.160 | And so we want a rule that can explain those.

00:07:33.700 | So we could say a noun phrase goes to the termina noun.

00:07:38.320 | And then somewhere over the side, we'd have a lexicon.

00:07:42.560 | And in our lexicon, we'd say that dog is a noun and cat is a noun and a is a determiner

00:07:52.540 | and that is a determiner.

00:07:54.560 | Okay.

00:07:55.560 | So then we notice you can do a bit more than that.

00:07:59.720 | So you can say things like the large cat, a barking dog.

00:08:06.320 | So that suggests we can have a noun phrase after the determiner.

00:08:12.000 | It can optionally be an adjective.

00:08:14.840 | And then there's the noun and that can explain some things we can say.

00:08:18.960 | But we can also say the cat by the door or a barking dog in a crate.

00:08:25.600 | And so we can also put a prepositional phrase at the end and that's optional, but you can

00:08:33.400 | combine it together with an adjective.

00:08:35.200 | For the example I gave, like a barking dog on the table.

00:08:39.680 | And so that this grammar can handle that.

00:08:44.400 | So then we'll keep on and say, well, actually you can use multiple adjectives.

00:08:51.120 | So you can say a large barking dog or a large barking cuddly cat.

00:08:57.880 | No, maybe not.

00:08:59.160 | Well, sentences like that.

00:09:00.440 | So we have any number of adjectives which we can represent with a star, what's referred

00:09:04.720 | to as the Kleene star.

00:09:06.960 | So that's good.

00:09:08.520 | Oh, but I forgot a bit actually.

00:09:12.720 | For by the door, I have to have a rule for producing by the door.

00:09:16.200 | So I also need a rule that's a prepositional phrase goes to a preposition followed by a

00:09:21.800 | noun phrase.

00:09:22.800 | And so then I also have to have prepositions and that can be in or on or by.

00:09:31.240 | Okay.

00:09:32.240 | And you can make other sentences, of course, with this as well, like the large crate on

00:09:38.040 | the table or something like that, or the large crate on the large table.

00:09:43.800 | Okay.

00:09:44.800 | So I chug along and then, well, I could have something like talk to the cat.

00:09:50.240 | And so now I need more stuff.

00:09:52.120 | So talk is a verb and to is still looks like a preposition.

00:09:58.240 | So I need to be able to make up something with that as well.

00:10:03.560 | Okay.

00:10:06.080 | So what I can do is say, I can also have a rule for a verb phrase that goes to a verb.

00:10:13.840 | And then after that, for something like talk to the cat, that it can take a prepositional

00:10:19.520 | phrase after it.

00:10:21.560 | And then I can say that the verb goes to talk or walked.

00:10:29.680 | Okay.

00:10:31.960 | Then I can pause and then I can cover those sentences.

00:10:35.680 | Oops.

00:10:36.680 | Okay.

00:10:37.680 | So that's, that's the end of what I have here.

00:10:41.360 | So in this sort of a way, I'm handwriting a grammar.

00:10:47.040 | So here is now I have this grammar and a lexicon.

00:10:52.920 | And for the examples that I've written, I've written down here, this grammar and this lexicon

00:11:02.080 | is sufficient to pause these sort of fragments of showing expansion that I just wrote down.

00:11:09.560 | I mean, of course, there's a lot more to English than what you see here.

00:11:14.120 | Right?

00:11:15.120 | So if I have something like, you know, the cat walked behind the dog, then I need some

00:11:26.920 | more grammar rules.

00:11:27.920 | So it seems then I need a rule that says I can have a sentence that goes to a noun phrase

00:11:33.200 | followed by a verb phrase.

00:11:35.920 | And I can keep on doing things of this sort.

00:11:42.240 | Let's see, one question that Ruth Ann asked was about what do the brackets mean?

00:11:50.560 | And is the first NP different from the second?

00:11:55.880 | So for this notation on the brackets here, I mean, this is actually a common notation

00:12:02.900 | that's used in linguistics.

00:12:05.840 | It's sort of in some sense a little bit different to traditional computer science notation,

00:12:12.560 | since the star is used in both to mean zero or more of something.

00:12:18.000 | So you could have zero, one, two, three, four, five adjectives.

00:12:21.960 | Somehow it's usual in linguistics that when you're using the star, you also put parentheses

00:12:27.800 | around it to mean it's optional.

00:12:30.600 | So sort of parentheses and star are used together to mean any number of something.

00:12:36.320 | When it's parentheses just by themselves, that's then meaning zero or one.

00:12:42.200 | And then for are these two noun phrases different?

00:12:48.040 | No, they're both noun phrase rules.

00:12:51.360 | And so in our grammar, we can have multiple rules that expand noun phrase in different

00:12:57.160 | ways.

00:12:58.800 | But actually in my example here, my second rule, because I wrote it quite generally,

00:13:06.320 | it actually covers the first rule as well.

00:13:09.120 | So actually at that point, I can cross out this first rule because I don't actually need

00:13:13.360 | it in my grammar.

00:13:15.160 | But in general, you have a choice between writing multiple rules for noun phrase goes

00:13:22.440 | to categories, which effectively gives you a disjunction or working out by various syntactic

00:13:29.320 | conventions how to compress them together.

00:13:34.200 | Okay.

00:13:36.700 | So that was what gets referred to in natural language processing as constituency grammars,

00:13:44.040 | where the standard form of constituency grammar is a context-free grammar of the sort that

00:13:50.200 | I trust you saw at least a teeny bit of either in CS 103 or something like a programming

00:13:56.400 | languages compilers formal languages class.

00:14:00.200 | There are other forms of grammars that also pick out constituency.

00:14:04.420 | There are things like tree adjoining grammars, but I'm not going to really talk about any

00:14:08.320 | of those now.

00:14:09.320 | What I actually want to present is a somewhat different way of looking at grammar, which

00:14:14.600 | is referred to as dependency grammar, which puts a dependency structure over sentences.

00:14:23.120 | Now actually, it's not that these two ways of looking at grammar have nothing to do with

00:14:28.920 | each other.

00:14:29.920 | I mean, there's a whole formal theory about the relationships between different kinds

00:14:35.760 | of grammars, and you can very precisely state relationships and isomorphisms between different

00:14:43.760 | grammars of different kinds.

00:14:45.920 | But on the surface, these two kinds of grammars look sort of different and emphasize different

00:14:52.120 | things.

00:14:53.480 | And for reasons of their sort of closeness to picking out relationships and sentences

00:15:02.280 | and their ease of use, it turns out that in modern natural language processing, starting,

00:15:09.840 | I guess, around 2000, so really in the last 20 years, NLP people have really swung behind

00:15:18.000 | dependency grammars.

00:15:19.400 | So if you look around now where people are using grammars in NLP, by far the most common

00:15:25.120 | thing that's being used is dependency grammars.

00:15:28.480 | So I'm going to teach us today a bit about those and for what we're going to build in

00:15:34.240 | assignment three is building, using supervised learning, a neural dependency parser.

00:15:41.600 | So the idea of dependency grammar is that when we have a sentence, what we're going

00:15:47.560 | to do is we're going to say for each word, what other words modify it.

00:15:55.880 | So what we're going to do is when we say the large crate, we're going to say, okay, well,

00:16:01.920 | large is modifying crate and that is modifying crate.

00:16:06.880 | In the kitchen, that is modifying kitchen.

00:16:09.800 | By the door, that is modifying door.

00:16:12.920 | And so I'm showing a modification, a dependency or an attachment relationship by drawing an

00:16:21.720 | arrow from the head to what's referred to in dependency grammar as the dependent.

00:16:28.320 | The thing that modifies, further specifies or attaches to the head.

00:16:36.080 | Okay.

00:16:37.740 | So that's the start of this.

00:16:40.400 | Well another dependency that is that, well, look in the large crate, that where you're

00:16:49.400 | looking is in the large crate.

00:16:51.960 | So you're going to want to have the large, in the large crate as being a dependent of

00:16:57.480 | look.

00:16:58.480 | And so that's also going to be a dependency relationship here.

00:17:03.600 | And then there's one final bit that might seem a little bit confusing to people.

00:17:10.640 | And that's actually when we have these prepositions.

00:17:14.200 | There are two ways that you can think that this might work.

00:17:18.300 | So if it was something like look in the crate, it seems like that is a dependent of crate.

00:17:29.860 | But you could think that you want to say look in and it's in the crate and give this dependency

00:17:36.980 | relationship with the sort of preposition as sort of thinking of it as the head of what

00:17:43.540 | was before our prepositional phrase.

00:17:46.740 | And that's a possible strategy in the dependency grammar.

00:17:50.900 | But what I'm going to show you today and what you're going to use in the assignment is dependency

00:17:59.420 | grammars that follow the representation of universal dependencies.

00:18:04.260 | And universal dependencies is a framework which actually I was involved in creating,

00:18:09.340 | which was set up to try and give a common dependency grammar over many different human

00:18:14.900 | languages.

00:18:16.180 | And in the design decisions that were made in the context of designing universal dependencies,

00:18:24.820 | what we decided was that for what in some languages you use prepositions, lots of other

00:18:32.980 | languages make much more use of case markings.

00:18:36.380 | So if you've seen something like German, you've seen more case markings like genitive and

00:18:42.940 | dative cases.

00:18:44.820 | And in other languages like Latin or Finnish, lots of Native American languages, you have

00:18:52.180 | many more case markings again, which cover most of the role of prepositions.

00:18:57.860 | So in universal dependencies, essentially in the crate is treated like a case marked

00:19:04.540 | noun.

00:19:05.700 | And so what we say is that the in is also a dependent of crate.

00:19:11.880 | And then you're looking in the crate.

00:19:14.940 | So in the structure we adopt, in is a dependent of crate.

00:19:20.900 | This in is a dependent of kitchen.

00:19:23.580 | This by is a dependent of door.

00:19:27.380 | And then we have these prepositional phrases in the kitchen, by the door, and we want to

00:19:33.180 | work out, well, what they modify.

00:19:35.780 | Well, in the kitchen is modifying crate, right?

00:19:40.560 | Because it's a crate in the kitchen.

00:19:42.460 | So we're going to say that it's, this piece is a dependent of crate.

00:19:48.960 | And then, well, what about by the door?

00:19:51.360 | Well, it's not really meaning that it's a kitchen by the door.

00:19:55.640 | And it's not meaning to look by the door.

00:19:58.240 | Again, it's a crate by the door.

00:20:01.280 | And so what we're going to have is the crate also has door as a dependent.

00:20:06.300 | And so that gives us our full dependency structure of this sentence.

00:20:13.840 | Okay.

00:20:17.200 | And so that's a teeny introduction to syntactic structure.

00:20:22.640 | I'm going to say a bit more about it and give a few more examples.

00:20:27.940 | But let me just for a moment sort of say a little bit about, well, why are we interested

00:20:33.200 | in syntactic structure?

00:20:34.400 | Why do we need to know the structure of sentences?

00:20:38.200 | And this gets into how does human languages work?

00:20:42.720 | So human languages can communicate very complex ideas.

00:20:49.320 | I mean, in fact, you know, anything that humans know how to communicate to one another, they

00:20:55.240 | communicate pretty much by using words.

00:20:59.000 | So we can structure and communicate very complex ideas.

00:21:03.000 | But we can't communicate a really complex idea by one word.

00:21:09.520 | We can't just, you know, choose a word like, you know, empathy and say it with a lot of

00:21:15.560 | meaning and say empathy and the other person is meant to understand everything about what

00:21:20.160 | that means.

00:21:21.160 | Right?

00:21:22.160 | So we can compose a complex meaning that explains things by putting words together

00:21:27.680 | into bigger units.

00:21:29.520 | And the syntax of a language allows us to put words together into bigger units where

00:21:36.440 | we can build up and convey to other people a complex meaning.

00:21:41.440 | And so then the listener doesn't get this syntactic structure.

00:21:46.160 | Right?

00:21:47.160 | The syntactic structure of the sentence is hidden from the listener.

00:21:51.520 | All the listener gets is a sequence of words, one after another, bang, bang, bang.

00:21:57.680 | So the listener has to be able to do what I was just trying to do in this example, that

00:22:04.000 | as the sequence of words comes in, that the listener works out which words modify which

00:22:11.500 | other words and therefore can construct the structure of the sentence and hence the meaning

00:22:18.600 | of the sentence.

00:22:20.800 | And so in the same way, if we want to build clever neural net models that can understand

00:22:27.400 | the meaning of sentences, those clever neural net models also have to understand what is

00:22:33.680 | the structure of the sentence so that they can interpret the language correctly.

00:22:39.160 | And we'll go through some examples and see more of that.

00:22:43.520 | Okay.

00:22:44.560 | So the fundamental point that we're going to sort of spend a bit more time on is that

00:22:51.680 | these choices of how you build up the structure of a language change the interpretation of

00:22:58.360 | the language.

00:22:59.880 | And a human listener or equally a natural language understanding program has to make

00:23:08.040 | in a sort of probabilistic fashion choices as to which words modify, i.e. depend upon

00:23:15.280 | which other words so that they're coming up with the interpretation of the sentence that

00:23:21.160 | they think was intended by the person who said it.

00:23:25.000 | Okay.

00:23:26.000 | So to get a sense of this and how sentence structure is interesting and difficult, what

00:23:34.920 | I'm going to go through now is a few examples of different ambiguities that you find in

00:23:41.040 | natural language.

00:23:43.280 | And I've got some funny examples from newspaper headlines, but these are all real natural

00:23:50.400 | language ambiguities that you find throughout natural language.

00:23:55.800 | Well, at this point I should say this is where I'm being guilty of saying true language,

00:24:01.360 | but I'm meaning in English.

00:24:04.760 | Some of these ambiguities you find in lots of other languages as well, but which ambiguities

00:24:11.040 | that offer syntactic structure partly depend on the details of the language.

00:24:17.320 | So different languages have different syntactic constructions, different word orders, different

00:24:22.760 | amounts of words having different forms of words like case markings.

00:24:29.240 | And so depending on those details, there might be different ambiguities.

00:24:34.500 | So here's one ambiguity, which is one of the commonest ambiguities in English.

00:24:40.720 | So San Jose cops kill man with knife.

00:24:44.200 | So this sentence has two meanings.

00:24:47.960 | Either it's the San Jose cops who are killing a man and they're killing a man with a knife.

00:24:57.440 | And so that corresponds to a dependency structure where the San Jose cops are the subject of

00:25:04.780 | killing, the man is the object of killing, and then the knife is then the instrument

00:25:13.820 | with which they're doing the killing.

00:25:16.540 | So that the knife is an oblique modifier for the instrument of killing.

00:25:22.280 | And so that's one possible structure for this sentence.

00:25:26.660 | But it's probably not the right one.

00:25:29.620 | So what it actually probably was, was that it was a man with a knife and the San Jose

00:25:36.260 | cops killed the man.

00:25:38.700 | So that corresponds to the knife then being a noun modifier of the man and then kill is

00:25:49.340 | still killing the man.

00:25:52.100 | So the man is the object of killing and the cops are still the subject.

00:25:57.420 | And so whenever you have a prepositional phrase like this, that's coming further on in a sentence,

00:26:06.300 | there's a choice of how to interpret it.

00:26:08.980 | It could be either interpreted as modifying a noun phrase that comes before it, or it

00:26:15.540 | can be interpreted as modifying a verb that comes before it.

00:26:20.420 | So systematically in English, you get these prepositional phrase attachment ambiguities

00:26:26.660 | throughout all of our sentences.

00:26:29.460 | But to give two further observations on that, the first observation is you encounter sentences

00:26:41.620 | with prepositional phrase attachment ambiguities every time you read a newspaper article, every

00:26:48.740 | time you talk to somebody, but most of the time you never notice them.

00:26:54.100 | And that's because our human brains are incredibly good at considering the possible interpretations

00:27:01.300 | and going with the one that makes sense according to context.

00:27:07.700 | The second comment, as I said, different human languages expose different ambiguities.

00:27:14.680 | So for example, this is an ambiguity that you normally don't get in Chinese because

00:27:20.060 | in Chinese prepositional phrases, modifying a verb are normally placed before the verb.

00:27:27.460 | And so there you don't standardly get this ambiguity, but there are different other ambiguities

00:27:34.920 | that you find commonly in Chinese sentences.

00:27:39.080 | So this ambiguity you find everywhere because prepositional phrases are really common at

00:27:44.700 | the right ends of sentences.

00:27:47.000 | So here's another one, scientists count whales from space.

00:27:51.380 | So that gives us these two possible interpretations that there are whales from space and scientists

00:27:58.280 | are counting them.

00:28:00.460 | And then the other one is how the scientists are counting the whales is that they're counting

00:28:07.100 | them from space and they're using satellites to count the sails, which is the correct interpretation

00:28:14.580 | that the newspaper hopes that you're getting.

00:28:21.500 | And this problem gets much, much more complex because many sentences in English have prepositional

00:28:30.700 | phrases all over the place.

00:28:33.680 | So here's the kind of boring sentence that you find in the financial news.

00:28:39.360 | The board approved its acquisition by Royal Trust Co. Ltd. of Toronto for $27 a share

00:28:45.460 | at its monthly meeting.

00:28:47.500 | And while if you look at the structure of this sentence, what we find is, you know,

00:28:51.980 | here's a verb, then here's the object noun phrase.

00:28:58.100 | So we've got the object noun phrase here.

00:29:01.060 | And then after that, what do we find?

00:29:03.860 | Well, we find a prepositional phrase, another prepositional phrase, another prepositional

00:29:09.060 | phrase, and another prepositional phrase.

00:29:12.040 | And how to attach each of these is then ambiguous.

00:29:16.220 | So the basic rule of how you can attach them is you can attach them to things to the left,

00:29:23.700 | providing you don't create crossing attachments.

00:29:27.380 | So in principle, by Royal Trust Co. Ltd. could be attached to either approved or acquisition.

00:29:34.940 | But in this case, by Royal Trust Co. Ltd. it's the acquirer.

00:29:40.620 | So it's a modifier of the acquisition.

00:29:48.300 | Okay, so then we have of Toronto.

00:29:51.380 | So of Toronto could be modifying Royal Trust Co. Ltd.

00:29:55.660 | It could be modifying the acquisition or it can be modifying the approved.

00:30:00.780 | And in this case, the of Toronto is telling you more about the company.

00:30:05.520 | And so it's a modifier of Royal Trust Co. Ltd.

00:30:09.740 | Okay, so then the next one is for $27 a share.

00:30:14.780 | And that could be modifying Toronto, Royal Trust Co. Ltd., the acquisition or the approving.

00:30:21.300 | And well, in this case, that's talking about the price of the acquisition.

00:30:28.260 | So this one jumps back.

00:30:31.740 | And this is now a prepositional phrase that's modifying the acquisition.

00:30:38.120 | And then at the end, at its monthly meeting, well, that's where the approval is happening

00:30:46.660 | by the board.

00:30:47.920 | So rather than any of these preceding four noun phrases, at its monthly meeting is modifying

00:30:55.860 | the approval.

00:30:58.080 | And so it attaches right back there.

00:31:01.180 | And this example is kind of too big.

00:31:03.260 | And so I couldn't fit it in one line.

00:31:06.100 | But as I think maybe you can see that, you know, none of these dependencies cross each

00:31:11.220 | other and they connect at different places ambiguously.

00:31:16.260 | So because we can chain these prepositions like this and attach them at different places

00:31:21.540 | like this, human language sentences are actually extremely ambiguous.

00:31:29.680 | So the number, if you have a sentence with K prepositional phrases at the end of it,

00:31:38.420 | where here we have K equals four, the number of parses this sentence has, the number of

00:31:44.660 | different ways you can make these attachments is given by the Catalan numbers.

00:31:49.540 | So the Catalan numbers are an exponentially growing series which arises in many tree-like

00:31:56.140 | contexts.

00:31:57.140 | So if you're doing something like triangulations of a polygon, you get Catalan numbers.

00:32:01.900 | If you're doing triangulation and graphical models in CS228, you get Catalan numbers.

00:32:08.500 | But we don't need to worry about the details here.

00:32:10.840 | The central point is this is an exponential series.

00:32:13.980 | And so you're getting an exponential number of parses in terms of the number of prepositional

00:32:19.980 | phrases.

00:32:21.580 | And so in general, you know, the number of parses human languages have is exponential

00:32:27.180 | in their length, which is kind of bad news.

00:32:30.740 | Because if you're then trying to enumerate all the parses, you might fear that you really

00:32:35.900 | have to do a ton of work.

00:32:38.820 | The thing to notice about structures like these prepositional phrase attachment ambiguities

00:32:45.940 | is that there's nothing that resolves these ambiguities in terms of the structure of the

00:32:52.460 | sentence.

00:32:53.700 | So if you've done something like looked at the kind of grammars that are used in compilers,

00:33:01.440 | that the grammars used in compilers for programming languages are mainly made to be unambiguous.

00:33:08.780 | And to the extent that there are any ambiguities, there are default rules that are used to say,

00:33:17.180 | choose this one particular parse tree for your piece of a programming language.

00:33:24.280 | And human languages just aren't like that.

00:33:27.380 | They're globally ambiguous.

00:33:29.540 | And the listening human is just meant to be smart enough to figure out what was intended.

00:33:35.300 | So the analogy would be that, you know, in programming languages, when you're working

00:33:44.300 | out what does an else clause modify, well, you've got the answer that you can either

00:33:53.220 | look at the curly braces to work out what the else clause modifies, or if you're using

00:33:59.020 | Python, you look at the indentation and it tells you what the else clause modifies.

00:34:04.500 | Where by contrast, for human languages, it would be just write down else something.

00:34:13.820 | Doesn't matter how you do it.

00:34:15.300 | You don't need parentheses.

00:34:16.540 | You don't need indentation.

00:34:18.260 | The human being will just figure out what the else clause is meant to pair up with.

00:34:23.340 | Okay.

00:34:24.340 | Lots of other forms of ambiguities in human languages.

00:34:29.340 | So let's look at a few others.

00:34:32.020 | Another one that's very common over all sorts of languages is coordination scope ambiguities.

00:34:39.260 | So here's a sentence.

00:34:40.580 | Shuttle veteran and longtime NASA executive Fred Gregory appointed to board.

00:34:45.940 | Well, this is an ambiguous sentence.

00:34:49.460 | There are two possible readings of this.

00:34:52.980 | One reading is that there are two people.

00:34:55.740 | There's a shuttle veteran and there's a longtime NASA executive Fred Gregory.

00:35:01.140 | And they were both appointed to the board, two people.

00:35:06.220 | And the other possibility is there's someone named Fred Gregory who's a shuttle veteran

00:35:15.300 | and longtime NASA executive, and they're appointed to the verb one person.

00:35:21.860 | And these two interpretations, again, correspond to having different past structures.

00:35:28.860 | So in one structure, we've got a coordination of the shuttle veteran and the longtime NASA

00:35:37.500 | executive Fred Gregory coordinated together.

00:35:42.620 | In one case, these are coordinated, and then Fred Gregory specifies the name of the NASA

00:35:49.420 | executive.

00:35:52.300 | So it's then specifying who that executive is, where in the other one, the shuttle veteran

00:36:00.460 | and longtime NASA executive all together is then something that is a modifier of Fred

00:36:09.220 | Gregory.

00:36:11.260 | Okay, so one time, this is the unit that modifies Fred Gregory.

00:36:17.660 | In the other one up here, just longtime NASA executive modifies Fred Gregory, and then

00:36:24.380 | that's conjoined together with the shuttle veteran.

00:36:28.020 | And so that also gives different interpretations.

00:36:32.220 | So this is a slightly reduced example of the, I mean, in newspaper headlines tend to be

00:36:40.020 | more ambiguous than many other pieces of text because they're written in this shortened

00:36:45.300 | form to get things to fit.

00:36:47.980 | And this is an especially shortened form, where it's actually left out in the explicit

00:36:53.860 | conjunction.

00:36:54.860 | But this headline says, "Doctor, no heart, cognitive issues."

00:37:01.740 | And this was after, I guess, one of Trump, it was after Trump's first physical.

00:37:06.420 | And while this is an ambiguity, because there are two ways that you can read this, you can

00:37:10.660 | either read this as saying, "Doctor, no heart and cognitive issues," which gives you one

00:37:18.820 | interpretation.

00:37:20.780 | Instead of that, the way we should read it is that it's "heart or cognitive," and so

00:37:29.260 | it's then saying, "No heart or cognitive issues."

00:37:34.180 | And we have a different, narrower scope of the coordination, and then we get a different

00:37:39.500 | wording.

00:37:41.820 | Okay.

00:37:45.740 | I want to give a couple more examples of different kinds of ambiguities.

00:37:50.660 | Another one you see quite a bit is when you have modifiers that are adjectives and adverbs,

00:37:57.140 | that there are different ways that you can have things modifying other things.

00:38:01.860 | This example is a little bit not safe for work, but here goes.

00:38:07.500 | "Hands get first-hand job experience."

00:38:11.700 | So this is an ambiguous sentence, and again, we can think of it as a syntactic ambiguity

00:38:17.820 | in terms of which things modify which other things.

00:38:24.260 | So the nice, polite way to render this sentence is that "first" is modifying "hand," so we've

00:38:33.940 | got "first-hand," it's "job experience," so "job" is a compound noun modifying "experience,"

00:38:41.540 | and it's "first-hand experience," so "first-hand" is then modifying "experience," and then "get"

00:38:49.900 | is the object of -- sorry, "first-hand job experience" is the object of "get," and "students"

00:38:57.780 | are the subject of "get."

00:39:01.740 | But if you have a smuttier mind, you can interpret this a different way, and in the alternative

00:39:08.340 | interpretation, you then have "hand" going together with "job," and the "first" is then

00:39:20.820 | a modifier of "experience," and "job" is still a modifier of "experience," and so then you

00:39:27.460 | get this different past structure and different interpretation there.

00:39:32.340 | Okay, one more example.

00:39:35.540 | In a way, this example is similar to the previous one.

00:39:38.940 | It's sort of having modifier pieces that can modify different things, but rather than it

00:39:45.740 | just being with individual adjectives or individual adverbs, it's then much larger units, such

00:39:54.160 | as verb phrases, can often have attachment ambiguities.

00:39:58.140 | So this sentence headline is "Mutilated body washes up on Rio beach to be used for Olympics

00:40:04.940 | beach volleyball."

00:40:06.900 | So we have this big verb phrase here of "to be used for Olympics beach volleyball," and

00:40:13.140 | then again we have this attachment decision that we could either say that that big verb

00:40:21.900 | phrase is modifying, i.e., is attached to the Rio beach, or we could say, "No, no, the

00:40:33.020 | 'to be used for Olympics beach volleyball,' that is modifying the mutilated body, and

00:40:42.340 | it's a body that's to be used for the Olympics beach volleyball," which gives the funny reading.

00:40:49.420 | Yeah, so I hope that's given you at least a little bit of a sense of how human language

00:40:55.620 | syntactic structure is complex, ambiguous, and to work out the intended interpretations,

00:41:03.620 | you need to know something about that structure.

00:41:07.660 | In terms of how much you need to understand, I mean, you know, this isn't a linguistics

00:41:13.820 | class.

00:41:14.820 | If you'd like to learn more about human language structure, you can go off and do a syntax

00:41:19.180 | class, but, you know, we're not really going to spend a lot of time working through language

00:41:25.660 | structure, but there will be some questions on this in the assignment, and so we're expecting

00:41:31.180 | that you can be at the level that you can have sort of some intuitions as to which words

00:41:37.220 | and phrases are modifying other words and phrases, and therefore you could choose between

00:41:42.740 | two dependency analyses which one's correct.

00:41:46.180 | Okay, I've spent quite a bit of time on that, so better keep going.

00:41:53.060 | Okay, so the general idea is that knowing this sort of syntactic structure of a sentence

00:42:00.580 | can help us with semantic interpretation.

00:42:04.340 | I mean, as well as just generally saying we can understand language, it's also used in

00:42:08.900 | many cases for simple practical forms of semantic extraction, so people such as in biomedical

00:42:15.300 | informatics often want to get out particular relations such as protein-protein interactions,

00:42:21.100 | and well, here's a sentence.

00:42:22.660 | The results demonstrated that Chi C interacts rhythmically with Sass A, Chi A, and Chi B,

00:42:31.460 | and commonly that people can get out those kind of relationships by looking at patterns

00:42:37.380 | of dependency relations with particular verbs, so for the interacts verb, if you have a pattern

00:42:43.420 | of something being the subject and something else being the noun modifier of interacts,

00:42:49.780 | well that's an interaction relationship, but it gets a bit more complicated than that,

00:42:54.060 | as in this example, because often there are conjunctions, so you also want to have another

00:42:58.740 | pattern where you have also interactions between the subject and the noun modifiers conjunct,

00:43:07.420 | which will allow us to also find the Chi A and Chi B examples.

00:43:13.420 | Okay, so I've sort of given an informal tour of dependency grammar to just try and quickly

00:43:22.460 | say a little bit more about formally what a dependency grammar is.

00:43:27.140 | So in dependency syntax, what we say is that the syntactic structure of a sentence consists

00:43:35.940 | of relations between pairs of words, and it's a binary asymmetric relation, i.e. we draw

00:43:43.780 | arrows between pairs of words, which we call dependencies.

00:43:48.980 | Now normally dependency grammars then type those grammatical relation, type those arrows

00:43:55.660 | to express what kind of relation that there is, and so that they have some kind of taxonomy

00:44:01.340 | of grammatical relations, so we might have a subject grammatical relation, a verbal auxiliary

00:44:06.580 | grammatical relation, an oblique modifier grammatical relation, we have some kind of

00:44:11.940 | typology of grammatical relations.

00:44:16.820 | And we refer to the arrows going between the head, here's the head here, and something

00:44:24.540 | that is a dependent of it, so the subject of a verb is a dependent of the verb, or when

00:44:31.420 | you have a noun modifier like our sort of cuddly cat, we say that cuddly is a dependent

00:44:41.980 | of cat, and so cat is the head of cuddly cat.

00:44:46.540 | And so normally dependencies, like in these examples, form a tree, which is formal, so

00:44:55.580 | it's not just any graph with arrows, we have a graph which is connected, acyclic, and has

00:45:03.460 | a single root, so here's the root of the graph, and so that gives us a dependency tree analysis.

00:45:13.660 | Language-based grammar has a really, really long history, so the famous first linguist

00:45:20.740 | was Panini, who wrote about the structure of Sanskrit, and mainly he worked on the sound

00:45:29.580 | system of Sanskrit and how sounds change in various contexts, which is what linguists

00:45:33.820 | call phonology, and the different forms of Sanskrit words, Sanskrit has rich morphology

00:45:40.860 | of inflecting nouns and verbs for different cases and forms, but he also worked a little

00:45:46.900 | on the syntactic structure of Sanskrit sentences, and essentially what he proposed was a dependency

00:45:54.100 | grammar over Sanskrit sentences.

00:45:57.340 | And it turns out that sort of for most of recorded history, when people have then gone

00:46:04.380 | on and tried to put structures over human sentences, what they have used is dependency

00:46:11.460 | grammars, so there was a lot of work in the first millennium by Arabic grammarians trying

00:46:17.860 | to work out the grammar structure of sentences, and effectively what they used was akin to

00:46:25.660 | what I've just presented as a dependency grammar.

00:46:29.060 | So compared to, you know, 2500 years of history, the ideas of having context-free grammars

00:46:37.140 | and having constituency grammars is actually a really, really recent invention, so it was

00:46:41.820 | really sort of in the middle of the 20th century that the ideas of constituency grammar and

00:46:48.220 | context-free grammars were developed, first by Wells in the 40s, and then by Noam Chomsky

00:46:53.900 | in the early 50s, leading to things like the Chomsky hierarchy that you might see in CS103

00:47:00.420 | or a formal languages class.

00:47:04.500 | So for modern work on dependency grammar, using kind of the terminology and notation

00:47:11.100 | that I've just introduced, that's normally attributed to Lucien Taignere, who was a French

00:47:16.300 | linguist in around the sort of middle of the 20th century as well.

00:47:22.980 | Constituency grammar was widely used in the 20th century in a number of places.

00:47:28.820 | I mean, in particular, it tends to be sort of much more natural and easier to think about

00:47:34.580 | for languages that have a lot of different case markings on nouns, like nominative, accusative,

00:47:41.260 | genitive, dative, instrumental kind of cases, like you get in a language like Latin or Russian,

00:47:47.380 | and a lot of those languages have much freer word order than English.

00:47:51.460 | So the subject or object of, you know, in English, the subject has to be before the

00:47:56.100 | verb and the object has to be after the verb, but lots of other languages have much freer

00:48:00.780 | word order and instead use different forms of nouns to show you what's the subject or

00:48:07.060 | the object of the sentence.

00:48:08.820 | And dependency grammars can often seem much more natural for those kinds of languages.

00:48:15.420 | Dependency grammars were also prominent in the very beginnings of computational linguistics.

00:48:19.500 | So one of the first people working in computational linguistics in the US was David Hayes.

00:48:27.080 | So the Professional Society for Computational Linguistics is called the Association for

00:48:30.940 | Computational Linguistics, and he was actually one of the founders of the Association for

00:48:35.520 | Computational Linguistics.

00:48:37.400 | And he published in the early 1960s, an early, perhaps the first dependency grammar, sorry,

00:48:45.260 | dependency parser.

00:48:47.260 | Okay.

00:48:48.260 | Yeah, a little teeny note, just in case you see other things.

00:48:54.280 | When you have these arrows, you can draw them in either direction.

00:48:58.820 | You can either draw arrows from the head or to the dependent or from the dependent to

00:49:04.500 | the head.

00:49:05.500 | And actually, different people have done one and the other, right?

00:49:09.620 | So the way Tenier drew them was to draw them from the head to the dependent, and we're

00:49:14.700 | following that convention.

00:49:15.940 | But, you know, if you're looking at something that somebody else has written with dependency

00:49:21.500 | arrows, the first thing you have to work out is, are they using the arrow heads or the

00:49:27.020 | dependents?

00:49:28.020 | Now, one other thing here is that a sentence is seen as having the overall head word of

00:49:37.420 | the sentence, which every other word of the sentence hangs off.

00:49:42.060 | It's a common convention to add this sort of fake root to every sentence that then points

00:49:48.700 | to the head word of the whole sentence here completed.

00:49:53.380 | That just tends to make the algorithmic stuff easier, because then you can say that every

00:49:58.620 | word of the sentence is dependent on precisely one other node, where what you can be dependent

00:50:05.980 | on is either another word on the sentence or the fake root of the sentence.

00:50:11.060 | And when we build our parsers, we will introduce that fake root.

00:50:16.660 | Okay.

00:50:18.740 | So that's sort of dependency grammars and dependency structure.

00:50:26.000 | I now want to get us back to natural language processing and starting to build parsers for

00:50:35.580 | dependency grammars.

00:50:37.220 | But before doing that, I just want to say, yeah, where do we get our data from?

00:50:44.020 | And that's actually an interesting story in some sense.

00:50:49.420 | So the answer to that is, well, what we do is get human beings, commonly linguists or

00:51:00.180 | other people who are actually interested in the structure of human sentences, and we get

00:51:06.100 | them to sit around and hand parse sentences and give them dependency structures, and we

00:51:13.420 | collect a lot of those parsers, and we call that a tree bank.

00:51:20.660 | And so this is something that really only started happening in the late '80s and took

00:51:27.980 | off in a bigger way in the '90s.

00:51:30.820 | Until then, no one had attempted to build tree banks.

00:51:34.300 | Lots of people had attempted to build parsers, and it seemed like, well, if you want to build

00:51:40.060 | a parser, the efficient way to do it is to start writing a grammar.

00:51:45.180 | So you start writing some grammar rules, and you start writing a lexicon with words and

00:51:50.140 | parts of speech, and you sit around working on your grammar.

00:51:54.980 | When I was a PhD student, one of my first summer jobs was spending the summer handwriting

00:52:00.100 | a grammar.

00:52:02.100 | And it sort of seems like writing a grammar is more efficient because you're writing this

00:52:06.900 | one general thing that tells you the structure of a human language.

00:52:11.920 | But there's just been this massive sea change, partly driven by the adoption of machine learning

00:52:16.640 | techniques, where it's now seen as axiomatic that the way to make progress is to have annotated

00:52:25.400 | data, namely here a tree bank that shows you the structure of sentences.

00:52:31.700 | And so what I'm showing here is a teeny extract from a universal dependencies tree bank.

00:52:38.160 | And so that's why I mentioned earlier that this has been this effort to try and have

00:52:42.740 | a common dependency grammar representation that you can apply to lots of different human

00:52:48.300 | languages.

00:52:49.300 | And so you can go over to this URL and see that there's about 60 different languages

00:52:53.720 | at the moment which have universal dependencies tree banks.

00:52:59.060 | So why are tree banks good?

00:53:01.500 | I mean, it sort of seems like it's bad news if you have to have people sitting around

00:53:08.020 | for weeks and months hand parsing sentences.

00:53:11.100 | It seems a lot slower and actually a lot less useful than having somebody writing a grammar

00:53:19.580 | which just has a much bigger multiplier factor in the utility of their effort.

00:53:27.980 | It turns out that although that initial feeling seems sort of valid, that in practice there's

00:53:35.420 | just a lot more you can do with the tree bank.

00:53:38.180 | So why are tree banks great?

00:53:41.660 | You know, one reason is that tree banks are highly reusable.

00:53:46.580 | So typically when people have written grammars, they've written grammars for, you know, one

00:53:52.300 | particular parser and the only thing it was ever used in is that one particular parser.

00:53:59.540 | But when you build a tree bank, that's just a useful data resource and people use it for

00:54:07.460 | all kinds of things.

00:54:09.260 | So the well-known tree banks have been used by hundreds and hundreds of people.

00:54:14.540 | And although all tree banks were initially built for the purposes of, hey, let's help

00:54:20.340 | natural language processing systems, it turns out that people have actually been able to

00:54:25.100 | do lots of other things with tree banks.

00:54:28.220 | So for example, these days, psycholinguists commonly use tree banks to get various kinds

00:54:34.220 | of statistics about data for thinking about psycholinguistic models.

00:54:40.060 | Linguists use tree banks for looking at patterns of different syntactic constructions that

00:54:45.140 | occur that there's just been a lot of reuse of this data for all kinds of purposes.

00:54:52.300 | But they have other advantages that I mentioned here.

00:54:55.380 | You know, when people are just sitting around saying, oh, what sentences are good, they

00:54:59.820 | tend to only think of the core of language where lots of weird things happen in language.

00:55:05.780 | And so if you actually just have some sentences and you have to go off and parse them, then

00:55:11.460 | you actually have to deal with the totality of language.

00:55:15.540 | Since you're parsing actual sentences, you get statistics.

00:55:19.620 | So you naturally get the kind of statistics that are useful to machine learning systems

00:55:25.300 | by constructing a tree bank where you don't get them for free if you hand write a grammar.

00:55:30.980 | But then a final way, which is perhaps the most important of all, is if you actually

00:55:38.980 | want to be able to do science of building systems, you need a way to evaluate these

00:55:48.100 | NLP systems.

00:55:49.100 | I mean, it seems hard to believe now, but, you know, back in the 90s, 80s, when people

00:55:58.820 | built NLP parsers, it was literally the case that the way they were evaluated was you said

00:56:05.940 | to your friend, I built this parser, type in a sentence on the terminal and see what

00:56:11.780 | it gives you back.

00:56:12.780 | It's pretty good, hey?

00:56:14.300 | And that was just the way business was done.

00:56:17.420 | Whereas what we'd like to know is, well, as I showed you earlier, English sentences can

00:56:23.020 | have lots of different parsers commonly.

00:56:25.740 | Can this system choose the right parsers for particular sentences and therefore have the

00:56:33.220 | basis of interpreting them as a human being would?

00:56:37.500 | And while we can only systematically do that evaluation if we have a whole bunch of sentences

00:56:43.020 | that have been hand parsed by humans with their correct interpretations.

00:56:48.220 | So the rise of tree banks turned parser building into an empirical science where people could

00:56:54.940 | then compete rigorously on the basis of, look, my parser has 2% higher accuracy than your

00:57:01.980 | parser in choosing the correct parsers for sentences.

00:57:06.380 | Okay.

00:57:07.820 | So well, how do we build a parser once we've got dependencies?

00:57:13.020 | So there's sort of a bunch of sources of information that you could hope to use.

00:57:18.420 | So one source of information is looking at the words on either end of the dependency.

00:57:26.020 | So discussing issues, that seems a reasonable thing to say.

00:57:31.160 | And so it's likely that issues could be the object of discussing.

00:57:38.300 | Whereas if it was some other word, right, if you were thinking of making, you know,

00:57:46.060 | outstanding the object of discussion, discussing outstanding, that doesn't sound right.

00:57:51.560 | So that wouldn't be so good.

00:57:53.380 | The second source of information is distance.

00:57:56.480 | So most dependencies are relatively short distance.

00:58:01.180 | Some of them aren't.

00:58:02.180 | Some of them are long distance dependencies, but they're relatively rare.

00:58:06.620 | The vast majority of dependencies are nearby.

00:58:11.180 | Another source of information is the intervening material.

00:58:14.800 | So there are certain things that dependencies rarely span.

00:58:22.020 | So clauses and sentences are normally organized around verbs.

00:58:28.360 | And so dependencies rarely span across intervening verbs.

00:58:33.980 | We can also use punctuation in written language, things like commas, which can give some indication

00:58:39.620 | of the structure.

00:58:41.140 | And so punctuation may also indicate bad places to have long distance dependencies over.

00:58:49.240 | And there's one final source of information, which is what's referred to as valency, which

00:58:55.940 | is for a head, what kind of information does it usually have around it?

00:59:02.080 | So if you have a noun, there are things that you just know about what kinds of dependence

00:59:09.060 | nouns normally have.

00:59:10.920 | So it's common that it will have a determiner to the left, the cat.

00:59:16.680 | On the other hand, it's not going to be the case that there's a determiner to the right,

00:59:24.540 | cat, the.

00:59:25.540 | That's just not what you get in English.

00:59:28.620 | On the left, you're also likely to have an adjectival modifier.

00:59:33.940 | That's where we had cuddly.

00:59:35.900 | But again, it's not so likely you're going to have the adjectival modifier over on the

00:59:42.540 | right for cuddly.

00:59:44.160 | So there are sort of facts about what things different kinds of words take on the left

00:59:49.620 | and the right.

00:59:50.700 | And so that's the valency of the heads.

00:59:53.040 | And that's also a useful source of information.

00:59:56.660 | Okay, so what do we need to do using that information to build a parser?

01:00:03.700 | Well, effectively, what we do is have a sentence.

01:00:06.740 | I'll give a talk tomorrow on neural networks.

01:00:09.580 | But what we have to do is say for every word in that sentence, we have to choose some other

01:00:15.060 | word that it's a dependent of where one possibility is it's a dependent of root.

01:00:22.500 | So we're giving it a structure where we're saying, okay, for this word, I've decided

01:00:28.540 | that it's a dependent on networks.

01:00:31.660 | And then for this word, it's also dependent on networks.

01:00:37.580 | And for this word, it's a dependent on give.

01:00:43.980 | So we're choosing one for each word.

01:00:47.520 | And there are usually a few constraints.

01:00:49.880 | So only one word is a dependent of root, we have a tree.

01:00:55.300 | We don't want cycles.

01:00:56.740 | So we don't want to say that word A is dependent on word B and word B is dependent on word

01:01:01.820 | A.

01:01:03.180 | And then this one final issue, which is whether arrows can cross or not.

01:01:12.600 | So in this particular sentence, we actually have these crossing dependencies, you can

01:01:17.700 | see there, I'll give a talk tomorrow on neural networks.

01:01:22.780 | And this is the correct dependency parse for this sentence.

01:01:26.840 | Because what we have here is that it's a talk, and it's a talk on neural networks.

01:01:31.860 | So the on neural networks modifies the talk, but which leads to these crossing dependencies.

01:01:38.860 | I didn't have to say it like that.

01:01:40.820 | I could have said, I'll give a talk on neural networks tomorrow, and then on neural networks

01:01:46.660 | would be next to the talk.

01:01:48.560 | So most of the time, in languages, dependencies are projective, the things stay together.

01:01:57.140 | So the dependencies have a kind of a nesting structure of the kind that you also see in

01:02:02.660 | context free grammars.

01:02:04.580 | But most languages have at least a few phenomena where you ended up with these ability for

01:02:11.300 | phrases to be split apart, which lead to non projective dependencies.

01:02:16.740 | So in particular, one of them in English is that you can take modifying phrases and clauses

01:02:23.460 | like the on neural networks here and shift them right towards the end of the sentence

01:02:28.380 | and get I'll give a talk tomorrow on neural networks.

01:02:32.340 | And that then leads to non projective sentences.

01:02:37.860 | So a parse is projective if there are no crossing dependency arcs when the words are laid out

01:02:43.460 | in their linear order with all arcs above the words.

01:02:48.080 | And if you have a dependency parse that corresponds to a context free grammar tree, it actually

01:02:53.560 | has to be projective because context free grammars necessarily have this sort of nested

01:02:59.560 | tree structure following the linear order.

01:03:03.200 | But dependency grammars normally allow non projective structures to account for displaced

01:03:09.440 | constituents.

01:03:10.440 | And you can't easily get the semantics of certain constructions right without these

01:03:16.040 | non projective dependencies.

01:03:18.100 | So here's another example in English with question formation with what's called preposition

01:03:24.760 | stranding.

01:03:26.200 | So the sentence is, who did Bill buy the coffee from yesterday?

01:03:31.480 | There's another way I could have said this.

01:03:34.120 | It's less natural in English, but I could have said, from who did Bill from who did

01:03:40.720 | Bill buy the coffee yesterday?

01:03:44.240 | In many languages of the world, that's the only way you could have said it.

01:03:48.600 | And when you do that, from who is kept together and you have a projective parse for the sentence.

01:03:56.920 | But English allows and indeed much prefers you to do what is referred to as preposition

01:04:02.960 | stranding where you move the who, but you just leave the preposition behind.

01:04:09.320 | And so you get, who did Bill buy the coffee from yesterday?

01:04:13.660 | And so then we're ending up with this non projective dependency structure as I've shown

01:04:19.200 | there.

01:04:20.200 | Okay, I'll come back to non projectivity in a little bit.

01:04:26.720 | How do we go about building dependency parsers?

01:04:30.520 | Well, there are a whole bunch of ways that you can build dependency parsers.

01:04:37.120 | Very quickly, I'll just say a few names and I'll tell you about one of them.

01:04:40.920 | So you can use dynamic programming methods to build dependency parsers.

01:04:44.920 | So I showed earlier that you can have an exponential number of parsers for a sentence.

01:04:50.500 | And that sounds like really bad news for building a system.

01:04:53.240 | Well, it turns out that you can be clever and you can work out a way to dynamic program,

01:04:59.000 | finding that exponential number of parsers, and then you can have an O N cubed algorithm.

01:05:04.960 | So you could do that.

01:05:07.500 | You can use graph algorithms and I'll say a bit about that later, but that may spill

01:05:12.600 | into next time.

01:05:14.600 | So you can see, since we're wanting to kind of connect up all the words into a tree using

01:05:22.360 | graph edges, that you could think of doing that using a minimum spanning tree algorithm

01:05:27.880 | of the sort that you hopefully saw in CS 161.

01:05:32.000 | And so that idea has been used for parsing.

01:05:35.560 | Transition satisfaction ideas that you might have seen in CS 221 have been used for dependency

01:05:40.880 | parsing.

01:05:43.320 | But the way I'm going to show now is transition based parsing or sometimes referred to as

01:05:48.740 | deterministic dependency parsing.

01:05:51.720 | And the idea of this is one's going to use a transition system.

01:05:57.960 | So that's like shift reduce parsing.

01:06:00.560 | If you've seen shift reduce parsing in something like a compiler's class or formal languages

01:06:06.560 | class that shift and reduce our transition steps.

01:06:10.880 | And so use a transition system to guide the construction of parsers.

01:06:18.080 | And so let me just explain about that.

01:06:21.880 | So let's see.

01:06:26.240 | So this was an idea that was made prominent by Joakim Nivre, who's a Swedish computational

01:06:35.520 | linguist who introduced this idea of greedy transition based parsing.

01:06:42.140 | So his idea is, well, what we're going to do for dependency parsing is we're going to

01:06:49.000 | be able to parse sentences by having a set of transitions which are kind of like shift

01:06:55.080 | reduce parser.

01:06:57.080 | And it's going to just work left to right, bottom up, and parse a sentence.

01:07:03.200 | So we're going to say we have a stack sigma, a buffer beta of the words that we have to

01:07:10.320 | process.

01:07:11.320 | And we're going to build up a set of dependency arcs by using actions, which are shift and

01:07:16.680 | reduce actions.

01:07:18.640 | And putting those together, this will give us the ability to put parse structures over

01:07:24.920 | sentences.

01:07:25.920 | And let me go through the details of this.

01:07:29.360 | And this is a little bit hairy when you first see it.

01:07:32.800 | It's not so complex, really.

01:07:35.040 | And this kind of transition based dependency parser is what we'll use in assignment three.

01:07:44.480 | So what we have, so this is our transition system.

01:07:47.360 | We have a starting point where we start with a stack that just has the root symbol on it

01:07:53.880 | and a buffer that has the sentence that's about to parse.

01:07:59.080 | And so far, we haven't built any dependency arcs.

01:08:04.200 | And so at each point in time, we can choose one of three actions.

01:08:09.080 | We can shift, which moves the next word onto the stack.

01:08:16.960 | We can then do actions that are the reduce actions.

01:08:22.480 | So there are two reduce actions to make it a dependency grammar.

01:08:26.840 | We can either do a left arc reduce or a right arc reduce.

01:08:31.920 | So when we do either of those, we take the top two items on the stack and we make one

01:08:39.280 | of them a dependent of the other one.

01:08:42.360 | So we can either say, okay, let's make WI a dependent of WJ or else we can say, okay,

01:08:48.560 | let's make WJ a dependent of WI.

01:08:52.160 | And so the result of when we do that is the one that's the dependent disappears from the

01:08:59.600 | stack.

01:09:00.720 | And so in the stacks over here, there's one less item.

01:09:04.160 | But then we add a dependency arc to our arc set so that we say that we've got either a

01:09:10.680 | dependency from J to I or a dependency from I to J.

01:09:15.600 | And commonly when we do this, we actually also specify what grammatical relation connects

01:09:22.080 | the two, such as subject, object, noun, modifier.

01:09:25.800 | And so we also have here a relation.

01:09:30.720 | That's still probably still very abstract.

01:09:33.040 | So let's go through an example.

01:09:35.320 | So this is how a simple transition-based dependency parser, what's referred to as an arc standard

01:09:42.720 | transition-based dependency parser, would parse up I ate the fish.

01:09:47.440 | So remember, these are the different operations that we can apply.

01:09:51.040 | So to start off with, we have root on the stack and the sentence in the buffer, and

01:09:56.240 | we have no dependency arcs constructed.

01:09:59.380 | So we have to choose one of the three actions.

01:10:02.600 | And when there's only one thing on the stack, the only thing we can do is shift.

01:10:08.380 | So we shift.

01:10:10.040 | And now the stack looks like this.

01:10:12.200 | So now we have to take another action.

01:10:15.040 | And at this point, we have a choice because we could immediately reduce.

01:10:19.900 | So we could say, okay, let's just make I a dependent of root and we'd get a stack size

01:10:27.100 | of one again.

01:10:28.100 | But that would be the wrong thing to do because I isn't the head of the sentence.

01:10:34.320 | So what we should instead do is shift again and get I ate on the stack and fish still

01:10:41.320 | in the buffer.

01:10:42.320 | Well, at that point, we keep on parsing a bit further.

01:10:46.200 | And so now what we can do is say, well, wait a minute.

01:10:50.520 | Now I is a dependent of eight.

01:10:54.500 | And so we can do a left arc reduce.

01:10:58.520 | And so I disappears from the stack.

01:11:00.800 | So here's our new stack.

01:11:02.560 | But we add to the set of arcs that we've added that I is the subject of eight.

01:11:08.600 | Okay.

01:11:09.600 | Well, after that, we could reduce again because there's still two things on the stack.

01:11:15.960 | But that'd be the wrong thing to do.

01:11:18.000 | The right thing to do next would be to shift fish onto the stack.

01:11:24.160 | And then at that point, we can do a right arc reduce saying that eight is the object

01:11:32.420 | of fish and add a new dependency to our dependency set.

01:11:37.720 | And then we can one more time do a right arc reduce to say that eight is the root of the

01:11:45.500 | whole sentence and add in that extra root relation with our pseudo root.

01:11:51.180 | And at that point, we've reached the end condition.

01:11:54.420 | So the end condition was the buffer was empty.

01:11:57.300 | And there's one thing, the root on the stack.

01:12:00.260 | And at that point, we can finish.

01:12:03.220 | So this little transition machine does the parsing up of the sentence.

01:12:10.200 | But there's one thing that's left to explain still here.

01:12:14.780 | Which is how do you choose the next action?

01:12:18.220 | So as soon as you have two things or more on the stack, what you do next, you've always

01:12:24.220 | got a choice.

01:12:25.220 | You could keep shifting, at least if there's still things on the buffer.

01:12:28.460 | Or you can do a left arc or you can do a right arc.

01:12:32.620 | And how do you know what choice is correct?

01:12:35.260 | And one answer to that is to say, well, you don't know what choice is correct.

01:12:39.540 | And that's why parsing is hard and sentences are ambiguous.

01:12:43.420 | You can do any of those things.

01:12:46.020 | You have to explore all of them.

01:12:48.860 | And well, if you naively explore all of them, then you do an exponential amount of work

01:12:54.500 | to parse the sentence.

01:12:56.420 | So in the early 2000s, Joachim Nivre's -- and that's essentially what people had done in

01:13:06.500 | the '80s and '90s, is explore every path.

01:13:10.540 | But in the early 2000s, Joachim Nivre's essential observation was, but wait a minute.

01:13:18.940 | We know about machine learning now.

01:13:21.260 | So why don't I try and train a classifier which predicts what the next action I should

01:13:29.300 | take is given this stack and buffer configuration?

01:13:33.900 | Because if I can write a machine learning classifier which can nearly always correctly

01:13:41.940 | predict the next action given a stack and buffer, then I'm in a really good position.

01:13:50.020 | Because then I can build what's referred to as a greedy dependency parser which just goes,

01:13:56.220 | bang, bang, bang, word at a time.

01:13:59.700 | Okay.

01:14:00.700 | Here's the next thing.

01:14:02.420 | Run classifier, choose next action, run classifier, choose next action, run classifier, choose

01:14:07.940 | next action, so that the amount of work that we're doing becomes linear in the length of

01:14:14.500 | the sentence rather than it being cubic in the length of the sentence using dynamic programming

01:14:22.180 | or exponential in the length of the sentence if you don't use dynamic programming.

01:14:27.200 | So at each step we predict the next action using some discriminative classifier.

01:14:34.680 | So starting off he was using things like support vector machines, but it can be anything at

01:14:40.080 | all like softmax classifier that's closer to our neural networks.

01:14:44.440 | And there are either for what I presented three classes if you're just thinking of the

01:14:50.000 | two reducers in the shift or if you're thinking of you're also assigning a relation and you

01:14:55.280 | have a set of R relations like 20 relations, then that's the sort of 41 moves that you

01:15:01.800 | could decide on at each point.

01:15:04.720 | And the features are effectively the configurations I was showing before.

01:15:09.840 | What's the top of the stack word?

01:15:12.040 | What part of speech is it?

01:15:13.480 | What's the first word in the buffer?

01:15:15.000 | What's that word's part of speech, et cetera?

01:15:17.520 | And so in the simplest way of doing this, you're now doing no search at all.

01:15:22.000 | You're just sort of take each configuration and turn, decide the most likely next move

01:15:27.720 | and you make it.

01:15:29.060 | And that's a greedy dependency parser, which is widely used.

01:15:33.280 | You can do better if you want to do a lot more work.

01:15:36.560 | So you can do what's called a beam search where you maintain a number of fairly good

01:15:42.640 | parse prefixes at each step and you can extend them out further.

01:15:48.800 | And then you can evaluate later on which of those seems to be the best.

01:15:53.520 | And so beam search is one technique to improve dependency parsing by doing a lot of work.

01:16:00.920 | And it turns out that although these greedy transition-based parsers are a fraction worse

01:16:09.440 | than the best possible ways known to parse sentences, that they actually work very accurately

01:16:17.120 | almost as well.

01:16:19.000 | And they have this wonderful advantage that they give you linear time parsing in terms

01:16:25.200 | of the length of your sentences and text.

01:16:29.080 | And so if you want to do a huge amount of parsing, they're just a fantastic thing to

01:16:34.440 | use because you've then got an algorithm that scales to the size of the web.

01:16:42.160 | Okay.

01:16:44.240 | So I'm kind of a little bit behind, so I guess I'm not going to get through all of

01:16:48.360 | these slides today and we'll have to finish out the final slides tomorrow.

01:16:52.520 | But just to push a teeny bit further, I'll just say a couple more on the sort of what

01:16:59.920 | Neivre did for dependency parser, and then I'll sort of introduce the neural form of

01:17:05.880 | that in the next class.

01:17:07.720 | So conventionally, you had this sort of stack and buffer configuration and you wanted to

01:17:13.040 | build a machine learning classifier.

01:17:16.200 | And so the way that was done was by using symbolic features of this configuration.

01:17:24.040 | And what kind of symbolic features did you use?

01:17:28.160 | Use these indicator features that picked out a small subset, normally one to three elements

01:17:35.040 | of the configuration.

01:17:36.800 | So you'd have a feature that could be something like the thing on the top of the stack is

01:17:42.240 | the word good, which is an adjective, or it could be the thing on the top of the stack

01:17:46.800 | is an adjective and the thing that's first in the buffer is a noun, or it could just

01:17:52.520 | be looking at one thing and saying the first thing in the buffer is a verb.

01:17:57.320 | So you'd have all of these features.

01:17:59.760 | And because these features commonly involved words and commonly involved conjunctions of

01:18:06.440 | several conditions, you had a lot of features.

01:18:11.480 | And having mentions of words and conjunctions of conditions definitely helped to make these

01:18:17.840 | parsers work better.

01:18:20.600 | But nevertheless, because you had all of these sort of one zero symbolic features, that you

01:18:26.480 | had a ton of such features.

01:18:28.840 | So commonly these parsers were built using something like a million to 10 million different

01:18:35.720 | features of sentences.

01:18:38.700 | And I mentioned already the importance of evaluation.

01:18:41.720 | Let me just sort of quickly say how these parsers were evaluated.

01:18:47.440 | So to evaluate a parser for a particular sentence, it was hand parsed, our test set was hand

01:18:57.480 | parsed in the tree bank.

01:18:58.960 | So we have gold dependencies of what the human thought were right.

01:19:03.120 | And so we can write those dependencies down as statements of saying the first word is

01:19:11.720 | the dependent of the second word via a subject dependency.

01:19:16.460 | And then the parser is also going to make similar claims as to what's a dependent on

01:19:22.220 | what.

01:19:23.260 | And so there are two common metrics that are used.

01:19:26.980 | One is just are you getting these dependency facts right?

01:19:30.540 | So both of these dependency facts match.

01:19:33.660 | And so that's referred to as the unlabeled accuracy score, where we're just sort of measuring

01:19:39.980 | accuracies, which are all of the dependencies in the gold sentence.

01:19:45.860 | And remember, we have one dependency per word in the sentence.

01:19:49.740 | So here we have five.

01:19:51.940 | How many of them are correct?

01:19:53.700 | And that's our unlabeled accuracy score of 80%.

01:19:57.260 | And a slightly more rigorous evaluation is to say, well, no, we're also going to label

01:20:04.180 | them and we're going to say that this is the subject.

01:20:07.740 | That's actually called the root.

01:20:09.580 | This one's the object.

01:20:11.580 | So these dependencies have labels.

01:20:17.340 | And you also need to get the grammatical relation label right.

01:20:21.480 | And so that's then referred to as labeled accuracy score.

01:20:25.220 | And although I got those two right for that as...

01:20:30.780 | I guess according to this example, actually, this is wrong.

01:20:34.140 | It looks like I got...

01:20:35.140 | Oh, no, this is wrong there.

01:20:37.620 | Sorry, that one's wrong there.

01:20:39.060 | Okay.

01:20:40.060 | So I only got two of the dependencies correct in the sense that I both got what depends

01:20:46.340 | on what and the label correct.

01:20:48.660 | And so my labeled accuracy score is only 40%.

01:20:52.900 | Okay.

01:20:53.900 | So I'll stop there now for the introduction for dependency parsing.

01:20:58.540 | I still have an IOU, which is how we can then bring neural nets into this picture and how

01:21:05.580 | they can be used to improve dependency parsing.

01:21:09.300 | So I'll do that at the start of next time before then proceeding further into neural

01:21:15.180 | language models.

01:21:16.180 | Thank you.

01:21:17.180 | [end of transcript]

01:21:17.180 | [BLANK_AUDIO]

Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 4 - Syntactic Structure and Dependency Parsing

Chapters