Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 13

00:00:00.000 | Okay, hi everyone.

00:00:07.440 | So we'll get started again.

00:00:09.000 | We're now into week seven of CS224N.

00:00:14.080 | If you're following along the syllabus really closely, we actually did a little bit of a

00:00:19.040 | rearrangement in classes.

00:00:21.240 | And so today it's me and I'm going to talk about coreference resolution, which is another

00:00:27.380 | chance we get to take a deeper dive into a more linguistic topic.

00:00:31.200 | I will also show you a couple of new things for deep learning models at the same time.

00:00:36.980 | And then the lecture that had previously been scheduled at this point, which was going to

00:00:41.880 | be John on explanation in neural models, is being shifted later down into week nine, I

00:00:49.280 | think it is.

00:00:50.280 | But you'll still get him later.

00:00:53.120 | So we're getting underway, just a couple of announcements on things.

00:00:57.480 | Well, first of all, congratulations on surviving assignment five, I hope.

00:01:02.840 | I know it was a bit of a challenge for some of you, but I hope it was a rewarding state

00:01:07.680 | of the art learning experience on the latest in neural nets.

00:01:12.040 | And at any rate, you know, this was a brand new assignment that we used for the first

00:01:15.760 | time this year.

00:01:17.320 | So we'll really appreciate later on when we do the second survey, getting your feedback

00:01:21.240 | on that.

00:01:22.240 | We've been busy reading people's final project proposals.

00:01:26.440 | Thanks.

00:01:27.440 | Lots of interesting stuff there.

00:01:29.040 | Our goal is to get them back to you tomorrow.

00:01:32.440 | But you know, as soon as you've had a good night's sleep after assignment five, now is

00:01:36.720 | also a great time to get started working on your final projects, because there's just

00:01:41.760 | not that much time till the end of quarter.

00:01:44.600 | And I particularly want to encourage all of you to chat to your mentor regularly, go and

00:01:49.560 | visit office hours and keep in touch, get advice, just talking through things is a good

00:01:54.920 | way to keep you on track.

00:01:57.240 | We also plan to be getting back assignment four grades later this week.

00:02:02.920 | There's sort of the work never stops at this point.

00:02:05.320 | So the next thing for the final project is the final project milestone.

00:02:11.360 | So that we handed out the details of that last Friday, and it's due a week from today.

00:02:18.280 | So the idea of this final project milestone is really to help keep you on track and keep

00:02:25.080 | things moving towards having a successful final project.

00:02:29.040 | So our hope is that sort of most of what you write for the final project milestone is material

00:02:34.760 | you can also include in your final project, except for a few paragraphs of here's exactly

00:02:40.000 | where I'm up to now.

00:02:42.000 | So the overall hope is that doing this in two parts and having a milestone before the

00:02:47.160 | final thing, it's just making you make progress and be on track to having a successful final

00:02:52.360 | project.

00:02:53.760 | Finally, the next class on Thursday is going to be Colin Raphel, and this is going to be

00:03:00.440 | super exciting.

00:03:01.920 | So he's going to be talking more about the very latest in large pre-trained language

00:03:06.960 | models, both what some of their successes are, and also what some of the disconcerting,

00:03:12.680 | not quite so good aspects of those models are.

00:03:16.000 | So that should be a really good, interesting lecture.

00:03:19.480 | When we had him come and talk to our NLP seminar, we had several hundred people come along for

00:03:26.080 | that.

00:03:27.680 | And so for this talk, again, we're asking that you write a reaction paragraph following

00:03:34.320 | the same instructions as last time about what's in this lecture.

00:03:40.240 | And someone asked in the questions, well, what about last Thursday's?

00:03:45.720 | The answer to that is no.

00:03:47.320 | So the distinction here is we're only doing the reaction paragraphs for outside guest

00:03:53.840 | speakers.

00:03:55.040 | And although it was great to have Tran Boss Lu for last Thursday's lecture, he's a postdoc

00:04:00.560 | at Stanford.

00:04:02.160 | So we don't count him as an outside guest speaker.

00:04:05.220 | And so nothing needs to be done for that one.

00:04:07.520 | So there are three classes for which you need to do it.

00:04:12.440 | So there was the one before from Dan Chi-Shun, Colin Raffel, which is Thursday, and then

00:04:19.440 | towards the end of the course, there's Yulia Svetkova.

00:04:24.440 | Okay, so this is the plan today.

00:04:27.880 | So in the first part of it, I'm actually going to spend a bit of time talking about what

00:04:33.000 | coreference is, what different kinds of reference and language are.

00:04:38.320 | And then I'm going to move on and talk about some of the kind of methods that people have

00:04:42.320 | used for solving coreference resolution.

00:04:45.720 | Now there's one bug in our course design, which was a lot of years, we've had a whole

00:04:53.960 | lecture on doing convolutional neural nets for language applications.

00:04:58.380 | And that slight bug appeared the other day when Dan Chi-Shun talked about the BIDAF model,

00:05:08.460 | because she sort of slipped in, oh, there's a character CNN representation of words, and

00:05:14.580 | we hadn't actually covered that.

00:05:16.200 | And so that was a slight oopsie.

00:05:17.420 | I mean, actually, for applications in coreference as well, people commonly make use of character

00:05:23.460 | level conv nets.

00:05:24.800 | So I wanted to sort of spend a few minutes sort of doing basics of conv nets for language.

00:05:31.140 | The sort of reality here is that given that there's no exam week this year, to give people

00:05:37.140 | more time for final projects, we sort of shorten the content by a week this year.

00:05:42.700 | And so you're getting a little bit less of that content.

00:05:46.940 | Then going on from there, say some stuff about a state of the art neural coreference system,

00:05:53.140 | and right at the end, talk about how coreference is evaluated and what some of the results

00:05:57.900 | are.

00:05:58.900 | Yeah.

00:05:59.900 | So first of all, what is this coreference resolution term that I've been talking about

00:06:04.660 | a lot?

00:06:05.660 | So coreference resolution is meaning to find all the mentions in a piece of text that refer

00:06:15.020 | to the same entity.

00:06:16.340 | And sorry, that's a typo.

00:06:18.260 | It should be in the world, not in the word.

00:06:21.180 | So let's make this concrete.

00:06:23.160 | So here's part of a short story by Shruti Rao called The Star.

00:06:28.660 | Now I have to make a confession here, because this is an NLP class, not a literature class,

00:06:36.380 | I crudely made some cuts to the story to be able to have relevant parts appear on my slide

00:06:43.580 | in a decent sized font for illustrating coreference.

00:06:47.420 | So it's not quite the full original text, but it basically is a piece of this story.

00:06:53.980 | So what we're doing in coreference resolution is we're working out what people are mentioned.

00:07:02.580 | So here's a mention of a person, Banarja, and here's a mention of another person, Akila.

00:07:09.200 | And well, mentions don't have to be people.

00:07:11.900 | So the local park, that's also a mention.

00:07:15.340 | And then here's Akila again, and Akila's son.

00:07:20.540 | And then there's Prajwal.

00:07:23.500 | Then there's another son here, and then her son, and Akash.

00:07:30.540 | And they both went to the same school.

00:07:34.860 | And then there's a preschool play.

00:07:38.940 | And there's Prajwal again.

00:07:42.740 | And then there's a naughty child, Lord Krishna.

00:07:48.540 | And there's some that are a bit complicated, like the lead role, is that a mention?

00:07:55.140 | It's sort of more of a functional specification of something in the play.

00:08:01.700 | There's Akash, and it's a tree.

00:08:04.100 | I won't go through the whole thing yet.

00:08:05.980 | But I mean, in general, there are noun phrases that are mentioning things in the world.

00:08:12.260 | And so then what we want to do for coreference resolution is work out which of these mentions

00:08:20.300 | are talking about the same real world entity.

00:08:23.340 | So if we start off, so there's Banarja.

00:08:31.260 | And so Banarja is the same person as her there.

00:08:38.020 | And then we could read through, she resigned herself.

00:08:44.980 | So that's both Banarja.

00:08:48.180 | She bought him a brown T-shirt and brown trousers.

00:08:54.220 | And then she made a large cut out tree.

00:09:01.380 | She attached, right.

00:09:03.020 | So all of that's about Banarja.

00:09:05.900 | But then we can have another person.

00:09:10.380 | So here's Akila.

00:09:13.420 | And here's Akila.

00:09:14.420 | Maybe those are the only mentions of Akila.

00:09:25.380 | So then we can go on from there.

00:09:28.460 | Okay.

00:09:30.060 | And so then there's Prajwal.

00:09:34.580 | But note that Prajwal is also Akila's son.

00:09:45.060 | So really Akila's son is also Prajwal.

00:09:50.600 | And so an interesting thing here is that you can get nested syntactic structure so that

00:09:57.260 | we have these sort of noun phrases.

00:10:04.380 | So that if, you know, overall we have sort of this noun phrase, Akila's son Prajwal,

00:10:10.260 | which consists of two noun phrases in apposition.

00:10:14.300 | Here's Prajwal.

00:10:16.220 | And then for the noun phrase Akila's son, it sort of breaks down to itself having an

00:10:22.980 | extra possessive noun phrase in it and then a noun so that you have Akila's and then this

00:10:31.020 | is son.

00:10:32.300 | So that you have these multiple noun phrases.

00:10:36.820 | And so that you can then be sort of having different parts of this be one person in the

00:10:44.740 | coreference.

00:10:46.740 | But this noun phrase here referring to a different person in the coreference.

00:10:51.700 | Okay, so back to Prajwal.

00:11:02.020 | So while there's some easy other Prajwals, right, so there's Prajwal here.

00:11:12.700 | And then you've got some more complicated things.

00:11:15.580 | So one of the complicated cases here is that we have they went to the same school.

00:11:23.540 | So that they there is what gets referred to as split antecedents.

00:11:32.460 | Because the they refers to both Prajwal and Akash.

00:11:41.520 | And that's an interesting phenomenon that and so I could try and show that somehow I

00:11:47.220 | could put some splashes in or something.

00:11:50.380 | And if I get a different color, Akash, we have Akash and her son.

00:11:57.140 | And then this one sort of both of them at once.

00:12:00.580 | So human languages have this phenomenon of split antecedents.

00:12:06.960 | But you know, one of the things that you should notice when we start talking about algorithms

00:12:13.860 | that people use for doing coreference resolution is that they make some simplified assumptions

00:12:20.780 | as to how they go about treating the problem.

00:12:24.660 | And one of the simplifications that most algorithms make is for any noun phrase like this pronoun

00:12:34.180 | say that's trying to work out what is a coreference with.

00:12:38.760 | And the answer is one thing.

00:12:40.760 | And so actually most NLP algorithms for coreference resolution just cannot get split antecedents

00:12:48.300 | right.

00:12:49.300 | Any time it occurs in the text, they guess something and they always get it wrong.

00:12:54.080 | So that's the sort of a bit of a sad state of affairs.

00:12:57.100 | But that's the truth of how it is.

00:12:59.980 | Okay.

00:13:01.100 | So then going ahead, we have Akash here.

00:13:05.780 | And then we have another tricky one.

00:13:08.100 | So moving on from there, we then have this tree.

00:13:15.460 | So well, in this context of this story, Akash is going to be the tree.

00:13:25.840 | So you could feel that it was okay to say, well, this tree is also Akash.

00:13:35.080 | You could also feel that that's a little bit weird and not want to do that.

00:13:39.780 | And I mean, actually different people's coreference datasets differ in this.

00:13:47.900 | So really that, you know, we're predicating identity relationship here between Akash and

00:13:54.620 | the property of being a tree.

00:13:56.560 | So do we regard the tree as the same as Akash or not?

00:13:59.780 | And people make different decisions there.

00:14:02.300 | Okay.

00:14:03.300 | But then going ahead, we have here's Akash and she bought him.

00:14:09.520 | So that's Akash.

00:14:12.660 | And then we have Akash here.

00:14:16.900 | And so then we go on.

00:14:19.220 | Okay.

00:14:20.820 | So then if we don't regard the tree as the same as Akash, we have a tree here.

00:14:32.860 | But then note that the next place over here, where we have a mention of a tree, the best

00:14:41.700 | tree, but that's sort of really a functional description of, you know, of possible trees

00:14:50.700 | making someone the best tree.

00:14:52.340 | It's not really referential to a tree.

00:14:56.420 | And so it seems like that's not really coreferent.

00:14:59.900 | But if we go on, there's definitely more mention of a tree.

00:15:06.980 | So when she has made the tree truly the nicest tree, or well, I'm not sure.

00:15:14.340 | Is that one coreferent?

00:15:16.100 | It is definitely referring to our tree.

00:15:18.420 | And maybe this one again is a sort of a functional description that isn't referring to the tree.

00:15:25.980 | Okay.

00:15:27.940 | So there's definitely and so maybe this one, though, where it's a tree is referring to

00:15:36.980 | the tree.

00:15:37.980 | But what I hope to have illustrated from this is, you know, most of the time when we do

00:15:44.940 | coreference in NLP, we just make it look sort of like the conceptual phenomenon is, you

00:15:54.260 | know, kind of obvious that there's a mention of Sarah and then it says she and you say,

00:16:02.060 | oh, they're coreferent.

00:16:04.060 | This is easy.

00:16:05.500 | But if you actually start looking at real text, especially when you're looking at something

00:16:11.700 | like this, that is a piece of literature, the kind of phenomenon you get for coreference

00:16:18.020 | and overlapping reference, and it varies other phenomena that I'll talk about, you know,

00:16:24.100 | they actually get pretty complex.

00:16:26.860 | And it's not, you know, there are a lot of hard cases that you actually have to think

00:16:30.500 | about as to what things you think about as coreferent or not.

00:16:35.780 | Okay.

00:16:37.420 | But basically, we do want to be able to do something with coreference because it's useful

00:16:45.100 | for a lot of things that we'd like to do in natural language processing.

00:16:49.280 | So for one task that we've already talked about, question answering, but equally for

00:16:53.960 | other tasks such as summarization, information extraction, if you're doing something like

00:17:00.360 | reading through a piece of text, and you've got a sentence like he was born in 1961, you

00:17:07.760 | really want to know who he refers to, to know if this is a good answer to the question of,

00:17:15.400 | you know, when was Barack Obama born or something like that.

00:17:20.980 | It turns out also that it's useful in machine translation.

00:17:25.720 | So in most languages, pronouns have features for gender and number, and in quite a lot

00:17:35.240 | of languages, nouns and adjectives also show features of gender, number, and case.

00:17:43.840 | And so when you're translating a sentence, you want to be aware of these features and

00:17:51.880 | what is coreferent as what to be able to get the translations correct.

00:18:00.080 | So you know, if you want to be able to work out a translation and know whether it's saying

00:18:06.600 | Alicia likes Juan because he's smart, or Alicia likes Juan because she's smart, then you have

00:18:12.740 | to be sensitive to coreference relationships to be able to choose the right translation.

00:18:22.720 | When people build dialogue systems, dialogue systems also have issues of coreference a

00:18:28.800 | lot of the time.

00:18:31.160 | So you know, if it's sort of book tickets to see James Bond and the system replies Spectre

00:18:38.440 | is playing near you at 2 and 3 today, well, there's actually a coreference relation, sorry,

00:18:43.560 | there's a reference relation between Spectre and James Bond because Spectre is a James

00:18:48.400 | Bond film.

00:18:49.400 | I'll come back to that one in a minute.

00:18:51.880 | But then it's how many tickets would you like, two tickets for the showing at 3.

00:18:57.760 | That 3 is not just the number 3.

00:19:01.440 | That 3 is then a coreference relationship back to the 3pm showing that was mentioned

00:19:09.560 | by the agent in the dialogue system.

00:19:12.920 | So again, to understand these, we need to be understanding the coreference relationships.

00:19:19.900 | So how now can you go about doing coreference?

00:19:23.940 | So the standard traditional answer, which I'll present first, is coreference is done

00:19:30.580 | in two steps.

00:19:32.320 | On the first step, what we do is detect mentions in a piece of text.

00:19:38.400 | And that's actually a pretty easy problem.

00:19:40.920 | And then in the second step, we work out how to cluster the mentions.

00:19:46.180 | So as in my example from the Shruti Rao text, basically what you're doing with coreference

00:19:53.280 | is you're building up these clusters, sets of mentions that refer to the same entity

00:19:58.640 | in the world.

00:20:01.760 | So if we explore a little how we could do that as a two-step solution, the first part

00:20:07.860 | was detecting the mentions.

00:20:10.020 | And so pretty much there are three kinds of things, different kinds of noun phrases that

00:20:19.520 | can be mentions.

00:20:21.060 | There are pronouns like I, your, it, she, him, and also some demonstrative pronouns

00:20:26.820 | like this and that and things like that.

00:20:29.440 | There are explicitly named things, so things like Paris, Joe Biden, Nike.

00:20:35.720 | And then there are plain noun phrases that describe things.

00:20:40.320 | So a dog, the big fluffy cat stuck in the tree.

00:20:43.800 | And so all of these are things that we'd like to identify as mentions.

00:20:49.440 | And the straightforward way to identify these mentions is to use natural language processing

00:20:56.680 | tools, several of which we've talked about already.

00:21:01.080 | So to work out pronouns, we can use what's called a part of speech tagger.

00:21:12.760 | We can use a part of speech tagger, which we haven't really explicitly talked about,

00:21:18.200 | but we used when you built dependency parsers.

00:21:21.880 | So it first of all assigns parts of speech to each word so we can just find the words

00:21:27.360 | that are pronouns.

00:21:30.680 | For named entities, we did talk just a little bit about named entity recognizers as a use

00:21:36.120 | of sequence models for neural networks so we can pick out things like person names and

00:21:40.980 | company names.

00:21:44.040 | And then for the ones like the big fluffy, a big fluffy dog, we could then be sort of

00:21:50.400 | picking out from syntactic structure noun phrases and regarding them as descriptions

00:21:55.760 | of things.

00:21:58.020 | So that we could use all of these tools and those would give us basically our mentions.

00:22:03.600 | It's a little bit more subtle than that, because it turns out there are some noun phrases and

00:22:11.440 | things of all of those kinds which don't actually refer so that they're not referential in the

00:22:17.880 | world.

00:22:18.880 | So when you say it is sunny, it doesn't really refer.

00:22:22.800 | When you make universal claims like every student, well, every student isn't referring

00:22:28.760 | to something you can point to in the world.

00:22:31.680 | And more dramatically, when you have no student and make a negative universal claim, it's

00:22:36.760 | not referential to anything.

00:22:39.120 | There are also things that you can describe functionally, which don't have any clear reference.

00:22:47.280 | So if I say the best donut in the world, that's a functional claim, but it doesn't necessarily

00:22:54.440 | have reference.

00:22:55.440 | Like if I've established that I think a particular kind of donut is the best donut in the world,

00:23:03.020 | I could then say to you, I ate the best donut in the world yesterday.

00:23:10.000 | And you know what I mean, it might have reference.

00:23:12.700 | But if I say something like I'm going around to all the donut stores trying to find the

00:23:17.220 | best donut in the world, then it doesn't have any reference yet.

00:23:20.860 | It's just a sort of a functional description I'm trying to satisfy.

00:23:24.980 | You also then have things like quantities, 100 miles.

00:23:29.560 | It's that quantity that is not really something that has any particular reference.

00:23:33.920 | You can mark out 100 miles, all sorts of places.

00:23:38.880 | So how do we deal with those things that aren't really mentions?

00:23:44.040 | Well one way is we could train a machine learning classifier to get rid of those spurious mentions.

00:23:51.400 | But actually mostly people don't do that.

00:23:54.680 | Most commonly if you're using this kind of pipeline model where you use a parser and

00:24:01.400 | a named entity recognizer, you regard everything as you've found as a candidate mention, and

00:24:08.100 | then you try and run your coref system.

00:24:10.680 | And some of them, like those ones, hopefully aren't made coref with anything else.

00:24:16.280 | And so then you just discard them at the end of the process.

00:24:19.440 | >> Hey, Chris.

00:24:21.440 | >> Yeah.

00:24:22.440 | >> I've got an interesting question that linguistics bears on this.

00:24:27.000 | A student asks, can we say that it is sunny?

00:24:30.520 | Has it referred to the weather?

00:24:34.760 | And I think the answer is yes.

00:24:36.960 | >> So that's a fair question.

00:24:40.280 | So people have actually tried to suggest that when you say it is sunny, it means the weather

00:24:48.920 | is sunny.

00:24:51.680 | But I guess the majority opinion at least is that isn't plausible.

00:24:59.180 | And I mean, for I guess many of you aren't native speakers of English, but similar phenomena

00:25:06.300 | occur in many other languages.

00:25:08.740 | I mean, it just intuitively doesn't seem plausible when you say it's sunny or it's raining today

00:25:17.460 | that you're really saying that as a shortcut for the weather is raining today.

00:25:23.480 | It just seems like really what the case is, is English likes to have something filling

00:25:28.740 | the subject position.

00:25:30.620 | And when there's nothing better to fill the subject position, you stick it in there and

00:25:36.740 | get it's raining.

00:25:39.360 | And so in general, it's believed that you get this phenomenon of having these empty

00:25:43.860 | dummy its that appear in various places.

00:25:46.900 | I mean, another place in which it seems like you clearly get dummy its is that when you

00:25:52.740 | have clauses that are subjects of a verb, you can move them to the end of the sentence.

00:25:59.540 | So if you have a sentence where you put a clause in the subject position, they normally

00:26:04.160 | in English sound fairly awkward.

00:26:07.580 | So it's you have a sentence something like that CS24N is a lot of work is known by all

00:26:16.180 | students.

00:26:17.660 | People don't normally say that the normal thing to do is to shift the clause to the

00:26:21.260 | end of the sentence.

00:26:22.660 | But when you do that, you stick in a dummy it to fill the subject position.

00:26:27.020 | So you then have it is known by all students that CS224N is a lot of work.

00:26:33.660 | So that's the general feeling that this is a dummy it that doesn't have any reference.

00:26:39.660 | Okay, there's one more question.

00:26:45.540 | So if someone says it is sunny among other things, and we ask how is the weather?

00:26:50.420 | Hmm, okay, good point.

00:26:52.780 | You've got me on that one.

00:26:54.220 | Right.

00:26:55.220 | So someone says, how's the weather?

00:26:56.820 | And you answer it is sunny, it then does seem like the it is in reference to the weather.

00:27:02.740 | Far by that.

00:27:04.260 | Well, you know, I guess this is what our coreference systems are built trying to do in situations

00:27:10.660 | like that, they're making a decision of coreference or not.

00:27:14.180 | And I guess what you'd want to say in that case is, it seems reasonable to regard this

00:27:18.260 | one as coreference that weather that did appear before it.

00:27:23.020 | I mean, but that also indicates another reason to think that in the normal case is not coreference,

00:27:29.980 | right?

00:27:30.980 | Because normally, pronouns are only used when their reference is established that you've

00:27:35.700 | referred to now like, John is answering questions, and then you can say, he types really quickly,

00:27:44.940 | sort of seem odd to just sort of start the conversation by he types really quickly, because

00:27:50.620 | it doesn't have any established reference, whereas that doesn't seem to be the case,

00:27:54.940 | it seems like you can just sort of start a conversation by saying it's raining really

00:28:00.420 | hard today.

00:28:01.700 | And that doesn't sound odd at all.

00:28:04.940 | Okay.

00:28:07.180 | So I've sort of there presented the traditional picture.

00:28:13.540 | But you know, this traditional picture doesn't mean something that was done last millennium

00:28:18.100 | before you were born.

00:28:19.820 | I mean, essentially, that was the picture until about 2016.

00:28:28.740 | That essentially, every coreference system that was built, use tools like part of speech

00:28:34.440 | taggers, NER systems, and parsers to analyze sentences, to identify mentions, and to give

00:28:41.060 | you features for coreference resolution.

00:28:43.700 | And I'll show a bit more about that later.

00:28:46.940 | But more recently, in our neural systems, people have moved to avoiding traditional

00:28:53.620 | pipeline systems and doing one shot end to end coreference resolution systems.

00:29:02.340 | So if I skip directly to the second bullet, there's a new generation of neural systems

00:29:09.300 | where you just start with your sequence of words, and you do the maximally dumb thing,

00:29:15.520 | you just say, let's take all spans, commonly with some heuristics for efficiency, but you

00:29:21.700 | know, conceptually, all sub sequences of this sentence, they might be mentions, let's feed

00:29:28.060 | them in to a neural network, which will simultaneously do mentioned detection and coreference resolution

00:29:35.380 | end to end in one model.

00:29:37.580 | And I'll give an example of that kind of system later in the lecture.

00:29:41.900 | Okay, is everything good to there and I should go on?

00:29:48.580 | Yep.

00:29:49.980 | Okay.

00:29:51.660 | So I'm going to get on to how to do coreference resolution systems.

00:29:58.340 | But before I do that, I do actually want to show a little bit more of the linguistics

00:30:03.820 | of coreference, because there are actually a few more interesting things to understand

00:30:09.820 | and know here.

00:30:10.820 | I mean, when we say coreference resolution, we really confuse together two linguistic

00:30:19.260 | things which are overlapping, but different.

00:30:23.380 | And so it's really actually good to understand the difference between these things.

00:30:27.300 | So there are two things that can happen.

00:30:30.020 | One is that you can have mentions, which are essentially standalone, but happen to refer

00:30:38.620 | to the same entity in the world.

00:30:41.140 | So if I have a piece of text that said, Barack Obama traveled yesterday to Nebraska, Obama

00:30:48.940 | was there to open a new meat processing plant or something like that.

00:30:54.060 | I've mentioned with Barack Obama and Obama, there are two mentions there, they refer to

00:30:59.660 | the same person in the world, they are coreference.

00:31:02.940 | So that is true coreference.

00:31:05.100 | But there's a different related linguistic concept called anaphora.

00:31:09.820 | And anaphora is when you have a textual dependence of an anaphor on another term, which is the

00:31:16.500 | antecedent.

00:31:17.740 | And in this case, the meaning of the anaphor is determined by the antecedent in a textual

00:31:25.460 | context.

00:31:26.500 | And the canonical case of this is pronouns.

00:31:29.700 | So when it's Barack Obama said he would sign the bill, he is an anaphor.

00:31:35.700 | It's not a word that independently we can work out what its meaning is in the world,

00:31:40.820 | apart from knowing the vaguest feature that it's referring to something probably male.

00:31:48.300 | But in the context of this text, we have that this anaphor is textually dependent on Barack

00:31:56.420 | Obama.

00:31:57.540 | And so then we have an anaphoric relationship, which sort of means they refer to the same

00:32:03.500 | thing in the world.

00:32:04.820 | And so therefore, you can say they're coreference.

00:32:08.140 | So the picture we have is like this.

00:32:11.160 | So for coreference, we have these separate textual mentions, which are basically standalone,

00:32:18.460 | which refer to the same thing in the world.

00:32:20.940 | Whereas in anaphora, we actually have a textual relationship.

00:32:25.860 | And you essentially have to use pronouns like he and she in legitimate ways in which the

00:32:33.940 | hearer can reconstruct the relationship from the text, because they can't work out what

00:32:39.580 | he refers to if that's not there.

00:32:44.980 | And so that's a fair bit of the distinction.

00:32:49.540 | But it's actually a little bit more to realize, because there are more complex forms of anaphora,

00:32:56.740 | which aren't coreference, because you have a textual dependence, but it's not actually

00:33:05.460 | one of reference.

00:33:07.140 | And so this comes back to things like these quantifier noun phrases that don't have reference.

00:33:13.980 | So when you have sentences like these ones, every dancer twisted her knee, well, this

00:33:20.780 | her here has an anaphoric dependency on every dancer, or even more clearly with no dancer

00:33:28.900 | twisted her knee, the her here has an anaphoric dependence on no dancer.

00:33:35.500 | But for no dancer twisted her knee, no dancer isn't referential.

00:33:41.500 | It's not referring to anything in the world.

00:33:43.980 | And so there's no coreferential relationship, because there's no reference relationship,

00:33:49.660 | but there's still an anaphoric relationship between these two noun phrases.

00:33:57.420 | And then you have this other complex case that turns up quite a bit, where you can have

00:34:03.500 | where the things being talked about do have reference, but an anaphoric relationship is

00:34:10.820 | more subtle than identity.

00:34:13.380 | So you commonly get constructions like this one.

00:34:18.660 | We went to a concert last night, the tickets were really expensive.

00:34:23.180 | Well, the concert and the tickets are two different things.

00:34:27.980 | They're not coreferential.

00:34:32.220 | But in interpreting this sentence, what this really means is the tickets to the concert,

00:34:41.220 | right?

00:34:42.220 | And so there's sort of this hidden, not said dependence where this is referring back to

00:34:48.380 | the concert.

00:34:49.660 | And so what we say is that these, the tickets does have an anaphoric dependence on the concert,

00:34:57.540 | but they're not coreferential.

00:34:59.500 | And so that's referred to as bridging anaphora.

00:35:02.540 | And so overall, there's the simple case and the common case, which is pronominal anaphora,

00:35:09.780 | where it's both coreference and anaphora.

00:35:13.220 | You then have other cases of coreference, such as every time you see a mention of the,

00:35:21.060 | every time the United States has said it's coreferential with every other mention of

00:35:25.940 | the United States, but those don't have any textual dependence on each other.

00:35:31.180 | And then you have textual dependencies like bridging anaphora, which aren't coreference.

00:35:37.780 | That's probably about as, now I was going to say that's probably as much linguistics

00:35:43.300 | as you wanted to hear, but actually I have one more point of linguistics.

00:35:49.440 | One or two of you, but probably not many, might've been troubled by the fact that the

00:35:57.380 | term anaphora as a classical term means that you are looking backward for your antecedent,

00:36:06.660 | that the anaphora part of anaphora means that you're looking backward for your antecedent.

00:36:12.700 | And in sort of classical terminology, you have both anaphora and catephra, and it's

00:36:20.940 | catephra where you look forward for your antecedent.

00:36:25.700 | Catephra isn't that common, but it does occur.

00:36:28.540 | Here's a beautiful example of catephra.

00:36:31.260 | So this is from Oscar Wilde.

00:36:33.740 | From the corner of the divan of Persian saddlebags on which he was lying, smoking as was his

00:36:39.420 | custom innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey

00:36:47.300 | sweet and honey colored blossoms of a laburnum.

00:36:50.780 | Okay.

00:36:51.780 | So in this example here, the he and then this his are actually referring to Lord Henry Wotton.

00:37:02.900 | And so these are both examples of catephra.

00:37:07.620 | But in modern linguistics, even though most reference to pronouns is backwards, we don't

00:37:19.460 | distinguish in terms of order.

00:37:23.460 | And so the term anaphora and anaphora is used for textual dependence, regardless of whether

00:37:28.660 | it's forward or backwards.

00:37:30.660 | Okay.

00:37:31.660 | A lot of details there, but taking stock of this.

00:37:37.140 | So the basic observation is language is interpreted in context, that in general, you can't work

00:37:45.780 | out the meaning or reference of things without looking at the context of the linguistic utterance.

00:37:53.760 | So we'd seen some simple examples before.

00:37:57.260 | So for something like word sense disambiguation, if you see just the words, the bank, you don't

00:38:03.980 | know what it means.

00:38:05.380 | And you need to look at a context to get some sense as to whether it means a financial institution

00:38:10.660 | or the bank of a river or something like that.

00:38:13.860 | And so anaphora and coreference give us additional examples where you need to be doing contextual

00:38:21.380 | interpretation of language.

00:38:23.860 | So when you see a pronoun, you need to be looking at the context to see what it refers

00:38:29.420 | to.

00:38:31.240 | And so if you think about text understanding as a human being does it, reading a story

00:38:36.980 | or an article, that we progress through the article from beginning to end.

00:38:42.280 | And as we do it, we build up a pretty complex discourse model in which new entities are

00:38:49.220 | introduced by mentions and then they're referred back to and relationships between them are

00:38:54.900 | established and they take actions and things like that.

00:38:58.540 | And it sort of seems like in our head that we sort of build up a kind of a complex graph

00:39:03.300 | like discourse representation of a piece of text with all these relationships.

00:39:09.180 | And so part of that is these anaphoric relationships and coreference that we're talking about here.

00:39:15.060 | And indeed in terms of CS224N, the only kind of whole discourse meaning that we're going

00:39:22.220 | to look at is looking a bit at anaphora and coreference.

00:39:26.700 | But if you want to see more about higher level natural language understanding, you can get

00:39:32.060 | more of this next quarter in CS224U.

00:39:37.100 | So I want to tell you a bit about several different ways of doing coreference.

00:39:45.620 | So broadly there are four different kinds of coreference models.

00:39:51.200 | So the traditional old way of doing it was rule-based systems.

00:39:55.700 | And this isn't the topic of this class and this is pretty archaic at this point.

00:40:01.780 | This is stuff from last millennium.

00:40:04.400 | But I wanted to say a little bit about it because it's actually kind of interesting

00:40:09.380 | as sort of food for thought as to how far along we are or aren't in solving artificial

00:40:16.220 | intelligence and really being able to understand texts.

00:40:20.740 | Then there are sort of classic machine learning methods of doing it, which you can sort of

00:40:25.340 | divide up as mentioned pair methods, mentioned ranking methods, and really clustering methods.

00:40:31.900 | And I'm sort of going to skip the clustering methods today because most of the work, especially

00:40:36.400 | most of the recent work, implicitly makes clusters by using either mentioned pair or

00:40:41.980 | mentioned ranking methods.

00:40:43.720 | And so I'm going to talk about a couple of neural methods for doing that.

00:40:48.380 | Okay.

00:40:49.800 | But first of all, let me just tell you a little bit about rule-based coreference.

00:40:55.380 | So there's a famous historical algorithm in NLP for doing pronoun and Afro resolution,

00:41:05.300 | which is referred to as the Hobbes algorithm.

00:41:09.420 | So everyone just refers to it as the Hobbes algorithm.

00:41:12.100 | And if you sort of look up a textbook like Draftsky and Martin's textbook, it's referred

00:41:17.580 | to as the Hobbes algorithm.

00:41:19.660 | But actually, if you go back to Jerry Hobbes, that's his picture over there in the corner,

00:41:24.620 | if you actually go back to his original paper, he refers to it as the naive algorithm.

00:41:32.460 | And his naive algorithm for pronoun coreference was this sort of intricate handwritten set

00:41:40.220 | of rules to work out coreference.

00:41:43.220 | So this is the start of the set of the rules, but there are more rules or more clauses of

00:41:49.380 | these rules for working out coreference.

00:41:53.540 | And this looks like a hot mess, but the funny thing was that this set of rules for determining

00:42:01.220 | coreference were actually pretty good.

00:42:04.140 | And so in the sort of 1990s and 2000s decade, even when people were using machine learning

00:42:12.300 | based systems for doing coreference, they'd hide into those machine learning based systems

00:42:18.460 | that one of their features was the Hobbes algorithm and that the predictions it made

00:42:24.220 | with a certain weight was then a feature in making your final decisions.

00:42:29.580 | And it's only really in the last five years that people have moved away from using the

00:42:34.660 | Hobbes algorithm.

00:42:35.660 | Let me give you a little bit of a sense of how it works.

00:42:40.340 | So the Hobbes algorithm, here's our example.

00:42:43.740 | This is an example from a Guardian book review.

00:42:46.740 | Niall Ferguson is prolific, well-paid and a snappy dresser.

00:42:50.860 | Stephen Moss hated him.

00:42:53.220 | So what the Hobbes algorithm does is we start with a pronoun.

00:43:01.100 | We start with a pronoun and then it says step one, go to the NP that's immediately dominating

00:43:08.180 | the pronoun.

00:43:09.180 | And then it says go up to the first NP or S, call this X and the path P.

00:43:18.760 | Then it says traverse all branches below X to the left of P, left to right, breadth first.

00:43:26.080 | So then it's saying to go left to right for other branches below breadth first.

00:43:32.740 | So that's sort of working down the tree.

00:43:34.940 | So we're going down and left to right and look for an NP.

00:43:41.260 | Okay.

00:43:42.340 | And here's an NP.

00:43:44.220 | But then we have to read more carefully and say propose as antecedent any NP that has

00:43:52.600 | an NP or S between it and X.

00:43:55.940 | Well, this NP here has no NP or S between NP and X.

00:44:04.100 | So this isn't a possible antecedent.

00:44:06.540 | So this is all very, you know, complex and handwritten, but basically he sort of fit

00:44:13.180 | into the clauses of this kind of a lot of facts about how the grammar of English works.

00:44:20.260 | And so what this is capturing is if you imagine a different sentence, you know, if you imagine

00:44:25.900 | the sentence, Stephen Moss's brother hated him.

00:44:35.820 | Well then Stephen Moss would naturally be coreferent with him.

00:44:40.980 | And in that case, well, precisely what you'd have is the noun phrase with, well, the noun

00:44:50.700 | phrase brother, and you'd have another noun phrase inside it for the Stephen Moss.

00:45:00.100 | And then that would go up to the sentence.

00:45:02.340 | So in the case of Stephen Moss's brother, when you looked at this noun phrase, there

00:45:08.140 | would be an intervening noun phrase before you got to the node X.

00:45:14.060 | And therefore Stephen Moss is a possible and in fact, good antecedent of him.

00:45:23.500 | And the algorithm would choose Stephen Moss, but the algorithm correctly captures that

00:45:29.020 | when you have the sentence, Stephen Moss hated him, that him cannot refer to Stephen Moss.

00:45:34.980 | Okay.

00:45:35.980 | So having worked that out, it then says if X is the highest S in the sentence, okay,

00:45:42.660 | so my X here is definitely the highest S in the sentence because I've got the whole sentence.

00:45:48.300 | What you should do is then traverse the parse trees of previous sentences in the order of

00:45:55.620 | recency.

00:45:56.620 | So what I should do now is sort of work backwards in the text, one sentence at a time, going

00:46:03.940 | backwards, looking for an antecedent.

00:46:07.800 | And then for each tree, traverse each tree left to right, breadth first.

00:46:13.100 | So then within each tree, I'm doing the same of going breadth first.

00:46:18.680 | So sort of working down and then going left to right with an equal breadth.

00:46:24.200 | And so hidden inside these clauses, it's capturing a lot of the facts of how coreference typically

00:46:32.760 | works.

00:46:33.760 | So what you find in English, I'll say, but in general, this is true of lots of languages,

00:46:42.440 | is that there are general preferences and tendencies for coreference.

00:46:46.600 | So a lot of the time, a pronoun will be coreferent with something in the same sentence, like

00:46:52.400 | Stephen Moss's brother hated him, but it can't be if it's too close to it.

00:46:57.480 | So you can't say Stephen Moss hated him and have the him be Stephen Moss.

00:47:02.220 | And if you're then looking for coreference that's further away, the thing it's coreferent

00:47:08.280 | with is normally close by.

00:47:10.920 | And so that's why you work backwards through sentences one by one.

00:47:15.440 | But then once you're looking within a particular sentence, the most likely thing it's going

00:47:20.680 | to be coreferent to is a topical noun phrase and default topics in English are subjects.

00:47:28.880 | So by doing things breadth first, left to right, a preferred antecedent is then a subject.

00:47:36.520 | And so this algorithm, I won't go through all the complex clauses five through nine,

00:47:41.440 | ends up saying, okay, what you should do is propose Niall Ferguson as what is coreferent

00:47:48.200 | to him, which is the obvious correct reading in this example.

00:47:52.440 | Okay.

00:47:53.440 | Phew.

00:47:54.440 | You probably didn't want to know that.

00:47:56.340 | And in some sense, the details of that aren't interesting.

00:48:00.960 | But what is I think actually still interesting in 2021 is what points Jerry Hobbes was actually

00:48:10.280 | trying to make last millennium.

00:48:15.240 | And the point he was trying to make was the following.

00:48:19.580 | So Jerry Hobbes wrote this algorithm, the naive algorithm, because what he said was,

00:48:28.040 | well, look, if you want to try and crudely determine coreference, well, there are these

00:48:36.240 | various preferences, right?

00:48:37.880 | There's the preference for same sentence.

00:48:39.940 | There's the preference for recency.

00:48:42.280 | There's a preference for topical things like subject.

00:48:45.400 | And there are things where, you know, if it has gender, it has to agree in gender.

00:48:49.860 | So there are sort of strong constraints of that sort.

00:48:54.300 | So I can write an algorithm using my linguistic mouse, which captures all the main preferences.

00:49:02.660 | And actually, it works pretty well.

00:49:06.180 | Doing that is a pretty strong baseline system.

00:49:10.300 | But what Jerry Hobbes wanted to argue is that this algorithm just isn't something you should

00:49:19.580 | believe in.

00:49:20.700 | This isn't a solution to the problem.

00:49:23.420 | This is just sort of, you know, making a best guess according to the preferences of what's

00:49:31.600 | most likely without actually understanding what's going on in the text at all.

00:49:37.160 | And so actually, what Jerry Hobbes wanted to argue was the so-called Hobbes algorithm

00:49:43.720 | now, he wasn't a fan of the Hobbes algorithm.

00:49:46.480 | He was wanting to argue that the Hobbes algorithm is completely inadequate as a solution to

00:49:51.960 | the problem.

00:49:53.080 | And the only way we'll actually make progress in natural language understanding is by building

00:49:57.760 | systems that actually really understand the text.

00:50:02.580 | And this is actually something that has come to the fore again more recently.

00:50:09.580 | So the suggestion is that in general, you can't work out coreference or pronominal and

00:50:17.020 | afro in particular unless you're really understanding the meaning of the text.

00:50:21.860 | And people look at pairs of examples like these ones.

00:50:25.300 | So she poured water from the pitcher into the cup until it was full.

00:50:30.520 | So think for just half a moment, well, what is it in that example that is full?

00:50:38.800 | So that what's full there is the cup.

00:50:43.320 | But then if I say she poured water from the pitcher into the cup until it was empty, well,

00:50:49.160 | what's empty?

00:50:50.160 | Well, that's the pitcher.

00:50:52.740 | And the point that is being made with this example is the only thing that's been changed

00:50:59.640 | in these examples is the adjective right here.

00:51:05.240 | So these two examples have exactly the same grammatical structure.

00:51:11.640 | So in terms of the Hobbes' naive algorithm, the Hobbes' naive algorithm necessarily has

00:51:19.160 | to predict the same answer for both of these.

00:51:23.040 | But that's wrong.

00:51:24.480 | You just cannot determine the correct pronoun antecedent based on grammatical preferences

00:51:31.200 | of the kind that are used in the naive algorithm.

00:51:34.520 | You actually have to conceptually understand about pitchers and cups and water and full

00:51:40.040 | and empty to be able to choose the right antecedent.

00:51:45.080 | Here's another famous example that goes along the same lines.

00:51:49.880 | So Terry Winograd, shown here as a young man.

00:51:53.920 | So long, long ago, Terry Winograd came to Stanford as the natural language processing

00:51:59.240 | faculty and Terry Winograd became disillusioned with the symbolic AI of those days and just

00:52:08.560 | gave it up altogether.

00:52:10.360 | And he reinvented himself as being an HCI person.

00:52:14.080 | And so Terry was then essentially the person who established the HCI program at Stanford.

00:52:20.720 | But before he lost faith in symbolic AI, he talked about the coreference problem and pointed

00:52:29.760 | out a similar pair of examples here.

00:52:32.320 | So we have the city council refused the women a permit because they feared violence versus

00:52:39.440 | the city council refused the women a permit because they advocated violence.

00:52:44.560 | So again, you have this situation where these two sentences have identical syntactic structure

00:52:50.800 | and they differ only in the choice of verb here.

00:52:54.200 | But once you add knowledge, common sense knowledge of how the human world works, well, how this

00:53:03.680 | should pretty obviously be interpreted that in the first one that they is referring to

00:53:10.280 | the city council, whereas in the second one that they is referring to the women.

00:53:17.640 | And so coming off of that example of Terry, these have been referred to as Winograd schemas.

00:53:26.120 | So Winograd schema challenges sort of choosing the right reference here.

00:53:31.920 | And so it's basically just doing pronominal and afro.

00:53:35.800 | But the interesting thing is people have been interested in what are tests of general intelligence

00:53:43.080 | and one famous general test of intelligence, though I won't talk about now, is the Turing

00:53:48.040 | test.

00:53:49.040 | And there's been a lot of debate about problems with the Turing test and is it good?

00:53:52.920 | And so in particular, Hector Levesque, who's a very well-known senior AI person, he actually

00:54:00.320 | proposed that a better alternative to the Turing test might be to do what he then dubbed

00:54:06.160 | Winograd schema.

00:54:08.280 | And Winograd schema is just solving pronominal coreference in cases like this where you have

00:54:13.880 | to have knowledge about the situation in the world to get the answer right.

00:54:18.280 | And so he's basically arguing that, you know, you can review really solving coreference

00:54:24.560 | as solving artificial intelligence.

00:54:27.140 | And that's sort of what the position that Hobbes wanted to advocate.

00:54:33.080 | So what he actually said about his algorithm was that the naive approach is quite good.

00:54:38.800 | Computationally speaking, it will be a long time before a semantically based algorithm

00:54:43.400 | is sophisticated enough to perform as well.

00:54:46.320 | And these results set a very high standard for any other approach to aim for.

00:54:50.480 | And he was proven right about that because there was sort of really took to around 2015

00:54:55.400 | before people thought they could do without the Hobbes algorithm.

00:54:59.400 | But then he notes, yet there is every reason to pursue a semantically based approach.

00:55:04.480 | The naive algorithm does not work.

00:55:07.160 | Anyone can think of examples where it fails.

00:55:09.640 | In these cases, it not only fails, it gives no indication that it has failed and offers

00:55:15.120 | no help in finding the real antecedent.

00:55:18.720 | And so I think this is actually still interesting stuff to think about because, you know, really

00:55:23.480 | for the kind of machine learning based coreference systems that we are building, you know, they're

00:55:29.320 | not a hot mess of rules like the Hobbes algorithm, but basically they're still sort of working

00:55:36.720 | out statistical preferences of what patterns are most likely and choosing the antecedent

00:55:45.240 | that way.

00:55:46.800 | They really have exactly the same deficiencies still that Hobbes was talking about, right?

00:55:54.040 | That they fail in various cases.

00:55:57.360 | It's easy to find places where they fail.

00:56:00.600 | The algorithms give you no idea when they fail.

00:56:03.880 | They're not really understanding the text in a way that a human does to determine the

00:56:08.480 | antecedent.

00:56:09.520 | So we still actually have a lot more work to do before we're really doing full artificial

00:56:14.680 | intelligence.

00:56:16.920 | But I'd best get on now and actually tell you a bit about some coreference algorithms.

00:56:22.160 | Right?

00:56:23.320 | So the simple way of thinking about coreference is to say that you're making just a binary

00:56:30.800 | decision about a reference pair.

00:56:34.680 | So if you have your mentions, you can then say, well, I've come to my next mention, she,

00:56:43.760 | I want to work out what it's coreferent with.

00:56:47.480 | And I can just look at all of the mentions that came before it and say, is it coreferent

00:56:53.120 | or not?

00:56:54.120 | And do a binary decision.

00:56:56.080 | So at training time, I'll be able to say I have positive examples, assuming I've got

00:57:01.120 | some data labeled for what's coreferent to what, as to these ones are coreferent.

00:57:06.260 | And I've got some negative examples of these ones are not coreferent.

00:57:11.440 | And what I want to do is build a model that learns to predict coreferent things.

00:57:16.760 | And I can do that fairly straightforwardly in the kind of ways that we have talked about.

00:57:22.360 | So I train with the regular kind of cross entropy loss, where I'm now summing over every

00:57:30.760 | pairwise binary decision as to whether two mentions are coreferent to each other or not.

00:57:38.460 | And so then when I'm at test time, what I want to do is cluster the mentions that correspond

00:57:44.700 | to the same entity.

00:57:46.420 | And I do that by making use of my pairwise scorer.

00:57:50.760 | So I can run my pairwise scorer, and it will give a probability or a score that any two

00:57:59.000 | mentions are coreferent.

00:58:00.980 | So by picking some threshold, like 0.5, I can add coreference links for when the classifier

00:58:08.080 | says it's above the threshold.

00:58:11.320 | And then I do one more step to give me a clustering.

00:58:15.080 | I then say, okay, let's also make the transitive closure to give me clusters.

00:58:21.200 | So it thought that I and she were coreferent and my and she were coreferent.

00:58:26.720 | Therefore, I also have to regard I and my as coreferent.

00:58:32.440 | And so that's sort of the completion by transitivity.

00:58:36.300 | And so since we always complete by transitivity, note that this algorithm is very sensitive

00:58:43.640 | to making any mistake in a positive sense.

00:58:47.160 | Because if you make one mistake, for example, you say that he and my are coreferent, then

00:58:53.440 | by transitivity, all of the mentions in the sentence become one big cluster and that they're

00:59:00.560 | all coreferent with each other.

00:59:03.940 | So that's a workable algorithm and people have often used it.

00:59:08.200 | But often people go a little bit beyond that and prefer a mention ranking model.

00:59:15.320 | So let me just explain the advantages of that.

00:59:18.520 | That normally, if you have a long document where it's Ralph Nader and he did this and

00:59:24.120 | some of them did something to him and we visited his house and blah, blah, blah, blah.

00:59:28.720 | And then somebody voted for Nader because he.

00:59:33.400 | In terms of building a coreference classifier, it seems like it's easy and reasonable to

00:59:43.000 | be able to recover that this he refers to Nader.

00:59:47.860 | But in terms of building a classifier for it to recognize that this he should be referring

00:59:54.500 | to this Nader, which might be three paragraphs back, seems kind of unreasonable how you're

00:59:59.960 | going to recover that.

01:00:01.900 | So those faraway ones might be almost impossible to get correct.

01:00:06.400 | And so that suggests that maybe we should have a different way of configuring this task.

01:00:13.460 | So instead of doing it that way, what we should say is, well, this he here has various possible

01:00:21.700 | antecedents and our job is to just choose one of them.

01:00:26.720 | And that's almost sufficient apart from we need to add one more choice, which is, well,

01:00:35.900 | some mentions won't be coreferent with anything that proceeds because we're introducing a

01:00:41.540 | new entity into the discourse.

01:00:43.680 | So we can add one more dummy mention, the N/A mention.

01:00:49.920 | So it doesn't refer to anything previously in the discourse.

01:00:55.220 | And then our job at each point is to do mention ranking to choose which one of these she refers

01:01:01.780 | to.

01:01:02.780 | And then at that point, rather than doing binary yes/no classifiers, that what we can

01:01:09.040 | do is say, aha, this is choose one classification and then we can use the kind of softmax classifiers

01:01:16.760 | that we've seen at many points previously.

01:01:20.340 | Okay.

01:01:22.020 | So that gets us in business for building systems.

01:01:25.760 | And for either of these kind of models, there are several ways in which we can build the

01:01:31.820 | system.

01:01:32.820 | We could use any kind of traditional machine learning classifier.

01:01:36.500 | We could use a simple neural network.

01:01:40.100 | We can use more advanced ones with all of the tools that we've been learning about more

01:01:44.340 | recently.

01:01:45.340 | So let me just quickly show you a simple neural network way of doing it.

01:01:51.540 | So this is a model that my PhD student, Kevin Clark, did in 2015.

01:01:58.580 | So not that long ago.

01:02:00.820 | But what he was doing was doing coreference resolution based on the mentions with a simple

01:02:07.120 | feedforward neural network, kind of in some sense like we did dependency parsing with

01:02:11.980 | a simple feedforward neural network.

01:02:14.220 | So for the mention, it had word embeddings, antecedent had word embeddings.

01:02:22.580 | There were some additional features of each of the mention and candidate antecedent.

01:02:27.660 | And then there were some final additional features that captured things like distance

01:02:31.660 | away, which you can't see from either the mention or the candidate.

01:02:36.560 | And all of those features were just fed into several feedforward layers of a neural network.

01:02:43.180 | And it gave you a score of are these things coreferent or not.

01:02:48.940 | And that by itself just worked pretty well.

01:02:55.420 | And I won't say more details about that.

01:02:58.540 | But what I do want to show is sort of a more advanced and modern neural coreference system.

01:03:06.340 | But before I do that, I want to take a digression and sort of say a few words about convolutional

01:03:12.900 | neural networks.

01:03:16.380 | So the idea of when you apply a convolutional neural network to language, i.e. to sequences,

01:03:25.900 | is that what you're going to do is you're going to compute vectors, features effectively,

01:03:32.700 | for every possible word subsequence of a certain length.

01:03:36.960 | So that if you have a piece of text like tentative deal reached to keep government open, you

01:03:42.260 | might say I'm going to take every three words of that, i.e. tentative deal reached, deal

01:03:48.300 | reached to, reach to keep, and I'm going to compute a vector based on that subsequence

01:03:55.860 | of words and use those computed vectors in my model by somehow grouping them together.

01:04:04.620 | So the canonical case of convolutional neural networks is in vision.

01:04:11.420 | And so if after this next quarter you go along to CS231N, you'll be able to spend weeks doing

01:04:19.800 | convolutional neural networks for vision.

01:04:22.500 | And so the idea there is that you've got these convolutional filters that you sort of slide

01:04:30.580 | over an image and you compute a function of each place.

01:04:36.120 | So the sort of little red numbers are showing you what you're computing, but then you'll

01:04:41.000 | slide it over to the next position and fill in this cell, and then you'll slide it over

01:04:46.480 | the next position and fill in this cell, and then you'll slide it down and fill in this

01:04:52.440 | cell.

01:04:53.480 | And so you've got this sort of little function of a patch, which you're sliding over your

01:04:58.800 | image and computing a convolution, which is just a dot product effectively, that you're

01:05:06.000 | then using to get an extra layer of representation.

01:05:10.280 | And so by sliding things over, you can pick out features and you've got a sort of a feature

01:05:16.000 | identifier that runs across every piece of the image.

01:05:21.080 | Well, for language, we've just got a sequence, but you can do basically the same thing.

01:05:28.360 | And what you then have is a 1D convolution for text.

01:05:32.640 | So if here's my sentence, tentative deal, reach to keep the government open, that what

01:05:38.120 | I can do is have, so these words have a word representation, which, so this is my vector

01:05:48.320 | for each word.

01:05:49.920 | And then I can have a filter, sometimes called a kernel, which I use for my convolution.

01:05:57.800 | And what I'm going to do is slide that down the text.

01:06:01.560 | So I can start with it, with the first three words, and then I sort of treat them as sort

01:06:08.360 | of elements I can dot product and sum, and then I can compute a value as to what they

01:06:13.920 | all add up to, which is minus one, it turns out.

01:06:18.100 | And so then I might have a bias that I add on and get an updated value if my bias is

01:06:24.240 | plus one.

01:06:28.080 | And then I'd run it through a nonlinearity, and that will give me a final value.

01:06:33.400 | And then I'll slide my filter down, and I'd work out a computation for this window of

01:06:41.120 | three words, and take 0.5 times 3 plus 0.2 times 1, et cetera, and that comes out as

01:06:49.120 | this value.

01:06:50.120 | I add the bias.

01:06:52.360 | I put it I'm going to put it through my nonlinearity, and then I keep on sliding down, and I'll

01:06:58.880 | do the next three words, and keep on going down.

01:07:02.880 | And so that gives me a 1D convolution and computes a representation of the text.

01:07:10.680 | You might have noticed in the previous example that I started here with seven words, but

01:07:17.120 | because I wanted to have a window of three for my convolution, the end result is that

01:07:23.240 | things shrunk.

01:07:24.680 | So in the output, I only had five things.

01:07:27.680 | That's not necessarily desirable.

01:07:30.120 | So commonly people will deal with that with padding.

01:07:33.960 | So if I put padding on both sides, I can then start my three by three convolution, my three

01:07:40.560 | sorry, not three by three, my three convolution here, and compute this one, and then slide

01:07:47.640 | it down one, and compute this one.

01:07:50.960 | And so now my output is the same size as my real input, and so that's a convolution with

01:07:57.760 | padding.

01:08:00.200 | Okay, so that was the start of things, but you know, how you get more power of the convolutional

01:08:08.080 | network is you don't only have one filter, you have several filters.

01:08:13.720 | So if I have three filters, each of which will have their own bias and nonlinearity,

01:08:18.560 | I can then get a three dimensional representation coming out the end, and sort of you can think

01:08:25.680 | of these as conceptually computing different features of your text.

01:08:32.080 | Okay, so that gives us a kind of a new feature re-representation of our text.

01:08:40.840 | But commonly, we then want to somehow summarize what we have.

01:08:47.920 | And a very common way of summarizing what we have is to then do pooling.

01:08:53.960 | So if we sort of think of these features as detecting different things in the text, so

01:09:00.600 | you know, they might even be high level features like, you know, does this show signs of toxicity

01:09:08.360 | or hate speech?

01:09:10.280 | Is there reference to something?

01:09:13.200 | So if you want to be interested in does it occur anywhere in the text, what people often

01:09:17.480 | then do is a max pooling operation, where for each feature, they simply sort of compute

01:09:24.560 | the maximum value it ever achieved in any position as you went through the text.

01:09:30.440 | And say that this vector ends up as the sentence representation.

01:09:35.720 | Sometimes for other purposes, rather than max pooling, people use average pooling, where

01:09:40.320 | you take the averages of the different vectors to get the sentence representation.

01:09:46.160 | Then general max pooling has been found to be more successful.

01:09:50.000 | And that's kind of because if you think of it as feature detectors that are wanting to

01:09:54.560 | detect was this present somewhere, then, you know, something like positive sentiment isn't

01:10:00.960 | going to be present in every three word subsequence you choose.

01:10:06.720 | But if it was there somewhere, it's there.

01:10:09.400 | And so often max pooling works better.

01:10:13.800 | And so that's a very quick look at convolutional neural networks.

01:10:19.760 | Just to say this example is doing 1D convolutions with words.

01:10:25.280 | But a very common place that convolutional neural networks are being used in natural

01:10:30.320 | language is actually using them with characters.

01:10:34.160 | And so what you can do is you can do convolutions over subsequences of the characters in the

01:10:41.760 | same way.

01:10:43.280 | And if you do that, this allows you to compute a representation for any sequence of characters.

01:10:50.160 | So you don't have any problems with being out of vocabulary or anything like that.

01:10:55.640 | Because for any sequence of characters, you just compute your convolutional representation

01:11:00.720 | and max pool it.

01:11:02.440 | And so quite commonly, people use a character convolution to give a representation of words.

01:11:10.720 | This is the only representation of words.

01:11:14.000 | But otherwise, it's something that you use in addition to a word vector.

01:11:20.080 | And so in both BIDAP and the model I'm about to show, that at the base level, it makes

01:11:25.040 | use of both a word vector representation that we saw right at the beginning of the text

01:11:30.520 | and a character level convolutional representation of the words.

01:11:36.880 | Okay.

01:11:38.680 | With that said, I now want to show you before time runs out, an end to end neural coref

01:11:44.160 | model.

01:11:45.160 | So the model I'm going to show you is Kenton Lee's one from University of Washington, 2017.

01:11:53.400 | This is no longer the state of the art.

01:11:55.080 | I'll mention the state of the art at the end.

01:11:57.980 | But this was the first model that really said, get rid of all of that old stuff of having

01:12:04.340 | pipelines and mentioned detection first.

01:12:07.200 | Build one end to end big model that does everything and returns coreference.

01:12:11.840 | So it's a good one to show.

01:12:14.440 | So compared to the earlier simple thing I saw, we're now going to process the text with

01:12:20.320 | BIOSTMs.

01:12:21.320 | We're going to make use of attention.

01:12:24.360 | And we're going to do all of mentioned detection coreference in one step end to end.

01:12:29.920 | And the way it does that is by considering every span of the text up to a certain length

01:12:37.240 | as a candidate mentioned and just figures out a representation for it and whether it's

01:12:42.280 | coreferent to other things.

01:12:44.720 | So what we do at the start is we start with a sequence of words and we calculate from

01:12:50.920 | those standard word embedding and a character level CNN embedding.

01:12:57.380 | We then feed those as inputs into a bidirectional LSTM of the kind that we saw quite a lot of

01:13:05.880 | before.

01:13:07.800 | But then after this, what we do is we compute representations for spans.

01:13:14.840 | So when we have a sequence of words, we're then going to work out a representation of

01:13:21.760 | a sequence of words which we can then put into our coreference model.

01:13:27.980 | So I can't fully illustrate in this picture, but sub sequences of different lengths, so

01:13:35.520 | like general, general electric, general electric said, will all have a span representation

01:13:41.360 | which I've only shown a subset of them in green.

01:13:45.980 | So how are those computed?

01:13:47.560 | Well, the way they're computed is that the span representation is a vector that concatenates

01:13:55.060 | several vectors and it consists of four parts.

01:13:59.020 | It consists of the representation that was computed for the start of the span from the

01:14:08.220 | LSTM, the representation for the end from the LSTM, that's over here.

01:14:16.000 | And then it has a third part that's kind of interesting.

01:14:19.320 | This is an intention-based representation that is calculated from the whole span, but

01:14:26.020 | particularly sort of looks for the head of a span.

01:14:29.100 | And then there are still a few additional features.

01:14:32.480 | So it turns out that, you know, some of these additional things like length and so on is

01:14:37.760 | still a bit useful.

01:14:40.980 | So to work out the final part, it's not the beginning and the end, what's done is to calculate

01:14:49.400 | an attention-weighted average of the word embeddings.

01:14:52.600 | So what you're doing is you're taking the X star representation of the final word of

01:14:59.480 | the span, and you're feeding that into a neural network to get attention scores for every

01:15:06.980 | word in the span, which are these three, and that's giving you an attention distribution

01:15:14.020 | as we've seen previously.

01:15:16.300 | And then you're calculating the third component of this as an attention-weighted sum of the

01:15:27.120 | different words in the span.

01:15:29.440 | And so therefore you've got the sort of a soft average of the representations of the

01:15:34.200 | words of the span.

01:15:37.600 | Okay.

01:15:39.480 | So then once you've got that, what you're doing is then feeding these representations

01:15:48.600 | into having scores for whether spans are coreferent mentions.

01:15:55.560 | So you have a representation of the two spans, you have a score that's calculated for whether

01:16:05.880 | two different spans look coreferent, and that overall you're getting a score for are different

01:16:11.920 | spans looking coreferent or not.

01:16:17.520 | And so this model is just run end to end on all spans.

01:16:22.320 | And that sort of would get intractable if you scored literally every span in a long

01:16:27.080 | piece of text.

01:16:28.500 | So they do some pruning.

01:16:30.160 | They sort of only allow spans up to a certain maximum size.

01:16:34.240 | They only consider pairs of spans that aren't too distant from each other, et cetera, et

01:16:41.360 | cetera.

01:16:42.360 | But basically it's sort of an approximation to just a complete comparison of spans.

01:16:48.160 | And this turns into a very effective coreference resolution algorithm.

01:16:52.880 | Today it's not the best coreference resolution algorithm because maybe not surprisingly,

01:16:59.440 | like everything else that we've been dealing with, there's now been these transformer models

01:17:05.000 | like BERT have come along and that they produce even better results.

01:17:09.720 | So the best coreference systems now make use of BERT.

01:17:15.640 | In particular, when Danci spoke, she briefly mentioned spanBERT, which was a variant of

01:17:21.520 | BERT which constructs blanks out for reconstruction, sub sequences of words rather than just a

01:17:28.840 | single word.

01:17:30.480 | And spanBERT has actually proven to be very effective for doing coreference, perhaps because

01:17:36.440 | you can blank out whole mentions.

01:17:40.800 | We've also gotten gains actually funnily by treating coreference as a question answering

01:17:46.280 | task.

01:17:47.400 | So effectively you can find a mention like he or the person and say what is its antecedent

01:17:57.880 | and get a question answering answer.

01:18:00.800 | And that's a good way to do coreference.

01:18:05.280 | So if we put that together as time is running out, let me just sort of give you some sense

01:18:11.880 | of how results come out for coreference systems.

01:18:17.200 | So I'm skipping a bit actually that you can find in the slides, which is how coreference

01:18:22.200 | is scored.

01:18:24.320 | But essentially it's scored on a clustering metric.

01:18:27.080 | So a perfect clustering would give you 100 and something that makes no correct decisions

01:18:34.000 | would give you zero.

01:18:35.760 | And so this is sort of how the coreference numbers have been panning out.

01:18:42.800 | So back in 2010, actually, this was a Stanford system.

01:18:48.160 | This was a state of the art system for coreference.

01:18:51.040 | It won a competition.

01:18:52.040 | It was actually a non-machine learning model, because again, we wanted to sort of prove

01:18:57.760 | how these rule-based methods in practice work kind of well.

01:19:02.280 | And so its accuracy was around 55 for English, 50 for Chinese.

01:19:08.720 | Then gradually machine learning, this was sort of statistical machine learning models

01:19:14.120 | got a bit better.

01:19:16.480 | Wiseman was the very first neural coreference system, and that gave some gains.

01:19:22.640 | Here's a system that Kevin Clark and I did, which gave a little bit further gains.

01:19:28.600 | So Lee is the model that I've just shown you as the end-to-end model, and it got a bit

01:19:35.840 | of further gains.

01:19:37.560 | But then again, what gave the huge breakthrough, just like question answering, was that the

01:19:44.360 | use of SpanBert.

01:19:46.920 | So once we moved to here, we're now using SpanBert, but that's giving you about an extra

01:19:51.280 | 10% or so.

01:19:53.600 | The CorefQA technique proved to be useful.

01:19:58.000 | And in the very latest best results are effectively combining together SpanBert and a larger version

01:20:05.920 | of SpanBert and CorefQA and getting up to 83.

01:20:11.640 | So you might think from that, that Coref is sort of doing really well and is getting close

01:20:19.040 | to solve like other NLP tasks.

01:20:22.680 | Well it's certainly true that in neural times, the results have been getting way, way better

01:20:27.240 | than they had been before.

01:20:28.560 | But I would caution you that these results that I just showed were on a corpus called

01:20:34.640 | Onto Notes, which is mainly Newswire.

01:20:37.520 | And it turns out that Newswire coreference is pretty easy.

01:20:41.880 | I mean, in particular, there's a lot of mention of the same entities, right?

01:20:46.200 | So the newspaper articles are full of mentions of the United States and China and leaders

01:20:52.820 | of the different countries.

01:20:54.880 | And it's sort of very easy to work out what they're coreferent to.

01:20:58.780 | And so the coreference scores are fairly high.

01:21:03.920 | Whereas if what you do is take something like a page of dialogue from a novel and feed that

01:21:10.840 | into a system and say, okay, do the coreference correctly, you'll find pretty rapidly that

01:21:17.840 | the performance of the models is much more modest.

01:21:21.800 | If you'd like to try out a coreference system for yourself, there are pointers to a couple

01:21:27.760 | of them here where the top one's ours from the certain Kevin Clark's neural coreference.

01:21:36.240 | And this is one that goes with the Hugging Face repository that we've mentioned.

01:21:40.160 | [end]

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 13 - Coreference Resolution

Chapters