back to indexStanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 13 - Coreference Resolution
Chapters
0:0 Introduction
4:25 Lecture Plan
5:58 What is Coreference Resolution
6:23 Example The Star
13:0 Example The Tree
17:20 Machine Translation
20:1 Detecting mentions
22:2 Noun phrases
23:38 How to deal with spurious mentions
24:21 Can we say that its sunny
32:8 Coreference vs Anaphora
32:53 Complex forms of Anaphora
37:34 Context
39:37 Coreference Models
40:49 Rulebased Coreference
42:41 Hobbs Algorithm
56:17 Coreference Algorithms
00:00:14.080 |
If you're following along the syllabus really closely, we actually did a little bit of a 00:00:21.240 |
And so today it's me and I'm going to talk about coreference resolution, which is another 00:00:27.380 |
chance we get to take a deeper dive into a more linguistic topic. 00:00:31.200 |
I will also show you a couple of new things for deep learning models at the same time. 00:00:36.980 |
And then the lecture that had previously been scheduled at this point, which was going to 00:00:41.880 |
be John on explanation in neural models, is being shifted later down into week nine, I 00:00:53.120 |
So we're getting underway, just a couple of announcements on things. 00:00:57.480 |
Well, first of all, congratulations on surviving assignment five, I hope. 00:01:02.840 |
I know it was a bit of a challenge for some of you, but I hope it was a rewarding state 00:01:07.680 |
of the art learning experience on the latest in neural nets. 00:01:12.040 |
And at any rate, you know, this was a brand new assignment that we used for the first 00:01:17.320 |
So we'll really appreciate later on when we do the second survey, getting your feedback 00:01:22.240 |
We've been busy reading people's final project proposals. 00:01:29.040 |
Our goal is to get them back to you tomorrow. 00:01:32.440 |
But you know, as soon as you've had a good night's sleep after assignment five, now is 00:01:36.720 |
also a great time to get started working on your final projects, because there's just 00:01:44.600 |
And I particularly want to encourage all of you to chat to your mentor regularly, go and 00:01:49.560 |
visit office hours and keep in touch, get advice, just talking through things is a good 00:01:57.240 |
We also plan to be getting back assignment four grades later this week. 00:02:02.920 |
There's sort of the work never stops at this point. 00:02:05.320 |
So the next thing for the final project is the final project milestone. 00:02:11.360 |
So that we handed out the details of that last Friday, and it's due a week from today. 00:02:18.280 |
So the idea of this final project milestone is really to help keep you on track and keep 00:02:25.080 |
things moving towards having a successful final project. 00:02:29.040 |
So our hope is that sort of most of what you write for the final project milestone is material 00:02:34.760 |
you can also include in your final project, except for a few paragraphs of here's exactly 00:02:42.000 |
So the overall hope is that doing this in two parts and having a milestone before the 00:02:47.160 |
final thing, it's just making you make progress and be on track to having a successful final 00:02:53.760 |
Finally, the next class on Thursday is going to be Colin Raphel, and this is going to be 00:03:01.920 |
So he's going to be talking more about the very latest in large pre-trained language 00:03:06.960 |
models, both what some of their successes are, and also what some of the disconcerting, 00:03:12.680 |
not quite so good aspects of those models are. 00:03:16.000 |
So that should be a really good, interesting lecture. 00:03:19.480 |
When we had him come and talk to our NLP seminar, we had several hundred people come along for 00:03:27.680 |
And so for this talk, again, we're asking that you write a reaction paragraph following 00:03:34.320 |
the same instructions as last time about what's in this lecture. 00:03:40.240 |
And someone asked in the questions, well, what about last Thursday's? 00:03:47.320 |
So the distinction here is we're only doing the reaction paragraphs for outside guest 00:03:55.040 |
And although it was great to have Tran Boss Lu for last Thursday's lecture, he's a postdoc 00:04:02.160 |
So we don't count him as an outside guest speaker. 00:04:05.220 |
And so nothing needs to be done for that one. 00:04:07.520 |
So there are three classes for which you need to do it. 00:04:12.440 |
So there was the one before from Dan Chi-Shun, Colin Raffel, which is Thursday, and then 00:04:19.440 |
towards the end of the course, there's Yulia Svetkova. 00:04:27.880 |
So in the first part of it, I'm actually going to spend a bit of time talking about what 00:04:33.000 |
coreference is, what different kinds of reference and language are. 00:04:38.320 |
And then I'm going to move on and talk about some of the kind of methods that people have 00:04:45.720 |
Now there's one bug in our course design, which was a lot of years, we've had a whole 00:04:53.960 |
lecture on doing convolutional neural nets for language applications. 00:04:58.380 |
And that slight bug appeared the other day when Dan Chi-Shun talked about the BIDAF model, 00:05:08.460 |
because she sort of slipped in, oh, there's a character CNN representation of words, and 00:05:17.420 |
I mean, actually, for applications in coreference as well, people commonly make use of character 00:05:24.800 |
So I wanted to sort of spend a few minutes sort of doing basics of conv nets for language. 00:05:31.140 |
The sort of reality here is that given that there's no exam week this year, to give people 00:05:37.140 |
more time for final projects, we sort of shorten the content by a week this year. 00:05:42.700 |
And so you're getting a little bit less of that content. 00:05:46.940 |
Then going on from there, say some stuff about a state of the art neural coreference system, 00:05:53.140 |
and right at the end, talk about how coreference is evaluated and what some of the results 00:05:59.900 |
So first of all, what is this coreference resolution term that I've been talking about 00:06:05.660 |
So coreference resolution is meaning to find all the mentions in a piece of text that refer 00:06:23.160 |
So here's part of a short story by Shruti Rao called The Star. 00:06:28.660 |
Now I have to make a confession here, because this is an NLP class, not a literature class, 00:06:36.380 |
I crudely made some cuts to the story to be able to have relevant parts appear on my slide 00:06:43.580 |
in a decent sized font for illustrating coreference. 00:06:47.420 |
So it's not quite the full original text, but it basically is a piece of this story. 00:06:53.980 |
So what we're doing in coreference resolution is we're working out what people are mentioned. 00:07:02.580 |
So here's a mention of a person, Banarja, and here's a mention of another person, Akila. 00:07:15.340 |
And then here's Akila again, and Akila's son. 00:07:23.500 |
Then there's another son here, and then her son, and Akash. 00:07:42.740 |
And then there's a naughty child, Lord Krishna. 00:07:48.540 |
And there's some that are a bit complicated, like the lead role, is that a mention? 00:07:55.140 |
It's sort of more of a functional specification of something in the play. 00:08:05.980 |
But I mean, in general, there are noun phrases that are mentioning things in the world. 00:08:12.260 |
And so then what we want to do for coreference resolution is work out which of these mentions 00:08:20.300 |
are talking about the same real world entity. 00:08:31.260 |
And so Banarja is the same person as her there. 00:08:38.020 |
And then we could read through, she resigned herself. 00:08:48.180 |
She bought him a brown T-shirt and brown trousers. 00:09:50.600 |
And so an interesting thing here is that you can get nested syntactic structure so that 00:10:04.380 |
So that if, you know, overall we have sort of this noun phrase, Akila's son Prajwal, 00:10:10.260 |
which consists of two noun phrases in apposition. 00:10:16.220 |
And then for the noun phrase Akila's son, it sort of breaks down to itself having an 00:10:22.980 |
extra possessive noun phrase in it and then a noun so that you have Akila's and then this 00:10:32.300 |
So that you have these multiple noun phrases. 00:10:36.820 |
And so that you can then be sort of having different parts of this be one person in the 00:10:46.740 |
But this noun phrase here referring to a different person in the coreference. 00:11:02.020 |
So while there's some easy other Prajwals, right, so there's Prajwal here. 00:11:12.700 |
And then you've got some more complicated things. 00:11:15.580 |
So one of the complicated cases here is that we have they went to the same school. 00:11:23.540 |
So that they there is what gets referred to as split antecedents. 00:11:32.460 |
Because the they refers to both Prajwal and Akash. 00:11:41.520 |
And that's an interesting phenomenon that and so I could try and show that somehow I 00:11:50.380 |
And if I get a different color, Akash, we have Akash and her son. 00:11:57.140 |
And then this one sort of both of them at once. 00:12:00.580 |
So human languages have this phenomenon of split antecedents. 00:12:06.960 |
But you know, one of the things that you should notice when we start talking about algorithms 00:12:13.860 |
that people use for doing coreference resolution is that they make some simplified assumptions 00:12:20.780 |
as to how they go about treating the problem. 00:12:24.660 |
And one of the simplifications that most algorithms make is for any noun phrase like this pronoun 00:12:34.180 |
say that's trying to work out what is a coreference with. 00:12:40.760 |
And so actually most NLP algorithms for coreference resolution just cannot get split antecedents 00:12:49.300 |
Any time it occurs in the text, they guess something and they always get it wrong. 00:12:54.080 |
So that's the sort of a bit of a sad state of affairs. 00:13:08.100 |
So moving on from there, we then have this tree. 00:13:15.460 |
So well, in this context of this story, Akash is going to be the tree. 00:13:25.840 |
So you could feel that it was okay to say, well, this tree is also Akash. 00:13:35.080 |
You could also feel that that's a little bit weird and not want to do that. 00:13:39.780 |
And I mean, actually different people's coreference datasets differ in this. 00:13:47.900 |
So really that, you know, we're predicating identity relationship here between Akash and 00:13:56.560 |
So do we regard the tree as the same as Akash or not? 00:14:03.300 |
But then going ahead, we have here's Akash and she bought him. 00:14:20.820 |
So then if we don't regard the tree as the same as Akash, we have a tree here. 00:14:32.860 |
But then note that the next place over here, where we have a mention of a tree, the best 00:14:41.700 |
tree, but that's sort of really a functional description of, you know, of possible trees 00:14:56.420 |
And so it seems like that's not really coreferent. 00:14:59.900 |
But if we go on, there's definitely more mention of a tree. 00:15:06.980 |
So when she has made the tree truly the nicest tree, or well, I'm not sure. 00:15:18.420 |
And maybe this one again is a sort of a functional description that isn't referring to the tree. 00:15:27.940 |
So there's definitely and so maybe this one, though, where it's a tree is referring to 00:15:37.980 |
But what I hope to have illustrated from this is, you know, most of the time when we do 00:15:44.940 |
coreference in NLP, we just make it look sort of like the conceptual phenomenon is, you 00:15:54.260 |
know, kind of obvious that there's a mention of Sarah and then it says she and you say, 00:16:05.500 |
But if you actually start looking at real text, especially when you're looking at something 00:16:11.700 |
like this, that is a piece of literature, the kind of phenomenon you get for coreference 00:16:18.020 |
and overlapping reference, and it varies other phenomena that I'll talk about, you know, 00:16:26.860 |
And it's not, you know, there are a lot of hard cases that you actually have to think 00:16:30.500 |
about as to what things you think about as coreferent or not. 00:16:37.420 |
But basically, we do want to be able to do something with coreference because it's useful 00:16:45.100 |
for a lot of things that we'd like to do in natural language processing. 00:16:49.280 |
So for one task that we've already talked about, question answering, but equally for 00:16:53.960 |
other tasks such as summarization, information extraction, if you're doing something like 00:17:00.360 |
reading through a piece of text, and you've got a sentence like he was born in 1961, you 00:17:07.760 |
really want to know who he refers to, to know if this is a good answer to the question of, 00:17:15.400 |
you know, when was Barack Obama born or something like that. 00:17:20.980 |
It turns out also that it's useful in machine translation. 00:17:25.720 |
So in most languages, pronouns have features for gender and number, and in quite a lot 00:17:35.240 |
of languages, nouns and adjectives also show features of gender, number, and case. 00:17:43.840 |
And so when you're translating a sentence, you want to be aware of these features and 00:17:51.880 |
what is coreferent as what to be able to get the translations correct. 00:18:00.080 |
So you know, if you want to be able to work out a translation and know whether it's saying 00:18:06.600 |
Alicia likes Juan because he's smart, or Alicia likes Juan because she's smart, then you have 00:18:12.740 |
to be sensitive to coreference relationships to be able to choose the right translation. 00:18:22.720 |
When people build dialogue systems, dialogue systems also have issues of coreference a 00:18:31.160 |
So you know, if it's sort of book tickets to see James Bond and the system replies Spectre 00:18:38.440 |
is playing near you at 2 and 3 today, well, there's actually a coreference relation, sorry, 00:18:43.560 |
there's a reference relation between Spectre and James Bond because Spectre is a James 00:18:51.880 |
But then it's how many tickets would you like, two tickets for the showing at 3. 00:19:01.440 |
That 3 is then a coreference relationship back to the 3pm showing that was mentioned 00:19:12.920 |
So again, to understand these, we need to be understanding the coreference relationships. 00:19:19.900 |
So how now can you go about doing coreference? 00:19:23.940 |
So the standard traditional answer, which I'll present first, is coreference is done 00:19:32.320 |
On the first step, what we do is detect mentions in a piece of text. 00:19:40.920 |
And then in the second step, we work out how to cluster the mentions. 00:19:46.180 |
So as in my example from the Shruti Rao text, basically what you're doing with coreference 00:19:53.280 |
is you're building up these clusters, sets of mentions that refer to the same entity 00:20:01.760 |
So if we explore a little how we could do that as a two-step solution, the first part 00:20:10.020 |
And so pretty much there are three kinds of things, different kinds of noun phrases that 00:20:21.060 |
There are pronouns like I, your, it, she, him, and also some demonstrative pronouns 00:20:29.440 |
There are explicitly named things, so things like Paris, Joe Biden, Nike. 00:20:35.720 |
And then there are plain noun phrases that describe things. 00:20:40.320 |
So a dog, the big fluffy cat stuck in the tree. 00:20:43.800 |
And so all of these are things that we'd like to identify as mentions. 00:20:49.440 |
And the straightforward way to identify these mentions is to use natural language processing 00:20:56.680 |
tools, several of which we've talked about already. 00:21:01.080 |
So to work out pronouns, we can use what's called a part of speech tagger. 00:21:12.760 |
We can use a part of speech tagger, which we haven't really explicitly talked about, 00:21:18.200 |
but we used when you built dependency parsers. 00:21:21.880 |
So it first of all assigns parts of speech to each word so we can just find the words 00:21:30.680 |
For named entities, we did talk just a little bit about named entity recognizers as a use 00:21:36.120 |
of sequence models for neural networks so we can pick out things like person names and 00:21:44.040 |
And then for the ones like the big fluffy, a big fluffy dog, we could then be sort of 00:21:50.400 |
picking out from syntactic structure noun phrases and regarding them as descriptions 00:21:58.020 |
So that we could use all of these tools and those would give us basically our mentions. 00:22:03.600 |
It's a little bit more subtle than that, because it turns out there are some noun phrases and 00:22:11.440 |
things of all of those kinds which don't actually refer so that they're not referential in the 00:22:18.880 |
So when you say it is sunny, it doesn't really refer. 00:22:22.800 |
When you make universal claims like every student, well, every student isn't referring 00:22:31.680 |
And more dramatically, when you have no student and make a negative universal claim, it's 00:22:39.120 |
There are also things that you can describe functionally, which don't have any clear reference. 00:22:47.280 |
So if I say the best donut in the world, that's a functional claim, but it doesn't necessarily 00:22:55.440 |
Like if I've established that I think a particular kind of donut is the best donut in the world, 00:23:03.020 |
I could then say to you, I ate the best donut in the world yesterday. 00:23:10.000 |
And you know what I mean, it might have reference. 00:23:12.700 |
But if I say something like I'm going around to all the donut stores trying to find the 00:23:17.220 |
best donut in the world, then it doesn't have any reference yet. 00:23:20.860 |
It's just a sort of a functional description I'm trying to satisfy. 00:23:24.980 |
You also then have things like quantities, 100 miles. 00:23:29.560 |
It's that quantity that is not really something that has any particular reference. 00:23:33.920 |
You can mark out 100 miles, all sorts of places. 00:23:38.880 |
So how do we deal with those things that aren't really mentions? 00:23:44.040 |
Well one way is we could train a machine learning classifier to get rid of those spurious mentions. 00:23:54.680 |
Most commonly if you're using this kind of pipeline model where you use a parser and 00:24:01.400 |
a named entity recognizer, you regard everything as you've found as a candidate mention, and 00:24:10.680 |
And some of them, like those ones, hopefully aren't made coref with anything else. 00:24:16.280 |
And so then you just discard them at the end of the process. 00:24:22.440 |
>> I've got an interesting question that linguistics bears on this. 00:24:40.280 |
So people have actually tried to suggest that when you say it is sunny, it means the weather 00:24:51.680 |
But I guess the majority opinion at least is that isn't plausible. 00:24:59.180 |
And I mean, for I guess many of you aren't native speakers of English, but similar phenomena 00:25:08.740 |
I mean, it just intuitively doesn't seem plausible when you say it's sunny or it's raining today 00:25:17.460 |
that you're really saying that as a shortcut for the weather is raining today. 00:25:23.480 |
It just seems like really what the case is, is English likes to have something filling 00:25:30.620 |
And when there's nothing better to fill the subject position, you stick it in there and 00:25:39.360 |
And so in general, it's believed that you get this phenomenon of having these empty 00:25:46.900 |
I mean, another place in which it seems like you clearly get dummy its is that when you 00:25:52.740 |
have clauses that are subjects of a verb, you can move them to the end of the sentence. 00:25:59.540 |
So if you have a sentence where you put a clause in the subject position, they normally 00:26:07.580 |
So it's you have a sentence something like that CS24N is a lot of work is known by all 00:26:17.660 |
People don't normally say that the normal thing to do is to shift the clause to the 00:26:22.660 |
But when you do that, you stick in a dummy it to fill the subject position. 00:26:27.020 |
So you then have it is known by all students that CS224N is a lot of work. 00:26:33.660 |
So that's the general feeling that this is a dummy it that doesn't have any reference. 00:26:45.540 |
So if someone says it is sunny among other things, and we ask how is the weather? 00:26:56.820 |
And you answer it is sunny, it then does seem like the it is in reference to the weather. 00:27:04.260 |
Well, you know, I guess this is what our coreference systems are built trying to do in situations 00:27:10.660 |
like that, they're making a decision of coreference or not. 00:27:14.180 |
And I guess what you'd want to say in that case is, it seems reasonable to regard this 00:27:18.260 |
one as coreference that weather that did appear before it. 00:27:23.020 |
I mean, but that also indicates another reason to think that in the normal case is not coreference, 00:27:30.980 |
Because normally, pronouns are only used when their reference is established that you've 00:27:35.700 |
referred to now like, John is answering questions, and then you can say, he types really quickly, 00:27:44.940 |
sort of seem odd to just sort of start the conversation by he types really quickly, because 00:27:50.620 |
it doesn't have any established reference, whereas that doesn't seem to be the case, 00:27:54.940 |
it seems like you can just sort of start a conversation by saying it's raining really 00:28:07.180 |
So I've sort of there presented the traditional picture. 00:28:13.540 |
But you know, this traditional picture doesn't mean something that was done last millennium 00:28:19.820 |
I mean, essentially, that was the picture until about 2016. 00:28:28.740 |
That essentially, every coreference system that was built, use tools like part of speech 00:28:34.440 |
taggers, NER systems, and parsers to analyze sentences, to identify mentions, and to give 00:28:46.940 |
But more recently, in our neural systems, people have moved to avoiding traditional 00:28:53.620 |
pipeline systems and doing one shot end to end coreference resolution systems. 00:29:02.340 |
So if I skip directly to the second bullet, there's a new generation of neural systems 00:29:09.300 |
where you just start with your sequence of words, and you do the maximally dumb thing, 00:29:15.520 |
you just say, let's take all spans, commonly with some heuristics for efficiency, but you 00:29:21.700 |
know, conceptually, all sub sequences of this sentence, they might be mentions, let's feed 00:29:28.060 |
them in to a neural network, which will simultaneously do mentioned detection and coreference resolution 00:29:37.580 |
And I'll give an example of that kind of system later in the lecture. 00:29:41.900 |
Okay, is everything good to there and I should go on? 00:29:51.660 |
So I'm going to get on to how to do coreference resolution systems. 00:29:58.340 |
But before I do that, I do actually want to show a little bit more of the linguistics 00:30:03.820 |
of coreference, because there are actually a few more interesting things to understand 00:30:10.820 |
I mean, when we say coreference resolution, we really confuse together two linguistic 00:30:23.380 |
And so it's really actually good to understand the difference between these things. 00:30:30.020 |
One is that you can have mentions, which are essentially standalone, but happen to refer 00:30:41.140 |
So if I have a piece of text that said, Barack Obama traveled yesterday to Nebraska, Obama 00:30:48.940 |
was there to open a new meat processing plant or something like that. 00:30:54.060 |
I've mentioned with Barack Obama and Obama, there are two mentions there, they refer to 00:30:59.660 |
the same person in the world, they are coreference. 00:31:05.100 |
But there's a different related linguistic concept called anaphora. 00:31:09.820 |
And anaphora is when you have a textual dependence of an anaphor on another term, which is the 00:31:17.740 |
And in this case, the meaning of the anaphor is determined by the antecedent in a textual 00:31:29.700 |
So when it's Barack Obama said he would sign the bill, he is an anaphor. 00:31:35.700 |
It's not a word that independently we can work out what its meaning is in the world, 00:31:40.820 |
apart from knowing the vaguest feature that it's referring to something probably male. 00:31:48.300 |
But in the context of this text, we have that this anaphor is textually dependent on Barack 00:31:57.540 |
And so then we have an anaphoric relationship, which sort of means they refer to the same 00:32:04.820 |
And so therefore, you can say they're coreference. 00:32:11.160 |
So for coreference, we have these separate textual mentions, which are basically standalone, 00:32:20.940 |
Whereas in anaphora, we actually have a textual relationship. 00:32:25.860 |
And you essentially have to use pronouns like he and she in legitimate ways in which the 00:32:33.940 |
hearer can reconstruct the relationship from the text, because they can't work out what 00:32:49.540 |
But it's actually a little bit more to realize, because there are more complex forms of anaphora, 00:32:56.740 |
which aren't coreference, because you have a textual dependence, but it's not actually 00:33:07.140 |
And so this comes back to things like these quantifier noun phrases that don't have reference. 00:33:13.980 |
So when you have sentences like these ones, every dancer twisted her knee, well, this 00:33:20.780 |
her here has an anaphoric dependency on every dancer, or even more clearly with no dancer 00:33:28.900 |
twisted her knee, the her here has an anaphoric dependence on no dancer. 00:33:35.500 |
But for no dancer twisted her knee, no dancer isn't referential. 00:33:43.980 |
And so there's no coreferential relationship, because there's no reference relationship, 00:33:49.660 |
but there's still an anaphoric relationship between these two noun phrases. 00:33:57.420 |
And then you have this other complex case that turns up quite a bit, where you can have 00:34:03.500 |
where the things being talked about do have reference, but an anaphoric relationship is 00:34:13.380 |
So you commonly get constructions like this one. 00:34:18.660 |
We went to a concert last night, the tickets were really expensive. 00:34:23.180 |
Well, the concert and the tickets are two different things. 00:34:32.220 |
But in interpreting this sentence, what this really means is the tickets to the concert, 00:34:42.220 |
And so there's sort of this hidden, not said dependence where this is referring back to 00:34:49.660 |
And so what we say is that these, the tickets does have an anaphoric dependence on the concert, 00:34:59.500 |
And so that's referred to as bridging anaphora. 00:35:02.540 |
And so overall, there's the simple case and the common case, which is pronominal anaphora, 00:35:13.220 |
You then have other cases of coreference, such as every time you see a mention of the, 00:35:21.060 |
every time the United States has said it's coreferential with every other mention of 00:35:25.940 |
the United States, but those don't have any textual dependence on each other. 00:35:31.180 |
And then you have textual dependencies like bridging anaphora, which aren't coreference. 00:35:37.780 |
That's probably about as, now I was going to say that's probably as much linguistics 00:35:43.300 |
as you wanted to hear, but actually I have one more point of linguistics. 00:35:49.440 |
One or two of you, but probably not many, might've been troubled by the fact that the 00:35:57.380 |
term anaphora as a classical term means that you are looking backward for your antecedent, 00:36:06.660 |
that the anaphora part of anaphora means that you're looking backward for your antecedent. 00:36:12.700 |
And in sort of classical terminology, you have both anaphora and catephra, and it's 00:36:20.940 |
catephra where you look forward for your antecedent. 00:36:25.700 |
Catephra isn't that common, but it does occur. 00:36:33.740 |
From the corner of the divan of Persian saddlebags on which he was lying, smoking as was his 00:36:39.420 |
custom innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey 00:36:47.300 |
sweet and honey colored blossoms of a laburnum. 00:36:51.780 |
So in this example here, the he and then this his are actually referring to Lord Henry Wotton. 00:37:07.620 |
But in modern linguistics, even though most reference to pronouns is backwards, we don't 00:37:23.460 |
And so the term anaphora and anaphora is used for textual dependence, regardless of whether 00:37:31.660 |
A lot of details there, but taking stock of this. 00:37:37.140 |
So the basic observation is language is interpreted in context, that in general, you can't work 00:37:45.780 |
out the meaning or reference of things without looking at the context of the linguistic utterance. 00:37:57.260 |
So for something like word sense disambiguation, if you see just the words, the bank, you don't 00:38:05.380 |
And you need to look at a context to get some sense as to whether it means a financial institution 00:38:10.660 |
or the bank of a river or something like that. 00:38:13.860 |
And so anaphora and coreference give us additional examples where you need to be doing contextual 00:38:23.860 |
So when you see a pronoun, you need to be looking at the context to see what it refers 00:38:31.240 |
And so if you think about text understanding as a human being does it, reading a story 00:38:36.980 |
or an article, that we progress through the article from beginning to end. 00:38:42.280 |
And as we do it, we build up a pretty complex discourse model in which new entities are 00:38:49.220 |
introduced by mentions and then they're referred back to and relationships between them are 00:38:54.900 |
established and they take actions and things like that. 00:38:58.540 |
And it sort of seems like in our head that we sort of build up a kind of a complex graph 00:39:03.300 |
like discourse representation of a piece of text with all these relationships. 00:39:09.180 |
And so part of that is these anaphoric relationships and coreference that we're talking about here. 00:39:15.060 |
And indeed in terms of CS224N, the only kind of whole discourse meaning that we're going 00:39:22.220 |
to look at is looking a bit at anaphora and coreference. 00:39:26.700 |
But if you want to see more about higher level natural language understanding, you can get 00:39:37.100 |
So I want to tell you a bit about several different ways of doing coreference. 00:39:45.620 |
So broadly there are four different kinds of coreference models. 00:39:51.200 |
So the traditional old way of doing it was rule-based systems. 00:39:55.700 |
And this isn't the topic of this class and this is pretty archaic at this point. 00:40:04.400 |
But I wanted to say a little bit about it because it's actually kind of interesting 00:40:09.380 |
as sort of food for thought as to how far along we are or aren't in solving artificial 00:40:16.220 |
intelligence and really being able to understand texts. 00:40:20.740 |
Then there are sort of classic machine learning methods of doing it, which you can sort of 00:40:25.340 |
divide up as mentioned pair methods, mentioned ranking methods, and really clustering methods. 00:40:31.900 |
And I'm sort of going to skip the clustering methods today because most of the work, especially 00:40:36.400 |
most of the recent work, implicitly makes clusters by using either mentioned pair or 00:40:43.720 |
And so I'm going to talk about a couple of neural methods for doing that. 00:40:49.800 |
But first of all, let me just tell you a little bit about rule-based coreference. 00:40:55.380 |
So there's a famous historical algorithm in NLP for doing pronoun and Afro resolution, 00:41:05.300 |
which is referred to as the Hobbes algorithm. 00:41:09.420 |
So everyone just refers to it as the Hobbes algorithm. 00:41:12.100 |
And if you sort of look up a textbook like Draftsky and Martin's textbook, it's referred 00:41:19.660 |
But actually, if you go back to Jerry Hobbes, that's his picture over there in the corner, 00:41:24.620 |
if you actually go back to his original paper, he refers to it as the naive algorithm. 00:41:32.460 |
And his naive algorithm for pronoun coreference was this sort of intricate handwritten set 00:41:43.220 |
So this is the start of the set of the rules, but there are more rules or more clauses of 00:41:53.540 |
And this looks like a hot mess, but the funny thing was that this set of rules for determining 00:42:04.140 |
And so in the sort of 1990s and 2000s decade, even when people were using machine learning 00:42:12.300 |
based systems for doing coreference, they'd hide into those machine learning based systems 00:42:18.460 |
that one of their features was the Hobbes algorithm and that the predictions it made 00:42:24.220 |
with a certain weight was then a feature in making your final decisions. 00:42:29.580 |
And it's only really in the last five years that people have moved away from using the 00:42:35.660 |
Let me give you a little bit of a sense of how it works. 00:42:43.740 |
This is an example from a Guardian book review. 00:42:46.740 |
Niall Ferguson is prolific, well-paid and a snappy dresser. 00:42:53.220 |
So what the Hobbes algorithm does is we start with a pronoun. 00:43:01.100 |
We start with a pronoun and then it says step one, go to the NP that's immediately dominating 00:43:09.180 |
And then it says go up to the first NP or S, call this X and the path P. 00:43:18.760 |
Then it says traverse all branches below X to the left of P, left to right, breadth first. 00:43:26.080 |
So then it's saying to go left to right for other branches below breadth first. 00:43:34.940 |
So we're going down and left to right and look for an NP. 00:43:44.220 |
But then we have to read more carefully and say propose as antecedent any NP that has 00:43:55.940 |
Well, this NP here has no NP or S between NP and X. 00:44:06.540 |
So this is all very, you know, complex and handwritten, but basically he sort of fit 00:44:13.180 |
into the clauses of this kind of a lot of facts about how the grammar of English works. 00:44:20.260 |
And so what this is capturing is if you imagine a different sentence, you know, if you imagine 00:44:25.900 |
the sentence, Stephen Moss's brother hated him. 00:44:35.820 |
Well then Stephen Moss would naturally be coreferent with him. 00:44:40.980 |
And in that case, well, precisely what you'd have is the noun phrase with, well, the noun 00:44:50.700 |
phrase brother, and you'd have another noun phrase inside it for the Stephen Moss. 00:45:02.340 |
So in the case of Stephen Moss's brother, when you looked at this noun phrase, there 00:45:08.140 |
would be an intervening noun phrase before you got to the node X. 00:45:14.060 |
And therefore Stephen Moss is a possible and in fact, good antecedent of him. 00:45:23.500 |
And the algorithm would choose Stephen Moss, but the algorithm correctly captures that 00:45:29.020 |
when you have the sentence, Stephen Moss hated him, that him cannot refer to Stephen Moss. 00:45:35.980 |
So having worked that out, it then says if X is the highest S in the sentence, okay, 00:45:42.660 |
so my X here is definitely the highest S in the sentence because I've got the whole sentence. 00:45:48.300 |
What you should do is then traverse the parse trees of previous sentences in the order of 00:45:56.620 |
So what I should do now is sort of work backwards in the text, one sentence at a time, going 00:46:07.800 |
And then for each tree, traverse each tree left to right, breadth first. 00:46:13.100 |
So then within each tree, I'm doing the same of going breadth first. 00:46:18.680 |
So sort of working down and then going left to right with an equal breadth. 00:46:24.200 |
And so hidden inside these clauses, it's capturing a lot of the facts of how coreference typically 00:46:33.760 |
So what you find in English, I'll say, but in general, this is true of lots of languages, 00:46:42.440 |
is that there are general preferences and tendencies for coreference. 00:46:46.600 |
So a lot of the time, a pronoun will be coreferent with something in the same sentence, like 00:46:52.400 |
Stephen Moss's brother hated him, but it can't be if it's too close to it. 00:46:57.480 |
So you can't say Stephen Moss hated him and have the him be Stephen Moss. 00:47:02.220 |
And if you're then looking for coreference that's further away, the thing it's coreferent 00:47:10.920 |
And so that's why you work backwards through sentences one by one. 00:47:15.440 |
But then once you're looking within a particular sentence, the most likely thing it's going 00:47:20.680 |
to be coreferent to is a topical noun phrase and default topics in English are subjects. 00:47:28.880 |
So by doing things breadth first, left to right, a preferred antecedent is then a subject. 00:47:36.520 |
And so this algorithm, I won't go through all the complex clauses five through nine, 00:47:41.440 |
ends up saying, okay, what you should do is propose Niall Ferguson as what is coreferent 00:47:48.200 |
to him, which is the obvious correct reading in this example. 00:47:56.340 |
And in some sense, the details of that aren't interesting. 00:48:00.960 |
But what is I think actually still interesting in 2021 is what points Jerry Hobbes was actually 00:48:15.240 |
And the point he was trying to make was the following. 00:48:19.580 |
So Jerry Hobbes wrote this algorithm, the naive algorithm, because what he said was, 00:48:28.040 |
well, look, if you want to try and crudely determine coreference, well, there are these 00:48:42.280 |
There's a preference for topical things like subject. 00:48:45.400 |
And there are things where, you know, if it has gender, it has to agree in gender. 00:48:49.860 |
So there are sort of strong constraints of that sort. 00:48:54.300 |
So I can write an algorithm using my linguistic mouse, which captures all the main preferences. 00:49:06.180 |
Doing that is a pretty strong baseline system. 00:49:10.300 |
But what Jerry Hobbes wanted to argue is that this algorithm just isn't something you should 00:49:23.420 |
This is just sort of, you know, making a best guess according to the preferences of what's 00:49:31.600 |
most likely without actually understanding what's going on in the text at all. 00:49:37.160 |
And so actually, what Jerry Hobbes wanted to argue was the so-called Hobbes algorithm 00:49:43.720 |
now, he wasn't a fan of the Hobbes algorithm. 00:49:46.480 |
He was wanting to argue that the Hobbes algorithm is completely inadequate as a solution to 00:49:53.080 |
And the only way we'll actually make progress in natural language understanding is by building 00:49:57.760 |
systems that actually really understand the text. 00:50:02.580 |
And this is actually something that has come to the fore again more recently. 00:50:09.580 |
So the suggestion is that in general, you can't work out coreference or pronominal and 00:50:17.020 |
afro in particular unless you're really understanding the meaning of the text. 00:50:21.860 |
And people look at pairs of examples like these ones. 00:50:25.300 |
So she poured water from the pitcher into the cup until it was full. 00:50:30.520 |
So think for just half a moment, well, what is it in that example that is full? 00:50:43.320 |
But then if I say she poured water from the pitcher into the cup until it was empty, well, 00:50:52.740 |
And the point that is being made with this example is the only thing that's been changed 00:50:59.640 |
in these examples is the adjective right here. 00:51:05.240 |
So these two examples have exactly the same grammatical structure. 00:51:11.640 |
So in terms of the Hobbes' naive algorithm, the Hobbes' naive algorithm necessarily has 00:51:19.160 |
to predict the same answer for both of these. 00:51:24.480 |
You just cannot determine the correct pronoun antecedent based on grammatical preferences 00:51:31.200 |
of the kind that are used in the naive algorithm. 00:51:34.520 |
You actually have to conceptually understand about pitchers and cups and water and full 00:51:40.040 |
and empty to be able to choose the right antecedent. 00:51:45.080 |
Here's another famous example that goes along the same lines. 00:51:49.880 |
So Terry Winograd, shown here as a young man. 00:51:53.920 |
So long, long ago, Terry Winograd came to Stanford as the natural language processing 00:51:59.240 |
faculty and Terry Winograd became disillusioned with the symbolic AI of those days and just 00:52:10.360 |
And he reinvented himself as being an HCI person. 00:52:14.080 |
And so Terry was then essentially the person who established the HCI program at Stanford. 00:52:20.720 |
But before he lost faith in symbolic AI, he talked about the coreference problem and pointed 00:52:32.320 |
So we have the city council refused the women a permit because they feared violence versus 00:52:39.440 |
the city council refused the women a permit because they advocated violence. 00:52:44.560 |
So again, you have this situation where these two sentences have identical syntactic structure 00:52:50.800 |
and they differ only in the choice of verb here. 00:52:54.200 |
But once you add knowledge, common sense knowledge of how the human world works, well, how this 00:53:03.680 |
should pretty obviously be interpreted that in the first one that they is referring to 00:53:10.280 |
the city council, whereas in the second one that they is referring to the women. 00:53:17.640 |
And so coming off of that example of Terry, these have been referred to as Winograd schemas. 00:53:26.120 |
So Winograd schema challenges sort of choosing the right reference here. 00:53:31.920 |
And so it's basically just doing pronominal and afro. 00:53:35.800 |
But the interesting thing is people have been interested in what are tests of general intelligence 00:53:43.080 |
and one famous general test of intelligence, though I won't talk about now, is the Turing 00:53:49.040 |
And there's been a lot of debate about problems with the Turing test and is it good? 00:53:52.920 |
And so in particular, Hector Levesque, who's a very well-known senior AI person, he actually 00:54:00.320 |
proposed that a better alternative to the Turing test might be to do what he then dubbed 00:54:08.280 |
And Winograd schema is just solving pronominal coreference in cases like this where you have 00:54:13.880 |
to have knowledge about the situation in the world to get the answer right. 00:54:18.280 |
And so he's basically arguing that, you know, you can review really solving coreference 00:54:27.140 |
And that's sort of what the position that Hobbes wanted to advocate. 00:54:33.080 |
So what he actually said about his algorithm was that the naive approach is quite good. 00:54:38.800 |
Computationally speaking, it will be a long time before a semantically based algorithm 00:54:46.320 |
And these results set a very high standard for any other approach to aim for. 00:54:50.480 |
And he was proven right about that because there was sort of really took to around 2015 00:54:55.400 |
before people thought they could do without the Hobbes algorithm. 00:54:59.400 |
But then he notes, yet there is every reason to pursue a semantically based approach. 00:55:09.640 |
In these cases, it not only fails, it gives no indication that it has failed and offers 00:55:18.720 |
And so I think this is actually still interesting stuff to think about because, you know, really 00:55:23.480 |
for the kind of machine learning based coreference systems that we are building, you know, they're 00:55:29.320 |
not a hot mess of rules like the Hobbes algorithm, but basically they're still sort of working 00:55:36.720 |
out statistical preferences of what patterns are most likely and choosing the antecedent 00:55:46.800 |
They really have exactly the same deficiencies still that Hobbes was talking about, right? 00:56:00.600 |
The algorithms give you no idea when they fail. 00:56:03.880 |
They're not really understanding the text in a way that a human does to determine the 00:56:09.520 |
So we still actually have a lot more work to do before we're really doing full artificial 00:56:16.920 |
But I'd best get on now and actually tell you a bit about some coreference algorithms. 00:56:23.320 |
So the simple way of thinking about coreference is to say that you're making just a binary 00:56:34.680 |
So if you have your mentions, you can then say, well, I've come to my next mention, she, 00:56:43.760 |
I want to work out what it's coreferent with. 00:56:47.480 |
And I can just look at all of the mentions that came before it and say, is it coreferent 00:56:56.080 |
So at training time, I'll be able to say I have positive examples, assuming I've got 00:57:01.120 |
some data labeled for what's coreferent to what, as to these ones are coreferent. 00:57:06.260 |
And I've got some negative examples of these ones are not coreferent. 00:57:11.440 |
And what I want to do is build a model that learns to predict coreferent things. 00:57:16.760 |
And I can do that fairly straightforwardly in the kind of ways that we have talked about. 00:57:22.360 |
So I train with the regular kind of cross entropy loss, where I'm now summing over every 00:57:30.760 |
pairwise binary decision as to whether two mentions are coreferent to each other or not. 00:57:38.460 |
And so then when I'm at test time, what I want to do is cluster the mentions that correspond 00:57:46.420 |
And I do that by making use of my pairwise scorer. 00:57:50.760 |
So I can run my pairwise scorer, and it will give a probability or a score that any two 00:58:00.980 |
So by picking some threshold, like 0.5, I can add coreference links for when the classifier 00:58:11.320 |
And then I do one more step to give me a clustering. 00:58:15.080 |
I then say, okay, let's also make the transitive closure to give me clusters. 00:58:21.200 |
So it thought that I and she were coreferent and my and she were coreferent. 00:58:26.720 |
Therefore, I also have to regard I and my as coreferent. 00:58:32.440 |
And so that's sort of the completion by transitivity. 00:58:36.300 |
And so since we always complete by transitivity, note that this algorithm is very sensitive 00:58:47.160 |
Because if you make one mistake, for example, you say that he and my are coreferent, then 00:58:53.440 |
by transitivity, all of the mentions in the sentence become one big cluster and that they're 00:59:03.940 |
So that's a workable algorithm and people have often used it. 00:59:08.200 |
But often people go a little bit beyond that and prefer a mention ranking model. 00:59:15.320 |
So let me just explain the advantages of that. 00:59:18.520 |
That normally, if you have a long document where it's Ralph Nader and he did this and 00:59:24.120 |
some of them did something to him and we visited his house and blah, blah, blah, blah. 00:59:28.720 |
And then somebody voted for Nader because he. 00:59:33.400 |
In terms of building a coreference classifier, it seems like it's easy and reasonable to 00:59:43.000 |
be able to recover that this he refers to Nader. 00:59:47.860 |
But in terms of building a classifier for it to recognize that this he should be referring 00:59:54.500 |
to this Nader, which might be three paragraphs back, seems kind of unreasonable how you're 01:00:01.900 |
So those faraway ones might be almost impossible to get correct. 01:00:06.400 |
And so that suggests that maybe we should have a different way of configuring this task. 01:00:13.460 |
So instead of doing it that way, what we should say is, well, this he here has various possible 01:00:21.700 |
antecedents and our job is to just choose one of them. 01:00:26.720 |
And that's almost sufficient apart from we need to add one more choice, which is, well, 01:00:35.900 |
some mentions won't be coreferent with anything that proceeds because we're introducing a 01:00:43.680 |
So we can add one more dummy mention, the N/A mention. 01:00:49.920 |
So it doesn't refer to anything previously in the discourse. 01:00:55.220 |
And then our job at each point is to do mention ranking to choose which one of these she refers 01:01:02.780 |
And then at that point, rather than doing binary yes/no classifiers, that what we can 01:01:09.040 |
do is say, aha, this is choose one classification and then we can use the kind of softmax classifiers 01:01:22.020 |
So that gets us in business for building systems. 01:01:25.760 |
And for either of these kind of models, there are several ways in which we can build the 01:01:32.820 |
We could use any kind of traditional machine learning classifier. 01:01:40.100 |
We can use more advanced ones with all of the tools that we've been learning about more 01:01:45.340 |
So let me just quickly show you a simple neural network way of doing it. 01:01:51.540 |
So this is a model that my PhD student, Kevin Clark, did in 2015. 01:02:00.820 |
But what he was doing was doing coreference resolution based on the mentions with a simple 01:02:07.120 |
feedforward neural network, kind of in some sense like we did dependency parsing with 01:02:14.220 |
So for the mention, it had word embeddings, antecedent had word embeddings. 01:02:22.580 |
There were some additional features of each of the mention and candidate antecedent. 01:02:27.660 |
And then there were some final additional features that captured things like distance 01:02:31.660 |
away, which you can't see from either the mention or the candidate. 01:02:36.560 |
And all of those features were just fed into several feedforward layers of a neural network. 01:02:43.180 |
And it gave you a score of are these things coreferent or not. 01:02:58.540 |
But what I do want to show is sort of a more advanced and modern neural coreference system. 01:03:06.340 |
But before I do that, I want to take a digression and sort of say a few words about convolutional 01:03:16.380 |
So the idea of when you apply a convolutional neural network to language, i.e. to sequences, 01:03:25.900 |
is that what you're going to do is you're going to compute vectors, features effectively, 01:03:32.700 |
for every possible word subsequence of a certain length. 01:03:36.960 |
So that if you have a piece of text like tentative deal reached to keep government open, you 01:03:42.260 |
might say I'm going to take every three words of that, i.e. tentative deal reached, deal 01:03:48.300 |
reached to, reach to keep, and I'm going to compute a vector based on that subsequence 01:03:55.860 |
of words and use those computed vectors in my model by somehow grouping them together. 01:04:04.620 |
So the canonical case of convolutional neural networks is in vision. 01:04:11.420 |
And so if after this next quarter you go along to CS231N, you'll be able to spend weeks doing 01:04:22.500 |
And so the idea there is that you've got these convolutional filters that you sort of slide 01:04:30.580 |
over an image and you compute a function of each place. 01:04:36.120 |
So the sort of little red numbers are showing you what you're computing, but then you'll 01:04:41.000 |
slide it over to the next position and fill in this cell, and then you'll slide it over 01:04:46.480 |
the next position and fill in this cell, and then you'll slide it down and fill in this 01:04:53.480 |
And so you've got this sort of little function of a patch, which you're sliding over your 01:04:58.800 |
image and computing a convolution, which is just a dot product effectively, that you're 01:05:06.000 |
then using to get an extra layer of representation. 01:05:10.280 |
And so by sliding things over, you can pick out features and you've got a sort of a feature 01:05:16.000 |
identifier that runs across every piece of the image. 01:05:21.080 |
Well, for language, we've just got a sequence, but you can do basically the same thing. 01:05:28.360 |
And what you then have is a 1D convolution for text. 01:05:32.640 |
So if here's my sentence, tentative deal, reach to keep the government open, that what 01:05:38.120 |
I can do is have, so these words have a word representation, which, so this is my vector 01:05:49.920 |
And then I can have a filter, sometimes called a kernel, which I use for my convolution. 01:05:57.800 |
And what I'm going to do is slide that down the text. 01:06:01.560 |
So I can start with it, with the first three words, and then I sort of treat them as sort 01:06:08.360 |
of elements I can dot product and sum, and then I can compute a value as to what they 01:06:13.920 |
all add up to, which is minus one, it turns out. 01:06:18.100 |
And so then I might have a bias that I add on and get an updated value if my bias is 01:06:28.080 |
And then I'd run it through a nonlinearity, and that will give me a final value. 01:06:33.400 |
And then I'll slide my filter down, and I'd work out a computation for this window of 01:06:41.120 |
three words, and take 0.5 times 3 plus 0.2 times 1, et cetera, and that comes out as 01:06:52.360 |
I put it I'm going to put it through my nonlinearity, and then I keep on sliding down, and I'll 01:06:58.880 |
do the next three words, and keep on going down. 01:07:02.880 |
And so that gives me a 1D convolution and computes a representation of the text. 01:07:10.680 |
You might have noticed in the previous example that I started here with seven words, but 01:07:17.120 |
because I wanted to have a window of three for my convolution, the end result is that 01:07:30.120 |
So commonly people will deal with that with padding. 01:07:33.960 |
So if I put padding on both sides, I can then start my three by three convolution, my three 01:07:40.560 |
sorry, not three by three, my three convolution here, and compute this one, and then slide 01:07:50.960 |
And so now my output is the same size as my real input, and so that's a convolution with 01:08:00.200 |
Okay, so that was the start of things, but you know, how you get more power of the convolutional 01:08:08.080 |
network is you don't only have one filter, you have several filters. 01:08:13.720 |
So if I have three filters, each of which will have their own bias and nonlinearity, 01:08:18.560 |
I can then get a three dimensional representation coming out the end, and sort of you can think 01:08:25.680 |
of these as conceptually computing different features of your text. 01:08:32.080 |
Okay, so that gives us a kind of a new feature re-representation of our text. 01:08:40.840 |
But commonly, we then want to somehow summarize what we have. 01:08:47.920 |
And a very common way of summarizing what we have is to then do pooling. 01:08:53.960 |
So if we sort of think of these features as detecting different things in the text, so 01:09:00.600 |
you know, they might even be high level features like, you know, does this show signs of toxicity 01:09:13.200 |
So if you want to be interested in does it occur anywhere in the text, what people often 01:09:17.480 |
then do is a max pooling operation, where for each feature, they simply sort of compute 01:09:24.560 |
the maximum value it ever achieved in any position as you went through the text. 01:09:30.440 |
And say that this vector ends up as the sentence representation. 01:09:35.720 |
Sometimes for other purposes, rather than max pooling, people use average pooling, where 01:09:40.320 |
you take the averages of the different vectors to get the sentence representation. 01:09:46.160 |
Then general max pooling has been found to be more successful. 01:09:50.000 |
And that's kind of because if you think of it as feature detectors that are wanting to 01:09:54.560 |
detect was this present somewhere, then, you know, something like positive sentiment isn't 01:10:00.960 |
going to be present in every three word subsequence you choose. 01:10:13.800 |
And so that's a very quick look at convolutional neural networks. 01:10:19.760 |
Just to say this example is doing 1D convolutions with words. 01:10:25.280 |
But a very common place that convolutional neural networks are being used in natural 01:10:30.320 |
language is actually using them with characters. 01:10:34.160 |
And so what you can do is you can do convolutions over subsequences of the characters in the 01:10:43.280 |
And if you do that, this allows you to compute a representation for any sequence of characters. 01:10:50.160 |
So you don't have any problems with being out of vocabulary or anything like that. 01:10:55.640 |
Because for any sequence of characters, you just compute your convolutional representation 01:11:02.440 |
And so quite commonly, people use a character convolution to give a representation of words. 01:11:14.000 |
But otherwise, it's something that you use in addition to a word vector. 01:11:20.080 |
And so in both BIDAP and the model I'm about to show, that at the base level, it makes 01:11:25.040 |
use of both a word vector representation that we saw right at the beginning of the text 01:11:30.520 |
and a character level convolutional representation of the words. 01:11:38.680 |
With that said, I now want to show you before time runs out, an end to end neural coref 01:11:45.160 |
So the model I'm going to show you is Kenton Lee's one from University of Washington, 2017. 01:11:55.080 |
I'll mention the state of the art at the end. 01:11:57.980 |
But this was the first model that really said, get rid of all of that old stuff of having 01:12:07.200 |
Build one end to end big model that does everything and returns coreference. 01:12:14.440 |
So compared to the earlier simple thing I saw, we're now going to process the text with 01:12:24.360 |
And we're going to do all of mentioned detection coreference in one step end to end. 01:12:29.920 |
And the way it does that is by considering every span of the text up to a certain length 01:12:37.240 |
as a candidate mentioned and just figures out a representation for it and whether it's 01:12:44.720 |
So what we do at the start is we start with a sequence of words and we calculate from 01:12:50.920 |
those standard word embedding and a character level CNN embedding. 01:12:57.380 |
We then feed those as inputs into a bidirectional LSTM of the kind that we saw quite a lot of 01:13:07.800 |
But then after this, what we do is we compute representations for spans. 01:13:14.840 |
So when we have a sequence of words, we're then going to work out a representation of 01:13:21.760 |
a sequence of words which we can then put into our coreference model. 01:13:27.980 |
So I can't fully illustrate in this picture, but sub sequences of different lengths, so 01:13:35.520 |
like general, general electric, general electric said, will all have a span representation 01:13:41.360 |
which I've only shown a subset of them in green. 01:13:47.560 |
Well, the way they're computed is that the span representation is a vector that concatenates 01:13:55.060 |
several vectors and it consists of four parts. 01:13:59.020 |
It consists of the representation that was computed for the start of the span from the 01:14:08.220 |
LSTM, the representation for the end from the LSTM, that's over here. 01:14:16.000 |
And then it has a third part that's kind of interesting. 01:14:19.320 |
This is an intention-based representation that is calculated from the whole span, but 01:14:26.020 |
particularly sort of looks for the head of a span. 01:14:29.100 |
And then there are still a few additional features. 01:14:32.480 |
So it turns out that, you know, some of these additional things like length and so on is 01:14:40.980 |
So to work out the final part, it's not the beginning and the end, what's done is to calculate 01:14:49.400 |
an attention-weighted average of the word embeddings. 01:14:52.600 |
So what you're doing is you're taking the X star representation of the final word of 01:14:59.480 |
the span, and you're feeding that into a neural network to get attention scores for every 01:15:06.980 |
word in the span, which are these three, and that's giving you an attention distribution 01:15:16.300 |
And then you're calculating the third component of this as an attention-weighted sum of the 01:15:29.440 |
And so therefore you've got the sort of a soft average of the representations of the 01:15:39.480 |
So then once you've got that, what you're doing is then feeding these representations 01:15:48.600 |
into having scores for whether spans are coreferent mentions. 01:15:55.560 |
So you have a representation of the two spans, you have a score that's calculated for whether 01:16:05.880 |
two different spans look coreferent, and that overall you're getting a score for are different 01:16:17.520 |
And so this model is just run end to end on all spans. 01:16:22.320 |
And that sort of would get intractable if you scored literally every span in a long 01:16:30.160 |
They sort of only allow spans up to a certain maximum size. 01:16:34.240 |
They only consider pairs of spans that aren't too distant from each other, et cetera, et 01:16:42.360 |
But basically it's sort of an approximation to just a complete comparison of spans. 01:16:48.160 |
And this turns into a very effective coreference resolution algorithm. 01:16:52.880 |
Today it's not the best coreference resolution algorithm because maybe not surprisingly, 01:16:59.440 |
like everything else that we've been dealing with, there's now been these transformer models 01:17:05.000 |
like BERT have come along and that they produce even better results. 01:17:09.720 |
So the best coreference systems now make use of BERT. 01:17:15.640 |
In particular, when Danci spoke, she briefly mentioned spanBERT, which was a variant of 01:17:21.520 |
BERT which constructs blanks out for reconstruction, sub sequences of words rather than just a 01:17:30.480 |
And spanBERT has actually proven to be very effective for doing coreference, perhaps because 01:17:40.800 |
We've also gotten gains actually funnily by treating coreference as a question answering 01:17:47.400 |
So effectively you can find a mention like he or the person and say what is its antecedent 01:18:05.280 |
So if we put that together as time is running out, let me just sort of give you some sense 01:18:11.880 |
of how results come out for coreference systems. 01:18:17.200 |
So I'm skipping a bit actually that you can find in the slides, which is how coreference 01:18:24.320 |
But essentially it's scored on a clustering metric. 01:18:27.080 |
So a perfect clustering would give you 100 and something that makes no correct decisions 01:18:35.760 |
And so this is sort of how the coreference numbers have been panning out. 01:18:42.800 |
So back in 2010, actually, this was a Stanford system. 01:18:48.160 |
This was a state of the art system for coreference. 01:18:52.040 |
It was actually a non-machine learning model, because again, we wanted to sort of prove 01:18:57.760 |
how these rule-based methods in practice work kind of well. 01:19:02.280 |
And so its accuracy was around 55 for English, 50 for Chinese. 01:19:08.720 |
Then gradually machine learning, this was sort of statistical machine learning models 01:19:16.480 |
Wiseman was the very first neural coreference system, and that gave some gains. 01:19:22.640 |
Here's a system that Kevin Clark and I did, which gave a little bit further gains. 01:19:28.600 |
So Lee is the model that I've just shown you as the end-to-end model, and it got a bit 01:19:37.560 |
But then again, what gave the huge breakthrough, just like question answering, was that the 01:19:46.920 |
So once we moved to here, we're now using SpanBert, but that's giving you about an extra 01:19:58.000 |
And in the very latest best results are effectively combining together SpanBert and a larger version 01:20:05.920 |
of SpanBert and CorefQA and getting up to 83. 01:20:11.640 |
So you might think from that, that Coref is sort of doing really well and is getting close 01:20:22.680 |
Well it's certainly true that in neural times, the results have been getting way, way better 01:20:28.560 |
But I would caution you that these results that I just showed were on a corpus called 01:20:37.520 |
And it turns out that Newswire coreference is pretty easy. 01:20:41.880 |
I mean, in particular, there's a lot of mention of the same entities, right? 01:20:46.200 |
So the newspaper articles are full of mentions of the United States and China and leaders 01:20:54.880 |
And it's sort of very easy to work out what they're coreferent to. 01:20:58.780 |
And so the coreference scores are fairly high. 01:21:03.920 |
Whereas if what you do is take something like a page of dialogue from a novel and feed that 01:21:10.840 |
into a system and say, okay, do the coreference correctly, you'll find pretty rapidly that 01:21:17.840 |
the performance of the models is much more modest. 01:21:21.800 |
If you'd like to try out a coreference system for yourself, there are pointers to a couple 01:21:27.760 |
of them here where the top one's ours from the certain Kevin Clark's neural coreference. 01:21:36.240 |
And this is one that goes with the Hugging Face repository that we've mentioned.