Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 13

Okay, hi everyone. So we'll get started again. We're now into week seven of CS224N. If you're following along the syllabus really closely, we actually did a little bit of a rearrangement in classes. And so today it's me and I'm going to talk about coreference resolution, which is another chance we get to take a deeper dive into a more linguistic topic.

I will also show you a couple of new things for deep learning models at the same time. And then the lecture that had previously been scheduled at this point, which was going to be John on explanation in neural models, is being shifted later down into week nine, I think it is.

But you'll still get him later. So we're getting underway, just a couple of announcements on things. Well, first of all, congratulations on surviving assignment five, I hope. I know it was a bit of a challenge for some of you, but I hope it was a rewarding state of the art learning experience on the latest in neural nets.

And at any rate, you know, this was a brand new assignment that we used for the first time this year. So we'll really appreciate later on when we do the second survey, getting your feedback on that. We've been busy reading people's final project proposals. Thanks. Lots of interesting stuff there.

Our goal is to get them back to you tomorrow. But you know, as soon as you've had a good night's sleep after assignment five, now is also a great time to get started working on your final projects, because there's just not that much time till the end of quarter.

And I particularly want to encourage all of you to chat to your mentor regularly, go and visit office hours and keep in touch, get advice, just talking through things is a good way to keep you on track. We also plan to be getting back assignment four grades later this week.

There's sort of the work never stops at this point. So the next thing for the final project is the final project milestone. So that we handed out the details of that last Friday, and it's due a week from today. So the idea of this final project milestone is really to help keep you on track and keep things moving towards having a successful final project.

So our hope is that sort of most of what you write for the final project milestone is material you can also include in your final project, except for a few paragraphs of here's exactly where I'm up to now. So the overall hope is that doing this in two parts and having a milestone before the final thing, it's just making you make progress and be on track to having a successful final project.

Finally, the next class on Thursday is going to be Colin Raphel, and this is going to be super exciting. So he's going to be talking more about the very latest in large pre-trained language models, both what some of their successes are, and also what some of the disconcerting, not quite so good aspects of those models are.

So that should be a really good, interesting lecture. When we had him come and talk to our NLP seminar, we had several hundred people come along for that. And so for this talk, again, we're asking that you write a reaction paragraph following the same instructions as last time about what's in this lecture.

And someone asked in the questions, well, what about last Thursday's? The answer to that is no. So the distinction here is we're only doing the reaction paragraphs for outside guest speakers. And although it was great to have Tran Boss Lu for last Thursday's lecture, he's a postdoc at Stanford.

So we don't count him as an outside guest speaker. And so nothing needs to be done for that one. So there are three classes for which you need to do it. So there was the one before from Dan Chi-Shun, Colin Raffel, which is Thursday, and then towards the end of the course, there's Yulia Svetkova.

Okay, so this is the plan today. So in the first part of it, I'm actually going to spend a bit of time talking about what coreference is, what different kinds of reference and language are. And then I'm going to move on and talk about some of the kind of methods that people have used for solving coreference resolution.

Now there's one bug in our course design, which was a lot of years, we've had a whole lecture on doing convolutional neural nets for language applications. And that slight bug appeared the other day when Dan Chi-Shun talked about the BIDAF model, because she sort of slipped in, oh, there's a character CNN representation of words, and we hadn't actually covered that.

And so that was a slight oopsie. I mean, actually, for applications in coreference as well, people commonly make use of character level conv nets. So I wanted to sort of spend a few minutes sort of doing basics of conv nets for language. The sort of reality here is that given that there's no exam week this year, to give people more time for final projects, we sort of shorten the content by a week this year.

And so you're getting a little bit less of that content. Then going on from there, say some stuff about a state of the art neural coreference system, and right at the end, talk about how coreference is evaluated and what some of the results are. Yeah. So first of all, what is this coreference resolution term that I've been talking about a lot?

So coreference resolution is meaning to find all the mentions in a piece of text that refer to the same entity. And sorry, that's a typo. It should be in the world, not in the word. So let's make this concrete. So here's part of a short story by Shruti Rao called The Star.

Now I have to make a confession here, because this is an NLP class, not a literature class, I crudely made some cuts to the story to be able to have relevant parts appear on my slide in a decent sized font for illustrating coreference. So it's not quite the full original text, but it basically is a piece of this story.

So what we're doing in coreference resolution is we're working out what people are mentioned. So here's a mention of a person, Banarja, and here's a mention of another person, Akila. And well, mentions don't have to be people. So the local park, that's also a mention. And then here's Akila again, and Akila's son.

And then there's Prajwal. Then there's another son here, and then her son, and Akash. And they both went to the same school. And then there's a preschool play. And there's Prajwal again. And then there's a naughty child, Lord Krishna. And there's some that are a bit complicated, like the lead role, is that a mention?

It's sort of more of a functional specification of something in the play. There's Akash, and it's a tree. I won't go through the whole thing yet. But I mean, in general, there are noun phrases that are mentioning things in the world. And so then what we want to do for coreference resolution is work out which of these mentions are talking about the same real world entity.

So if we start off, so there's Banarja. And so Banarja is the same person as her there. And then we could read through, she resigned herself. So that's both Banarja. She bought him a brown T-shirt and brown trousers. And then she made a large cut out tree. She attached, right.

So all of that's about Banarja. But then we can have another person. So here's Akila. And here's Akila. Maybe those are the only mentions of Akila. So then we can go on from there. Okay. And so then there's Prajwal. But note that Prajwal is also Akila's son. So really Akila's son is also Prajwal.

And so an interesting thing here is that you can get nested syntactic structure so that we have these sort of noun phrases. So that if, you know, overall we have sort of this noun phrase, Akila's son Prajwal, which consists of two noun phrases in apposition. Here's Prajwal. And then for the noun phrase Akila's son, it sort of breaks down to itself having an extra possessive noun phrase in it and then a noun so that you have Akila's and then this is son.

So that you have these multiple noun phrases. And so that you can then be sort of having different parts of this be one person in the coreference. But this noun phrase here referring to a different person in the coreference. Okay, so back to Prajwal. So while there's some easy other Prajwals, right, so there's Prajwal here.

And then you've got some more complicated things. So one of the complicated cases here is that we have they went to the same school. So that they there is what gets referred to as split antecedents. Because the they refers to both Prajwal and Akash. And that's an interesting phenomenon that and so I could try and show that somehow I could put some splashes in or something.

And if I get a different color, Akash, we have Akash and her son. And then this one sort of both of them at once. So human languages have this phenomenon of split antecedents. But you know, one of the things that you should notice when we start talking about algorithms that people use for doing coreference resolution is that they make some simplified assumptions as to how they go about treating the problem.

And one of the simplifications that most algorithms make is for any noun phrase like this pronoun say that's trying to work out what is a coreference with. And the answer is one thing. And so actually most NLP algorithms for coreference resolution just cannot get split antecedents right. Any time it occurs in the text, they guess something and they always get it wrong.

So that's the sort of a bit of a sad state of affairs. But that's the truth of how it is. Okay. So then going ahead, we have Akash here. And then we have another tricky one. So moving on from there, we then have this tree. So well, in this context of this story, Akash is going to be the tree.

So you could feel that it was okay to say, well, this tree is also Akash. You could also feel that that's a little bit weird and not want to do that. And I mean, actually different people's coreference datasets differ in this. So really that, you know, we're predicating identity relationship here between Akash and the property of being a tree.

So do we regard the tree as the same as Akash or not? And people make different decisions there. Okay. But then going ahead, we have here's Akash and she bought him. So that's Akash. And then we have Akash here. And so then we go on. Okay. So then if we don't regard the tree as the same as Akash, we have a tree here.

But then note that the next place over here, where we have a mention of a tree, the best tree, but that's sort of really a functional description of, you know, of possible trees making someone the best tree. It's not really referential to a tree. And so it seems like that's not really coreferent.

But if we go on, there's definitely more mention of a tree. So when she has made the tree truly the nicest tree, or well, I'm not sure. Is that one coreferent? It is definitely referring to our tree. And maybe this one again is a sort of a functional description that isn't referring to the tree.

Okay. So there's definitely and so maybe this one, though, where it's a tree is referring to the tree. But what I hope to have illustrated from this is, you know, most of the time when we do coreference in NLP, we just make it look sort of like the conceptual phenomenon is, you know, kind of obvious that there's a mention of Sarah and then it says she and you say, oh, they're coreferent.

This is easy. But if you actually start looking at real text, especially when you're looking at something like this, that is a piece of literature, the kind of phenomenon you get for coreference and overlapping reference, and it varies other phenomena that I'll talk about, you know, they actually get pretty complex.

And it's not, you know, there are a lot of hard cases that you actually have to think about as to what things you think about as coreferent or not. Okay. But basically, we do want to be able to do something with coreference because it's useful for a lot of things that we'd like to do in natural language processing.

So for one task that we've already talked about, question answering, but equally for other tasks such as summarization, information extraction, if you're doing something like reading through a piece of text, and you've got a sentence like he was born in 1961, you really want to know who he refers to, to know if this is a good answer to the question of, you know, when was Barack Obama born or something like that.

It turns out also that it's useful in machine translation. So in most languages, pronouns have features for gender and number, and in quite a lot of languages, nouns and adjectives also show features of gender, number, and case. And so when you're translating a sentence, you want to be aware of these features and what is coreferent as what to be able to get the translations correct.

So you know, if you want to be able to work out a translation and know whether it's saying Alicia likes Juan because he's smart, or Alicia likes Juan because she's smart, then you have to be sensitive to coreference relationships to be able to choose the right translation. When people build dialogue systems, dialogue systems also have issues of coreference a lot of the time.

So you know, if it's sort of book tickets to see James Bond and the system replies Spectre is playing near you at 2 and 3 today, well, there's actually a coreference relation, sorry, there's a reference relation between Spectre and James Bond because Spectre is a James Bond film. I'll come back to that one in a minute.

But then it's how many tickets would you like, two tickets for the showing at 3. That 3 is not just the number 3. That 3 is then a coreference relationship back to the 3pm showing that was mentioned by the agent in the dialogue system. So again, to understand these, we need to be understanding the coreference relationships.

So how now can you go about doing coreference? So the standard traditional answer, which I'll present first, is coreference is done in two steps. On the first step, what we do is detect mentions in a piece of text. And that's actually a pretty easy problem. And then in the second step, we work out how to cluster the mentions.

So as in my example from the Shruti Rao text, basically what you're doing with coreference is you're building up these clusters, sets of mentions that refer to the same entity in the world. So if we explore a little how we could do that as a two-step solution, the first part was detecting the mentions.

And so pretty much there are three kinds of things, different kinds of noun phrases that can be mentions. There are pronouns like I, your, it, she, him, and also some demonstrative pronouns like this and that and things like that. There are explicitly named things, so things like Paris, Joe Biden, Nike.

And then there are plain noun phrases that describe things. So a dog, the big fluffy cat stuck in the tree. And so all of these are things that we'd like to identify as mentions. And the straightforward way to identify these mentions is to use natural language processing tools, several of which we've talked about already.

So to work out pronouns, we can use what's called a part of speech tagger. We can use a part of speech tagger, which we haven't really explicitly talked about, but we used when you built dependency parsers. So it first of all assigns parts of speech to each word so we can just find the words that are pronouns.

For named entities, we did talk just a little bit about named entity recognizers as a use of sequence models for neural networks so we can pick out things like person names and company names. And then for the ones like the big fluffy, a big fluffy dog, we could then be sort of picking out from syntactic structure noun phrases and regarding them as descriptions of things.

So that we could use all of these tools and those would give us basically our mentions. It's a little bit more subtle than that, because it turns out there are some noun phrases and things of all of those kinds which don't actually refer so that they're not referential in the world.

So when you say it is sunny, it doesn't really refer. When you make universal claims like every student, well, every student isn't referring to something you can point to in the world. And more dramatically, when you have no student and make a negative universal claim, it's not referential to anything.

There are also things that you can describe functionally, which don't have any clear reference. So if I say the best donut in the world, that's a functional claim, but it doesn't necessarily have reference. Like if I've established that I think a particular kind of donut is the best donut in the world, I could then say to you, I ate the best donut in the world yesterday.

And you know what I mean, it might have reference. But if I say something like I'm going around to all the donut stores trying to find the best donut in the world, then it doesn't have any reference yet. It's just a sort of a functional description I'm trying to satisfy.

You also then have things like quantities, 100 miles. It's that quantity that is not really something that has any particular reference. You can mark out 100 miles, all sorts of places. So how do we deal with those things that aren't really mentions? Well one way is we could train a machine learning classifier to get rid of those spurious mentions.

But actually mostly people don't do that. Most commonly if you're using this kind of pipeline model where you use a parser and a named entity recognizer, you regard everything as you've found as a candidate mention, and then you try and run your coref system. And some of them, like those ones, hopefully aren't made coref with anything else.

And so then you just discard them at the end of the process. >> Hey, Chris. >> Yeah. >> I've got an interesting question that linguistics bears on this. A student asks, can we say that it is sunny? Has it referred to the weather? And I think the answer is yes.

>> So that's a fair question. So people have actually tried to suggest that when you say it is sunny, it means the weather is sunny. But I guess the majority opinion at least is that isn't plausible. And I mean, for I guess many of you aren't native speakers of English, but similar phenomena occur in many other languages.

I mean, it just intuitively doesn't seem plausible when you say it's sunny or it's raining today that you're really saying that as a shortcut for the weather is raining today. It just seems like really what the case is, is English likes to have something filling the subject position. And when there's nothing better to fill the subject position, you stick it in there and get it's raining.

And so in general, it's believed that you get this phenomenon of having these empty dummy its that appear in various places. I mean, another place in which it seems like you clearly get dummy its is that when you have clauses that are subjects of a verb, you can move them to the end of the sentence.

So if you have a sentence where you put a clause in the subject position, they normally in English sound fairly awkward. So it's you have a sentence something like that CS24N is a lot of work is known by all students. People don't normally say that the normal thing to do is to shift the clause to the end of the sentence.

But when you do that, you stick in a dummy it to fill the subject position. So you then have it is known by all students that CS224N is a lot of work. So that's the general feeling that this is a dummy it that doesn't have any reference. Okay, there's one more question.

So if someone says it is sunny among other things, and we ask how is the weather? Hmm, okay, good point. You've got me on that one. Right. So someone says, how's the weather? And you answer it is sunny, it then does seem like the it is in reference to the weather.

Far by that. Well, you know, I guess this is what our coreference systems are built trying to do in situations like that, they're making a decision of coreference or not. And I guess what you'd want to say in that case is, it seems reasonable to regard this one as coreference that weather that did appear before it.

I mean, but that also indicates another reason to think that in the normal case is not coreference, right? Because normally, pronouns are only used when their reference is established that you've referred to now like, John is answering questions, and then you can say, he types really quickly, sort of seem odd to just sort of start the conversation by he types really quickly, because it doesn't have any established reference, whereas that doesn't seem to be the case, it seems like you can just sort of start a conversation by saying it's raining really hard today.

And that doesn't sound odd at all. Okay. So I've sort of there presented the traditional picture. But you know, this traditional picture doesn't mean something that was done last millennium before you were born. I mean, essentially, that was the picture until about 2016. That essentially, every coreference system that was built, use tools like part of speech taggers, NER systems, and parsers to analyze sentences, to identify mentions, and to give you features for coreference resolution.

And I'll show a bit more about that later. But more recently, in our neural systems, people have moved to avoiding traditional pipeline systems and doing one shot end to end coreference resolution systems. So if I skip directly to the second bullet, there's a new generation of neural systems where you just start with your sequence of words, and you do the maximally dumb thing, you just say, let's take all spans, commonly with some heuristics for efficiency, but you know, conceptually, all sub sequences of this sentence, they might be mentions, let's feed them in to a neural network, which will simultaneously do mentioned detection and coreference resolution end to end in one model.

And I'll give an example of that kind of system later in the lecture. Okay, is everything good to there and I should go on? Yep. Okay. So I'm going to get on to how to do coreference resolution systems. But before I do that, I do actually want to show a little bit more of the linguistics of coreference, because there are actually a few more interesting things to understand and know here.

I mean, when we say coreference resolution, we really confuse together two linguistic things which are overlapping, but different. And so it's really actually good to understand the difference between these things. So there are two things that can happen. One is that you can have mentions, which are essentially standalone, but happen to refer to the same entity in the world.

So if I have a piece of text that said, Barack Obama traveled yesterday to Nebraska, Obama was there to open a new meat processing plant or something like that. I've mentioned with Barack Obama and Obama, there are two mentions there, they refer to the same person in the world, they are coreference.

So that is true coreference. But there's a different related linguistic concept called anaphora. And anaphora is when you have a textual dependence of an anaphor on another term, which is the antecedent. And in this case, the meaning of the anaphor is determined by the antecedent in a textual context.

And the canonical case of this is pronouns. So when it's Barack Obama said he would sign the bill, he is an anaphor. It's not a word that independently we can work out what its meaning is in the world, apart from knowing the vaguest feature that it's referring to something probably male.

But in the context of this text, we have that this anaphor is textually dependent on Barack Obama. And so then we have an anaphoric relationship, which sort of means they refer to the same thing in the world. And so therefore, you can say they're coreference. So the picture we have is like this.

So for coreference, we have these separate textual mentions, which are basically standalone, which refer to the same thing in the world. Whereas in anaphora, we actually have a textual relationship. And you essentially have to use pronouns like he and she in legitimate ways in which the hearer can reconstruct the relationship from the text, because they can't work out what he refers to if that's not there.

And so that's a fair bit of the distinction. But it's actually a little bit more to realize, because there are more complex forms of anaphora, which aren't coreference, because you have a textual dependence, but it's not actually one of reference. And so this comes back to things like these quantifier noun phrases that don't have reference.

So when you have sentences like these ones, every dancer twisted her knee, well, this her here has an anaphoric dependency on every dancer, or even more clearly with no dancer twisted her knee, the her here has an anaphoric dependence on no dancer. But for no dancer twisted her knee, no dancer isn't referential.

It's not referring to anything in the world. And so there's no coreferential relationship, because there's no reference relationship, but there's still an anaphoric relationship between these two noun phrases. And then you have this other complex case that turns up quite a bit, where you can have where the things being talked about do have reference, but an anaphoric relationship is more subtle than identity.

So you commonly get constructions like this one. We went to a concert last night, the tickets were really expensive. Well, the concert and the tickets are two different things. They're not coreferential. But in interpreting this sentence, what this really means is the tickets to the concert, right? And so there's sort of this hidden, not said dependence where this is referring back to the concert.

And so what we say is that these, the tickets does have an anaphoric dependence on the concert, but they're not coreferential. And so that's referred to as bridging anaphora. And so overall, there's the simple case and the common case, which is pronominal anaphora, where it's both coreference and anaphora.

You then have other cases of coreference, such as every time you see a mention of the, every time the United States has said it's coreferential with every other mention of the United States, but those don't have any textual dependence on each other. And then you have textual dependencies like bridging anaphora, which aren't coreference.

That's probably about as, now I was going to say that's probably as much linguistics as you wanted to hear, but actually I have one more point of linguistics. One or two of you, but probably not many, might've been troubled by the fact that the term anaphora as a classical term means that you are looking backward for your antecedent, that the anaphora part of anaphora means that you're looking backward for your antecedent.

And in sort of classical terminology, you have both anaphora and catephra, and it's catephra where you look forward for your antecedent. Catephra isn't that common, but it does occur. Here's a beautiful example of catephra. So this is from Oscar Wilde. From the corner of the divan of Persian saddlebags on which he was lying, smoking as was his custom innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey sweet and honey colored blossoms of a laburnum.

Okay. So in this example here, the he and then this his are actually referring to Lord Henry Wotton. And so these are both examples of catephra. But in modern linguistics, even though most reference to pronouns is backwards, we don't distinguish in terms of order. And so the term anaphora and anaphora is used for textual dependence, regardless of whether it's forward or backwards.

Okay. A lot of details there, but taking stock of this. So the basic observation is language is interpreted in context, that in general, you can't work out the meaning or reference of things without looking at the context of the linguistic utterance. So we'd seen some simple examples before. So for something like word sense disambiguation, if you see just the words, the bank, you don't know what it means.

And you need to look at a context to get some sense as to whether it means a financial institution or the bank of a river or something like that. And so anaphora and coreference give us additional examples where you need to be doing contextual interpretation of language. So when you see a pronoun, you need to be looking at the context to see what it refers to.

And so if you think about text understanding as a human being does it, reading a story or an article, that we progress through the article from beginning to end. And as we do it, we build up a pretty complex discourse model in which new entities are introduced by mentions and then they're referred back to and relationships between them are established and they take actions and things like that.

And it sort of seems like in our head that we sort of build up a kind of a complex graph like discourse representation of a piece of text with all these relationships. And so part of that is these anaphoric relationships and coreference that we're talking about here. And indeed in terms of CS224N, the only kind of whole discourse meaning that we're going to look at is looking a bit at anaphora and coreference.

But if you want to see more about higher level natural language understanding, you can get more of this next quarter in CS224U. So I want to tell you a bit about several different ways of doing coreference. So broadly there are four different kinds of coreference models. So the traditional old way of doing it was rule-based systems.

And this isn't the topic of this class and this is pretty archaic at this point. This is stuff from last millennium. But I wanted to say a little bit about it because it's actually kind of interesting as sort of food for thought as to how far along we are or aren't in solving artificial intelligence and really being able to understand texts.

Then there are sort of classic machine learning methods of doing it, which you can sort of divide up as mentioned pair methods, mentioned ranking methods, and really clustering methods. And I'm sort of going to skip the clustering methods today because most of the work, especially most of the recent work, implicitly makes clusters by using either mentioned pair or mentioned ranking methods.

And so I'm going to talk about a couple of neural methods for doing that. Okay. But first of all, let me just tell you a little bit about rule-based coreference. So there's a famous historical algorithm in NLP for doing pronoun and Afro resolution, which is referred to as the Hobbes algorithm.

So everyone just refers to it as the Hobbes algorithm. And if you sort of look up a textbook like Draftsky and Martin's textbook, it's referred to as the Hobbes algorithm. But actually, if you go back to Jerry Hobbes, that's his picture over there in the corner, if you actually go back to his original paper, he refers to it as the naive algorithm.

And his naive algorithm for pronoun coreference was this sort of intricate handwritten set of rules to work out coreference. So this is the start of the set of the rules, but there are more rules or more clauses of these rules for working out coreference. And this looks like a hot mess, but the funny thing was that this set of rules for determining coreference were actually pretty good.

And so in the sort of 1990s and 2000s decade, even when people were using machine learning based systems for doing coreference, they'd hide into those machine learning based systems that one of their features was the Hobbes algorithm and that the predictions it made with a certain weight was then a feature in making your final decisions.

And it's only really in the last five years that people have moved away from using the Hobbes algorithm. Let me give you a little bit of a sense of how it works. So the Hobbes algorithm, here's our example. This is an example from a Guardian book review. Niall Ferguson is prolific, well-paid and a snappy dresser.

Stephen Moss hated him. So what the Hobbes algorithm does is we start with a pronoun. We start with a pronoun and then it says step one, go to the NP that's immediately dominating the pronoun. And then it says go up to the first NP or S, call this X and the path P.

Then it says traverse all branches below X to the left of P, left to right, breadth first. So then it's saying to go left to right for other branches below breadth first. So that's sort of working down the tree. So we're going down and left to right and look for an NP.

Okay. And here's an NP. But then we have to read more carefully and say propose as antecedent any NP that has an NP or S between it and X. Well, this NP here has no NP or S between NP and X. So this isn't a possible antecedent. So this is all very, you know, complex and handwritten, but basically he sort of fit into the clauses of this kind of a lot of facts about how the grammar of English works.

And so what this is capturing is if you imagine a different sentence, you know, if you imagine the sentence, Stephen Moss's brother hated him. Well then Stephen Moss would naturally be coreferent with him. And in that case, well, precisely what you'd have is the noun phrase with, well, the noun phrase brother, and you'd have another noun phrase inside it for the Stephen Moss.

And then that would go up to the sentence. So in the case of Stephen Moss's brother, when you looked at this noun phrase, there would be an intervening noun phrase before you got to the node X. And therefore Stephen Moss is a possible and in fact, good antecedent of him.

And the algorithm would choose Stephen Moss, but the algorithm correctly captures that when you have the sentence, Stephen Moss hated him, that him cannot refer to Stephen Moss. Okay. So having worked that out, it then says if X is the highest S in the sentence, okay, so my X here is definitely the highest S in the sentence because I've got the whole sentence.

What you should do is then traverse the parse trees of previous sentences in the order of recency. So what I should do now is sort of work backwards in the text, one sentence at a time, going backwards, looking for an antecedent. And then for each tree, traverse each tree left to right, breadth first.

So then within each tree, I'm doing the same of going breadth first. So sort of working down and then going left to right with an equal breadth. And so hidden inside these clauses, it's capturing a lot of the facts of how coreference typically works. So what you find in English, I'll say, but in general, this is true of lots of languages, is that there are general preferences and tendencies for coreference.

So a lot of the time, a pronoun will be coreferent with something in the same sentence, like Stephen Moss's brother hated him, but it can't be if it's too close to it. So you can't say Stephen Moss hated him and have the him be Stephen Moss. And if you're then looking for coreference that's further away, the thing it's coreferent with is normally close by.

And so that's why you work backwards through sentences one by one. But then once you're looking within a particular sentence, the most likely thing it's going to be coreferent to is a topical noun phrase and default topics in English are subjects. So by doing things breadth first, left to right, a preferred antecedent is then a subject.

And so this algorithm, I won't go through all the complex clauses five through nine, ends up saying, okay, what you should do is propose Niall Ferguson as what is coreferent to him, which is the obvious correct reading in this example. Okay. Phew. You probably didn't want to know that.

And in some sense, the details of that aren't interesting. But what is I think actually still interesting in 2021 is what points Jerry Hobbes was actually trying to make last millennium. And the point he was trying to make was the following. So Jerry Hobbes wrote this algorithm, the naive algorithm, because what he said was, well, look, if you want to try and crudely determine coreference, well, there are these various preferences, right?

There's the preference for same sentence. There's the preference for recency. There's a preference for topical things like subject. And there are things where, you know, if it has gender, it has to agree in gender. So there are sort of strong constraints of that sort. So I can write an algorithm using my linguistic mouse, which captures all the main preferences.

And actually, it works pretty well. Doing that is a pretty strong baseline system. But what Jerry Hobbes wanted to argue is that this algorithm just isn't something you should believe in. This isn't a solution to the problem. This is just sort of, you know, making a best guess according to the preferences of what's most likely without actually understanding what's going on in the text at all.

And so actually, what Jerry Hobbes wanted to argue was the so-called Hobbes algorithm now, he wasn't a fan of the Hobbes algorithm. He was wanting to argue that the Hobbes algorithm is completely inadequate as a solution to the problem. And the only way we'll actually make progress in natural language understanding is by building systems that actually really understand the text.

And this is actually something that has come to the fore again more recently. So the suggestion is that in general, you can't work out coreference or pronominal and afro in particular unless you're really understanding the meaning of the text. And people look at pairs of examples like these ones.

So she poured water from the pitcher into the cup until it was full. So think for just half a moment, well, what is it in that example that is full? So that what's full there is the cup. But then if I say she poured water from the pitcher into the cup until it was empty, well, what's empty?

Well, that's the pitcher. And the point that is being made with this example is the only thing that's been changed in these examples is the adjective right here. So these two examples have exactly the same grammatical structure. So in terms of the Hobbes' naive algorithm, the Hobbes' naive algorithm necessarily has to predict the same answer for both of these.

But that's wrong. You just cannot determine the correct pronoun antecedent based on grammatical preferences of the kind that are used in the naive algorithm. You actually have to conceptually understand about pitchers and cups and water and full and empty to be able to choose the right antecedent. Here's another famous example that goes along the same lines.

So Terry Winograd, shown here as a young man. So long, long ago, Terry Winograd came to Stanford as the natural language processing faculty and Terry Winograd became disillusioned with the symbolic AI of those days and just gave it up altogether. And he reinvented himself as being an HCI person.

And so Terry was then essentially the person who established the HCI program at Stanford. But before he lost faith in symbolic AI, he talked about the coreference problem and pointed out a similar pair of examples here. So we have the city council refused the women a permit because they feared violence versus the city council refused the women a permit because they advocated violence.

So again, you have this situation where these two sentences have identical syntactic structure and they differ only in the choice of verb here. But once you add knowledge, common sense knowledge of how the human world works, well, how this should pretty obviously be interpreted that in the first one that they is referring to the city council, whereas in the second one that they is referring to the women.

And so coming off of that example of Terry, these have been referred to as Winograd schemas. So Winograd schema challenges sort of choosing the right reference here. And so it's basically just doing pronominal and afro. But the interesting thing is people have been interested in what are tests of general intelligence and one famous general test of intelligence, though I won't talk about now, is the Turing test.

And there's been a lot of debate about problems with the Turing test and is it good? And so in particular, Hector Levesque, who's a very well-known senior AI person, he actually proposed that a better alternative to the Turing test might be to do what he then dubbed Winograd schema.

And Winograd schema is just solving pronominal coreference in cases like this where you have to have knowledge about the situation in the world to get the answer right. And so he's basically arguing that, you know, you can review really solving coreference as solving artificial intelligence. And that's sort of what the position that Hobbes wanted to advocate.

So what he actually said about his algorithm was that the naive approach is quite good. Computationally speaking, it will be a long time before a semantically based algorithm is sophisticated enough to perform as well. And these results set a very high standard for any other approach to aim for.

And he was proven right about that because there was sort of really took to around 2015 before people thought they could do without the Hobbes algorithm. But then he notes, yet there is every reason to pursue a semantically based approach. The naive algorithm does not work. Anyone can think of examples where it fails.

In these cases, it not only fails, it gives no indication that it has failed and offers no help in finding the real antecedent. And so I think this is actually still interesting stuff to think about because, you know, really for the kind of machine learning based coreference systems that we are building, you know, they're not a hot mess of rules like the Hobbes algorithm, but basically they're still sort of working out statistical preferences of what patterns are most likely and choosing the antecedent that way.

They really have exactly the same deficiencies still that Hobbes was talking about, right? That they fail in various cases. It's easy to find places where they fail. The algorithms give you no idea when they fail. They're not really understanding the text in a way that a human does to determine the antecedent.

So we still actually have a lot more work to do before we're really doing full artificial intelligence. But I'd best get on now and actually tell you a bit about some coreference algorithms. Right? So the simple way of thinking about coreference is to say that you're making just a binary decision about a reference pair.

So if you have your mentions, you can then say, well, I've come to my next mention, she, I want to work out what it's coreferent with. And I can just look at all of the mentions that came before it and say, is it coreferent or not? And do a binary decision.

So at training time, I'll be able to say I have positive examples, assuming I've got some data labeled for what's coreferent to what, as to these ones are coreferent. And I've got some negative examples of these ones are not coreferent. And what I want to do is build a model that learns to predict coreferent things.

And I can do that fairly straightforwardly in the kind of ways that we have talked about. So I train with the regular kind of cross entropy loss, where I'm now summing over every pairwise binary decision as to whether two mentions are coreferent to each other or not. And so then when I'm at test time, what I want to do is cluster the mentions that correspond to the same entity.

And I do that by making use of my pairwise scorer. So I can run my pairwise scorer, and it will give a probability or a score that any two mentions are coreferent. So by picking some threshold, like 0.5, I can add coreference links for when the classifier says it's above the threshold.

And then I do one more step to give me a clustering. I then say, okay, let's also make the transitive closure to give me clusters. So it thought that I and she were coreferent and my and she were coreferent. Therefore, I also have to regard I and my as coreferent.

And so that's sort of the completion by transitivity. And so since we always complete by transitivity, note that this algorithm is very sensitive to making any mistake in a positive sense. Because if you make one mistake, for example, you say that he and my are coreferent, then by transitivity, all of the mentions in the sentence become one big cluster and that they're all coreferent with each other.

So that's a workable algorithm and people have often used it. But often people go a little bit beyond that and prefer a mention ranking model. So let me just explain the advantages of that. That normally, if you have a long document where it's Ralph Nader and he did this and some of them did something to him and we visited his house and blah, blah, blah, blah.

And then somebody voted for Nader because he. In terms of building a coreference classifier, it seems like it's easy and reasonable to be able to recover that this he refers to Nader. But in terms of building a classifier for it to recognize that this he should be referring to this Nader, which might be three paragraphs back, seems kind of unreasonable how you're going to recover that.

So those faraway ones might be almost impossible to get correct. And so that suggests that maybe we should have a different way of configuring this task. So instead of doing it that way, what we should say is, well, this he here has various possible antecedents and our job is to just choose one of them.

And that's almost sufficient apart from we need to add one more choice, which is, well, some mentions won't be coreferent with anything that proceeds because we're introducing a new entity into the discourse. So we can add one more dummy mention, the N/A mention. So it doesn't refer to anything previously in the discourse.

And then our job at each point is to do mention ranking to choose which one of these she refers to. And then at that point, rather than doing binary yes/no classifiers, that what we can do is say, aha, this is choose one classification and then we can use the kind of softmax classifiers that we've seen at many points previously.

Okay. So that gets us in business for building systems. And for either of these kind of models, there are several ways in which we can build the system. We could use any kind of traditional machine learning classifier. We could use a simple neural network. We can use more advanced ones with all of the tools that we've been learning about more recently.

So let me just quickly show you a simple neural network way of doing it. So this is a model that my PhD student, Kevin Clark, did in 2015. So not that long ago. But what he was doing was doing coreference resolution based on the mentions with a simple feedforward neural network, kind of in some sense like we did dependency parsing with a simple feedforward neural network.

So for the mention, it had word embeddings, antecedent had word embeddings. There were some additional features of each of the mention and candidate antecedent. And then there were some final additional features that captured things like distance away, which you can't see from either the mention or the candidate. And all of those features were just fed into several feedforward layers of a neural network.

And it gave you a score of are these things coreferent or not. And that by itself just worked pretty well. And I won't say more details about that. But what I do want to show is sort of a more advanced and modern neural coreference system. But before I do that, I want to take a digression and sort of say a few words about convolutional neural networks.

So the idea of when you apply a convolutional neural network to language, i.e. to sequences, is that what you're going to do is you're going to compute vectors, features effectively, for every possible word subsequence of a certain length. So that if you have a piece of text like tentative deal reached to keep government open, you might say I'm going to take every three words of that, i.e.

tentative deal reached, deal reached to, reach to keep, and I'm going to compute a vector based on that subsequence of words and use those computed vectors in my model by somehow grouping them together. So the canonical case of convolutional neural networks is in vision. And so if after this next quarter you go along to CS231N, you'll be able to spend weeks doing convolutional neural networks for vision.

And so the idea there is that you've got these convolutional filters that you sort of slide over an image and you compute a function of each place. So the sort of little red numbers are showing you what you're computing, but then you'll slide it over to the next position and fill in this cell, and then you'll slide it over the next position and fill in this cell, and then you'll slide it down and fill in this cell.

And so you've got this sort of little function of a patch, which you're sliding over your image and computing a convolution, which is just a dot product effectively, that you're then using to get an extra layer of representation. And so by sliding things over, you can pick out features and you've got a sort of a feature identifier that runs across every piece of the image.

Well, for language, we've just got a sequence, but you can do basically the same thing. And what you then have is a 1D convolution for text. So if here's my sentence, tentative deal, reach to keep the government open, that what I can do is have, so these words have a word representation, which, so this is my vector for each word.

And then I can have a filter, sometimes called a kernel, which I use for my convolution. And what I'm going to do is slide that down the text. So I can start with it, with the first three words, and then I sort of treat them as sort of elements I can dot product and sum, and then I can compute a value as to what they all add up to, which is minus one, it turns out.

And so then I might have a bias that I add on and get an updated value if my bias is plus one. And then I'd run it through a nonlinearity, and that will give me a final value. And then I'll slide my filter down, and I'd work out a computation for this window of three words, and take 0.5 times 3 plus 0.2 times 1, et cetera, and that comes out as this value.

I add the bias. I put it I'm going to put it through my nonlinearity, and then I keep on sliding down, and I'll do the next three words, and keep on going down. And so that gives me a 1D convolution and computes a representation of the text. You might have noticed in the previous example that I started here with seven words, but because I wanted to have a window of three for my convolution, the end result is that things shrunk.

So in the output, I only had five things. That's not necessarily desirable. So commonly people will deal with that with padding. So if I put padding on both sides, I can then start my three by three convolution, my three sorry, not three by three, my three convolution here, and compute this one, and then slide it down one, and compute this one.

And so now my output is the same size as my real input, and so that's a convolution with padding. Okay, so that was the start of things, but you know, how you get more power of the convolutional network is you don't only have one filter, you have several filters.

So if I have three filters, each of which will have their own bias and nonlinearity, I can then get a three dimensional representation coming out the end, and sort of you can think of these as conceptually computing different features of your text. Okay, so that gives us a kind of a new feature re-representation of our text.

But commonly, we then want to somehow summarize what we have. And a very common way of summarizing what we have is to then do pooling. So if we sort of think of these features as detecting different things in the text, so you know, they might even be high level features like, you know, does this show signs of toxicity or hate speech?

Is there reference to something? So if you want to be interested in does it occur anywhere in the text, what people often then do is a max pooling operation, where for each feature, they simply sort of compute the maximum value it ever achieved in any position as you went through the text.

And say that this vector ends up as the sentence representation. Sometimes for other purposes, rather than max pooling, people use average pooling, where you take the averages of the different vectors to get the sentence representation. Then general max pooling has been found to be more successful. And that's kind of because if you think of it as feature detectors that are wanting to detect was this present somewhere, then, you know, something like positive sentiment isn't going to be present in every three word subsequence you choose.

But if it was there somewhere, it's there. And so often max pooling works better. And so that's a very quick look at convolutional neural networks. Just to say this example is doing 1D convolutions with words. But a very common place that convolutional neural networks are being used in natural language is actually using them with characters.

And so what you can do is you can do convolutions over subsequences of the characters in the same way. And if you do that, this allows you to compute a representation for any sequence of characters. So you don't have any problems with being out of vocabulary or anything like that.

Because for any sequence of characters, you just compute your convolutional representation and max pool it. And so quite commonly, people use a character convolution to give a representation of words. This is the only representation of words. But otherwise, it's something that you use in addition to a word vector.

And so in both BIDAP and the model I'm about to show, that at the base level, it makes use of both a word vector representation that we saw right at the beginning of the text and a character level convolutional representation of the words. Okay. With that said, I now want to show you before time runs out, an end to end neural coref model.

So the model I'm going to show you is Kenton Lee's one from University of Washington, 2017. This is no longer the state of the art. I'll mention the state of the art at the end. But this was the first model that really said, get rid of all of that old stuff of having pipelines and mentioned detection first.

Build one end to end big model that does everything and returns coreference. So it's a good one to show. So compared to the earlier simple thing I saw, we're now going to process the text with BIOSTMs. We're going to make use of attention. And we're going to do all of mentioned detection coreference in one step end to end.

And the way it does that is by considering every span of the text up to a certain length as a candidate mentioned and just figures out a representation for it and whether it's coreferent to other things. So what we do at the start is we start with a sequence of words and we calculate from those standard word embedding and a character level CNN embedding.

We then feed those as inputs into a bidirectional LSTM of the kind that we saw quite a lot of before. But then after this, what we do is we compute representations for spans. So when we have a sequence of words, we're then going to work out a representation of a sequence of words which we can then put into our coreference model.

So I can't fully illustrate in this picture, but sub sequences of different lengths, so like general, general electric, general electric said, will all have a span representation which I've only shown a subset of them in green. So how are those computed? Well, the way they're computed is that the span representation is a vector that concatenates several vectors and it consists of four parts.

It consists of the representation that was computed for the start of the span from the LSTM, the representation for the end from the LSTM, that's over here. And then it has a third part that's kind of interesting. This is an intention-based representation that is calculated from the whole span, but particularly sort of looks for the head of a span.

And then there are still a few additional features. So it turns out that, you know, some of these additional things like length and so on is still a bit useful. So to work out the final part, it's not the beginning and the end, what's done is to calculate an attention-weighted average of the word embeddings.

So what you're doing is you're taking the X star representation of the final word of the span, and you're feeding that into a neural network to get attention scores for every word in the span, which are these three, and that's giving you an attention distribution as we've seen previously.

And then you're calculating the third component of this as an attention-weighted sum of the different words in the span. And so therefore you've got the sort of a soft average of the representations of the words of the span. Okay. So then once you've got that, what you're doing is then feeding these representations into having scores for whether spans are coreferent mentions.

So you have a representation of the two spans, you have a score that's calculated for whether two different spans look coreferent, and that overall you're getting a score for are different spans looking coreferent or not. And so this model is just run end to end on all spans. And that sort of would get intractable if you scored literally every span in a long piece of text.

So they do some pruning. They sort of only allow spans up to a certain maximum size. They only consider pairs of spans that aren't too distant from each other, et cetera, et cetera. But basically it's sort of an approximation to just a complete comparison of spans. And this turns into a very effective coreference resolution algorithm.

Today it's not the best coreference resolution algorithm because maybe not surprisingly, like everything else that we've been dealing with, there's now been these transformer models like BERT have come along and that they produce even better results. So the best coreference systems now make use of BERT. In particular, when Danci spoke, she briefly mentioned spanBERT, which was a variant of BERT which constructs blanks out for reconstruction, sub sequences of words rather than just a single word.

And spanBERT has actually proven to be very effective for doing coreference, perhaps because you can blank out whole mentions. We've also gotten gains actually funnily by treating coreference as a question answering task. So effectively you can find a mention like he or the person and say what is its antecedent and get a question answering answer.

And that's a good way to do coreference. So if we put that together as time is running out, let me just sort of give you some sense of how results come out for coreference systems. So I'm skipping a bit actually that you can find in the slides, which is how coreference is scored.

But essentially it's scored on a clustering metric. So a perfect clustering would give you 100 and something that makes no correct decisions would give you zero. And so this is sort of how the coreference numbers have been panning out. So back in 2010, actually, this was a Stanford system.

This was a state of the art system for coreference. It won a competition. It was actually a non-machine learning model, because again, we wanted to sort of prove how these rule-based methods in practice work kind of well. And so its accuracy was around 55 for English, 50 for Chinese.

Then gradually machine learning, this was sort of statistical machine learning models got a bit better. Wiseman was the very first neural coreference system, and that gave some gains. Here's a system that Kevin Clark and I did, which gave a little bit further gains. So Lee is the model that I've just shown you as the end-to-end model, and it got a bit of further gains.

But then again, what gave the huge breakthrough, just like question answering, was that the use of SpanBert. So once we moved to here, we're now using SpanBert, but that's giving you about an extra 10% or so. The CorefQA technique proved to be useful. And in the very latest best results are effectively combining together SpanBert and a larger version of SpanBert and CorefQA and getting up to 83.

So you might think from that, that Coref is sort of doing really well and is getting close to solve like other NLP tasks. Well it's certainly true that in neural times, the results have been getting way, way better than they had been before. But I would caution you that these results that I just showed were on a corpus called Onto Notes, which is mainly Newswire.

And it turns out that Newswire coreference is pretty easy. I mean, in particular, there's a lot of mention of the same entities, right? So the newspaper articles are full of mentions of the United States and China and leaders of the different countries. And it's sort of very easy to work out what they're coreferent to.

And so the coreference scores are fairly high. Whereas if what you do is take something like a page of dialogue from a novel and feed that into a system and say, okay, do the coreference correctly, you'll find pretty rapidly that the performance of the models is much more modest.

If you'd like to try out a coreference system for yourself, there are pointers to a couple of them here where the top one's ours from the certain Kevin Clark's neural coreference. And this is one that goes with the Hugging Face repository that we've mentioned.

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 13 - Coreference Resolution

Chapters

Transcript