Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Pt 5: Adversarial Testing I Spring 2023

Hello everyone, welcome back. This is the fifth screencast in our series on advanced behavioral testing for NLU. What we've done so far in the unit is reflect on the nature of behavioral testing and think about its motivations, and we've tried to come to grips with its strengths and its weaknesses.

With that context in place, I thought it would be good to look at some recent prominent cases of adversarial tests to see what lessons they can teach us, especially taking a kind of historical perspective because we've learned a bunch of things about the nature of these challenges. Let's begin with squad.

On this slide here, I have some screenshots from the squad leaderboard that I took recently. The squad leaderboard is nice and friendly to humans because it gives us this privileged place at the top of the leaderboard. You can see that humans are getting around 87% exact match, so 87% accuracy on squad.

But be careful here, you actually have to travel all the way down to position 31 on the leaderboard to find a system that is worse than humans according to this metric. And many of the systems that are above humans are well above humans according to this metric on squad.

So you can essentially picture the headlines. Players have gotten better at humans than answering questions, and the underlying evidence is squad. And I think that's the kind of headline that motivated this first and very prominent adversarial test from Jia and Liang, 2017. This is an important initial entry into this modern era of adversarial testing in NLU.

To start, let me just remind you of what the squad task is like. We're given context passages for evidence. A question is posed, and the task is to find an answer to that question. And we have essentially a guarantee that that answer will be a substring of the context passage.

Jia and Liang had the intuition that models might be overfit to the particular data that was in squad, and so they set up some adversaries to try to diagnose that problem. And the way their adversaries worked is that they would append misleading sentences to the ends of these passages.

So in this case, I've appended quarterback Leland Stanford had jersey number 37 in Champ Bowl 34. And what Jia and Liang find is that with these appended sentences in place, models switch and start to answer Leland Stanford. They're distracted by this new and false evidence. So that's worrisome, but you might have an intuition that we can surely overcome this adversary.

What we should do is take this augmented train set with these appended sentences on it and retrain our models, and then surely the models will overcome this adversary. And indeed, we find that they do overcome this particular adversary, and they will stop being misled by the appended sentence. But Jia and Liang are ahead of you.

What about an adversary where we simply pre-pend these misleading sentences to the evidence passages? They find again that models are distracted and start to answer Leland Stanford using that first new sentence. And you could think, well, we can now train on the augmented train set that has the pre-pending and the appending, and maybe now we'll have overcome the adversary.

But you can see what kind of dynamic we've set up. Now we could put the misleading sentences in the middle of the passage. And again, we'd probably see models start to fall down. And fall down they did. Here is a kind of leaderboard showing original system performance for what at the time were top performing systems for S.Q.U.A.D.

as well as their performance on this adversary that Jia and Liang had set up. Obviously there's an enormous drop in overall system performance for this adversary, and that is worrisome enough. But I think we should look even more closely. It's noteworthy that the original ranking has gotten essentially totally shuffled on this adversarial leaderboard.

The original number one system has fallen to position five. The original two position went all the way to 10, three to 12. I think the original seventh place position is now in the number one slot on the adversary. But the point is that there's essentially no relationship between original and adversary.

This is a plot that kind of substantiates that. I have original system performance along the x-axis, adversarial along the y-axis, and it's just a cloud of dots showing no evident correlation between the two. So this probably suggests that systems were kind of overfit to the original S.Q.U.A.D. problem, and they're dealing with this adversary in pretty chaotic ways.

And that itself is worrisome. I should say that I'm not sure for this particular adversary of the current state of things. I would love to have evidence for you about more modern transformer systems and how they behave with these adversaries. But as far as I know, no one has done that systematic testing.

I think it would be valuable to have those data points. Let's move to a second example. This is natural language inference. What I've got on the slide here is a picture of performance on SNLI, one of the major benchmarks for this task, over time. So along the x-axis, I have time.

Along the y-axis, I have the F1 score, and the red line marks our estimate of human performance. And the blue line is tracking different systems that are from the published literature. I think it's important to emphasize that these are essentially all published papers. What you see is a very rapid progress over time, eventually surpassing our evidence of human performance.

And the line is kind of almost monotonically increasing, which strongly suggests to me that published papers are learning implicit lessons from earlier papers about how to do well on the SNLI task. But the point is, we do now have systems that are superhuman. Those are published papers. The multi-NLI leaderboard is a little bit different.

This is hosted on Kaggle. Anyone can enter. And as a result, you get many more systems competing on this leaderboard, and you get much less of that kind of community-wide hill climbing on the task, because people who aren't communicating with each other are simply entering systems to see how they did.

So the blue line oscillates all over the place. But it's still a story of progress toward that estimate of human performance. And if you took these numbers at face value, you might conclude that we are developing systems that are really, really good at doing inference in natural language, at doing what is in effect common sense reasoning.

But again, the intuition behind adversaries is that we might worry about that. And one of the first and most influential entries into the adversarial space for NLI was Glockner et al., 2018. This is the breaking NLI paper. And what they did is conceptually very simple and draws really nicely on intuitions around systematicity and compositionality.

I've got two examples on the table here. The first one is going to play on synonyms. So the original premise from SNLI is a little girl kneeling in the dirt crying. And that entails a little girl is very sad. What they did for their adversary is simply change sad to unhappy.

Those are synonyms, essentially. And so we have an expectation that systems will continue to say entails in this case. But what Glockner et al. observe is that even the best systems are apt to start switching to calling this a contradiction case. And they do that because they're kind of overfit to assuming that the presence of negation is an indicator for contradiction.

And again, I love this test because it's clearly drawing on intuition around systematicity. We assume that substitution of synonyms should preserve the SNLI label. And systems are just not obeying that kind of underlying principle. The second example is sort of similar. The premise is an elderly couple are sitting outside a restaurant enjoying wine.

The original SNLI case is a couple drinking wine. That had the entailment label. For breaking an LI, they switched wine to champagne. Now we have two terms that are kind of siblings in the conceptual hierarchy. They're disjoint from each other, but they're very semantically related. The human intuition is that these examples are now neutral because wine and champagne are disjoint.

But systems have only a very fuzzy understanding of that kind of lexical nuance. And so they are very prone to saying entails for this case as well. Again, an intuition about the systematicity of the lexicon and how it will play into our judgments about natural language inference. And we're just not seeing human-like behavior from these systems.

Here is the results table from the Glockner et al. paper. We should exempt the final lines, I think, on the grounds that those systems consumed WordNet, which was the resource that was used to create the adversary. And look at the systems that are in the top three rows. What you find is that they do really well on SNLI's test set.

They're in the mid '80s. And they do abysmally on the new test set. There are huge deltas in performance there. So that's interesting and worrisome. But we should pause here. Let's look more carefully at this table, in particular at the model column. All of these models are from the pre-transformer era.

In fact, they're just on the cusp of the transformer era. And these are all instances of recurrent neural networks with tons of attention mechanisms added onto them, almost reaching the insight about how we should go forward with attention being the primary mechanism. And you see their historical period reflected also in the SNLI test results, which are lower than systems we routinely train today built on the transformer architecture.

So we should ask ourselves, what will happen if we test some of these newer transformer models? So I decided to do that. I simply downloaded a Roberta model that was fine-tuned on the multi-NLI data set, so different from SNLI, which was used by Glockner et al to create their data set.

This is the code for doing it. It's a little bit fiddly. So I decided to reproduce it. And the headline here is that this model off the shelf essentially solves the Glockner et al adversary. We should look just at contradiction and entailment because neutral is too small a category in this little challenge set.

And you see impressively high F1 scores, dramatically different from the results that Glockner et al reported from the pre-transformer era. And remember, this is even under domain shift because this Roberta model was trained on multi-NLI and we're testing it on examples derived from SNLI. So that looks like a real success story, an adversary that has essentially been overcome.

And I think even the most cynical among us would regard that as an instance of progress. Let's look at a second NLI case. This one will play out somewhat differently. But again, there are some important lessons learned. This is from Nike et al 2018. It's a larger adversarial test benchmark.

It's got a bunch of different categories, antonyms, numerical, word overlap negation, and I think a few others. For some of them, we're doing something very similar to Glockner, which is drawing on underlying intuitions about compositionality or systematicity for doing things like love, hate, or this reasoning around numerical terms.

But the data set also includes some things that look more directly adversarial to me as when we append some sort of redundant or confusing elements onto the end of the examples to see whether that affects system performance. That's more of a kind of squad adversary intuition, I would say.

So that's the benchmark. It's pretty substantial. There are lots of examples in it. And the results tell a similar story. We have systems that are pretty good at multi-NLI, but really, really bad at essentially all of the splits from this adversarial benchmark. But we've seen this data set before.

We talked about it when we talked about inoculation by fine-tuning. From the inoculation by fine-tuning paper, five of the six panels in this central figure are actually from this Niket All benchmark. And they tell very different stories. We saw that word overlap and negation were diagnosed as a data set weakness instance where models had no problem solving those tasks if given enough relevant evidence.

Solving errors and length mismatch by contrast were model weaknesses. Models really couldn't get traction on those. So those might still worry us. And numerical reasoning is an example of a data set artifact where there's something about the examples that really disrupts the performance of otherwise pretty solid models. So three very different lessons coming from the same challenge or adversarial benchmark.

And I think that's also a sign of progress because we now have the tooling to help us go one layer deeper in terms of understanding why models might be failing on different of these challenge test sets, which is after all the kind of analytic insight that we set out to achieve with this new mode of behavioral testing.

Thank you. Thank you. Thank you.

Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Pt 5: Adversarial Testing I Spring 2023

Transcript