back to indexStanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Pt 5: Adversarial Testing I Spring 2023
00:00:06.700 |
This is the fifth screencast in our series on advanced behavioral testing for NLU. 00:00:11.800 |
What we've done so far in the unit is reflect on the nature of behavioral testing and think 00:00:17.120 |
about its motivations, and we've tried to come to grips with its strengths and its weaknesses. 00:00:22.760 |
With that context in place, I thought it would be good to look at some recent prominent cases 00:00:27.320 |
of adversarial tests to see what lessons they can teach us, especially taking a kind of 00:00:32.320 |
historical perspective because we've learned a bunch of things about the nature of these 00:00:40.680 |
On this slide here, I have some screenshots from the squad leaderboard that I took recently. 00:00:45.200 |
The squad leaderboard is nice and friendly to humans because it gives us this privileged 00:00:51.840 |
You can see that humans are getting around 87% exact match, so 87% accuracy on squad. 00:01:00.040 |
But be careful here, you actually have to travel all the way down to position 31 on 00:01:05.120 |
the leaderboard to find a system that is worse than humans according to this metric. 00:01:10.480 |
And many of the systems that are above humans are well above humans according to this metric 00:01:16.680 |
So you can essentially picture the headlines. 00:01:20.200 |
Players have gotten better at humans than answering questions, and the underlying evidence 00:01:25.760 |
And I think that's the kind of headline that motivated this first and very prominent adversarial 00:01:34.000 |
This is an important initial entry into this modern era of adversarial testing in NLU. 00:01:41.360 |
To start, let me just remind you of what the squad task is like. 00:01:47.720 |
A question is posed, and the task is to find an answer to that question. 00:01:51.880 |
And we have essentially a guarantee that that answer will be a substring of the context 00:01:58.640 |
Jia and Liang had the intuition that models might be overfit to the particular data that 00:02:03.240 |
was in squad, and so they set up some adversaries to try to diagnose that problem. 00:02:07.640 |
And the way their adversaries worked is that they would append misleading sentences to 00:02:14.240 |
So in this case, I've appended quarterback Leland Stanford had jersey number 37 in Champ 00:02:22.120 |
And what Jia and Liang find is that with these appended sentences in place, models switch 00:02:29.120 |
They're distracted by this new and false evidence. 00:02:33.280 |
So that's worrisome, but you might have an intuition that we can surely overcome this 00:02:38.480 |
What we should do is take this augmented train set with these appended sentences on it and 00:02:43.200 |
retrain our models, and then surely the models will overcome this adversary. 00:02:47.520 |
And indeed, we find that they do overcome this particular adversary, and they will stop 00:02:57.320 |
What about an adversary where we simply pre-pend these misleading sentences to the evidence 00:03:03.160 |
They find again that models are distracted and start to answer Leland Stanford using 00:03:09.800 |
And you could think, well, we can now train on the augmented train set that has the pre-pending 00:03:14.720 |
and the appending, and maybe now we'll have overcome the adversary. 00:03:18.280 |
But you can see what kind of dynamic we've set up. 00:03:20.920 |
Now we could put the misleading sentences in the middle of the passage. 00:03:24.560 |
And again, we'd probably see models start to fall down. 00:03:29.840 |
Here is a kind of leaderboard showing original system performance for what at the time were 00:03:35.800 |
top performing systems for S.Q.U.A.D. as well as their performance on this adversary that 00:03:43.960 |
Obviously there's an enormous drop in overall system performance for this adversary, and 00:03:49.720 |
But I think we should look even more closely. 00:03:52.640 |
It's noteworthy that the original ranking has gotten essentially totally shuffled on 00:04:00.600 |
The original number one system has fallen to position five. 00:04:04.300 |
The original two position went all the way to 10, three to 12. 00:04:08.680 |
I think the original seventh place position is now in the number one slot on the adversary. 00:04:15.000 |
But the point is that there's essentially no relationship between original and adversary. 00:04:19.160 |
This is a plot that kind of substantiates that. 00:04:21.760 |
I have original system performance along the x-axis, adversarial along the y-axis, and 00:04:27.960 |
it's just a cloud of dots showing no evident correlation between the two. 00:04:33.240 |
So this probably suggests that systems were kind of overfit to the original S.Q.U.A.D. 00:04:38.360 |
problem, and they're dealing with this adversary in pretty chaotic ways. 00:04:45.640 |
I should say that I'm not sure for this particular adversary of the current state of things. 00:04:51.800 |
I would love to have evidence for you about more modern transformer systems and how they 00:04:57.460 |
But as far as I know, no one has done that systematic testing. 00:05:01.120 |
I think it would be valuable to have those data points. 00:05:08.700 |
What I've got on the slide here is a picture of performance on SNLI, one of the major benchmarks 00:05:18.400 |
Along the y-axis, I have the F1 score, and the red line marks our estimate of human performance. 00:05:24.280 |
And the blue line is tracking different systems that are from the published literature. 00:05:28.120 |
I think it's important to emphasize that these are essentially all published papers. 00:05:33.520 |
What you see is a very rapid progress over time, eventually surpassing our evidence of 00:05:41.760 |
And the line is kind of almost monotonically increasing, which strongly suggests to me 00:05:46.360 |
that published papers are learning implicit lessons from earlier papers about how to do 00:05:55.200 |
But the point is, we do now have systems that are superhuman. 00:06:00.100 |
The multi-NLI leaderboard is a little bit different. 00:06:06.560 |
And as a result, you get many more systems competing on this leaderboard, and you get 00:06:10.640 |
much less of that kind of community-wide hill climbing on the task, because people who aren't 00:06:15.600 |
communicating with each other are simply entering systems to see how they did. 00:06:19.200 |
So the blue line oscillates all over the place. 00:06:22.500 |
But it's still a story of progress toward that estimate of human performance. 00:06:28.360 |
And if you took these numbers at face value, you might conclude that we are developing 00:06:32.380 |
systems that are really, really good at doing inference in natural language, at doing what 00:06:40.620 |
But again, the intuition behind adversaries is that we might worry about that. 00:06:44.420 |
And one of the first and most influential entries into the adversarial space for NLI 00:06:55.020 |
And what they did is conceptually very simple and draws really nicely on intuitions around 00:07:08.780 |
So the original premise from SNLI is a little girl kneeling in the dirt crying. 00:07:17.560 |
What they did for their adversary is simply change sad to unhappy. 00:07:24.060 |
And so we have an expectation that systems will continue to say entails in this case. 00:07:29.220 |
But what Glockner et al. observe is that even the best systems are apt to start switching 00:07:36.160 |
And they do that because they're kind of overfit to assuming that the presence of negation 00:07:44.300 |
And again, I love this test because it's clearly drawing on intuition around systematicity. 00:07:49.840 |
We assume that substitution of synonyms should preserve the SNLI label. 00:07:55.300 |
And systems are just not obeying that kind of underlying principle. 00:08:01.960 |
The premise is an elderly couple are sitting outside a restaurant enjoying wine. 00:08:06.820 |
The original SNLI case is a couple drinking wine. 00:08:12.000 |
For breaking an LI, they switched wine to champagne. 00:08:15.480 |
Now we have two terms that are kind of siblings in the conceptual hierarchy. 00:08:19.560 |
They're disjoint from each other, but they're very semantically related. 00:08:23.400 |
The human intuition is that these examples are now neutral because wine and champagne 00:08:29.360 |
But systems have only a very fuzzy understanding of that kind of lexical nuance. 00:08:34.020 |
And so they are very prone to saying entails for this case as well. 00:08:38.840 |
Again, an intuition about the systematicity of the lexicon and how it will play into our 00:08:46.820 |
And we're just not seeing human-like behavior from these systems. 00:08:51.280 |
Here is the results table from the Glockner et al. paper. 00:08:55.000 |
We should exempt the final lines, I think, on the grounds that those systems consumed 00:09:00.320 |
WordNet, which was the resource that was used to create the adversary. 00:09:04.660 |
And look at the systems that are in the top three rows. 00:09:08.180 |
What you find is that they do really well on SNLI's test set. 00:09:24.280 |
Let's look more carefully at this table, in particular at the model column. 00:09:29.360 |
All of these models are from the pre-transformer era. 00:09:33.960 |
In fact, they're just on the cusp of the transformer era. 00:09:37.920 |
And these are all instances of recurrent neural networks with tons of attention mechanisms 00:09:43.980 |
added onto them, almost reaching the insight about how we should go forward with attention 00:09:52.560 |
And you see their historical period reflected also in the SNLI test results, which are lower 00:09:59.120 |
than systems we routinely train today built on the transformer architecture. 00:10:05.720 |
So we should ask ourselves, what will happen if we test some of these newer transformer 00:10:12.400 |
I simply downloaded a Roberta model that was fine-tuned on the multi-NLI data set, so different 00:10:18.640 |
from SNLI, which was used by Glockner et al to create their data set. 00:10:27.560 |
And the headline here is that this model off the shelf essentially solves the Glockner 00:10:35.800 |
We should look just at contradiction and entailment because neutral is too small a category in 00:10:41.740 |
And you see impressively high F1 scores, dramatically different from the results that Glockner et 00:10:51.080 |
And remember, this is even under domain shift because this Roberta model was trained on 00:10:55.000 |
multi-NLI and we're testing it on examples derived from SNLI. 00:10:59.660 |
So that looks like a real success story, an adversary that has essentially been overcome. 00:11:05.120 |
And I think even the most cynical among us would regard that as an instance of progress. 00:11:15.720 |
But again, there are some important lessons learned. 00:11:23.860 |
It's got a bunch of different categories, antonyms, numerical, word overlap negation, 00:11:30.640 |
For some of them, we're doing something very similar to Glockner, which is drawing on underlying 00:11:34.700 |
intuitions about compositionality or systematicity for doing things like love, hate, or this 00:11:43.120 |
But the data set also includes some things that look more directly adversarial to me 00:11:47.040 |
as when we append some sort of redundant or confusing elements onto the end of the examples 00:11:52.720 |
to see whether that affects system performance. 00:11:55.160 |
That's more of a kind of squad adversary intuition, I would say. 00:12:06.760 |
We have systems that are pretty good at multi-NLI, but really, really bad at essentially all 00:12:13.440 |
of the splits from this adversarial benchmark. 00:12:20.820 |
We talked about it when we talked about inoculation by fine-tuning. 00:12:25.040 |
From the inoculation by fine-tuning paper, five of the six panels in this central figure 00:12:35.580 |
We saw that word overlap and negation were diagnosed as a data set weakness instance 00:12:41.300 |
where models had no problem solving those tasks if given enough relevant evidence. 00:12:46.700 |
Solving errors and length mismatch by contrast were model weaknesses. 00:12:50.780 |
Models really couldn't get traction on those. 00:12:55.460 |
And numerical reasoning is an example of a data set artifact where there's something 00:12:59.620 |
about the examples that really disrupts the performance of otherwise pretty solid models. 00:13:06.040 |
So three very different lessons coming from the same challenge or adversarial benchmark. 00:13:11.820 |
And I think that's also a sign of progress because we now have the tooling to help us 00:13:16.180 |
go one layer deeper in terms of understanding why models might be failing on different of 00:13:21.700 |
these challenge test sets, which is after all the kind of analytic insight that we set 00:13:26.100 |
out to achieve with this new mode of behavioral testing.