Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 6: Adversarial NLI I Spring 2023

Welcome back everyone. This is part six in our series on advanced behavioral testing for NLU. To this point, we've been focused on adversarial testing. We're now going to take a more expansive view and think about the potential benefits of training on adversarial cases. The foundational entry in this literature is the ANLI paper and the associated benchmark.

As far as I know, ANLI is the first attempt to create a really large train set that is filled with adversarial examples. That is, with examples that fooled a top performing model but were intuitive for humans. I think it's fair to say that ANLI is a direct response to the adversarial test results that we reviewed in the previous screencast where we saw NLI models that were surpassing our estimates for human performance but nonetheless falling down on very simple phenomena turning on systematicity or compositionality in language.

The vision for ANLI is that by introducing an adversarial dynamic into the train set creation, we can get models that are more robust. Here's how data set creation worked. The annotator is presented with a premise sentence and a condition that is entailment, contradiction, or neutral, one of the NLI labels.

The annotator writes a hypothesis and then a state of the art model makes a prediction about the resulting premise hypothesis pair. If the model's prediction matches the condition, that is, if the model was correct in some sense, the annotator returns to step 2 to try again. Whereas if the model was fooled, the premise hypothesis pair is independently validated by other human annotators.

The result of this dynamic, of this interaction with this top performing model, is a train set that is full of really hard cases, cases that fooled this top performing model, in addition to cases that didn't fool that model. The examples are interesting. The premises in ANLI tend to be long.

The hypotheses are, of course, challenging. Interestingly, the dataset also contains these reason texts. This is the annotator's best attempt to explain why the model might have struggled with that particular example. As far as I know, the reason texts haven't been used very much in the literature, but they strike me as an interesting source of indirect supervision about the task.

You might check those out. This is the core results table for the ANLI paper. There's a lot of information here, but I think the story is pretty straightforward. Let's focus on the BERT model. The BERT model is doing really well on SNLI and multi-NLI across all of these different variants of the training regimes.

When the model is trained only on SNLI and multi-NLI, it does really poorly on ANLI. You can see ANLI had three rounds. When we pool them together, this model gets around 20 percent accuracy. As we take that model and augment its training data with ANLI data from previous rounds, we do see improvements overall in the ANLI column, which is encouraging.

It looks like the models are getting better at the task as they get more of these adversarial examples as part of training. But the fundamental insight here is that performance on ANLI is well below performance for the other benchmarks. This is a substantial challenge and I believe that this substantial challenge still stands.

Models do not excel at ANLI even to this day as far as I know. One thing I love about ANLI is that it projects this really interesting vision for the future development of train and test assets for the field. It's actually all credit due to Zellers et al. They also described this vision in their papers on SWAG and Hella SWAG.

They write, "A path for NLP progress going forward towards benchmarks that adversarially co-evolve with evolving state-of-the-art models." I didn't have time to tell this full story in details, but Zellers et al is an interesting story. There are two papers. The first one introduced SWAG, which is a synthetically created train and test environment for adversarial testing.

They found that it was very difficult, but when the BERT paper came out, BERT essentially solved the SWAG problem. In response to that, Zellers et al made some adjustments to the SWAG dataset that produced Hella SWAG. Hella SWAG was substantially harder for BERT, and I believe that Hella SWAG remains a challenging benchmark to this day.

I think that started us on the path of seeing how productive it could be to create datasets, use them to develop models, and then respond when models seem to succeed with even harder challenges. In the ANLI paper, they project this vision very directly. This process yields a moving post dynamic target for NLU systems rather than a static benchmark that will eventually saturate.

This sounds so productive to me. Throughout the field, large teams of very talented people spend lots of time and money getting epsilon more performance out of our established benchmarks. Wouldn't it be wonderful if instead, when we saw the benchmark saturating, we simply created new benchmarks, and posed new challenges for ourselves.

I think it's a very safe bet that models would improve more rapidly and become more capable if we did this moving post thing. That really is the vision for Dynabench. Dynabench is an open-source software effort, an open-source platform for doing, among other things, dynamic adversarial data collection. Dynabench has produced a number of datasets to this point.

ANLI is the first one. That's the precursor. We also have Dynabench derived datasets for QA, for sentiment, and a number of datasets for hate speech, including counter speech to hate speech. We have a few on QA and one on German hate speech. I think this list will continue to grow and offer us these incredible new resources.

Let me stop there for the next screencast. I'm going to do a deep dive on a Dynabench derived dataset that we created called Dynascent.

Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 6: Adversarial NLI I Spring 2023

Transcript