back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 4: Datasets I Spring 2023
Chapters
0:0 Intro
0:24 Water and air for our field
1:7 We ask a lot of our datasets
2:22 Benchmarks saturate faster than ever
3:16 Limitations found more quickly
5:33 Central questions
6:31 Trade-offs
8:56 DynaSent: Prompts increase naturalism
10:32 Adversarial examples or the most common cases?
11:20 Dynamics of adversarial datasets
11:28 Counterpoint from Bowman and Dahl (2021)
14:0 The job to be done
15:14 Major lessons thus far
16:53 Synthetic or naturalistic benchmarks
18:8 Negation as a learning target
19:1 MONLI: A slightly synthetic dataset
19:52 MONLI as challenge dataset
20:41 The value of messy data
21:8 Other vital issues for datasets
00:00:06.040 |
This is part 4 in our series on methods and metrics. 00:00:20.920 |
datasets play in our field and how we construct them. 00:00:33.920 |
the two essential fluids on which all life depends." 00:00:39.340 |
the resource on which all progress in the field of NLP depends. 00:00:43.480 |
Now, Cousteau's quotation actually continues, 00:00:46.280 |
"Have become global garbage cans, which is worrisome." 00:00:49.640 |
For datasets, I think there are some in the field who would 00:00:52.280 |
make the analogy extend to this worrisome aspect, 00:01:05.940 |
But it really is important that we get this right 00:01:17.360 |
to enable new capabilities in models via training, 00:01:23.520 |
and for fundamental aspects of scientific inquiry. 00:01:32.360 |
You can see it's important that we get these datasets right. 00:01:36.820 |
then we've got a very shaky foundation and we might be 00:01:39.200 |
tricking ourselves when we think we're making a lot of 00:01:52.440 |
the analogy that datasets are like the telescopes of our field. 00:02:01.480 |
He said that NLP-ers were like astronomers who want to see 00:02:04.880 |
the stars but refuse to build any telescopes. 00:02:11.260 |
toward creating more datasets and valuing dataset contributions. 00:02:21.360 |
In that context though, it's worth mentioning 00:02:24.040 |
this plot that I've used a few times in this course, 00:02:26.760 |
under the heading of benchmark saturating faster than ever. 00:02:32.520 |
I have time going back to the 90s and the y-axis is 00:02:47.520 |
the benchmarks are indeed saturating faster than ever before. 00:02:51.080 |
I think we can't deny that that is evident here. 00:02:54.800 |
But the other aspect of this is just the worrisome fact that 00:03:00.520 |
this chart are superhuman in any meaningful sense. 00:03:08.720 |
to the task of measuring what we want them to measure. 00:03:15.660 |
let's talk about the limitations that we find in 00:03:18.400 |
these datasets and we do indeed find them more quickly. 00:03:41.200 |
that are finding errors in the Penn Treebank. 00:03:46.120 |
Detmar Meurer and colleagues and hat tip to them for 00:03:51.120 |
the quality of the data and trying to improve it. 00:03:53.780 |
But one thing that's noteworthy for me is that 00:03:57.760 |
there are relatively few papers and they're all 00:04:04.780 |
the Stanford Natural Language Inference Benchmark. 00:04:15.080 |
It's actually rarer to find papers pointing out 00:04:18.060 |
errors in the era of natural language understanding 00:04:42.380 |
in this case, pointing out artifacts in the dataset. 00:04:51.020 |
which feels like a previous era of dataset generation. 00:05:00.940 |
But then you get an outpouring of papers identifying things 00:05:04.800 |
like biases and errors and artifacts and gaps. 00:05:08.900 |
We've entered into this era in which if you are 00:05:11.520 |
successful with your benchmark and you get a lot of users, 00:05:17.620 |
I think that is a healthy dynamic that we should embrace. 00:05:20.340 |
It's a little bit hard to take as the creator of a dataset, 00:05:27.420 |
this skeptical inquiry about these fundamental devices. 00:05:37.140 |
three central questions that I'll address for datasets, 00:05:39.980 |
and then I'll list out some more at the end of the screencast. 00:05:43.140 |
First question, should we rely on naturalistic data, 00:05:58.200 |
Second question, should we use adversarial examples or 00:06:01.740 |
benchmarks that consist only of the most common cases? 00:06:11.260 |
synthetic benchmarks or naturalistic benchmarks? 00:06:17.660 |
problematic and that we should use only naturalistic ones, 00:06:21.260 |
but you can probably anticipate my answer at this point. 00:06:28.420 |
these both as we move through the screencast. 00:06:46.380 |
some work to harvest examples from a website, 00:07:04.980 |
On the other hand, these are also weaknesses. 00:07:09.700 |
You're at the mercy of what you observe in the world. 00:07:18.660 |
It's probably not the case that you got opt-in from 00:07:23.660 |
data point for this dataset that you've created. 00:07:29.480 |
about the rights of the people who contributed. 00:07:42.300 |
It could be privacy preserving in the sense that you could just 00:07:46.940 |
knows that they're contributing to the dataset, 00:07:52.860 |
a later date if they decide that that's important. 00:07:55.820 |
This will be genuinely expressive because you can have 00:08:05.780 |
But then you have the corresponding weaknesses. 00:08:15.820 |
You're having people do things that are very unnatural, 00:08:19.020 |
not things that they would do as a matter of course, 00:08:25.100 |
The results of this might feel contrived also in the sense that 00:08:28.540 |
you know the crowd workers are trying to please you, 00:08:35.420 |
different from the one that you actually have in mind. 00:08:41.540 |
the question is how could we balance all these things? 00:08:46.180 |
hybrid models that allow us to be both genuine and expressive, 00:08:52.700 |
the strengths across these two and minimize the weaknesses. 00:09:05.620 |
scratch to try to fool a top performing model for sentiment, 00:09:15.500 |
Fundamentally, I think the editing condition offers 00:09:24.980 |
I would first observe that they did edit the text. 00:09:27.820 |
We see a wide range of different edit distances 00:09:30.380 |
between the original and the thing they produced. 00:09:43.700 |
The prompt ones have lengths that are more like 00:09:46.620 |
just naturally occurring sentences that we would 00:09:59.860 |
whereas we get much more diversity for the prompt condition 00:10:08.300 |
This looks like a clear win for prompting which 00:10:10.860 |
mixes naturalism with things we do in the lab. 00:10:13.980 |
The result was really wonderful examples that would be hard to 00:10:18.620 |
observe that do all sorts of interesting things 00:10:36.740 |
just benchmarks that contain the most common cases? 00:10:55.400 |
we have a separate test set created in a way that you suspect or 00:11:01.220 |
your system and the way it was developed initially. 00:11:09.700 |
those elements were guided by attempts by people 00:11:12.340 |
usually to fool a top-performing model or set of models. 00:11:17.180 |
These are the comparisons that we're thinking about here. 00:11:20.460 |
I mentioned for you before that there are a bunch of 00:11:23.000 |
these fully adversarial datasets covering a wide range of domains. 00:11:28.180 |
I think that's been fruitful and I think it's a lesson of 00:11:31.400 |
that literature that we're seeing lots of good results, 00:11:33.640 |
especially from adversarial training and testing. 00:11:36.700 |
But there is an alternative perspective out there, 00:11:39.340 |
and I think the most vocal of that perspective is Bowman and Dahl 2021. 00:11:44.760 |
I'll offer you some quotes and you should definitely check out the paper. 00:11:48.680 |
They write under the heading of adversarial examples not being 00:11:52.080 |
a panacea that adversarial filtering can systematically 00:11:56.880 |
eliminate coverage of linguistic phenomena or skills that are 00:12:00.080 |
necessary for the task but already well-solved by the adversary model. 00:12:06.720 |
mass covering behavior by adversarial filtering, 00:12:11.760 |
dataset diversity and thus make validity harder to achieve. 00:12:17.640 |
a totally reasonable perspective and the disconnect 00:12:23.560 |
That is certainly not something I would advocate for. 00:12:31.240 |
a mixture of cases that were adversarial and cases that the model actually got right, 00:12:36.320 |
more like the mode-seeking behavior that they're talking about here. 00:12:39.940 |
I do think you could damage a model by doing adversarial filtering, 00:12:43.480 |
especially for training, because I think you could put 00:12:49.440 |
But again, that's not something I was arguing for. 00:12:55.120 |
have benchmarks that contain both the adversarial cases and 00:12:59.760 |
the truly normal mode-seeking cases that they're mentioning here. 00:13:07.320 |
They also write, "This position paper argues that 00:13:12.240 |
motivate methods like adversarial filtering are justified, 00:13:15.360 |
but that they can and should be addressed directly, 00:13:17.920 |
and that it is possible and reasonable to do so in 00:13:23.200 |
Again, let's set aside the distracting thing about filtering, 00:13:33.140 |
will simply by virtue of having that massive benchmark, 00:13:39.640 |
I actually just think that that's factually incorrect. 00:13:42.360 |
I think it is very difficult given the complexity of language to develop 00:13:46.280 |
a benchmark that is so large that just as a matter of course, 00:13:52.200 |
The role of adversarial training examples could be to help us 00:13:56.200 |
fill in those gaps in a much more efficient way. 00:13:59.880 |
Because remember, the job to be done is a complicated one. 00:14:06.760 |
Yes, we need our models to get normal cases like the food was good, correct. 00:14:14.560 |
these complicated shifts in perspective as in, 00:14:17.140 |
my sister hated the food but she's massively wrong, 00:14:20.240 |
or the cookies seem dry to my boss but I couldn't disagree more. 00:14:24.600 |
We also need them to get things like non-literal language use, 00:14:28.240 |
like breakfast is really good if you're trying to feed it to dogs. 00:14:33.180 |
As well as really creative things that people do with language like, 00:14:46.980 |
But we know models will struggle with this innovative use of language, 00:14:51.160 |
and we need to push them to overcome that hurdle. 00:14:56.360 |
you might not see any of these examples or certainly not in 00:14:59.520 |
the density that you need to see them to improve our systems. 00:15:03.120 |
That's why I would just introduce a measure of 00:15:17.400 |
here's what I would say are the major lessons we've learned so far. 00:15:20.720 |
Often, our top performing systems like the one from 00:15:26.320 |
found unsystematic solutions that should worry us. 00:15:29.720 |
I also noted in earlier units of this course that progress on 00:15:33.440 |
challenge sets does seem to correlate with meaningful progress in general. 00:15:39.520 |
Present-day systems get traction on adversarial cases 00:15:45.740 |
It'd be worrisome if training on adversarial examples, 00:15:50.020 |
caused our systems to perform worse in the general case, 00:15:56.000 |
Then the final thing I would say is that whatever your view is 00:15:59.500 |
on the role of adversarials in system development, 00:16:05.700 |
the adversarial examples that people cook up and throw at 00:16:09.240 |
your system will define public perception for your system. 00:16:15.960 |
I would encourage you to think about adversarial dynamics for 00:16:19.080 |
evaluation before you do any kind of deployment. 00:16:22.780 |
That's why I exhorted you all in an earlier unit for this course to really think 00:16:27.400 |
deeply about evaluation and have diverse teams of people with multiple perspectives on 00:16:33.440 |
your system participate in that internal evaluation to really find 00:16:37.880 |
the cases where your system performs in a problematic way. 00:16:41.720 |
You should be your own adversary to the extent that you can to avoid having 00:16:46.240 |
really adversarial problems emerge when your system is used in the world. 00:16:52.000 |
Final question, synthetic benchmarks or naturalistic benchmarks? 00:16:57.960 |
As I said, there is a prominent perspective in the field 00:17:00.880 |
that naturalistic benchmarks are the only ones we should be using. 00:17:07.280 |
this is deeply worrisome because what it does is 00:17:10.400 |
introduce two unknowns into almost all the experiments that we run. 00:17:14.660 |
The dataset is an unknown in the sense that we don't fully command what its structure is 00:17:19.160 |
like and the model is almost by definition in these contexts an unknown. 00:17:25.560 |
The situation is like you have this massive dataset that you cannot audit comprehensively. 00:17:31.800 |
You might not even fully understand the process that 00:17:43.680 |
The question is, what are the causal factors in this output? 00:17:50.240 |
very difficult because of the fact that we have two unknowns. 00:17:53.840 |
If we could fix the dataset and call it a known quantity, 00:18:01.200 |
the output to properties of the model that we have manipulated. 00:18:12.500 |
This is under the heading of negation as a learning target. 00:18:16.160 |
Remember, we have this idea that we should have systems that know that if A entails B, 00:18:22.920 |
the entailment reversing property of negation. 00:18:26.400 |
We have an observation across a lot of different papers that 00:18:29.860 |
top performing NLI models fail to hit that learning target. 00:18:34.320 |
It's very tempting to conclude here that the model is the problem. 00:18:39.440 |
Top performing models seem incapable of learning negation, 00:18:43.040 |
but we have an observation that our datasets, 00:18:46.400 |
the naturalistic benchmarks these models were trained on, 00:18:56.000 |
with the models or with the dataset because we have two unknowns. 00:19:02.920 |
we created what I've called here a slightly synthetic benchmark, 00:19:11.200 |
a positive part where we take existing SNLI hypotheses and use 00:19:19.700 |
the systematic cases where we get A neutral B and B entailment A. 00:19:39.960 |
but a systematic manipulation that leaves us with 00:19:42.920 |
complete guarantees that we have a certain representation 00:19:52.800 |
Then when we use this as a challenge dataset, 00:20:02.760 |
extremely well on the positive part of our synthetic benchmark, 00:20:06.660 |
but essentially hitting zero for the negative part of our benchmark. 00:20:17.920 |
Well, when we do a modest amount of fine-tuning on negative MoNLI examples, 00:20:22.740 |
we immediately boost performance for the model on that split. 00:20:38.380 |
we have learned something directly about our model. 00:20:43.660 |
and I emphasize when there because I do think 00:20:51.440 |
we do so knowing that BERT can in principle learn negation, 00:20:55.620 |
and that data coverage will be a major factor in its performance there. 00:21:00.020 |
Those are crisp analytic lessons that we learned 00:21:03.020 |
only because we allowed some synthetic evaluations. 00:21:07.260 |
That's it. Those are three major questions for datasets in the field. 00:21:13.300 |
I address these, but we can also think about issues like data sheets, 00:21:18.640 |
that is disclosures for datasets that help us understand 00:21:21.740 |
how they can be used responsibly and where their limits lie. 00:21:25.460 |
We should also be thinking much more about how we're going to 00:21:28.220 |
achieve cross-linguistic coverage for our benchmarks. 00:21:31.060 |
Right now we have still to this day too much focus on English, 00:21:35.140 |
when in fact we want systems and models that are performant the world over. 00:21:42.960 |
and of course we should also worry deeply about 00:21:46.020 |
the pernicious social biases that are embedded in our datasets, 00:21:49.860 |
and how we will get rid of those in order to create technologies that are more equitable.