back to index

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 4: Datasets I Spring 2023


Chapters

0:0 Intro
0:24 Water and air for our field
1:7 We ask a lot of our datasets
2:22 Benchmarks saturate faster than ever
3:16 Limitations found more quickly
5:33 Central questions
6:31 Trade-offs
8:56 DynaSent: Prompts increase naturalism
10:32 Adversarial examples or the most common cases?
11:20 Dynamics of adversarial datasets
11:28 Counterpoint from Bowman and Dahl (2021)
14:0 The job to be done
15:14 Major lessons thus far
16:53 Synthetic or naturalistic benchmarks
18:8 Negation as a learning target
19:1 MONLI: A slightly synthetic dataset
19:52 MONLI as challenge dataset
20:41 The value of messy data
21:8 Other vital issues for datasets

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.040 | This is part 4 in our series on methods and metrics.
00:00:08.960 | We're going to talk about datasets.
00:00:10.820 | In the previous two screencasts,
00:00:12.520 | we really got in the weeds around
00:00:14.320 | classifier and generation metrics.
00:00:16.520 | I want to pop up a level now and talk
00:00:18.800 | conceptually about the role that
00:00:20.920 | datasets play in our field and how we construct them.
00:00:23.940 | This is really a central topic.
00:00:26.280 | In this context, I'd like to mention
00:00:28.000 | this quotation from the famous oceanographer
00:00:30.400 | and explorer Jacques Cousteau.
00:00:32.140 | Cousteau said, "Water and air,
00:00:33.920 | the two essential fluids on which all life depends."
00:00:36.940 | My analogy here is that datasets are
00:00:39.340 | the resource on which all progress in the field of NLP depends.
00:00:43.480 | Now, Cousteau's quotation actually continues,
00:00:46.280 | "Have become global garbage cans, which is worrisome."
00:00:49.640 | For datasets, I think there are some in the field who would
00:00:52.280 | make the analogy extend to this worrisome aspect,
00:00:55.100 | but I feel optimistic.
00:00:56.560 | I feel we've learned a lot about how to
00:00:58.400 | develop datasets effectively and we
00:01:00.060 | have more datasets than ever.
00:01:02.000 | I think things are on a good trajectory
00:01:04.380 | as long as we're thoughtful.
00:01:05.940 | But it really is important that we get this right
00:01:08.520 | because we ask so much of our datasets.
00:01:11.320 | We use them to optimize models,
00:01:13.800 | to evaluate models, to compare models,
00:01:17.360 | to enable new capabilities in models via training,
00:01:20.960 | to measure field-wide progress,
00:01:23.520 | and for fundamental aspects of scientific inquiry.
00:01:26.680 | This list is pretty much everything
00:01:28.680 | that we do in the field of NLP.
00:01:30.520 | All of it depends on datasets.
00:01:32.360 | You can see it's important that we get these datasets right.
00:01:35.400 | After all, if we don't,
00:01:36.820 | then we've got a very shaky foundation and we might be
00:01:39.200 | tricking ourselves when we think we're making a lot of
00:01:41.920 | progress because datasets really
00:01:44.920 | are in a way the fundamental instrument.
00:01:47.780 | I like this quotation from Aravind Joshi.
00:01:50.120 | The late great Aravind Joshi had
00:01:52.440 | the analogy that datasets are like the telescopes of our field.
00:01:56.280 | When he said this,
00:01:57.640 | this was back in around 2007,
00:01:59.640 | he was actually expressing a concern.
00:02:01.480 | He said that NLP-ers were like astronomers who want to see
00:02:04.880 | the stars but refuse to build any telescopes.
00:02:08.240 | Aravind indeed did try to lead the way
00:02:11.260 | toward creating more datasets and valuing dataset contributions.
00:02:15.200 | I think he would be pleased with
00:02:17.160 | the current state of the field when we
00:02:18.640 | have more datasets than ever.
00:02:21.360 | In that context though, it's worth mentioning
00:02:24.040 | this plot that I've used a few times in this course,
00:02:26.760 | under the heading of benchmark saturating faster than ever.
00:02:30.320 | Remember along this chart, along the x-axis,
00:02:32.520 | I have time going back to the 90s and the y-axis is
00:02:36.000 | a normalized measure of distance from
00:02:37.940 | our so-called human performance,
00:02:39.800 | although we've talked about what
00:02:41.360 | human performance actually means here.
00:02:43.640 | When some people look at this chart,
00:02:45.460 | they see a story of progress where
00:02:47.520 | the benchmarks are indeed saturating faster than ever before.
00:02:51.080 | I think we can't deny that that is evident here.
00:02:54.800 | But the other aspect of this is just the worrisome fact that
00:02:58.360 | none of the systems that are represented in
00:03:00.520 | this chart are superhuman in any meaningful sense.
00:03:03.880 | The fundamental problem there might be that
00:03:06.320 | our datasets are simply not up
00:03:08.720 | to the task of measuring what we want them to measure.
00:03:12.320 | For a alternative perspective on this,
00:03:15.660 | let's talk about the limitations that we find in
00:03:18.400 | these datasets and we do indeed find them more quickly.
00:03:21.460 | Again, for this slide along the x-axis,
00:03:23.840 | I have time stretching back into the 90s.
00:03:26.560 | So far I have one dataset represented,
00:03:28.940 | the Penn Treebank, which is a collection
00:03:30.840 | of syntactic parses for sentences.
00:03:33.060 | It drove progress on syntactic parsing
00:03:35.600 | for decades for maybe too long.
00:03:38.280 | The dots here, the red dots are papers
00:03:41.200 | that are finding errors in the Penn Treebank.
00:03:43.840 | Most of these papers trace to work by
00:03:46.120 | Detmar Meurer and colleagues and hat tip to them for
00:03:49.360 | really thinking carefully about
00:03:51.120 | the quality of the data and trying to improve it.
00:03:53.780 | But one thing that's noteworthy for me is that
00:03:56.160 | despite the very long timeline,
00:03:57.760 | there are relatively few papers and they're all
00:03:59.800 | just about errors in the parse trees.
00:04:02.720 | Let's fast forward to SNLI,
00:04:04.780 | the Stanford Natural Language Inference Benchmark.
00:04:07.460 | This was launched in 2015 and right away,
00:04:10.520 | you get an outpouring of papers that are
00:04:12.660 | finding limitations in this dataset.
00:04:15.080 | It's actually rarer to find papers pointing out
00:04:18.060 | errors in the era of natural language understanding
00:04:20.400 | and error is harder to define.
00:04:22.520 | But we can identify things like artifacts,
00:04:25.200 | those are the orange dots,
00:04:26.620 | and biases, those are the blue dots,
00:04:28.960 | as well as gaps in the dataset
00:04:30.680 | as I've given in that maroon color.
00:04:32.780 | There are lots of dots here
00:04:34.240 | compared with the Penn Treebank.
00:04:35.960 | A similar story holds for SQuAD.
00:04:38.040 | It was launched soon after SNLI,
00:04:40.320 | and again, you get a bunch of papers,
00:04:42.380 | in this case, pointing out artifacts in the dataset.
00:04:45.580 | Then finally, for another illustration,
00:04:47.540 | ImageNet is an interesting case.
00:04:49.140 | It was launched in 2009,
00:04:51.020 | which feels like a previous era of dataset generation.
00:04:54.500 | For a while, it got to lead a quiet life as
00:04:57.540 | a trusted benchmark just like the PTB did.
00:05:00.940 | But then you get an outpouring of papers identifying things
00:05:04.800 | like biases and errors and artifacts and gaps.
00:05:08.900 | We've entered into this era in which if you are
00:05:11.520 | successful with your benchmark and you get a lot of users,
00:05:14.640 | people will also find limitations quickly.
00:05:17.620 | I think that is a healthy dynamic that we should embrace.
00:05:20.340 | It's a little bit hard to take as the creator of a dataset,
00:05:23.660 | but ultimately, I think we can
00:05:25.380 | see that this is a marker of progress,
00:05:27.420 | this skeptical inquiry about these fundamental devices.
00:05:33.060 | To keep things succinct here,
00:05:36.040 | I'm going to identify
00:05:37.140 | three central questions that I'll address for datasets,
00:05:39.980 | and then I'll list out some more at the end of the screencast.
00:05:43.140 | First question, should we rely on naturalistic data,
00:05:47.040 | like data that you scrape from a website
00:05:49.220 | or extract from an existing database,
00:05:51.860 | or should we turn to crowdsourcing?
00:05:54.100 | It's a commonly debated question.
00:05:55.700 | My answer will be, use both.
00:05:58.200 | Second question, should we use adversarial examples or
00:06:01.740 | benchmarks that consist only of the most common cases?
00:06:05.060 | Another thing that's hotly debated,
00:06:06.960 | and my answer is both.
00:06:08.900 | Third question, should we use
00:06:11.260 | synthetic benchmarks or naturalistic benchmarks?
00:06:14.260 | A lot of people in the field think
00:06:15.700 | synthetic benchmarks are fundamentally
00:06:17.660 | problematic and that we should use only naturalistic ones,
00:06:21.260 | but you can probably anticipate my answer at this point.
00:06:24.300 | I think both of them have a role to play.
00:06:26.580 | I'll substantiate all three of
00:06:28.420 | these both as we move through the screencast.
00:06:31.380 | Let's start with that question of
00:06:34.060 | naturalistic data versus crowdsourcing.
00:06:36.740 | The reason I answer both is that this is
00:06:38.580 | basically about trade-offs for me.
00:06:40.420 | For naturalistic datasets,
00:06:42.160 | which you could call found or curated,
00:06:44.260 | like you scrape a website or do
00:06:46.380 | some work to harvest examples from a website,
00:06:49.160 | you have abundance.
00:06:51.300 | It's probably inexpensive to gather
00:06:53.540 | these examples and they will be genuine in
00:06:55.780 | some sense because they were presumably not
00:06:57.780 | created for the sake of your experiment,
00:06:59.680 | but rather for some naturalistic purpose,
00:07:01.900 | some genuine communicative purpose.
00:07:04.980 | On the other hand, these are also weaknesses.
00:07:07.980 | The dataset will be uncontrolled.
00:07:09.700 | You're at the mercy of what you observe in the world.
00:07:12.020 | It will be limited in terms of
00:07:13.700 | the kind of information that you can gather.
00:07:16.180 | It will be maybe intrusive.
00:07:18.660 | It's probably not the case that you got opt-in from
00:07:21.960 | every single person who contributed
00:07:23.660 | data point for this dataset that you've created.
00:07:26.500 | In some sense, you might have a deep concern
00:07:29.480 | about the rights of the people who contributed.
00:07:32.580 | Let's contrast this with crowdsourcing.
00:07:34.860 | I've called this lab-grown.
00:07:36.300 | This is a more artificial thing that you do.
00:07:38.460 | This could be highly controlled
00:07:40.300 | because you set up the task.
00:07:42.300 | It could be privacy preserving in the sense that you could just
00:07:45.060 | make sure everyone who contributes
00:07:46.940 | knows that they're contributing to the dataset,
00:07:49.260 | and you could even offer them
00:07:50.860 | the opportunity to remove themselves at
00:07:52.860 | a later date if they decide that that's important.
00:07:55.820 | This will be genuinely expressive because you can have
00:07:58.980 | crowd workers in principle do even very
00:08:01.420 | complicated things to get data
00:08:03.260 | that you wouldn't observe in the wild.
00:08:05.780 | But then you have the corresponding weaknesses.
00:08:08.180 | This will be scarce.
00:08:09.140 | You'll never have enough crowdsource
00:08:10.860 | data and it will be expensive.
00:08:13.220 | In addition, it can get very contrived.
00:08:15.820 | You're having people do things that are very unnatural,
00:08:19.020 | not things that they would do as a matter of course,
00:08:21.700 | with communication, but rather things
00:08:23.320 | that you set them up to do.
00:08:25.100 | The results of this might feel contrived also in the sense that
00:08:28.540 | you know the crowd workers are trying to please you,
00:08:31.940 | the person who launched the task,
00:08:33.660 | and that might be a goal that's very
00:08:35.420 | different from the one that you actually have in mind.
00:08:38.740 | For me, looking at these trade-offs,
00:08:41.540 | the question is how could we balance all these things?
00:08:44.140 | I do think that we can find
00:08:46.180 | hybrid models that allow us to be both genuine and expressive,
00:08:50.820 | and to preserve in general a lot of
00:08:52.700 | the strengths across these two and minimize the weaknesses.
00:08:56.300 | I've shown you an example of this already.
00:08:58.860 | For Dynascent round 2,
00:09:01.020 | we had two conditions.
00:09:03.060 | One, where workers just wrote a text from
00:09:05.620 | scratch to try to fool a top performing model for sentiment,
00:09:09.060 | and another condition where we gave them
00:09:11.260 | existing sentences that they could
00:09:12.980 | edit in order to achieve that goal.
00:09:15.500 | Fundamentally, I think the editing condition offers
00:09:19.360 | much more naturalism while still
00:09:20.940 | giving us the results that we wanted.
00:09:22.980 | For that prompt condition,
00:09:24.980 | I would first observe that they did edit the text.
00:09:27.820 | We see a wide range of different edit distances
00:09:30.380 | between the original and the thing they produced.
00:09:33.100 | That seems healthy.
00:09:34.500 | Then this is more important.
00:09:36.100 | For example, in terms of length,
00:09:38.500 | we find that the no prompt examples were
00:09:40.880 | very short compared to the prompt ones.
00:09:43.700 | The prompt ones have lengths that are more like
00:09:46.620 | just naturally occurring sentences that we would
00:09:49.180 | harvest in domain from a site like Yelp.
00:09:52.140 | Here's a similar thing for vocabulary size.
00:09:54.880 | The no prompt condition is
00:09:56.660 | very limited in terms of its vocabulary,
00:09:59.860 | whereas we get much more diversity for the prompt condition
00:10:03.040 | approaching the diversity of
00:10:05.020 | vocabulary for naturally occurring cases.
00:10:08.300 | This looks like a clear win for prompting which
00:10:10.860 | mixes naturalism with things we do in the lab.
00:10:13.980 | The result was really wonderful examples that would be hard to
00:10:18.620 | observe that do all sorts of interesting things
00:10:21.580 | linguistically and also play
00:10:23.500 | with non-literal language use and so forth.
00:10:26.540 | I think the hybrid model gave us
00:10:28.460 | the best of both worlds in some sense.
00:10:31.740 | Let's move to our second question.
00:10:34.300 | Should we use adversarial examples or
00:10:36.740 | just benchmarks that contain the most common cases?
00:10:39.900 | Remember, my answer is both.
00:10:41.680 | Just as a reminder,
00:10:43.100 | we talked about this in a previous unit.
00:10:45.060 | For standard evaluations,
00:10:46.620 | you create a dataset from
00:10:48.300 | a single model independent process
00:10:50.520 | and divide it into trained dev test.
00:10:52.920 | Whereas for adversarial assessment,
00:10:55.400 | we have a separate test set created in a way that you suspect or
00:10:59.180 | know will be challenging given
00:11:01.220 | your system and the way it was developed initially.
00:11:04.260 | Then for adversarial datasets in general,
00:11:07.020 | this would be trained dev test where all of
00:11:09.700 | those elements were guided by attempts by people
00:11:12.340 | usually to fool a top-performing model or set of models.
00:11:17.180 | These are the comparisons that we're thinking about here.
00:11:20.460 | I mentioned for you before that there are a bunch of
00:11:23.000 | these fully adversarial datasets covering a wide range of domains.
00:11:28.180 | I think that's been fruitful and I think it's a lesson of
00:11:31.400 | that literature that we're seeing lots of good results,
00:11:33.640 | especially from adversarial training and testing.
00:11:36.700 | But there is an alternative perspective out there,
00:11:39.340 | and I think the most vocal of that perspective is Bowman and Dahl 2021.
00:11:44.760 | I'll offer you some quotes and you should definitely check out the paper.
00:11:48.680 | They write under the heading of adversarial examples not being
00:11:52.080 | a panacea that adversarial filtering can systematically
00:11:56.880 | eliminate coverage of linguistic phenomena or skills that are
00:12:00.080 | necessary for the task but already well-solved by the adversary model.
00:12:04.320 | This mode-seeking as opposed to
00:12:06.720 | mass covering behavior by adversarial filtering,
00:12:09.560 | if left unchecked, tends to reduce
00:12:11.760 | dataset diversity and thus make validity harder to achieve.
00:12:15.560 | I actually frankly think that this is
00:12:17.640 | a totally reasonable perspective and the disconnect
00:12:20.240 | here is the notion of adversarial filtering.
00:12:23.560 | That is certainly not something I would advocate for.
00:12:26.480 | If you think about Dynascent,
00:12:28.220 | our training and Devon test sets all contain
00:12:31.240 | a mixture of cases that were adversarial and cases that the model actually got right,
00:12:36.320 | more like the mode-seeking behavior that they're talking about here.
00:12:39.940 | I do think you could damage a model by doing adversarial filtering,
00:12:43.480 | especially for training, because I think you could put
00:12:46.240 | the model in a very unusual state.
00:12:49.440 | But again, that's not something I was arguing for.
00:12:52.600 | I was arguing for the both perspective,
00:12:55.120 | have benchmarks that contain both the adversarial cases and
00:12:59.760 | the truly normal mode-seeking cases that they're mentioning here.
00:13:04.400 | I would not leave this pressure unchecked.
00:13:07.320 | They also write, "This position paper argues that
00:13:10.360 | concerns about standard benchmarks that
00:13:12.240 | motivate methods like adversarial filtering are justified,
00:13:15.360 | but that they can and should be addressed directly,
00:13:17.920 | and that it is possible and reasonable to do so in
00:13:20.240 | the context of static IID evaluation."
00:13:23.200 | Again, let's set aside the distracting thing about filtering,
00:13:27.000 | and focus on what they claim here,
00:13:29.000 | which is that you,
00:13:30.520 | if you have a massive benchmark,
00:13:33.140 | will simply by virtue of having that massive benchmark,
00:13:36.680 | cover all of the relevant phenomena.
00:13:39.640 | I actually just think that that's factually incorrect.
00:13:42.360 | I think it is very difficult given the complexity of language to develop
00:13:46.280 | a benchmark that is so large that just as a matter of course,
00:13:49.720 | you've covered all the hard cases.
00:13:52.200 | The role of adversarial training examples could be to help us
00:13:56.200 | fill in those gaps in a much more efficient way.
00:13:59.880 | Because remember, the job to be done is a complicated one.
00:14:04.600 | Let's focus on the domain of sentiment.
00:14:06.760 | Yes, we need our models to get normal cases like the food was good, correct.
00:14:12.160 | But we also need them to deal with
00:14:14.560 | these complicated shifts in perspective as in,
00:14:17.140 | my sister hated the food but she's massively wrong,
00:14:20.240 | or the cookies seem dry to my boss but I couldn't disagree more.
00:14:24.600 | We also need them to get things like non-literal language use,
00:14:28.240 | like breakfast is really good if you're trying to feed it to dogs.
00:14:31.280 | That's some sarcasm or irony.
00:14:33.180 | As well as really creative things that people do with language like,
00:14:36.840 | worthy of gasps of foodgasms,
00:14:39.240 | where we get a new use of a suffix here.
00:14:41.860 | We can all immediately intuitive what
00:14:44.040 | this means it's a positive statement.
00:14:46.980 | But we know models will struggle with this innovative use of language,
00:14:51.160 | and we need to push them to overcome that hurdle.
00:14:54.100 | If you just do standard data collection,
00:14:56.360 | you might not see any of these examples or certainly not in
00:14:59.520 | the density that you need to see them to improve our systems.
00:15:03.120 | That's why I would just introduce a measure of
00:15:06.280 | adversarialness into train, dev, and test.
00:15:09.700 | But I would not do any of the filtering that
00:15:11.720 | Bowman and Dahl are worried about.
00:15:14.440 | So for adversarial testing in general,
00:15:17.400 | here's what I would say are the major lessons we've learned so far.
00:15:20.720 | Often, our top performing systems like the one from
00:15:23.880 | that benchmark saturating slide have
00:15:26.320 | found unsystematic solutions that should worry us.
00:15:29.720 | I also noted in earlier units of this course that progress on
00:15:33.440 | challenge sets does seem to correlate with meaningful progress in general.
00:15:37.560 | That's an important insight.
00:15:39.520 | Present-day systems get traction on adversarial cases
00:15:42.920 | without degradation on the general cases.
00:15:45.740 | It'd be worrisome if training on adversarial examples,
00:15:48.600 | even a little bit of them,
00:15:50.020 | caused our systems to perform worse in the general case,
00:15:53.360 | but I think we do not see that happening.
00:15:56.000 | Then the final thing I would say is that whatever your view is
00:15:59.500 | on the role of adversarials in system development,
00:16:02.860 | if you deploy a system out into the world,
00:16:05.700 | the adversarial examples that people cook up and throw at
00:16:09.240 | your system will define public perception for your system.
00:16:13.440 | In the interest of self-preservation,
00:16:15.960 | I would encourage you to think about adversarial dynamics for
00:16:19.080 | evaluation before you do any kind of deployment.
00:16:22.780 | That's why I exhorted you all in an earlier unit for this course to really think
00:16:27.400 | deeply about evaluation and have diverse teams of people with multiple perspectives on
00:16:33.440 | your system participate in that internal evaluation to really find
00:16:37.880 | the cases where your system performs in a problematic way.
00:16:41.720 | You should be your own adversary to the extent that you can to avoid having
00:16:46.240 | really adversarial problems emerge when your system is used in the world.
00:16:52.000 | Final question, synthetic benchmarks or naturalistic benchmarks?
00:16:57.960 | As I said, there is a prominent perspective in the field
00:17:00.880 | that naturalistic benchmarks are the only ones we should be using.
00:17:04.920 | To me, at a scientific level,
00:17:07.280 | this is deeply worrisome because what it does is
00:17:10.400 | introduce two unknowns into almost all the experiments that we run.
00:17:14.660 | The dataset is an unknown in the sense that we don't fully command what its structure is
00:17:19.160 | like and the model is almost by definition in these contexts an unknown.
00:17:23.260 | We're trying to explore its properties.
00:17:25.560 | The situation is like you have this massive dataset that you cannot audit comprehensively.
00:17:31.800 | You might not even fully understand the process that
00:17:34.200 | created it even if you did crowdsourcing.
00:17:36.760 | Then you have that as the input to a model,
00:17:39.560 | which is also a major unknown,
00:17:41.660 | and then you get some output.
00:17:43.680 | The question is, what are the causal factors in this output?
00:17:47.460 | Causal assignment in this case is
00:17:50.240 | very difficult because of the fact that we have two unknowns.
00:17:53.840 | If we could fix the dataset and call it a known quantity,
00:17:59.040 | then we could trace aspects of
00:18:01.200 | the output to properties of the model that we have manipulated.
00:18:04.800 | But with two unknowns in play,
00:18:06.580 | this will always be uncertain.
00:18:08.720 | I gave you a story about this before.
00:18:11.100 | Let me briefly rehearse it.
00:18:12.500 | This is under the heading of negation as a learning target.
00:18:16.160 | Remember, we have this idea that we should have systems that know that if A entails B,
00:18:21.060 | then not B entails not A,
00:18:22.920 | the entailment reversing property of negation.
00:18:26.400 | We have an observation across a lot of different papers that
00:18:29.860 | top performing NLI models fail to hit that learning target.
00:18:34.320 | It's very tempting to conclude here that the model is the problem.
00:18:39.440 | Top performing models seem incapable of learning negation,
00:18:43.040 | but we have an observation that our datasets,
00:18:46.400 | the naturalistic benchmarks these models were trained on,
00:18:49.500 | severely under-represent negation.
00:18:52.800 | Now, we don't know whether the issue is
00:18:56.000 | with the models or with the dataset because we have two unknowns.
00:19:01.240 | In response to that,
00:19:02.920 | we created what I've called here a slightly synthetic benchmark,
00:19:06.460 | that is monotonicity NLI or MoNLI.
00:19:09.640 | Recall it has two parts,
00:19:11.200 | a positive part where we take existing SNLI hypotheses and use
00:19:16.160 | WordNet to create new examples that fire off
00:19:19.700 | the systematic cases where we get A neutral B and B entailment A.
00:19:24.600 | That's the positive part.
00:19:26.000 | We did the same thing for negated examples.
00:19:28.880 | Now, after the replacement,
00:19:31.380 | we get the reverse of those patterns.
00:19:33.880 | What this leads us to is a dataset that
00:19:36.920 | has naturally occurring cases as its basis,
00:19:39.960 | but a systematic manipulation that leaves us with
00:19:42.920 | complete guarantees that we have a certain representation
00:19:46.960 | for lexical entailment and negation.
00:19:50.200 | That's why it's slightly synthetic.
00:19:52.800 | Then when we use this as a challenge dataset,
00:19:55.160 | we get a blast of insight, I claim.
00:19:57.480 | Let's look at the BERT row here.
00:19:59.460 | BERT is performing extremely well on SNLI,
00:20:02.760 | extremely well on the positive part of our synthetic benchmark,
00:20:06.660 | but essentially hitting zero for the negative part of our benchmark.
00:20:11.140 | It's obviously just ignoring the negations.
00:20:13.900 | What is the issue here?
00:20:15.460 | Is it data or is it the model?
00:20:17.920 | Well, when we do a modest amount of fine-tuning on negative MoNLI examples,
00:20:22.740 | we immediately boost performance for the model on that split.
00:20:27.020 | That shows us definitively that when we show
00:20:30.220 | a model like BERT relevant negation cases,
00:20:33.280 | it can handle the task.
00:20:35.460 | Now, as a result of having a known dataset,
00:20:38.380 | we have learned something directly about our model.
00:20:41.140 | When we turn to naturalistic data,
00:20:43.660 | and I emphasize when there because I do think
00:20:45.940 | that that's an important component in NLP.
00:20:48.580 | When we move from synthetic to naturalistic,
00:20:51.440 | we do so knowing that BERT can in principle learn negation,
00:20:55.620 | and that data coverage will be a major factor in its performance there.
00:21:00.020 | Those are crisp analytic lessons that we learned
00:21:03.020 | only because we allowed some synthetic evaluations.
00:21:07.260 | That's it. Those are three major questions for datasets in the field.
00:21:12.020 | There are many more though.
00:21:13.300 | I address these, but we can also think about issues like data sheets,
00:21:18.640 | that is disclosures for datasets that help us understand
00:21:21.740 | how they can be used responsibly and where their limits lie.
00:21:25.460 | We should also be thinking much more about how we're going to
00:21:28.220 | achieve cross-linguistic coverage for our benchmarks.
00:21:31.060 | Right now we have still to this day too much focus on English,
00:21:35.140 | when in fact we want systems and models that are performant the world over.
00:21:40.220 | We could worry about statistical power,
00:21:42.960 | and of course we should also worry deeply about
00:21:46.020 | the pernicious social biases that are embedded in our datasets,
00:21:49.860 | and how we will get rid of those in order to create technologies that are more equitable.
00:21:55.860 | [BLANK_AUDIO]