Stanford XCS224U: NLU I NLP Methods and Metrics, Part 4: Datasets I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.040 | This is part 4 in our series on methods and metrics.

00:00:08.960 | We're going to talk about datasets.

00:00:10.820 | In the previous two screencasts,

00:00:12.520 | we really got in the weeds around

00:00:14.320 | classifier and generation metrics.

00:00:16.520 | I want to pop up a level now and talk

00:00:18.800 | conceptually about the role that

00:00:20.920 | datasets play in our field and how we construct them.

00:00:23.940 | This is really a central topic.

00:00:26.280 | In this context, I'd like to mention

00:00:28.000 | this quotation from the famous oceanographer

00:00:30.400 | and explorer Jacques Cousteau.

00:00:32.140 | Cousteau said, "Water and air,

00:00:33.920 | the two essential fluids on which all life depends."

00:00:36.940 | My analogy here is that datasets are

00:00:39.340 | the resource on which all progress in the field of NLP depends.

00:00:43.480 | Now, Cousteau's quotation actually continues,

00:00:46.280 | "Have become global garbage cans, which is worrisome."

00:00:49.640 | For datasets, I think there are some in the field who would

00:00:52.280 | make the analogy extend to this worrisome aspect,

00:00:55.100 | but I feel optimistic.

00:00:56.560 | I feel we've learned a lot about how to

00:00:58.400 | develop datasets effectively and we

00:01:00.060 | have more datasets than ever.

00:01:02.000 | I think things are on a good trajectory

00:01:04.380 | as long as we're thoughtful.

00:01:05.940 | But it really is important that we get this right

00:01:08.520 | because we ask so much of our datasets.

00:01:11.320 | We use them to optimize models,

00:01:13.800 | to evaluate models, to compare models,

00:01:17.360 | to enable new capabilities in models via training,

00:01:20.960 | to measure field-wide progress,

00:01:23.520 | and for fundamental aspects of scientific inquiry.

00:01:26.680 | This list is pretty much everything

00:01:28.680 | that we do in the field of NLP.

00:01:30.520 | All of it depends on datasets.

00:01:32.360 | You can see it's important that we get these datasets right.

00:01:35.400 | After all, if we don't,

00:01:36.820 | then we've got a very shaky foundation and we might be

00:01:39.200 | tricking ourselves when we think we're making a lot of

00:01:41.920 | progress because datasets really

00:01:44.920 | are in a way the fundamental instrument.

00:01:47.780 | I like this quotation from Aravind Joshi.

00:01:50.120 | The late great Aravind Joshi had

00:01:52.440 | the analogy that datasets are like the telescopes of our field.

00:01:56.280 | When he said this,

00:01:57.640 | this was back in around 2007,

00:01:59.640 | he was actually expressing a concern.

00:02:01.480 | He said that NLP-ers were like astronomers who want to see

00:02:04.880 | the stars but refuse to build any telescopes.

00:02:08.240 | Aravind indeed did try to lead the way

00:02:11.260 | toward creating more datasets and valuing dataset contributions.

00:02:15.200 | I think he would be pleased with

00:02:17.160 | the current state of the field when we

00:02:18.640 | have more datasets than ever.

00:02:21.360 | In that context though, it's worth mentioning

00:02:24.040 | this plot that I've used a few times in this course,

00:02:26.760 | under the heading of benchmark saturating faster than ever.

00:02:30.320 | Remember along this chart, along the x-axis,

00:02:32.520 | I have time going back to the 90s and the y-axis is

00:02:36.000 | a normalized measure of distance from

00:02:37.940 | our so-called human performance,

00:02:39.800 | although we've talked about what

00:02:41.360 | human performance actually means here.

00:02:43.640 | When some people look at this chart,

00:02:45.460 | they see a story of progress where

00:02:47.520 | the benchmarks are indeed saturating faster than ever before.

00:02:51.080 | I think we can't deny that that is evident here.

00:02:54.800 | But the other aspect of this is just the worrisome fact that

00:02:58.360 | none of the systems that are represented in

00:03:00.520 | this chart are superhuman in any meaningful sense.

00:03:03.880 | The fundamental problem there might be that

00:03:06.320 | our datasets are simply not up

00:03:08.720 | to the task of measuring what we want them to measure.

00:03:12.320 | For a alternative perspective on this,

00:03:15.660 | let's talk about the limitations that we find in

00:03:18.400 | these datasets and we do indeed find them more quickly.

00:03:21.460 | Again, for this slide along the x-axis,

00:03:23.840 | I have time stretching back into the 90s.

00:03:26.560 | So far I have one dataset represented,

00:03:28.940 | the Penn Treebank, which is a collection

00:03:30.840 | of syntactic parses for sentences.

00:03:33.060 | It drove progress on syntactic parsing

00:03:35.600 | for decades for maybe too long.

00:03:38.280 | The dots here, the red dots are papers

00:03:41.200 | that are finding errors in the Penn Treebank.

00:03:43.840 | Most of these papers trace to work by

00:03:46.120 | Detmar Meurer and colleagues and hat tip to them for

00:03:49.360 | really thinking carefully about

00:03:51.120 | the quality of the data and trying to improve it.

00:03:53.780 | But one thing that's noteworthy for me is that

00:03:56.160 | despite the very long timeline,

00:03:57.760 | there are relatively few papers and they're all

00:03:59.800 | just about errors in the parse trees.

00:04:02.720 | Let's fast forward to SNLI,

00:04:04.780 | the Stanford Natural Language Inference Benchmark.

00:04:07.460 | This was launched in 2015 and right away,

00:04:10.520 | you get an outpouring of papers that are

00:04:12.660 | finding limitations in this dataset.

00:04:15.080 | It's actually rarer to find papers pointing out

00:04:18.060 | errors in the era of natural language understanding

00:04:20.400 | and error is harder to define.

00:04:22.520 | But we can identify things like artifacts,

00:04:25.200 | those are the orange dots,

00:04:26.620 | and biases, those are the blue dots,

00:04:28.960 | as well as gaps in the dataset

00:04:30.680 | as I've given in that maroon color.

00:04:32.780 | There are lots of dots here

00:04:34.240 | compared with the Penn Treebank.

00:04:35.960 | A similar story holds for SQuAD.

00:04:38.040 | It was launched soon after SNLI,

00:04:40.320 | and again, you get a bunch of papers,

00:04:42.380 | in this case, pointing out artifacts in the dataset.

00:04:45.580 | Then finally, for another illustration,

00:04:47.540 | ImageNet is an interesting case.

00:04:49.140 | It was launched in 2009,

00:04:51.020 | which feels like a previous era of dataset generation.

00:04:54.500 | For a while, it got to lead a quiet life as

00:04:57.540 | a trusted benchmark just like the PTB did.

00:05:00.940 | But then you get an outpouring of papers identifying things

00:05:04.800 | like biases and errors and artifacts and gaps.

00:05:08.900 | We've entered into this era in which if you are

00:05:11.520 | successful with your benchmark and you get a lot of users,

00:05:14.640 | people will also find limitations quickly.

00:05:17.620 | I think that is a healthy dynamic that we should embrace.

00:05:20.340 | It's a little bit hard to take as the creator of a dataset,

00:05:23.660 | but ultimately, I think we can

00:05:25.380 | see that this is a marker of progress,

00:05:27.420 | this skeptical inquiry about these fundamental devices.

00:05:33.060 | To keep things succinct here,

00:05:36.040 | I'm going to identify

00:05:37.140 | three central questions that I'll address for datasets,

00:05:39.980 | and then I'll list out some more at the end of the screencast.

00:05:43.140 | First question, should we rely on naturalistic data,

00:05:47.040 | like data that you scrape from a website

00:05:49.220 | or extract from an existing database,

00:05:51.860 | or should we turn to crowdsourcing?

00:05:54.100 | It's a commonly debated question.

00:05:55.700 | My answer will be, use both.

00:05:58.200 | Second question, should we use adversarial examples or

00:06:01.740 | benchmarks that consist only of the most common cases?

00:06:05.060 | Another thing that's hotly debated,

00:06:06.960 | and my answer is both.

00:06:08.900 | Third question, should we use

00:06:11.260 | synthetic benchmarks or naturalistic benchmarks?

00:06:14.260 | A lot of people in the field think

00:06:15.700 | synthetic benchmarks are fundamentally

00:06:17.660 | problematic and that we should use only naturalistic ones,

00:06:21.260 | but you can probably anticipate my answer at this point.

00:06:24.300 | I think both of them have a role to play.

00:06:26.580 | I'll substantiate all three of

00:06:28.420 | these both as we move through the screencast.

00:06:31.380 | Let's start with that question of

00:06:34.060 | naturalistic data versus crowdsourcing.

00:06:36.740 | The reason I answer both is that this is

00:06:38.580 | basically about trade-offs for me.

00:06:40.420 | For naturalistic datasets,

00:06:42.160 | which you could call found or curated,

00:06:44.260 | like you scrape a website or do

00:06:46.380 | some work to harvest examples from a website,

00:06:49.160 | you have abundance.

00:06:51.300 | It's probably inexpensive to gather

00:06:53.540 | these examples and they will be genuine in

00:06:55.780 | some sense because they were presumably not

00:06:57.780 | created for the sake of your experiment,

00:06:59.680 | but rather for some naturalistic purpose,

00:07:01.900 | some genuine communicative purpose.

00:07:04.980 | On the other hand, these are also weaknesses.

00:07:07.980 | The dataset will be uncontrolled.

00:07:09.700 | You're at the mercy of what you observe in the world.

00:07:12.020 | It will be limited in terms of

00:07:13.700 | the kind of information that you can gather.

00:07:16.180 | It will be maybe intrusive.

00:07:18.660 | It's probably not the case that you got opt-in from

00:07:21.960 | every single person who contributed

00:07:23.660 | data point for this dataset that you've created.

00:07:26.500 | In some sense, you might have a deep concern

00:07:29.480 | about the rights of the people who contributed.

00:07:32.580 | Let's contrast this with crowdsourcing.

00:07:34.860 | I've called this lab-grown.

00:07:36.300 | This is a more artificial thing that you do.

00:07:38.460 | This could be highly controlled

00:07:40.300 | because you set up the task.

00:07:42.300 | It could be privacy preserving in the sense that you could just

00:07:45.060 | make sure everyone who contributes

00:07:46.940 | knows that they're contributing to the dataset,

00:07:49.260 | and you could even offer them

00:07:50.860 | the opportunity to remove themselves at

00:07:52.860 | a later date if they decide that that's important.

00:07:55.820 | This will be genuinely expressive because you can have

00:07:58.980 | crowd workers in principle do even very

00:08:01.420 | complicated things to get data

00:08:03.260 | that you wouldn't observe in the wild.

00:08:05.780 | But then you have the corresponding weaknesses.

00:08:08.180 | This will be scarce.

00:08:09.140 | You'll never have enough crowdsource

00:08:10.860 | data and it will be expensive.

00:08:13.220 | In addition, it can get very contrived.

00:08:15.820 | You're having people do things that are very unnatural,

00:08:19.020 | not things that they would do as a matter of course,

00:08:21.700 | with communication, but rather things

00:08:23.320 | that you set them up to do.

00:08:25.100 | The results of this might feel contrived also in the sense that

00:08:28.540 | you know the crowd workers are trying to please you,

00:08:31.940 | the person who launched the task,

00:08:33.660 | and that might be a goal that's very

00:08:35.420 | different from the one that you actually have in mind.

00:08:38.740 | For me, looking at these trade-offs,

00:08:41.540 | the question is how could we balance all these things?

00:08:44.140 | I do think that we can find

00:08:46.180 | hybrid models that allow us to be both genuine and expressive,

00:08:50.820 | and to preserve in general a lot of

00:08:52.700 | the strengths across these two and minimize the weaknesses.

00:08:56.300 | I've shown you an example of this already.

00:08:58.860 | For Dynascent round 2,

00:09:01.020 | we had two conditions.

00:09:03.060 | One, where workers just wrote a text from

00:09:05.620 | scratch to try to fool a top performing model for sentiment,

00:09:09.060 | and another condition where we gave them

00:09:11.260 | existing sentences that they could

00:09:12.980 | edit in order to achieve that goal.

00:09:15.500 | Fundamentally, I think the editing condition offers

00:09:19.360 | much more naturalism while still

00:09:20.940 | giving us the results that we wanted.

00:09:22.980 | For that prompt condition,

00:09:24.980 | I would first observe that they did edit the text.

00:09:27.820 | We see a wide range of different edit distances

00:09:30.380 | between the original and the thing they produced.

00:09:33.100 | That seems healthy.

00:09:34.500 | Then this is more important.

00:09:36.100 | For example, in terms of length,

00:09:38.500 | we find that the no prompt examples were

00:09:40.880 | very short compared to the prompt ones.

00:09:43.700 | The prompt ones have lengths that are more like

00:09:46.620 | just naturally occurring sentences that we would

00:09:49.180 | harvest in domain from a site like Yelp.

00:09:52.140 | Here's a similar thing for vocabulary size.

00:09:54.880 | The no prompt condition is

00:09:56.660 | very limited in terms of its vocabulary,

00:09:59.860 | whereas we get much more diversity for the prompt condition

00:10:03.040 | approaching the diversity of

00:10:05.020 | vocabulary for naturally occurring cases.

00:10:08.300 | This looks like a clear win for prompting which

00:10:10.860 | mixes naturalism with things we do in the lab.

00:10:13.980 | The result was really wonderful examples that would be hard to

00:10:18.620 | observe that do all sorts of interesting things

00:10:21.580 | linguistically and also play

00:10:23.500 | with non-literal language use and so forth.

00:10:26.540 | I think the hybrid model gave us

00:10:28.460 | the best of both worlds in some sense.

00:10:31.740 | Let's move to our second question.

00:10:34.300 | Should we use adversarial examples or

00:10:36.740 | just benchmarks that contain the most common cases?

00:10:39.900 | Remember, my answer is both.

00:10:41.680 | Just as a reminder,

00:10:43.100 | we talked about this in a previous unit.

00:10:45.060 | For standard evaluations,

00:10:46.620 | you create a dataset from

00:10:48.300 | a single model independent process

00:10:50.520 | and divide it into trained dev test.

00:10:52.920 | Whereas for adversarial assessment,

00:10:55.400 | we have a separate test set created in a way that you suspect or

00:10:59.180 | know will be challenging given

00:11:01.220 | your system and the way it was developed initially.

00:11:04.260 | Then for adversarial datasets in general,

00:11:07.020 | this would be trained dev test where all of

00:11:09.700 | those elements were guided by attempts by people

00:11:12.340 | usually to fool a top-performing model or set of models.

00:11:17.180 | These are the comparisons that we're thinking about here.

00:11:20.460 | I mentioned for you before that there are a bunch of

00:11:23.000 | these fully adversarial datasets covering a wide range of domains.

00:11:28.180 | I think that's been fruitful and I think it's a lesson of

00:11:31.400 | that literature that we're seeing lots of good results,

00:11:33.640 | especially from adversarial training and testing.

00:11:36.700 | But there is an alternative perspective out there,

00:11:39.340 | and I think the most vocal of that perspective is Bowman and Dahl 2021.

00:11:44.760 | I'll offer you some quotes and you should definitely check out the paper.

00:11:48.680 | They write under the heading of adversarial examples not being

00:11:52.080 | a panacea that adversarial filtering can systematically

00:11:56.880 | eliminate coverage of linguistic phenomena or skills that are

00:12:00.080 | necessary for the task but already well-solved by the adversary model.

00:12:04.320 | This mode-seeking as opposed to

00:12:06.720 | mass covering behavior by adversarial filtering,

00:12:09.560 | if left unchecked, tends to reduce

00:12:11.760 | dataset diversity and thus make validity harder to achieve.

00:12:15.560 | I actually frankly think that this is

00:12:17.640 | a totally reasonable perspective and the disconnect

00:12:20.240 | here is the notion of adversarial filtering.

00:12:23.560 | That is certainly not something I would advocate for.

00:12:26.480 | If you think about Dynascent,

00:12:28.220 | our training and Devon test sets all contain

00:12:31.240 | a mixture of cases that were adversarial and cases that the model actually got right,

00:12:36.320 | more like the mode-seeking behavior that they're talking about here.

00:12:39.940 | I do think you could damage a model by doing adversarial filtering,

00:12:43.480 | especially for training, because I think you could put

00:12:46.240 | the model in a very unusual state.

00:12:49.440 | But again, that's not something I was arguing for.

00:12:52.600 | I was arguing for the both perspective,

00:12:55.120 | have benchmarks that contain both the adversarial cases and

00:12:59.760 | the truly normal mode-seeking cases that they're mentioning here.

00:13:04.400 | I would not leave this pressure unchecked.

00:13:07.320 | They also write, "This position paper argues that

00:13:10.360 | concerns about standard benchmarks that

00:13:12.240 | motivate methods like adversarial filtering are justified,

00:13:15.360 | but that they can and should be addressed directly,

00:13:17.920 | and that it is possible and reasonable to do so in

00:13:20.240 | the context of static IID evaluation."

00:13:23.200 | Again, let's set aside the distracting thing about filtering,

00:13:27.000 | and focus on what they claim here,

00:13:29.000 | which is that you,

00:13:30.520 | if you have a massive benchmark,

00:13:33.140 | will simply by virtue of having that massive benchmark,

00:13:36.680 | cover all of the relevant phenomena.

00:13:39.640 | I actually just think that that's factually incorrect.

00:13:42.360 | I think it is very difficult given the complexity of language to develop

00:13:46.280 | a benchmark that is so large that just as a matter of course,

00:13:49.720 | you've covered all the hard cases.

00:13:52.200 | The role of adversarial training examples could be to help us

00:13:56.200 | fill in those gaps in a much more efficient way.

00:13:59.880 | Because remember, the job to be done is a complicated one.

00:14:04.600 | Let's focus on the domain of sentiment.

00:14:06.760 | Yes, we need our models to get normal cases like the food was good, correct.

00:14:12.160 | But we also need them to deal with

00:14:14.560 | these complicated shifts in perspective as in,

00:14:17.140 | my sister hated the food but she's massively wrong,

00:14:20.240 | or the cookies seem dry to my boss but I couldn't disagree more.

00:14:24.600 | We also need them to get things like non-literal language use,

00:14:28.240 | like breakfast is really good if you're trying to feed it to dogs.

00:14:31.280 | That's some sarcasm or irony.

00:14:33.180 | As well as really creative things that people do with language like,

00:14:36.840 | worthy of gasps of foodgasms,

00:14:39.240 | where we get a new use of a suffix here.

00:14:41.860 | We can all immediately intuitive what

00:14:44.040 | this means it's a positive statement.

00:14:46.980 | But we know models will struggle with this innovative use of language,

00:14:51.160 | and we need to push them to overcome that hurdle.

00:14:54.100 | If you just do standard data collection,

00:14:56.360 | you might not see any of these examples or certainly not in

00:14:59.520 | the density that you need to see them to improve our systems.

00:15:03.120 | That's why I would just introduce a measure of

00:15:06.280 | adversarialness into train, dev, and test.

00:15:09.700 | But I would not do any of the filtering that

00:15:11.720 | Bowman and Dahl are worried about.

00:15:14.440 | So for adversarial testing in general,

00:15:17.400 | here's what I would say are the major lessons we've learned so far.

00:15:20.720 | Often, our top performing systems like the one from

00:15:23.880 | that benchmark saturating slide have

00:15:26.320 | found unsystematic solutions that should worry us.

00:15:29.720 | I also noted in earlier units of this course that progress on

00:15:33.440 | challenge sets does seem to correlate with meaningful progress in general.

00:15:37.560 | That's an important insight.

00:15:39.520 | Present-day systems get traction on adversarial cases

00:15:42.920 | without degradation on the general cases.

00:15:45.740 | It'd be worrisome if training on adversarial examples,

00:15:48.600 | even a little bit of them,

00:15:50.020 | caused our systems to perform worse in the general case,

00:15:53.360 | but I think we do not see that happening.

00:15:56.000 | Then the final thing I would say is that whatever your view is

00:15:59.500 | on the role of adversarials in system development,

00:16:02.860 | if you deploy a system out into the world,

00:16:05.700 | the adversarial examples that people cook up and throw at

00:16:09.240 | your system will define public perception for your system.

00:16:13.440 | In the interest of self-preservation,

00:16:15.960 | I would encourage you to think about adversarial dynamics for

00:16:19.080 | evaluation before you do any kind of deployment.

00:16:22.780 | That's why I exhorted you all in an earlier unit for this course to really think

00:16:27.400 | deeply about evaluation and have diverse teams of people with multiple perspectives on

00:16:33.440 | your system participate in that internal evaluation to really find

00:16:37.880 | the cases where your system performs in a problematic way.

00:16:41.720 | You should be your own adversary to the extent that you can to avoid having

00:16:46.240 | really adversarial problems emerge when your system is used in the world.

00:16:52.000 | Final question, synthetic benchmarks or naturalistic benchmarks?

00:16:57.960 | As I said, there is a prominent perspective in the field

00:17:00.880 | that naturalistic benchmarks are the only ones we should be using.

00:17:04.920 | To me, at a scientific level,

00:17:07.280 | this is deeply worrisome because what it does is

00:17:10.400 | introduce two unknowns into almost all the experiments that we run.

00:17:14.660 | The dataset is an unknown in the sense that we don't fully command what its structure is

00:17:19.160 | like and the model is almost by definition in these contexts an unknown.

00:17:23.260 | We're trying to explore its properties.

00:17:25.560 | The situation is like you have this massive dataset that you cannot audit comprehensively.

00:17:31.800 | You might not even fully understand the process that

00:17:34.200 | created it even if you did crowdsourcing.

00:17:36.760 | Then you have that as the input to a model,

00:17:39.560 | which is also a major unknown,

00:17:41.660 | and then you get some output.

00:17:43.680 | The question is, what are the causal factors in this output?

00:17:47.460 | Causal assignment in this case is

00:17:50.240 | very difficult because of the fact that we have two unknowns.

00:17:53.840 | If we could fix the dataset and call it a known quantity,

00:17:59.040 | then we could trace aspects of

00:18:01.200 | the output to properties of the model that we have manipulated.

00:18:04.800 | But with two unknowns in play,

00:18:06.580 | this will always be uncertain.

00:18:08.720 | I gave you a story about this before.

00:18:11.100 | Let me briefly rehearse it.

00:18:12.500 | This is under the heading of negation as a learning target.

00:18:16.160 | Remember, we have this idea that we should have systems that know that if A entails B,

00:18:21.060 | then not B entails not A,

00:18:22.920 | the entailment reversing property of negation.

00:18:26.400 | We have an observation across a lot of different papers that

00:18:29.860 | top performing NLI models fail to hit that learning target.

00:18:34.320 | It's very tempting to conclude here that the model is the problem.

00:18:39.440 | Top performing models seem incapable of learning negation,

00:18:43.040 | but we have an observation that our datasets,

00:18:46.400 | the naturalistic benchmarks these models were trained on,

00:18:49.500 | severely under-represent negation.

00:18:52.800 | Now, we don't know whether the issue is

00:18:56.000 | with the models or with the dataset because we have two unknowns.

00:19:01.240 | In response to that,

00:19:02.920 | we created what I've called here a slightly synthetic benchmark,

00:19:06.460 | that is monotonicity NLI or MoNLI.

00:19:09.640 | Recall it has two parts,

00:19:11.200 | a positive part where we take existing SNLI hypotheses and use

00:19:16.160 | WordNet to create new examples that fire off

00:19:19.700 | the systematic cases where we get A neutral B and B entailment A.

00:19:24.600 | That's the positive part.

00:19:26.000 | We did the same thing for negated examples.

00:19:28.880 | Now, after the replacement,

00:19:31.380 | we get the reverse of those patterns.

00:19:33.880 | What this leads us to is a dataset that

00:19:36.920 | has naturally occurring cases as its basis,

00:19:39.960 | but a systematic manipulation that leaves us with

00:19:42.920 | complete guarantees that we have a certain representation

00:19:46.960 | for lexical entailment and negation.

00:19:50.200 | That's why it's slightly synthetic.

00:19:52.800 | Then when we use this as a challenge dataset,

00:19:55.160 | we get a blast of insight, I claim.

00:19:57.480 | Let's look at the BERT row here.

00:19:59.460 | BERT is performing extremely well on SNLI,

00:20:02.760 | extremely well on the positive part of our synthetic benchmark,

00:20:06.660 | but essentially hitting zero for the negative part of our benchmark.

00:20:11.140 | It's obviously just ignoring the negations.

00:20:13.900 | What is the issue here?

00:20:15.460 | Is it data or is it the model?

00:20:17.920 | Well, when we do a modest amount of fine-tuning on negative MoNLI examples,

00:20:22.740 | we immediately boost performance for the model on that split.

00:20:27.020 | That shows us definitively that when we show

00:20:30.220 | a model like BERT relevant negation cases,

00:20:33.280 | it can handle the task.

00:20:35.460 | Now, as a result of having a known dataset,

00:20:38.380 | we have learned something directly about our model.

00:20:41.140 | When we turn to naturalistic data,

00:20:43.660 | and I emphasize when there because I do think

00:20:45.940 | that that's an important component in NLP.

00:20:48.580 | When we move from synthetic to naturalistic,

00:20:51.440 | we do so knowing that BERT can in principle learn negation,

00:20:55.620 | and that data coverage will be a major factor in its performance there.

00:21:00.020 | Those are crisp analytic lessons that we learned

00:21:03.020 | only because we allowed some synthetic evaluations.

00:21:07.260 | That's it. Those are three major questions for datasets in the field.

00:21:12.020 | There are many more though.

00:21:13.300 | I address these, but we can also think about issues like data sheets,

00:21:18.640 | that is disclosures for datasets that help us understand

00:21:21.740 | how they can be used responsibly and where their limits lie.

00:21:25.460 | We should also be thinking much more about how we're going to

00:21:28.220 | achieve cross-linguistic coverage for our benchmarks.

00:21:31.060 | Right now we have still to this day too much focus on English,

00:21:35.140 | when in fact we want systems and models that are performant the world over.

00:21:40.220 | We could worry about statistical power,

00:21:42.960 | and of course we should also worry deeply about

00:21:46.020 | the pernicious social biases that are embedded in our datasets,

00:21:49.860 | and how we will get rid of those in order to create technologies that are more equitable.

00:21:55.860 | [BLANK_AUDIO]

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 4: Datasets I Spring 2023

Chapters