back to index

Stanford XCS224U: NLU I Behavioral Eval of NLU Models, Pt 7: DynaSent and Conclusion I Spring 2023

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello everyone, welcome back.
00:00:06.600 | This is the seventh and final screencast in
00:00:09.120 | our series on Advanced Behavioral Evaluation for NLU.
00:00:12.440 | In the previous screencast,
00:00:13.900 | I introduced the idea of having
00:00:15.640 | adversarial train sets in the mixed and I did that in
00:00:18.320 | the context of adversarial NLI
00:00:20.780 | and I talked briefly about the Dynabench platform.
00:00:23.800 | For this screencast, we're going to build on those ideas.
00:00:26.580 | This is going to be a deep dive on the Dynascent dataset,
00:00:29.600 | which I was involved in creating.
00:00:31.360 | You've actually worked already with Dynascent in
00:00:33.760 | the context of assignment 1 and the associated bake-off.
00:00:36.960 | I'm describing it now because I think it
00:00:38.820 | introduces some new concepts and also
00:00:41.400 | because it might be a useful resource
00:00:43.220 | for you as you think about projects.
00:00:45.240 | All our data code and models are available on GitHub.
00:00:48.680 | Dynascent is a substantial resource with
00:00:51.280 | over 120,000 sentences across two rounds,
00:00:54.720 | and each one of those examples has
00:00:56.560 | five gold labels from crowd workers.
00:00:59.520 | There's the associated paper.
00:01:01.800 | As I said, round 2 for this was created on
00:01:05.000 | Dynabench with an interesting adversarial
00:01:07.280 | dynamic that I'll talk about.
00:01:09.400 | This is a complete project overview.
00:01:11.960 | We're going to walk through this diagram in some detail.
00:01:14.440 | At a high level though, I think you can see there are
00:01:16.240 | two rounds and there are also two models in the mix.
00:01:19.380 | Those are the red boxes.
00:01:21.080 | At each round, we're going to do extensive human validation.
00:01:24.800 | Let's dive into round 1.
00:01:27.040 | The starting point for this is model 0,
00:01:29.400 | which we use as a device for finding
00:01:32.480 | interesting cases naturally occurring on the web.
00:01:36.400 | Those are human validated and that
00:01:38.200 | gives us our round 1 dataset.
00:01:40.720 | In a bit more detail,
00:01:42.400 | model 0 is a Roberta-based model that was
00:01:45.160 | fine-tuned on a whole lot of sentiment examples.
00:01:49.080 | These are the five benchmarks that
00:01:50.840 | we used to develop this model.
00:01:52.720 | All of these datasets for us are cast as
00:01:55.220 | ternary sentiment problems that is positive,
00:01:57.880 | negative, and neutral.
00:01:59.320 | You can see from this slide that we are training on
00:02:01.940 | a really substantial number of sentiment examples.
00:02:05.960 | We're going to benchmark these models against
00:02:08.440 | three external datasets, SST3, Yelp, and Amazon.
00:02:13.440 | The table here is showing that the model
00:02:15.560 | does well on all three of those benchmarks.
00:02:17.900 | The results are not stellar.
00:02:19.300 | I think this is a pretty hard multi-domain
00:02:22.400 | problem for sentiment.
00:02:23.840 | But in general, this looks like
00:02:25.600 | a solid device for finding interesting cases.
00:02:29.040 | That's the role that this will play.
00:02:30.840 | We are primarily thinking of using model 0 as
00:02:33.360 | a device for harvesting examples from the wild.
00:02:37.080 | The space we explore is the Yelp open dataset,
00:02:40.640 | and we use the following heuristic.
00:02:42.500 | We favor sentences where the review is
00:02:44.920 | one star and model 0 predicted positive.
00:02:48.880 | Conversely, where the review is five star and
00:02:52.000 | model 0 predicted negative.
00:02:53.920 | This is a heuristic that we think on average will lead us
00:02:57.360 | to examples that model 0 is getting incorrect.
00:03:00.800 | But it is just a heuristic.
00:03:03.020 | The only labels we use are ones that are
00:03:05.320 | derived from a human validation phase.
00:03:07.840 | This slide is showing you the interface that we used.
00:03:10.420 | The code is actually available in the project repository.
00:03:13.600 | You can see at a high level that reviewers were making
00:03:16.000 | a choice about whether a sentiment had positive,
00:03:18.780 | negative, no sentiment, or mixed sentiment labels.
00:03:21.960 | Each example was validated by five workers.
00:03:25.920 | The resulting dataset is quite substantial.
00:03:28.920 | This is a summary of the numbers.
00:03:30.880 | First, I would point out that 47 percent of
00:03:34.080 | the examples are adversarial,
00:03:36.080 | which seems to me a high rate.
00:03:37.720 | But the dataset includes
00:03:39.180 | both adversarial and non-adversarial cases.
00:03:41.720 | I think that's important to making a high-quality benchmark.
00:03:45.360 | There are two ways that you can think about
00:03:47.760 | training on this resource.
00:03:49.520 | The standard one would be what we call majority label training.
00:03:53.060 | This is the case where you infer that the label for an example is
00:03:57.200 | the label that was chosen by
00:03:58.860 | at least three of the five people who labeled it.
00:04:01.760 | If there is no such majority label,
00:04:03.920 | you put that in that separate elsewhere category.
00:04:06.760 | That leads to a substantial resource.
00:04:09.240 | However, we find that it is more powerful to
00:04:12.040 | do what we call distributional training.
00:04:14.300 | In distributional training,
00:04:15.560 | you repeat each example five times with each of
00:04:18.840 | the labels that it got from the crowd workers
00:04:21.200 | and train on that entire set.
00:04:23.500 | The result is that you don't have to worry about
00:04:25.800 | the no majority category anymore,
00:04:27.800 | so you keep all your examples.
00:04:29.820 | You also intuitively get a much more nuanced perspective
00:04:33.720 | on the sentiment judgments that people offered.
00:04:36.300 | Some are clear cases with five out of five,
00:04:38.680 | and some actually have pretty mixed distributions across
00:04:41.520 | the labels and your training models
00:04:43.840 | on all of that information.
00:04:46.040 | Then we find in practice that that leads to more robust models.
00:04:50.000 | For the Devon test,
00:04:51.600 | we restrict attention to positive,
00:04:53.200 | negative, and neutral to have a clean three-class ternary sentiment problem,
00:04:57.680 | and we balanced across those three labels for both Devon test.
00:05:02.400 | How do we do? Well, let's think first about
00:05:05.920 | Model 0 and its performance on this benchmark.
00:05:08.380 | This is a summary.
00:05:09.520 | We set things up so that Model 0 performs at
00:05:13.260 | chance on round 1.
00:05:14.920 | No information coming from Model 0 about the labels,
00:05:18.120 | and then you have the summary numbers from before on
00:05:21.600 | how it does on all of those external benchmarks.
00:05:24.800 | Humans by contrast do extremely well on round 1.
00:05:28.480 | We estimate that the F1 for humans is around 88 percent.
00:05:33.640 | That's a high number and it also
00:05:35.680 | arguably understates the level of agreement.
00:05:37.960 | We note that 614 of our 1,200 workers
00:05:41.380 | never disagreed with the majority label.
00:05:44.000 | This looks to us like a very high rate of agreement and
00:05:47.840 | consistency for humans on this resource.
00:05:52.200 | Here just to round out the discussion of round 1 are
00:05:56.040 | some randomly sampled short examples showing you
00:05:58.960 | every combination of model prediction and distribution
00:06:02.400 | across the labels focused on the majority label in this case.
00:06:06.120 | You see a lot of interesting nuanced linguistic things,
00:06:09.360 | and I think a lot of use of non-literal language.
00:06:13.580 | Let's move now to round 2.
00:06:16.260 | This is substantially different.
00:06:18.080 | We begin from Model 1,
00:06:19.900 | and this is a Roberta model that was fine-tuned on
00:06:22.520 | those external sentiment benchmarks
00:06:25.400 | as well as all of our round 1 data.
00:06:27.780 | The intuition here coming from the ANLI project is that we
00:06:31.100 | should train models on previous rounds
00:06:33.480 | of our own dynamic dataset collection.
00:06:36.580 | Instead of harvesting examples from the wild in this phase,
00:06:39.820 | we're going to use Dynabench to crowdsource
00:06:41.860 | sentences that fool Model 1.
00:06:44.380 | We'll human validate those and that will
00:06:46.100 | lead us to our round 2 dataset.
00:06:48.720 | Let's think a little bit about Model 1.
00:06:51.260 | Again, this is a Roberta-based classifier,
00:06:53.720 | and it is trained on those same external benchmarks,
00:06:56.860 | but now down-sampled somewhat so that we can give
00:06:59.660 | a lot of weight to round 1,
00:07:01.660 | which is now in the mix.
00:07:03.300 | These models are still trained on
00:07:05.100 | a substantial amount of data.
00:07:07.660 | We're trying to offer some evidence that round 1
00:07:11.740 | is the important thing to actually focus on for this model.
00:07:15.140 | How do we do? This is a summary of performance on
00:07:18.540 | the external datasets as well as round 1.
00:07:21.740 | You can see down here that this model is getting
00:07:24.340 | around 80 percent on our round 1 data with
00:07:27.420 | essentially no loss in
00:07:29.140 | performance on those external benchmarks.
00:07:31.180 | There is a bit of a drop.
00:07:32.540 | I think we are performing some domain shift
00:07:35.280 | by emphasizing round 1 as I described.
00:07:38.300 | But overall, we're maintaining pretty good performance while
00:07:41.480 | doing quite well on the round 1 dataset.
00:07:45.940 | I want to do a deep dive a little bit on how
00:07:49.940 | the examples were crowdsourced because I think this is
00:07:52.180 | an interesting nuance around how to get people
00:07:54.700 | to write productively in a crowdsourcing context.
00:07:57.820 | In the original interface,
00:07:59.600 | we simply did more or less what was done for ANLI,
00:08:02.340 | which is that we asked people to write a sentence from
00:08:05.560 | scratch that would fool the model in a particular way.
00:08:09.020 | We found though that that's
00:08:10.980 | a very difficult creative writing task,
00:08:13.560 | and it leads people to do
00:08:14.820 | similar things over multiple examples,
00:08:17.060 | which we intuited would lead to
00:08:19.060 | artifacts in the resulting dataset.
00:08:21.700 | We switched to emphasizing what we call the prompt condition.
00:08:25.560 | In the prompt condition,
00:08:26.920 | we actually offer crowd workers
00:08:29.340 | a naturally occurring sentence that comes from
00:08:31.740 | the Yelp open dataset and their task is to edit
00:08:35.300 | that sentence in order to achieve
00:08:37.140 | this task of fooling the model.
00:08:39.180 | The result is a dataset that's much more high-quality and
00:08:42.420 | has much more naturalistic examples in it.
00:08:45.500 | For validation, we did the same thing as round 1,
00:08:48.740 | and that leads to a dataset that looks like this.
00:08:51.100 | There are only 19 percent adversarial examples in this.
00:08:54.460 | I think this shows that by now in the process,
00:08:56.980 | we have a very strong sentiment model
00:08:58.980 | that is very difficult to fool.
00:09:01.060 | But 19 percent is still a substantial number numerically,
00:09:04.340 | and so we feel like we're in good shape.
00:09:06.460 | Overall, it's a somewhat smaller benchmark,
00:09:09.040 | but it has similar structure.
00:09:10.900 | We can do majority label training
00:09:13.020 | as well as distributional training,
00:09:15.020 | and we have balanced Dev and test.
00:09:16.820 | They just happen to be a little smaller than round 1.
00:09:20.700 | How does model 1 do versus humans?
00:09:23.980 | Well, again, we set things up so that model 1
00:09:26.460 | would perform that chance on our round 2 data,
00:09:28.940 | and you saw that model 1 does pretty
00:09:30.700 | well on the round 1 data.
00:09:32.540 | For humans though, this round is extremely intuitive.
00:09:36.100 | Our estimate of F_1 for humans
00:09:37.820 | is actually higher than for round 1.
00:09:39.460 | We're now at around 90 percent.
00:09:41.740 | Here, 116 of our 244 workers
00:09:45.180 | never disagreed with the majority label.
00:09:47.420 | Again, a substantial level of agreement on what
00:09:51.300 | are clearly very difficult sentiment problems.
00:09:54.860 | Just to round this out, I thought I'd show
00:09:56.780 | another sample of examples from this round.
00:09:59.860 | Again, showing model 1 predictions
00:10:02.220 | in every way that the majority label could have played out.
00:10:05.220 | I think even more than in round 1,
00:10:07.580 | what we start to see are examples that make
00:10:10.020 | extensive use of intricate syntactic structures,
00:10:13.820 | and also intricate use of non-literal language like
00:10:17.780 | metaphor and sarcasm and irony as
00:10:21.140 | techniques for coming up with examples that are
00:10:23.020 | intuitive for us as humans,
00:10:24.660 | but are routinely very challenging for even our best models.
00:10:30.260 | That is Dynascent.
00:10:32.260 | Let me use this opportunity to
00:10:33.780 | just wrap things up with a few conclusions.
00:10:36.620 | These are all meant to be open questions designed to have us
00:10:39.660 | looking ahead to the future of adversarial training and testing.
00:10:44.020 | Core question here, can adversarial training improve systems?
00:10:49.100 | I think overall, we're seeing evidence that the answer is yes,
00:10:52.540 | but there is some nuance there,
00:10:54.060 | and I think it's going to take some calibration
00:10:56.100 | to get this exactly right.
00:10:58.500 | What constitutes a fair non-IID generalization test?
00:11:03.540 | I introduced this notion of fairness when we discussed
00:11:06.220 | the analytic considerations around
00:11:07.900 | all these behavioral evaluations,
00:11:09.880 | and then this became very pressing when we talked about why
00:11:12.900 | some of the COGs and re-COGs splits are so difficult.
00:11:16.420 | The question arises whether it's even fair to be asking
00:11:20.020 | our machine learning systems to generalize in
00:11:22.860 | particular ways that might nonetheless
00:11:25.140 | seem pretty intuitive for us as humans.
00:11:28.340 | Can hard behavioral testing provide us with
00:11:31.460 | the insights we need when it comes
00:11:33.460 | to certifying systems as trustworthy?
00:11:35.920 | If so, which tests?
00:11:37.540 | If not, what should we do instead?
00:11:39.720 | I think this is a crucial question.
00:11:41.500 | I think in a way we know that the answer is no.
00:11:44.500 | No amount of behavioral testing can offer
00:11:47.180 | us the guarantees that we're seeking.
00:11:49.520 | But it is a powerful component
00:11:52.780 | in getting closer to
00:11:54.020 | deeply understanding what these systems are like,
00:11:56.260 | and certainly we can use behavioral testing to
00:11:58.460 | find cases where they definitely fall down.
00:12:01.120 | But for actual certification of safety and trustworthiness,
00:12:04.740 | I believe we will need to go deeper,
00:12:06.660 | and that is the topic of the next unit of the course.
00:12:10.500 | Fundamentally, are our best systems finding systematic solutions?
00:12:15.560 | If the answer is yes,
00:12:16.900 | we will feel as humans that we can trust them.
00:12:19.780 | If the answer is no,
00:12:21.140 | even if they seem to behave well in some scenarios,
00:12:23.800 | we might always worry that they're going to do
00:12:25.760 | things that are totally baffling to us.
00:12:28.300 | Then finally, the big juicy cognitive and philosophical question,
00:12:32.920 | where humans generalize in ways that
00:12:35.200 | are unsupported by direct experience,
00:12:37.600 | how should AI respond in terms of system design?
00:12:40.820 | What should we do in order to achieve
00:12:43.260 | these very unusual quasi-cognitive,
00:12:46.520 | quasi-behavioral learning targets?
00:12:48.660 | I don't have a way to resolve this question,
00:12:50.900 | but I think it's really pressing when we think about
00:12:53.620 | really challenging our systems to do complex things with language.
00:12:58.500 | [BLANK_AUDIO]