Stanford XCS224U: NLU I Behavioral Eval of NLU Models, Pt 7: DynaSent and Conclusion I Spring 2023

00:00:00.000 | Hello everyone, welcome back.

00:00:06.600 | This is the seventh and final screencast in

00:00:09.120 | our series on Advanced Behavioral Evaluation for NLU.

00:00:12.440 | In the previous screencast,

00:00:13.900 | I introduced the idea of having

00:00:15.640 | adversarial train sets in the mixed and I did that in

00:00:18.320 | the context of adversarial NLI

00:00:20.780 | and I talked briefly about the Dynabench platform.

00:00:23.800 | For this screencast, we're going to build on those ideas.

00:00:26.580 | This is going to be a deep dive on the Dynascent dataset,

00:00:29.600 | which I was involved in creating.

00:00:31.360 | You've actually worked already with Dynascent in

00:00:33.760 | the context of assignment 1 and the associated bake-off.

00:00:36.960 | I'm describing it now because I think it

00:00:38.820 | introduces some new concepts and also

00:00:41.400 | because it might be a useful resource

00:00:43.220 | for you as you think about projects.

00:00:45.240 | All our data code and models are available on GitHub.

00:00:48.680 | Dynascent is a substantial resource with

00:00:51.280 | over 120,000 sentences across two rounds,

00:00:54.720 | and each one of those examples has

00:00:56.560 | five gold labels from crowd workers.

00:00:59.520 | There's the associated paper.

00:01:01.800 | As I said, round 2 for this was created on

00:01:05.000 | Dynabench with an interesting adversarial

00:01:07.280 | dynamic that I'll talk about.

00:01:09.400 | This is a complete project overview.

00:01:11.960 | We're going to walk through this diagram in some detail.

00:01:14.440 | At a high level though, I think you can see there are

00:01:16.240 | two rounds and there are also two models in the mix.

00:01:19.380 | Those are the red boxes.

00:01:21.080 | At each round, we're going to do extensive human validation.

00:01:24.800 | Let's dive into round 1.

00:01:27.040 | The starting point for this is model 0,

00:01:29.400 | which we use as a device for finding

00:01:32.480 | interesting cases naturally occurring on the web.

00:01:36.400 | Those are human validated and that

00:01:38.200 | gives us our round 1 dataset.

00:01:40.720 | In a bit more detail,

00:01:42.400 | model 0 is a Roberta-based model that was

00:01:45.160 | fine-tuned on a whole lot of sentiment examples.

00:01:49.080 | These are the five benchmarks that

00:01:50.840 | we used to develop this model.

00:01:52.720 | All of these datasets for us are cast as

00:01:55.220 | ternary sentiment problems that is positive,

00:01:57.880 | negative, and neutral.

00:01:59.320 | You can see from this slide that we are training on

00:02:01.940 | a really substantial number of sentiment examples.

00:02:05.960 | We're going to benchmark these models against

00:02:08.440 | three external datasets, SST3, Yelp, and Amazon.

00:02:13.440 | The table here is showing that the model

00:02:15.560 | does well on all three of those benchmarks.

00:02:17.900 | The results are not stellar.

00:02:19.300 | I think this is a pretty hard multi-domain

00:02:22.400 | problem for sentiment.

00:02:23.840 | But in general, this looks like

00:02:25.600 | a solid device for finding interesting cases.

00:02:29.040 | That's the role that this will play.

00:02:30.840 | We are primarily thinking of using model 0 as

00:02:33.360 | a device for harvesting examples from the wild.

00:02:37.080 | The space we explore is the Yelp open dataset,

00:02:40.640 | and we use the following heuristic.

00:02:42.500 | We favor sentences where the review is

00:02:44.920 | one star and model 0 predicted positive.

00:02:48.880 | Conversely, where the review is five star and

00:02:52.000 | model 0 predicted negative.

00:02:53.920 | This is a heuristic that we think on average will lead us

00:02:57.360 | to examples that model 0 is getting incorrect.

00:03:00.800 | But it is just a heuristic.

00:03:03.020 | The only labels we use are ones that are

00:03:05.320 | derived from a human validation phase.

00:03:07.840 | This slide is showing you the interface that we used.

00:03:10.420 | The code is actually available in the project repository.

00:03:13.600 | You can see at a high level that reviewers were making

00:03:16.000 | a choice about whether a sentiment had positive,

00:03:18.780 | negative, no sentiment, or mixed sentiment labels.

00:03:21.960 | Each example was validated by five workers.

00:03:25.920 | The resulting dataset is quite substantial.

00:03:28.920 | This is a summary of the numbers.

00:03:30.880 | First, I would point out that 47 percent of

00:03:34.080 | the examples are adversarial,

00:03:36.080 | which seems to me a high rate.

00:03:37.720 | But the dataset includes

00:03:39.180 | both adversarial and non-adversarial cases.

00:03:41.720 | I think that's important to making a high-quality benchmark.

00:03:45.360 | There are two ways that you can think about

00:03:47.760 | training on this resource.

00:03:49.520 | The standard one would be what we call majority label training.

00:03:53.060 | This is the case where you infer that the label for an example is

00:03:57.200 | the label that was chosen by

00:03:58.860 | at least three of the five people who labeled it.

00:04:01.760 | If there is no such majority label,

00:04:03.920 | you put that in that separate elsewhere category.

00:04:06.760 | That leads to a substantial resource.

00:04:09.240 | However, we find that it is more powerful to

00:04:12.040 | do what we call distributional training.

00:04:14.300 | In distributional training,

00:04:15.560 | you repeat each example five times with each of

00:04:18.840 | the labels that it got from the crowd workers

00:04:21.200 | and train on that entire set.

00:04:23.500 | The result is that you don't have to worry about

00:04:25.800 | the no majority category anymore,

00:04:27.800 | so you keep all your examples.

00:04:29.820 | You also intuitively get a much more nuanced perspective

00:04:33.720 | on the sentiment judgments that people offered.

00:04:36.300 | Some are clear cases with five out of five,

00:04:38.680 | and some actually have pretty mixed distributions across

00:04:41.520 | the labels and your training models

00:04:43.840 | on all of that information.

00:04:46.040 | Then we find in practice that that leads to more robust models.

00:04:50.000 | For the Devon test,

00:04:51.600 | we restrict attention to positive,

00:04:53.200 | negative, and neutral to have a clean three-class ternary sentiment problem,

00:04:57.680 | and we balanced across those three labels for both Devon test.

00:05:02.400 | How do we do? Well, let's think first about

00:05:05.920 | Model 0 and its performance on this benchmark.

00:05:08.380 | This is a summary.

00:05:09.520 | We set things up so that Model 0 performs at

00:05:13.260 | chance on round 1.

00:05:14.920 | No information coming from Model 0 about the labels,

00:05:18.120 | and then you have the summary numbers from before on

00:05:21.600 | how it does on all of those external benchmarks.

00:05:24.800 | Humans by contrast do extremely well on round 1.

00:05:28.480 | We estimate that the F1 for humans is around 88 percent.

00:05:33.640 | That's a high number and it also

00:05:35.680 | arguably understates the level of agreement.

00:05:37.960 | We note that 614 of our 1,200 workers

00:05:41.380 | never disagreed with the majority label.

00:05:44.000 | This looks to us like a very high rate of agreement and

00:05:47.840 | consistency for humans on this resource.

00:05:52.200 | Here just to round out the discussion of round 1 are

00:05:56.040 | some randomly sampled short examples showing you

00:05:58.960 | every combination of model prediction and distribution

00:06:02.400 | across the labels focused on the majority label in this case.

00:06:06.120 | You see a lot of interesting nuanced linguistic things,

00:06:09.360 | and I think a lot of use of non-literal language.

00:06:13.580 | Let's move now to round 2.

00:06:16.260 | This is substantially different.

00:06:18.080 | We begin from Model 1,

00:06:19.900 | and this is a Roberta model that was fine-tuned on

00:06:22.520 | those external sentiment benchmarks

00:06:25.400 | as well as all of our round 1 data.

00:06:27.780 | The intuition here coming from the ANLI project is that we

00:06:31.100 | should train models on previous rounds

00:06:33.480 | of our own dynamic dataset collection.

00:06:36.580 | Instead of harvesting examples from the wild in this phase,

00:06:39.820 | we're going to use Dynabench to crowdsource

00:06:41.860 | sentences that fool Model 1.

00:06:44.380 | We'll human validate those and that will

00:06:46.100 | lead us to our round 2 dataset.

00:06:48.720 | Let's think a little bit about Model 1.

00:06:51.260 | Again, this is a Roberta-based classifier,

00:06:53.720 | and it is trained on those same external benchmarks,

00:06:56.860 | but now down-sampled somewhat so that we can give

00:06:59.660 | a lot of weight to round 1,

00:07:01.660 | which is now in the mix.

00:07:03.300 | These models are still trained on

00:07:05.100 | a substantial amount of data.

00:07:07.660 | We're trying to offer some evidence that round 1

00:07:11.740 | is the important thing to actually focus on for this model.

00:07:15.140 | How do we do? This is a summary of performance on

00:07:18.540 | the external datasets as well as round 1.

00:07:21.740 | You can see down here that this model is getting

00:07:24.340 | around 80 percent on our round 1 data with

00:07:27.420 | essentially no loss in

00:07:29.140 | performance on those external benchmarks.

00:07:31.180 | There is a bit of a drop.

00:07:32.540 | I think we are performing some domain shift

00:07:35.280 | by emphasizing round 1 as I described.

00:07:38.300 | But overall, we're maintaining pretty good performance while

00:07:41.480 | doing quite well on the round 1 dataset.

00:07:45.940 | I want to do a deep dive a little bit on how

00:07:49.940 | the examples were crowdsourced because I think this is

00:07:52.180 | an interesting nuance around how to get people

00:07:54.700 | to write productively in a crowdsourcing context.

00:07:57.820 | In the original interface,

00:07:59.600 | we simply did more or less what was done for ANLI,

00:08:02.340 | which is that we asked people to write a sentence from

00:08:05.560 | scratch that would fool the model in a particular way.

00:08:09.020 | We found though that that's

00:08:10.980 | a very difficult creative writing task,

00:08:13.560 | and it leads people to do

00:08:14.820 | similar things over multiple examples,

00:08:17.060 | which we intuited would lead to

00:08:19.060 | artifacts in the resulting dataset.

00:08:21.700 | We switched to emphasizing what we call the prompt condition.

00:08:25.560 | In the prompt condition,

00:08:26.920 | we actually offer crowd workers

00:08:29.340 | a naturally occurring sentence that comes from

00:08:31.740 | the Yelp open dataset and their task is to edit

00:08:35.300 | that sentence in order to achieve

00:08:37.140 | this task of fooling the model.

00:08:39.180 | The result is a dataset that's much more high-quality and

00:08:42.420 | has much more naturalistic examples in it.

00:08:45.500 | For validation, we did the same thing as round 1,

00:08:48.740 | and that leads to a dataset that looks like this.

00:08:51.100 | There are only 19 percent adversarial examples in this.

00:08:54.460 | I think this shows that by now in the process,

00:08:56.980 | we have a very strong sentiment model

00:08:58.980 | that is very difficult to fool.

00:09:01.060 | But 19 percent is still a substantial number numerically,

00:09:04.340 | and so we feel like we're in good shape.

00:09:06.460 | Overall, it's a somewhat smaller benchmark,

00:09:09.040 | but it has similar structure.

00:09:10.900 | We can do majority label training

00:09:13.020 | as well as distributional training,

00:09:15.020 | and we have balanced Dev and test.

00:09:16.820 | They just happen to be a little smaller than round 1.

00:09:20.700 | How does model 1 do versus humans?

00:09:23.980 | Well, again, we set things up so that model 1

00:09:26.460 | would perform that chance on our round 2 data,

00:09:28.940 | and you saw that model 1 does pretty

00:09:30.700 | well on the round 1 data.

00:09:32.540 | For humans though, this round is extremely intuitive.

00:09:36.100 | Our estimate of F_1 for humans

00:09:37.820 | is actually higher than for round 1.

00:09:39.460 | We're now at around 90 percent.

00:09:41.740 | Here, 116 of our 244 workers

00:09:45.180 | never disagreed with the majority label.

00:09:47.420 | Again, a substantial level of agreement on what

00:09:51.300 | are clearly very difficult sentiment problems.

00:09:54.860 | Just to round this out, I thought I'd show

00:09:56.780 | another sample of examples from this round.

00:09:59.860 | Again, showing model 1 predictions

00:10:02.220 | in every way that the majority label could have played out.

00:10:05.220 | I think even more than in round 1,

00:10:07.580 | what we start to see are examples that make

00:10:10.020 | extensive use of intricate syntactic structures,

00:10:13.820 | and also intricate use of non-literal language like

00:10:17.780 | metaphor and sarcasm and irony as

00:10:21.140 | techniques for coming up with examples that are

00:10:23.020 | intuitive for us as humans,

00:10:24.660 | but are routinely very challenging for even our best models.

00:10:30.260 | That is Dynascent.

00:10:32.260 | Let me use this opportunity to

00:10:33.780 | just wrap things up with a few conclusions.

00:10:36.620 | These are all meant to be open questions designed to have us

00:10:39.660 | looking ahead to the future of adversarial training and testing.

00:10:44.020 | Core question here, can adversarial training improve systems?

00:10:49.100 | I think overall, we're seeing evidence that the answer is yes,

00:10:52.540 | but there is some nuance there,

00:10:54.060 | and I think it's going to take some calibration

00:10:56.100 | to get this exactly right.

00:10:58.500 | What constitutes a fair non-IID generalization test?

00:11:03.540 | I introduced this notion of fairness when we discussed

00:11:06.220 | the analytic considerations around

00:11:07.900 | all these behavioral evaluations,

00:11:09.880 | and then this became very pressing when we talked about why

00:11:12.900 | some of the COGs and re-COGs splits are so difficult.

00:11:16.420 | The question arises whether it's even fair to be asking

00:11:20.020 | our machine learning systems to generalize in

00:11:22.860 | particular ways that might nonetheless

00:11:25.140 | seem pretty intuitive for us as humans.

00:11:28.340 | Can hard behavioral testing provide us with

00:11:31.460 | the insights we need when it comes

00:11:33.460 | to certifying systems as trustworthy?

00:11:35.920 | If so, which tests?

00:11:37.540 | If not, what should we do instead?

00:11:39.720 | I think this is a crucial question.

00:11:41.500 | I think in a way we know that the answer is no.

00:11:44.500 | No amount of behavioral testing can offer

00:11:47.180 | us the guarantees that we're seeking.

00:11:49.520 | But it is a powerful component

00:11:52.780 | in getting closer to

00:11:54.020 | deeply understanding what these systems are like,

00:11:56.260 | and certainly we can use behavioral testing to

00:11:58.460 | find cases where they definitely fall down.

00:12:01.120 | But for actual certification of safety and trustworthiness,

00:12:04.740 | I believe we will need to go deeper,

00:12:06.660 | and that is the topic of the next unit of the course.

00:12:10.500 | Fundamentally, are our best systems finding systematic solutions?

00:12:15.560 | If the answer is yes,

00:12:16.900 | we will feel as humans that we can trust them.

00:12:19.780 | If the answer is no,

00:12:21.140 | even if they seem to behave well in some scenarios,

00:12:23.800 | we might always worry that they're going to do

00:12:25.760 | things that are totally baffling to us.

00:12:28.300 | Then finally, the big juicy cognitive and philosophical question,

00:12:32.920 | where humans generalize in ways that

00:12:35.200 | are unsupported by direct experience,

00:12:37.600 | how should AI respond in terms of system design?

00:12:40.820 | What should we do in order to achieve

00:12:43.260 | these very unusual quasi-cognitive,

00:12:46.520 | quasi-behavioral learning targets?

00:12:48.660 | I don't have a way to resolve this question,

00:12:50.900 | but I think it's really pressing when we think about

00:12:53.620 | really challenging our systems to do complex things with language.

00:12:58.500 | [BLANK_AUDIO]