back to indexStanford XCS224U: NLU I Behavioral Eval of NLU Models, Pt 7: DynaSent and Conclusion I Spring 2023
00:00:09.120 |
our series on Advanced Behavioral Evaluation for NLU. 00:00:15.640 |
adversarial train sets in the mixed and I did that in 00:00:20.780 |
and I talked briefly about the Dynabench platform. 00:00:23.800 |
For this screencast, we're going to build on those ideas. 00:00:26.580 |
This is going to be a deep dive on the Dynascent dataset, 00:00:31.360 |
You've actually worked already with Dynascent in 00:00:33.760 |
the context of assignment 1 and the associated bake-off. 00:00:45.240 |
All our data code and models are available on GitHub. 00:01:11.960 |
We're going to walk through this diagram in some detail. 00:01:14.440 |
At a high level though, I think you can see there are 00:01:16.240 |
two rounds and there are also two models in the mix. 00:01:21.080 |
At each round, we're going to do extensive human validation. 00:01:32.480 |
interesting cases naturally occurring on the web. 00:01:45.160 |
fine-tuned on a whole lot of sentiment examples. 00:01:59.320 |
You can see from this slide that we are training on 00:02:01.940 |
a really substantial number of sentiment examples. 00:02:05.960 |
We're going to benchmark these models against 00:02:08.440 |
three external datasets, SST3, Yelp, and Amazon. 00:02:25.600 |
a solid device for finding interesting cases. 00:02:30.840 |
We are primarily thinking of using model 0 as 00:02:33.360 |
a device for harvesting examples from the wild. 00:02:37.080 |
The space we explore is the Yelp open dataset, 00:02:48.880 |
Conversely, where the review is five star and 00:02:53.920 |
This is a heuristic that we think on average will lead us 00:02:57.360 |
to examples that model 0 is getting incorrect. 00:03:07.840 |
This slide is showing you the interface that we used. 00:03:10.420 |
The code is actually available in the project repository. 00:03:13.600 |
You can see at a high level that reviewers were making 00:03:16.000 |
a choice about whether a sentiment had positive, 00:03:18.780 |
negative, no sentiment, or mixed sentiment labels. 00:03:41.720 |
I think that's important to making a high-quality benchmark. 00:03:49.520 |
The standard one would be what we call majority label training. 00:03:53.060 |
This is the case where you infer that the label for an example is 00:03:58.860 |
at least three of the five people who labeled it. 00:04:03.920 |
you put that in that separate elsewhere category. 00:04:15.560 |
you repeat each example five times with each of 00:04:18.840 |
the labels that it got from the crowd workers 00:04:23.500 |
The result is that you don't have to worry about 00:04:29.820 |
You also intuitively get a much more nuanced perspective 00:04:33.720 |
on the sentiment judgments that people offered. 00:04:38.680 |
and some actually have pretty mixed distributions across 00:04:46.040 |
Then we find in practice that that leads to more robust models. 00:04:53.200 |
negative, and neutral to have a clean three-class ternary sentiment problem, 00:04:57.680 |
and we balanced across those three labels for both Devon test. 00:05:05.920 |
Model 0 and its performance on this benchmark. 00:05:14.920 |
No information coming from Model 0 about the labels, 00:05:18.120 |
and then you have the summary numbers from before on 00:05:21.600 |
how it does on all of those external benchmarks. 00:05:24.800 |
Humans by contrast do extremely well on round 1. 00:05:28.480 |
We estimate that the F1 for humans is around 88 percent. 00:05:44.000 |
This looks to us like a very high rate of agreement and 00:05:52.200 |
Here just to round out the discussion of round 1 are 00:05:56.040 |
some randomly sampled short examples showing you 00:05:58.960 |
every combination of model prediction and distribution 00:06:02.400 |
across the labels focused on the majority label in this case. 00:06:06.120 |
You see a lot of interesting nuanced linguistic things, 00:06:09.360 |
and I think a lot of use of non-literal language. 00:06:19.900 |
and this is a Roberta model that was fine-tuned on 00:06:27.780 |
The intuition here coming from the ANLI project is that we 00:06:36.580 |
Instead of harvesting examples from the wild in this phase, 00:06:53.720 |
and it is trained on those same external benchmarks, 00:06:56.860 |
but now down-sampled somewhat so that we can give 00:07:07.660 |
We're trying to offer some evidence that round 1 00:07:11.740 |
is the important thing to actually focus on for this model. 00:07:15.140 |
How do we do? This is a summary of performance on 00:07:21.740 |
You can see down here that this model is getting 00:07:38.300 |
But overall, we're maintaining pretty good performance while 00:07:49.940 |
the examples were crowdsourced because I think this is 00:07:52.180 |
an interesting nuance around how to get people 00:07:54.700 |
to write productively in a crowdsourcing context. 00:07:59.600 |
we simply did more or less what was done for ANLI, 00:08:02.340 |
which is that we asked people to write a sentence from 00:08:05.560 |
scratch that would fool the model in a particular way. 00:08:21.700 |
We switched to emphasizing what we call the prompt condition. 00:08:29.340 |
a naturally occurring sentence that comes from 00:08:31.740 |
the Yelp open dataset and their task is to edit 00:08:39.180 |
The result is a dataset that's much more high-quality and 00:08:45.500 |
For validation, we did the same thing as round 1, 00:08:48.740 |
and that leads to a dataset that looks like this. 00:08:51.100 |
There are only 19 percent adversarial examples in this. 00:08:54.460 |
I think this shows that by now in the process, 00:09:01.060 |
But 19 percent is still a substantial number numerically, 00:09:16.820 |
They just happen to be a little smaller than round 1. 00:09:23.980 |
Well, again, we set things up so that model 1 00:09:26.460 |
would perform that chance on our round 2 data, 00:09:32.540 |
For humans though, this round is extremely intuitive. 00:09:47.420 |
Again, a substantial level of agreement on what 00:09:51.300 |
are clearly very difficult sentiment problems. 00:10:02.220 |
in every way that the majority label could have played out. 00:10:10.020 |
extensive use of intricate syntactic structures, 00:10:13.820 |
and also intricate use of non-literal language like 00:10:21.140 |
techniques for coming up with examples that are 00:10:24.660 |
but are routinely very challenging for even our best models. 00:10:36.620 |
These are all meant to be open questions designed to have us 00:10:39.660 |
looking ahead to the future of adversarial training and testing. 00:10:44.020 |
Core question here, can adversarial training improve systems? 00:10:49.100 |
I think overall, we're seeing evidence that the answer is yes, 00:10:54.060 |
and I think it's going to take some calibration 00:10:58.500 |
What constitutes a fair non-IID generalization test? 00:11:03.540 |
I introduced this notion of fairness when we discussed 00:11:09.880 |
and then this became very pressing when we talked about why 00:11:12.900 |
some of the COGs and re-COGs splits are so difficult. 00:11:16.420 |
The question arises whether it's even fair to be asking 00:11:20.020 |
our machine learning systems to generalize in 00:11:41.500 |
I think in a way we know that the answer is no. 00:11:54.020 |
deeply understanding what these systems are like, 00:11:56.260 |
and certainly we can use behavioral testing to 00:12:01.120 |
But for actual certification of safety and trustworthiness, 00:12:06.660 |
and that is the topic of the next unit of the course. 00:12:10.500 |
Fundamentally, are our best systems finding systematic solutions? 00:12:16.900 |
we will feel as humans that we can trust them. 00:12:21.140 |
even if they seem to behave well in some scenarios, 00:12:23.800 |
we might always worry that they're going to do 00:12:28.300 |
Then finally, the big juicy cognitive and philosophical question, 00:12:37.600 |
how should AI respond in terms of system design? 00:12:50.900 |
but I think it's really pressing when we think about 00:12:53.620 |
really challenging our systems to do complex things with language.