back to indexStanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 1: Overview I Spring 2023
00:00:08.420 |
advanced behavioral evaluation of NLU models. 00:00:11.660 |
For this unit, we're going to switch gears a little bit. 00:00:13.960 |
We've been very focused on architectures and models. 00:00:34.680 |
about how models manage those input-output mappings. 00:00:38.600 |
To kick this off, I think it's useful to reflect a little bit 00:00:51.160 |
We are just focused on whether or not models produce 00:01:03.560 |
IID evaluations for independent and identically distributed. 00:01:12.160 |
disjoint from the ones the system was trained on, 00:01:18.960 |
very much like those that were seen in training. 00:01:23.760 |
guarantees about what we can expect at test time, 00:01:29.960 |
With exploratory analyses, you might start to 00:01:41.180 |
but the idea is that you're now going to start to probe to see 00:02:04.100 |
specifically for ways of answering that particular question. 00:02:08.240 |
You're using a behavioral evaluation to answer 00:02:11.100 |
a more conceptual hypothesis-driven question there. 00:02:23.020 |
you might be posing problems that you know are going to be 00:02:35.280 |
That could become truly adversarial in the sense that you might have 00:02:40.700 |
the train data and the properties of the model, 00:02:43.100 |
and then constructed examples where you know the model is going to 00:02:48.620 |
some problematic behavior or important weakness. 00:02:55.280 |
called security-oriented behavioral evaluations. 00:03:00.680 |
constructing examples that you would expect to fall 00:03:03.300 |
outside of the normal user interaction with your model, 00:03:12.060 |
In particular, you might be looking to see whether with 00:03:15.140 |
those very unusual out-of-distribution inputs, 00:03:27.540 |
We could contrast those with what I've called structural evaluations, 00:03:31.340 |
probing, feature attribution, and interventions. 00:03:36.680 |
With structural evaluations, what we try to do is go beyond 00:03:42.260 |
understand the mechanisms at work behind those behaviors. 00:03:45.900 |
I think the ideal for that would be that we uncover 00:03:48.660 |
the causal mechanisms behind model behaviors. 00:03:52.300 |
Those go beyond behavioral testing and I think 00:03:55.020 |
complement behavioral testing really powerfully. 00:03:58.700 |
Let's reflect a little bit on standard evaluations in the field. 00:04:06.420 |
extremely friendly to our systems in ways that 00:04:11.620 |
about systems being deployed in the wider world. 00:04:21.900 |
That is the part to emphasize, a single process. 00:04:38.280 |
you divide the dataset into disjoint train and test sets, 00:04:48.220 |
That's really good because that's going to be 00:04:50.380 |
our estimate of the capacity of the system to generalize. 00:04:53.580 |
But notice you've already been very friendly to your system 00:05:02.460 |
very much like those that you saw in training. 00:05:05.900 |
Then you develop your system on the train set. 00:05:13.400 |
some notion of accuracy standardly on the test set. 00:05:17.000 |
Then this is crucial, you do report the results as 00:05:20.600 |
an estimate of the system's capacity to generalize. 00:05:30.940 |
and you know people will infer that that means that's 00:05:40.900 |
Step 1 was a single process for creating the data, 00:05:48.640 |
Even though we know full well that the world is 00:05:53.540 |
we absolutely know that once the model is deployed, 00:05:58.700 |
very different from those that were created at step 1. 00:06:03.700 |
and that is where so-called adversarial evaluations come in. 00:06:12.520 |
the fragility of that standard evaluation mode. 00:06:18.460 |
you create a dataset by whatever means you like, as usual. 00:06:24.480 |
that dataset according to whatever protocols you choose. 00:06:30.200 |
you develop a new test dataset of examples that you 00:06:35.560 |
challenging given your system and the original dataset. 00:06:39.640 |
Only after all system development is complete, 00:06:46.960 |
Then you report the results as providing some estimate 00:06:50.560 |
of the system's capacity to generalize as before. 00:06:56.440 |
We have our dataset that we use to create the system, 00:07:03.680 |
and that plays a crucial role of now offering 00:07:06.440 |
us an estimate of the system's capacity to generalize. 00:07:10.040 |
To the extent that we have created some hard and 00:07:15.160 |
we can probably gain increasing confidence that we are 00:07:18.360 |
simulating what life will be like for the model if it is deployed. 00:07:22.960 |
That's a call for action to do this really effectively, 00:07:26.520 |
to really feel like you can get behind step 5 here. 00:07:29.720 |
You should construct these adversarial or challenge datasets in a way that 00:07:34.000 |
covers as much of the spectrum of user behaviors, 00:07:53.640 |
a real guarantee for how the model will behave when it is deployed. 00:07:58.280 |
It's a hallmark of behavioral testing that you will never have 00:08:05.880 |
you might supplement that with some deeper understanding of how the model works. 00:08:10.560 |
But in any case, I feel like this is the mode that we should be in when we think 00:08:14.960 |
about AI systems in this modern era of ever-widening impact. 00:08:25.040 |
but in fact, it stretches all the way back to at least the Turing test. 00:08:29.600 |
You might recall that the fundamental insight behind 00:08:32.240 |
the Turing test is that we'll get a reliable evaluation when we 00:08:35.880 |
pit people against computers where the goal of 00:08:39.020 |
the computer is to try to fool the person into thinking it is a person itself, 00:08:43.360 |
and the human is trying their level best to figure out whether 00:08:53.160 |
that is centered around linguistic interaction. 00:08:55.640 |
I think we have to call that the first or certainly the most influential adversarial test. 00:09:01.800 |
Sometime later, Terry Winograd proposed developing datasets that involved 00:09:07.360 |
very intricate problems that he hoped would get past 00:09:10.760 |
simple statistical tricks and really probe to see 00:09:13.440 |
whether models truly understood what the world was like. 00:09:17.680 |
Hector Levesque in this lovely paper on our best behavior revived this idea from 00:09:23.460 |
Winograd of adversarially testing models to see whether they 00:09:27.420 |
truly understand what language and what the world is like. 00:09:31.400 |
The Winograd sentences are really interesting to reflect on now. 00:09:34.860 |
They are simple problems that can be quite revealing about 00:09:38.900 |
physical reality and social reality and all the rest. 00:09:44.920 |
The trophy doesn't fit into this brown suitcase because it's too small. 00:09:50.400 |
The human intuition is to say the suitcase and that's probably because you can do 00:09:57.760 |
and then arrive at an answer to the question. 00:10:00.320 |
The minimal pair there is the trophy doesn't fit into 00:10:06.460 |
Here the human answer is the trophy again because of 00:10:12.340 |
The idea is that this is a behavioral test that will help us 00:10:18.260 |
that deep understanding of our physical reality. 00:10:21.820 |
Here's a case that keys more into social norms and roles that people play. 00:10:27.140 |
The council refused the demonstrators a permit 00:10:34.980 |
stereotypical roles for demonstrators and politicians. 00:10:45.100 |
Again, we default to saying the demonstrators because 00:10:48.580 |
of our default assumptions about the roles that people will play. 00:10:51.740 |
The idea is for a model to get these responses correct, 00:10:56.800 |
it too needs to deeply understand what's happening 00:11:00.140 |
with these entities and with the social norms involved. 00:11:09.580 |
for the underlying capability that we cared about. 00:11:12.300 |
But examples like this are inspiring in terms of getting us closer to that ideal. 00:11:17.940 |
Hector Lavec took this further in a way that I 00:11:20.700 |
think has proved really inspiring for the field. 00:11:28.200 |
The question can be answered by thinking it through. 00:11:32.100 |
The hedges in a steeplechase would be too tall for the crocodile to jump over. 00:11:36.060 |
So no, a crocodile cannot run a steeplechase. 00:11:39.020 |
Again, evoking this idea of doing a mental simulation about 00:11:42.980 |
a very unfamiliar situation and arriving at a systematic answer to the question. 00:11:48.620 |
What Lavec was really after was what he called foiling cheap tricks. 00:11:53.180 |
Can we find questions where cheap tricks like 00:11:55.580 |
this will not be sufficient to produce the desired behavior? 00:12:02.480 |
The best we can do perhaps is to come up with a suite of 00:12:05.440 |
multiple choice questions carefully and then study 00:12:08.560 |
the sorts of computer programs that might be able to answer them. 00:12:12.280 |
Again, what I hear in this early paper back in 2013 is a call for 00:12:17.120 |
constructing adversarial datasets that will reveal 00:12:20.280 |
much more about the solutions that our models have found.