Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 1: Overview I Spring 2023

00:00:00.000 | Welcome, everyone.

00:00:06.120 | This screencast kicks off our unit on

00:00:08.420 | advanced behavioral evaluation of NLU models.

00:00:11.660 | For this unit, we're going to switch gears a little bit.

00:00:13.960 | We've been very focused on architectures and models.

00:00:16.880 | We are now going to turn our attention

00:00:18.900 | to the nature of evaluation,

00:00:21.260 | how we gather evidence,

00:00:22.440 | and how we mark progress in the field.

00:00:24.640 | For this unit, we're going to focus

00:00:26.300 | on behavioral evaluations,

00:00:27.820 | those based in input-output behavior.

00:00:30.100 | In the next unit, we're going to try to go

00:00:32.080 | one layer deeper to uncover information

00:00:34.680 | about how models manage those input-output mappings.

00:00:38.600 | To kick this off, I think it's useful to reflect a little bit

00:00:42.000 | on the varieties of evaluation that we

00:00:44.380 | conduct in the field and in AI more broadly.

00:00:47.640 | For this unit, as I said,

00:00:49.360 | we'll be focused on behavioral methods.

00:00:51.160 | We are just focused on whether or not models produce

00:00:53.680 | the desired output given some inputs,

00:00:56.400 | and we don't directly attend to

00:00:58.400 | how they manage to do that mapping.

00:01:00.960 | Standard evaluations are often called

00:01:03.560 | IID evaluations for independent and identically distributed.

00:01:07.780 | The intuition here is that we have

00:01:10.760 | some test examples that are

00:01:12.160 | disjoint from the ones the system was trained on,

00:01:14.840 | but we have a underlying guarantee that

00:01:17.460 | the test examples are

00:01:18.960 | very much like those that were seen in training.

00:01:21.540 | This standard mode gives us a lot of

00:01:23.760 | guarantees about what we can expect at test time,

00:01:26.480 | but it is also very friendly to our systems.

00:01:29.960 | With exploratory analyses, you might start to

00:01:33.240 | venture outside of that IID assumption.

00:01:36.380 | You might know or not know about

00:01:38.500 | what the training data were like,

00:01:41.180 | but the idea is that you're now going to start to probe to see

00:01:43.840 | whether the model has certain capabilities

00:01:46.180 | via examples that you construct,

00:01:48.440 | and they might go outside

00:01:49.960 | of what you'd expect from the training data.

00:01:52.520 | That could also be hypothesis-driven.

00:01:54.920 | You might just ask a question like, "Hey,

00:01:57.300 | does my model know about synonyms

00:01:59.380 | or does it know about lexical entailment?"

00:02:01.640 | You might construct a dataset that probe

00:02:04.100 | specifically for ways of answering that particular question.

00:02:08.240 | You're using a behavioral evaluation to answer

00:02:11.100 | a more conceptual hypothesis-driven question there.

00:02:15.040 | In challenge datasets, you might start to

00:02:17.840 | adventure outside of the friendly mode.

00:02:20.340 | In this mode with challenge datasets,

00:02:23.020 | you might be posing problems that you know are going to be

00:02:26.420 | difficult given the nature

00:02:28.520 | of the training experiences of your model.

00:02:30.720 | You're trying to push the limits to see

00:02:32.840 | where it's going to fall down essentially.

00:02:35.280 | That could become truly adversarial in the sense that you might have

00:02:39.020 | done a full study of

00:02:40.700 | the train data and the properties of the model,

00:02:43.100 | and then constructed examples where you know the model is going to

00:02:46.580 | fail as a way of revealing

00:02:48.620 | some problematic behavior or important weakness.

00:02:52.460 | We could escalate all the way to what I've

00:02:55.280 | called security-oriented behavioral evaluations.

00:02:58.280 | In this mode, you might be deliberately

00:03:00.680 | constructing examples that you would expect to fall

00:03:03.300 | outside of the normal user interaction with your model,

00:03:06.960 | maybe with unfamiliar characters or

00:03:09.220 | character combinations to see what happens.

00:03:12.060 | In particular, you might be looking to see whether with

00:03:15.140 | those very unusual out-of-distribution inputs,

00:03:18.140 | the model does something that is

00:03:20.100 | toxic or problematic or unsafe in some way.

00:03:24.300 | Those are all behavioral evaluations.

00:03:27.540 | We could contrast those with what I've called structural evaluations,

00:03:31.340 | probing, feature attribution, and interventions.

00:03:34.420 | These are the topic of the next unit.

00:03:36.680 | With structural evaluations, what we try to do is go beyond

00:03:40.100 | input-output mappings and really

00:03:42.260 | understand the mechanisms at work behind those behaviors.

00:03:45.900 | I think the ideal for that would be that we uncover

00:03:48.660 | the causal mechanisms behind model behaviors.

00:03:52.300 | Those go beyond behavioral testing and I think

00:03:55.020 | complement behavioral testing really powerfully.

00:03:58.700 | Let's reflect a little bit on standard evaluations in the field.

00:04:03.660 | I think the upshot here is that they are

00:04:06.420 | extremely friendly to our systems in ways that

00:04:09.540 | should increasingly worry us as we think

00:04:11.620 | about systems being deployed in the wider world.

00:04:14.620 | For standard evaluations, at step 1,

00:04:18.140 | you create a dataset from a single process.

00:04:21.900 | That is the part to emphasize, a single process.

00:04:24.660 | You could scrape a website,

00:04:26.660 | you could reformat a database,

00:04:28.980 | you could crowdsource some labels for

00:04:31.060 | some examples and so forth and so on.

00:04:32.980 | Whatever you do, you run this one process.

00:04:36.540 | Then in the next phase,

00:04:38.280 | you divide the dataset into disjoint train and test sets,

00:04:42.500 | and you set the test set aside.

00:04:44.140 | It's under lock and key,

00:04:45.560 | you won't look at it until the very end.

00:04:48.220 | That's really good because that's going to be

00:04:50.380 | our estimate of the capacity of the system to generalize.

00:04:53.580 | But notice you've already been very friendly to your system

00:04:56.860 | because step 1 offers you a guarantee that

00:04:59.900 | those test examples will in some sense be

00:05:02.460 | very much like those that you saw in training.

00:05:05.900 | Then you develop your system on the train set.

00:05:09.080 | Only after all development is complete,

00:05:11.480 | you evaluate the system based on

00:05:13.400 | some notion of accuracy standardly on the test set.

00:05:17.000 | Then this is crucial, you do report the results as

00:05:20.600 | an estimate of the system's capacity to generalize.

00:05:24.840 | At that point, you're communicating with

00:05:26.680 | the wider world and saying,

00:05:28.440 | you have a measure of the system's accuracy,

00:05:30.940 | and you know people will infer that that means that's

00:05:33.640 | the accuracy that they will experience

00:05:35.820 | if they use the model in free usage.

00:05:38.980 | This is the part that worries me.

00:05:40.900 | Step 1 was a single process for creating the data,

00:05:44.260 | and we report that as an estimate

00:05:46.380 | of the system's capacity to generalize.

00:05:48.640 | Even though we know full well that the world is

00:05:51.260 | not a single homogeneous process,

00:05:53.540 | we absolutely know that once the model is deployed,

00:05:56.620 | it will encounter examples that are

00:05:58.700 | very different from those that were created at step 1.

00:06:02.100 | That is the worrisome part,

00:06:03.700 | and that is where so-called adversarial evaluations come in.

00:06:07.880 | They needn't be full-on adversarial,

00:06:09.900 | but the idea is to expose some of

00:06:12.520 | the fragility of that standard evaluation mode.

00:06:15.900 | At step 1 in adversarial evaluations,

00:06:18.460 | you create a dataset by whatever means you like, as usual.

00:06:22.160 | You develop and assess the system using

00:06:24.480 | that dataset according to whatever protocols you choose.

00:06:28.560 | Now the new part,

00:06:30.200 | you develop a new test dataset of examples that you

00:06:33.260 | suspect or know will be

00:06:35.560 | challenging given your system and the original dataset.

00:06:39.640 | Only after all system development is complete,

00:06:42.280 | you evaluate the system based on

00:06:43.760 | accuracy on that new test set.

00:06:46.960 | Then you report the results as providing some estimate

00:06:50.560 | of the system's capacity to generalize as before.

00:06:54.000 | This is the new piece, this contrast.

00:06:56.440 | We have our dataset that we use to create the system,

00:06:58.920 | especially for training.

00:07:00.440 | But then in step 3,

00:07:01.800 | we have a new test dataset,

00:07:03.680 | and that plays a crucial role of now offering

00:07:06.440 | us an estimate of the system's capacity to generalize.

00:07:10.040 | To the extent that we have created some hard and

00:07:12.520 | diverse new test sets in this way,

00:07:15.160 | we can probably gain increasing confidence that we are

00:07:18.360 | simulating what life will be like for the model if it is deployed.

00:07:22.960 | That's a call for action to do this really effectively,

00:07:26.520 | to really feel like you can get behind step 5 here.

00:07:29.720 | You should construct these adversarial or challenge datasets in a way that

00:07:34.000 | covers as much of the spectrum of user behaviors,

00:07:37.300 | user goals, user inputs,

00:07:39.120 | as you will expect to see.

00:07:40.720 | That implies having diverse teams of people,

00:07:43.920 | battle testing these models,

00:07:45.560 | and creating hard examples,

00:07:47.040 | and studying the resulting behavior.

00:07:48.880 | In that way, with a concerted effort there,

00:07:51.660 | you can inch closer to having

00:07:53.640 | a real guarantee for how the model will behave when it is deployed.

00:07:58.280 | It's a hallmark of behavioral testing that you will never have

00:08:01.420 | a full guarantee but you could approach it,

00:08:03.760 | and then as the next unit will show,

00:08:05.880 | you might supplement that with some deeper understanding of how the model works.

00:08:10.560 | But in any case, I feel like this is the mode that we should be in when we think

00:08:14.960 | about AI systems in this modern era of ever-widening impact.

00:08:19.880 | The history of this is interesting.

00:08:22.720 | Adversarial testing feels like a new idea,

00:08:25.040 | but in fact, it stretches all the way back to at least the Turing test.

00:08:29.600 | You might recall that the fundamental insight behind

00:08:32.240 | the Turing test is that we'll get a reliable evaluation when we

00:08:35.880 | pit people against computers where the goal of

00:08:39.020 | the computer is to try to fool the person into thinking it is a person itself,

00:08:43.360 | and the human is trying their level best to figure out whether

00:08:46.480 | that entity is a human or some AI.

00:08:50.460 | That is an inherently adversarial dynamic

00:08:53.160 | that is centered around linguistic interaction.

00:08:55.640 | I think we have to call that the first or certainly the most influential adversarial test.

00:09:01.800 | Sometime later, Terry Winograd proposed developing datasets that involved

00:09:07.360 | very intricate problems that he hoped would get past

00:09:10.760 | simple statistical tricks and really probe to see

00:09:13.440 | whether models truly understood what the world was like.

00:09:17.680 | Hector Levesque in this lovely paper on our best behavior revived this idea from

00:09:23.460 | Winograd of adversarially testing models to see whether they

00:09:27.420 | truly understand what language and what the world is like.

00:09:31.400 | The Winograd sentences are really interesting to reflect on now.

00:09:34.860 | They are simple problems that can be quite revealing about

00:09:38.900 | physical reality and social reality and all the rest.

00:09:42.680 | Here's a typical Winograd case.

00:09:44.920 | The trophy doesn't fit into this brown suitcase because it's too small.

00:09:49.100 | What is too small?

00:09:50.400 | The human intuition is to say the suitcase and that's probably because you can do

00:09:54.380 | some mental simulation of these two objects

00:09:57.760 | and then arrive at an answer to the question.

00:10:00.320 | The minimal pair there is the trophy doesn't fit into

00:10:03.060 | the brown suitcase because it's too large.

00:10:05.200 | What is too large?

00:10:06.460 | Here the human answer is the trophy again because of

00:10:09.220 | that mental simulation that you can do.

00:10:12.340 | The idea is that this is a behavioral test that will help us

00:10:15.700 | understand whether models also have

00:10:18.260 | that deep understanding of our physical reality.

00:10:21.820 | Here's a case that keys more into social norms and roles that people play.

00:10:27.140 | The council refused the demonstrators a permit

00:10:29.740 | because they feared violence.

00:10:31.260 | Who feared violence?

00:10:32.480 | The human answer is the council based on

00:10:34.980 | stereotypical roles for demonstrators and politicians.

00:10:38.980 | Versus the council refused the demonstrators

00:10:41.900 | a permit because they advocated violence.

00:10:44.020 | Who advocated violence?

00:10:45.100 | Again, we default to saying the demonstrators because

00:10:48.580 | of our default assumptions about the roles that people will play.

00:10:51.740 | The idea is for a model to get these responses correct,

00:10:56.800 | it too needs to deeply understand what's happening

00:11:00.140 | with these entities and with the social norms involved.

00:11:03.420 | That's the guiding hypothesis.

00:11:05.260 | Again, behavioral testing can never give us

00:11:07.220 | full guarantees that we've probed fully

00:11:09.580 | for the underlying capability that we cared about.

00:11:12.300 | But examples like this are inspiring in terms of getting us closer to that ideal.

00:11:17.940 | Hector Lavec took this further in a way that I

00:11:20.700 | think has proved really inspiring for the field.

00:11:23.140 | He says, for example,

00:11:24.340 | could a crocodile run a steeplechase?

00:11:26.660 | The intent here is clear.

00:11:28.200 | The question can be answered by thinking it through.

00:11:30.260 | A crocodile has short legs.

00:11:32.100 | The hedges in a steeplechase would be too tall for the crocodile to jump over.

00:11:36.060 | So no, a crocodile cannot run a steeplechase.

00:11:39.020 | Again, evoking this idea of doing a mental simulation about

00:11:42.980 | a very unfamiliar situation and arriving at a systematic answer to the question.

00:11:48.620 | What Lavec was really after was what he called foiling cheap tricks.

00:11:53.180 | Can we find questions where cheap tricks like

00:11:55.580 | this will not be sufficient to produce the desired behavior?

00:11:59.640 | This unfortunately has no easy answer.

00:12:02.480 | The best we can do perhaps is to come up with a suite of

00:12:05.440 | multiple choice questions carefully and then study

00:12:08.560 | the sorts of computer programs that might be able to answer them.

00:12:12.280 | Again, what I hear in this early paper back in 2013 is a call for

00:12:17.120 | constructing adversarial datasets that will reveal

00:12:20.280 | much more about the solutions that our models have found.

00:12:24.960 | [BLANK_AUDIO]