back to index

Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 1: Overview I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome, everyone.
00:00:06.120 | This screencast kicks off our unit on
00:00:08.420 | advanced behavioral evaluation of NLU models.
00:00:11.660 | For this unit, we're going to switch gears a little bit.
00:00:13.960 | We've been very focused on architectures and models.
00:00:16.880 | We are now going to turn our attention
00:00:18.900 | to the nature of evaluation,
00:00:21.260 | how we gather evidence,
00:00:22.440 | and how we mark progress in the field.
00:00:24.640 | For this unit, we're going to focus
00:00:26.300 | on behavioral evaluations,
00:00:27.820 | those based in input-output behavior.
00:00:30.100 | In the next unit, we're going to try to go
00:00:32.080 | one layer deeper to uncover information
00:00:34.680 | about how models manage those input-output mappings.
00:00:38.600 | To kick this off, I think it's useful to reflect a little bit
00:00:42.000 | on the varieties of evaluation that we
00:00:44.380 | conduct in the field and in AI more broadly.
00:00:47.640 | For this unit, as I said,
00:00:49.360 | we'll be focused on behavioral methods.
00:00:51.160 | We are just focused on whether or not models produce
00:00:53.680 | the desired output given some inputs,
00:00:56.400 | and we don't directly attend to
00:00:58.400 | how they manage to do that mapping.
00:01:00.960 | Standard evaluations are often called
00:01:03.560 | IID evaluations for independent and identically distributed.
00:01:07.780 | The intuition here is that we have
00:01:10.760 | some test examples that are
00:01:12.160 | disjoint from the ones the system was trained on,
00:01:14.840 | but we have a underlying guarantee that
00:01:17.460 | the test examples are
00:01:18.960 | very much like those that were seen in training.
00:01:21.540 | This standard mode gives us a lot of
00:01:23.760 | guarantees about what we can expect at test time,
00:01:26.480 | but it is also very friendly to our systems.
00:01:29.960 | With exploratory analyses, you might start to
00:01:33.240 | venture outside of that IID assumption.
00:01:36.380 | You might know or not know about
00:01:38.500 | what the training data were like,
00:01:41.180 | but the idea is that you're now going to start to probe to see
00:01:43.840 | whether the model has certain capabilities
00:01:46.180 | via examples that you construct,
00:01:48.440 | and they might go outside
00:01:49.960 | of what you'd expect from the training data.
00:01:52.520 | That could also be hypothesis-driven.
00:01:54.920 | You might just ask a question like, "Hey,
00:01:57.300 | does my model know about synonyms
00:01:59.380 | or does it know about lexical entailment?"
00:02:01.640 | You might construct a dataset that probe
00:02:04.100 | specifically for ways of answering that particular question.
00:02:08.240 | You're using a behavioral evaluation to answer
00:02:11.100 | a more conceptual hypothesis-driven question there.
00:02:15.040 | In challenge datasets, you might start to
00:02:17.840 | adventure outside of the friendly mode.
00:02:20.340 | In this mode with challenge datasets,
00:02:23.020 | you might be posing problems that you know are going to be
00:02:26.420 | difficult given the nature
00:02:28.520 | of the training experiences of your model.
00:02:30.720 | You're trying to push the limits to see
00:02:32.840 | where it's going to fall down essentially.
00:02:35.280 | That could become truly adversarial in the sense that you might have
00:02:39.020 | done a full study of
00:02:40.700 | the train data and the properties of the model,
00:02:43.100 | and then constructed examples where you know the model is going to
00:02:46.580 | fail as a way of revealing
00:02:48.620 | some problematic behavior or important weakness.
00:02:52.460 | We could escalate all the way to what I've
00:02:55.280 | called security-oriented behavioral evaluations.
00:02:58.280 | In this mode, you might be deliberately
00:03:00.680 | constructing examples that you would expect to fall
00:03:03.300 | outside of the normal user interaction with your model,
00:03:06.960 | maybe with unfamiliar characters or
00:03:09.220 | character combinations to see what happens.
00:03:12.060 | In particular, you might be looking to see whether with
00:03:15.140 | those very unusual out-of-distribution inputs,
00:03:18.140 | the model does something that is
00:03:20.100 | toxic or problematic or unsafe in some way.
00:03:24.300 | Those are all behavioral evaluations.
00:03:27.540 | We could contrast those with what I've called structural evaluations,
00:03:31.340 | probing, feature attribution, and interventions.
00:03:34.420 | These are the topic of the next unit.
00:03:36.680 | With structural evaluations, what we try to do is go beyond
00:03:40.100 | input-output mappings and really
00:03:42.260 | understand the mechanisms at work behind those behaviors.
00:03:45.900 | I think the ideal for that would be that we uncover
00:03:48.660 | the causal mechanisms behind model behaviors.
00:03:52.300 | Those go beyond behavioral testing and I think
00:03:55.020 | complement behavioral testing really powerfully.
00:03:58.700 | Let's reflect a little bit on standard evaluations in the field.
00:04:03.660 | I think the upshot here is that they are
00:04:06.420 | extremely friendly to our systems in ways that
00:04:09.540 | should increasingly worry us as we think
00:04:11.620 | about systems being deployed in the wider world.
00:04:14.620 | For standard evaluations, at step 1,
00:04:18.140 | you create a dataset from a single process.
00:04:21.900 | That is the part to emphasize, a single process.
00:04:24.660 | You could scrape a website,
00:04:26.660 | you could reformat a database,
00:04:28.980 | you could crowdsource some labels for
00:04:31.060 | some examples and so forth and so on.
00:04:32.980 | Whatever you do, you run this one process.
00:04:36.540 | Then in the next phase,
00:04:38.280 | you divide the dataset into disjoint train and test sets,
00:04:42.500 | and you set the test set aside.
00:04:44.140 | It's under lock and key,
00:04:45.560 | you won't look at it until the very end.
00:04:48.220 | That's really good because that's going to be
00:04:50.380 | our estimate of the capacity of the system to generalize.
00:04:53.580 | But notice you've already been very friendly to your system
00:04:56.860 | because step 1 offers you a guarantee that
00:04:59.900 | those test examples will in some sense be
00:05:02.460 | very much like those that you saw in training.
00:05:05.900 | Then you develop your system on the train set.
00:05:09.080 | Only after all development is complete,
00:05:11.480 | you evaluate the system based on
00:05:13.400 | some notion of accuracy standardly on the test set.
00:05:17.000 | Then this is crucial, you do report the results as
00:05:20.600 | an estimate of the system's capacity to generalize.
00:05:24.840 | At that point, you're communicating with
00:05:26.680 | the wider world and saying,
00:05:28.440 | you have a measure of the system's accuracy,
00:05:30.940 | and you know people will infer that that means that's
00:05:33.640 | the accuracy that they will experience
00:05:35.820 | if they use the model in free usage.
00:05:38.980 | This is the part that worries me.
00:05:40.900 | Step 1 was a single process for creating the data,
00:05:44.260 | and we report that as an estimate
00:05:46.380 | of the system's capacity to generalize.
00:05:48.640 | Even though we know full well that the world is
00:05:51.260 | not a single homogeneous process,
00:05:53.540 | we absolutely know that once the model is deployed,
00:05:56.620 | it will encounter examples that are
00:05:58.700 | very different from those that were created at step 1.
00:06:02.100 | That is the worrisome part,
00:06:03.700 | and that is where so-called adversarial evaluations come in.
00:06:07.880 | They needn't be full-on adversarial,
00:06:09.900 | but the idea is to expose some of
00:06:12.520 | the fragility of that standard evaluation mode.
00:06:15.900 | At step 1 in adversarial evaluations,
00:06:18.460 | you create a dataset by whatever means you like, as usual.
00:06:22.160 | You develop and assess the system using
00:06:24.480 | that dataset according to whatever protocols you choose.
00:06:28.560 | Now the new part,
00:06:30.200 | you develop a new test dataset of examples that you
00:06:33.260 | suspect or know will be
00:06:35.560 | challenging given your system and the original dataset.
00:06:39.640 | Only after all system development is complete,
00:06:42.280 | you evaluate the system based on
00:06:43.760 | accuracy on that new test set.
00:06:46.960 | Then you report the results as providing some estimate
00:06:50.560 | of the system's capacity to generalize as before.
00:06:54.000 | This is the new piece, this contrast.
00:06:56.440 | We have our dataset that we use to create the system,
00:06:58.920 | especially for training.
00:07:00.440 | But then in step 3,
00:07:01.800 | we have a new test dataset,
00:07:03.680 | and that plays a crucial role of now offering
00:07:06.440 | us an estimate of the system's capacity to generalize.
00:07:10.040 | To the extent that we have created some hard and
00:07:12.520 | diverse new test sets in this way,
00:07:15.160 | we can probably gain increasing confidence that we are
00:07:18.360 | simulating what life will be like for the model if it is deployed.
00:07:22.960 | That's a call for action to do this really effectively,
00:07:26.520 | to really feel like you can get behind step 5 here.
00:07:29.720 | You should construct these adversarial or challenge datasets in a way that
00:07:34.000 | covers as much of the spectrum of user behaviors,
00:07:37.300 | user goals, user inputs,
00:07:39.120 | as you will expect to see.
00:07:40.720 | That implies having diverse teams of people,
00:07:43.920 | battle testing these models,
00:07:45.560 | and creating hard examples,
00:07:47.040 | and studying the resulting behavior.
00:07:48.880 | In that way, with a concerted effort there,
00:07:51.660 | you can inch closer to having
00:07:53.640 | a real guarantee for how the model will behave when it is deployed.
00:07:58.280 | It's a hallmark of behavioral testing that you will never have
00:08:01.420 | a full guarantee but you could approach it,
00:08:03.760 | and then as the next unit will show,
00:08:05.880 | you might supplement that with some deeper understanding of how the model works.
00:08:10.560 | But in any case, I feel like this is the mode that we should be in when we think
00:08:14.960 | about AI systems in this modern era of ever-widening impact.
00:08:19.880 | The history of this is interesting.
00:08:22.720 | Adversarial testing feels like a new idea,
00:08:25.040 | but in fact, it stretches all the way back to at least the Turing test.
00:08:29.600 | You might recall that the fundamental insight behind
00:08:32.240 | the Turing test is that we'll get a reliable evaluation when we
00:08:35.880 | pit people against computers where the goal of
00:08:39.020 | the computer is to try to fool the person into thinking it is a person itself,
00:08:43.360 | and the human is trying their level best to figure out whether
00:08:46.480 | that entity is a human or some AI.
00:08:50.460 | That is an inherently adversarial dynamic
00:08:53.160 | that is centered around linguistic interaction.
00:08:55.640 | I think we have to call that the first or certainly the most influential adversarial test.
00:09:01.800 | Sometime later, Terry Winograd proposed developing datasets that involved
00:09:07.360 | very intricate problems that he hoped would get past
00:09:10.760 | simple statistical tricks and really probe to see
00:09:13.440 | whether models truly understood what the world was like.
00:09:17.680 | Hector Levesque in this lovely paper on our best behavior revived this idea from
00:09:23.460 | Winograd of adversarially testing models to see whether they
00:09:27.420 | truly understand what language and what the world is like.
00:09:31.400 | The Winograd sentences are really interesting to reflect on now.
00:09:34.860 | They are simple problems that can be quite revealing about
00:09:38.900 | physical reality and social reality and all the rest.
00:09:42.680 | Here's a typical Winograd case.
00:09:44.920 | The trophy doesn't fit into this brown suitcase because it's too small.
00:09:49.100 | What is too small?
00:09:50.400 | The human intuition is to say the suitcase and that's probably because you can do
00:09:54.380 | some mental simulation of these two objects
00:09:57.760 | and then arrive at an answer to the question.
00:10:00.320 | The minimal pair there is the trophy doesn't fit into
00:10:03.060 | the brown suitcase because it's too large.
00:10:05.200 | What is too large?
00:10:06.460 | Here the human answer is the trophy again because of
00:10:09.220 | that mental simulation that you can do.
00:10:12.340 | The idea is that this is a behavioral test that will help us
00:10:15.700 | understand whether models also have
00:10:18.260 | that deep understanding of our physical reality.
00:10:21.820 | Here's a case that keys more into social norms and roles that people play.
00:10:27.140 | The council refused the demonstrators a permit
00:10:29.740 | because they feared violence.
00:10:31.260 | Who feared violence?
00:10:32.480 | The human answer is the council based on
00:10:34.980 | stereotypical roles for demonstrators and politicians.
00:10:38.980 | Versus the council refused the demonstrators
00:10:41.900 | a permit because they advocated violence.
00:10:44.020 | Who advocated violence?
00:10:45.100 | Again, we default to saying the demonstrators because
00:10:48.580 | of our default assumptions about the roles that people will play.
00:10:51.740 | The idea is for a model to get these responses correct,
00:10:56.800 | it too needs to deeply understand what's happening
00:11:00.140 | with these entities and with the social norms involved.
00:11:03.420 | That's the guiding hypothesis.
00:11:05.260 | Again, behavioral testing can never give us
00:11:07.220 | full guarantees that we've probed fully
00:11:09.580 | for the underlying capability that we cared about.
00:11:12.300 | But examples like this are inspiring in terms of getting us closer to that ideal.
00:11:17.940 | Hector Lavec took this further in a way that I
00:11:20.700 | think has proved really inspiring for the field.
00:11:23.140 | He says, for example,
00:11:24.340 | could a crocodile run a steeplechase?
00:11:26.660 | The intent here is clear.
00:11:28.200 | The question can be answered by thinking it through.
00:11:30.260 | A crocodile has short legs.
00:11:32.100 | The hedges in a steeplechase would be too tall for the crocodile to jump over.
00:11:36.060 | So no, a crocodile cannot run a steeplechase.
00:11:39.020 | Again, evoking this idea of doing a mental simulation about
00:11:42.980 | a very unfamiliar situation and arriving at a systematic answer to the question.
00:11:48.620 | What Lavec was really after was what he called foiling cheap tricks.
00:11:53.180 | Can we find questions where cheap tricks like
00:11:55.580 | this will not be sufficient to produce the desired behavior?
00:11:59.640 | This unfortunately has no easy answer.
00:12:02.480 | The best we can do perhaps is to come up with a suite of
00:12:05.440 | multiple choice questions carefully and then study
00:12:08.560 | the sorts of computer programs that might be able to answer them.
00:12:12.280 | Again, what I hear in this early paper back in 2013 is a call for
00:12:17.120 | constructing adversarial datasets that will reveal
00:12:20.280 | much more about the solutions that our models have found.
00:12:24.960 | [BLANK_AUDIO]