back to index

Stanford XCS224U: NLU I Analysis Methods for NLU, Part 1: Overview I Spring 2023


Chapters

0:0 Intro
0:11 Varieties of evaluation
0:50 Limits of behavioral testing
3:37 Models today
4:16 The interpretability dream
5:8 Progress on benchmarks
6:8 Systematicity
7:31 A crucial prerequisite
8:50 Probing internal representations
10:37 Feature attribution
11:34 Intervention-based methods
12:6 Analytical framework

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome everyone.
00:00:06.040 | This screencast kicks off our unit on analysis methods in NLP.
00:00:10.600 | In the previous unit for the course,
00:00:12.640 | we were very focused on behavioral testing,
00:00:15.360 | and we looked in particular at hypothesis-driven challenge and adversarial tests
00:00:20.780 | as a vehicle for deeply understanding how our models will behave,
00:00:25.080 | in especially unfamiliar scenarios.
00:00:27.560 | What we're going to try to do in this unit is go
00:00:30.320 | one layer deeper and talk about what I've called structural methods,
00:00:33.880 | including probing, feature attribution,
00:00:36.480 | and a class of intervention-based methods.
00:00:38.840 | The idea is that we're going to go beyond simple behavioral testing to understand,
00:00:43.680 | we hope, the causal mechanisms that are guiding the input-output behavior of our models.
00:00:50.480 | In the previous unit,
00:00:52.640 | I tried to make you very aware of the limits of behavioral testing.
00:00:56.440 | Of course, it plays an important role in the field,
00:00:58.800 | and it will complement the methods that we discuss,
00:01:01.160 | but it is intrinsically limited in ways that should worry us when it
00:01:04.920 | comes to offering guarantees about how models will behave.
00:01:08.800 | To make that very vivid for you,
00:01:11.000 | I use this example of an even-odd detector.
00:01:13.840 | Let me walk through that from now taking a slightly different perspective,
00:01:17.840 | which is the illuminated feeling that we get when
00:01:21.080 | we finally get to see how the model actually works.
00:01:24.120 | But recall this even-odd model takes in strings like
00:01:27.040 | four and predicts whether they refer to even or odd numbers.
00:01:30.960 | Four comes in and it predicts even,
00:01:33.160 | 21 and it rightly predicts odd,
00:01:35.520 | 32 even, 36 even, 63 odd.
00:01:41.720 | This is all making you feel that the model is a good model of even-odd detection.
00:01:47.680 | But you need to be careful,
00:01:49.280 | you've only done five tests.
00:01:50.960 | Now I show you how the model actually works,
00:01:53.520 | and it is immediately revealed to you that this is a very poor model.
00:01:57.480 | We got lucky with our first five inputs.
00:01:59.960 | It is a simple lookup on those inputs,
00:02:02.360 | and when it gets an unfamiliar input,
00:02:04.560 | it defaults to predicting odd.
00:02:07.120 | Once we see that,
00:02:08.480 | we know exactly how the model is broken,
00:02:10.600 | and exactly how to foil it behaviorally.
00:02:12.720 | We input 22, and it thinks that that is odd.
00:02:16.320 | But then we get a second even-odd model.
00:02:19.080 | It passes the first five tests,
00:02:21.040 | it makes a good prediction about 22,
00:02:23.360 | good prediction about 5,
00:02:25.240 | and 89, and 56.
00:02:27.400 | Again, your confidence is building,
00:02:29.600 | but you should be aware of the fact that you might have missed some crucial examples.
00:02:34.640 | Again, when I show you the inner workings of this model,
00:02:38.880 | you get immediately illuminated about where it works and where it doesn't.
00:02:43.520 | This model is more sophisticated.
00:02:45.680 | It tokenizes its input and uses the final token as the basis for predicting even-odd.
00:02:51.760 | That is a pretty good theory,
00:02:53.520 | but it has this else clause where it predicts odd,
00:02:56.360 | and now we know exactly how to foil the model.
00:02:59.120 | We input 16, and it thinks that that is odd.
00:03:02.720 | It was really the point at which we got to see the internal causal mechanisms,
00:03:06.800 | that we knew exactly how the model would work,
00:03:08.760 | and exactly where it would fail.
00:03:10.680 | Now we move at last to model 3.
00:03:12.800 | Let's suppose that it gets all of those previous inputs correct.
00:03:15.600 | Is it the one true model of even-odd detection?
00:03:19.000 | Well, we can keep up our behavioral testing,
00:03:21.880 | but you should see by now that no matter how many inputs we offer this model,
00:03:25.640 | we will never get a guarantee for
00:03:27.800 | every integer string that it will behave as intended.
00:03:30.840 | For that guarantee,
00:03:32.760 | we need to look inside this black box.
00:03:36.720 | But of course, in the modern era of NLP models,
00:03:41.500 | they're hardly ever as easy to
00:03:43.760 | understand as the symbolic programs that I was just showing you.
00:03:47.060 | Instead, our models look like this huge array of birds nests,
00:03:52.840 | lots of internal states all connected to all the other states,
00:03:56.560 | completely opaque, they consist mainly of
00:03:59.400 | weights and multiplications of weights,
00:04:01.940 | they have no symbols in them.
00:04:03.440 | Therefore, they are very difficult for us to understand as
00:04:07.180 | humans in a way that will illuminate how they'll behave in unfamiliar settings.
00:04:11.980 | Of course, the dream of these models is that somehow we'll see
00:04:16.500 | patterns of activation or something that look like this and
00:04:20.480 | begin to reveal what is clearly a tree structure.
00:04:24.280 | You might think, aha, the model actually does implicitly represent
00:04:28.440 | constituents or named entities or other kinds of meaningful unit in language,
00:04:34.160 | and then you would feel like you truly understood it.
00:04:37.320 | But of course, that never happens.
00:04:39.720 | Instead, what we get when we look at these models is
00:04:42.400 | apparently just a mass of activations.
00:04:45.980 | You get the feeling that either there's nothing systematic
00:04:48.480 | happening here or we're just looking at it incorrectly.
00:04:52.280 | I'm going to offer a hopeful message on this point.
00:04:55.860 | The mess is only apparent when we
00:04:58.720 | use the right techniques and take the right perspective on these models.
00:05:02.400 | The best of them actually have found really systematic and interesting solutions.
00:05:08.440 | There's another angle we could take on this which connects
00:05:11.560 | back to the stuff about behavioral testing.
00:05:14.340 | I've showed this slide a few times in the course,
00:05:17.000 | it's progress on benchmarks.
00:05:19.120 | Along the x-axis, we have time and the y-axis is a normalized measure
00:05:24.080 | of distance from our estimate of human performance in the red line.
00:05:28.280 | One perspective on this slide is that progress is incredible.
00:05:32.560 | Benchmarks used to take us decades to get to
00:05:35.520 | saturate and now saturation happens in a matter of years.
00:05:40.480 | The other perspective on this plot, of course,
00:05:43.640 | is that the benchmarks are too weak.
00:05:45.660 | We have a suspicion that even the models that are performing well on
00:05:49.840 | these tasks are very far from
00:05:52.240 | the human capability that we are trying to diagnose.
00:05:55.080 | We feel that they have brittle solutions,
00:05:58.320 | concerning solutions that are going to reveal themselves in problematic ways.
00:06:03.480 | To really get past that concern,
00:06:05.720 | we need to go beyond this behavioral testing.
00:06:08.980 | There's another underlying motivation for this,
00:06:12.180 | which is systematicity.
00:06:13.640 | We talked about this in detail in the previous unit.
00:06:16.240 | It's an idea from Froeder and Pilishin.
00:06:18.420 | They say, what we mean when we say that linguistic capacities are
00:06:21.780 | systematic is that the ability to produce or understand
00:06:25.440 | some sentences is intrinsically connected to
00:06:28.220 | the ability to produce understand certain others.
00:06:30.760 | This is the idea that if you know what Sandy loves the puppy means,
00:06:34.640 | then you just know what the puppy loves Sandy means.
00:06:37.540 | If you recognize the distributional affinity between the turtle and
00:06:41.360 | the puppy, you also understand the turtle loves the puppy,
00:06:44.880 | Sandy loves the turtle, and so forth and so on for
00:06:47.320 | suddenly an enormous number of sentences.
00:06:50.640 | The human capacity for language makes it feels like
00:06:53.800 | these aren't new facts that you're learning,
00:06:55.720 | but rather things that follow directly from
00:06:58.200 | an underlying capability that you have.
00:07:00.600 | We offered compositionality as one possible explanation for why in
00:07:05.120 | the language realm our understanding and use of language is so systematic.
00:07:11.320 | The related point here is that you get the feeling that we won't fully trust
00:07:16.480 | our models until we can validate that the solutions that they have
00:07:20.080 | found are also systematic or maybe even compositional in this way.
00:07:24.160 | Otherwise, we'll have concerns that at crucial moments,
00:07:27.440 | their behaviors will be arbitrary seeming to us.
00:07:31.360 | There's another angle that you can take on
00:07:33.900 | this project of explaining model behaviors.
00:07:36.560 | The field has a lot of really crucial high-level goals that
00:07:40.880 | relate to safety and trustworthiness and so forth.
00:07:44.760 | We want to be able to certify where models
00:07:47.680 | can be used and where they should not be used.
00:07:50.600 | We want to be able to certify that our models are free from
00:07:54.320 | pernicious social biases and we want to offer
00:07:57.160 | guarantees that our models are safe in certain contexts.
00:08:01.000 | Given what I've said about behavioral testing,
00:08:03.360 | you can anticipate what I'll say now,
00:08:05.440 | behavioral testing alone will not suffice to achieve these goals.
00:08:09.920 | It could possibly tell us that a model does have
00:08:12.880 | a pernicious social bias or is unsafe in
00:08:15.640 | a certain context or has a certain area
00:08:17.760 | where it should be disapproved for use.
00:08:20.040 | But the positive guarantees free from social bias,
00:08:23.720 | safe in a context or approved for a given use,
00:08:26.880 | those will not be achieved until we get beyond behavioral testing.
00:08:31.080 | For those, we need to understand at a deep level what our models are
00:08:36.640 | structured by and what mechanisms guide their behavior.
00:08:40.360 | We need analytic guarantees about how they will behave,
00:08:43.920 | and that means beyond behavioral testing
00:08:46.040 | to really understand the causal mechanisms.
00:08:49.560 | In service of moving toward that goal,
00:08:52.960 | we're going to discuss, as I said, three main methods.
00:08:56.440 | The first one is probing.
00:08:58.280 | There are some precedents before Tenney et al,
00:09:00.640 | 2019 in the literature,
00:09:02.120 | but I think Tenney et al give real credit for showing that
00:09:05.720 | probing was viable and interesting in the BERT era.
00:09:09.400 | Because what they did is essentially fit
00:09:12.000 | small supervised models to
00:09:13.540 | different layers in the BERT architecture.
00:09:15.680 | What they discovered is that there is a lot of
00:09:18.100 | systematic information encoded in those layers.
00:09:21.020 | This was really eye-opening.
00:09:22.640 | I think that most people believed that
00:09:24.520 | even though BERT was performant,
00:09:26.520 | it was performant in ways that depended on
00:09:29.040 | entirely unsystematic solutions.
00:09:31.640 | What probing began to suggest is that BERT had
00:09:35.000 | induced some really interesting causal structure
00:09:38.200 | about language as part of its training regime.
00:09:41.440 | The way this plot works is that we have
00:09:43.320 | the layers of BERT along the x-axis,
00:09:45.480 | and we have different phenomena in these different panels.
00:09:48.920 | What you can see in the blue especially,
00:09:51.640 | is that different kinds of information are emerging
00:09:54.280 | pretty systematically at
00:09:55.680 | different points in the BERT layer structure.
00:09:58.400 | For example, part of speech
00:09:59.880 | seems to emerge around the middle.
00:10:01.680 | Dependency parses emerge a bit later,
00:10:04.380 | named entities are fainter and later in the structure,
00:10:07.680 | semantic roles pretty strong near the middle,
00:10:10.840 | coreference information emerging later in
00:10:13.400 | the network, and so forth and so on.
00:10:15.480 | This was really eye-opening because I think people
00:10:18.080 | didn't anticipate that all of this would be so
00:10:20.400 | accessible in the hidden representations of these models.
00:10:24.800 | What we'll see is that probing is very
00:10:27.280 | rich in terms of characterizing
00:10:28.960 | these internal representations,
00:10:30.600 | but it cannot offer causal guarantees that
00:10:33.240 | this information is shaping model performance.
00:10:36.600 | We can complement that with a class of methods that
00:10:39.240 | are called feature attribution methods.
00:10:41.680 | The idea here is that we will essentially,
00:10:44.400 | in the deep learning context,
00:10:45.600 | study the gradients of our model and use those to
00:10:48.720 | understand which neurons and which collections of
00:10:52.080 | neurons are most guiding its input-output behavior.
00:10:55.440 | For these methods, we're going to get only
00:10:57.920 | faint characterizations of what
00:10:59.560 | the representations are doing,
00:11:01.160 | but we will get some causal guarantees.
00:11:04.000 | What I've got here to illustrate is
00:11:06.160 | a simple sentiment challenge set.
00:11:08.920 | There are a bunch of hard cases involving
00:11:11.240 | attitude-taking with verbs like
00:11:13.200 | say and shifts in sentiment.
00:11:15.760 | What you see here in the highlighting is that the model
00:11:18.680 | seems to be making use of very intuitive information
00:11:21.800 | to shape what are very good predictions for these cases.
00:11:25.240 | Again, that might be reassuring to us that the model is
00:11:28.480 | doing something human interpretable
00:11:30.960 | and systematic under the hood.
00:11:33.640 | Then finally, we're going to study intervention-based methods.
00:11:37.320 | This is a large class of methods.
00:11:39.480 | I think I'll save the details for a later screencast,
00:11:42.360 | but the essence of this is that we're going to
00:11:44.840 | perform brain surgery on our models.
00:11:47.080 | We are going to manipulate their internal states and
00:11:49.800 | study the effects that that
00:11:51.400 | has on their input-output behavior.
00:11:53.800 | In that way, we can piece together
00:11:56.760 | an understanding of the causal mechanisms
00:11:59.200 | that shape the model's behavior,
00:12:01.360 | pushing us toward exactly the guarantees that we need.
00:12:06.120 | Let me, by way of wrapping up this opening screencast,
00:12:09.720 | offer you a analytical framework for
00:12:11.760 | thinking about the methods that we're going to discuss.
00:12:14.080 | Let's say we have three goals.
00:12:15.960 | First, we want to characterize representations,
00:12:19.000 | input representations, output representations,
00:12:21.560 | but maybe most crucially,
00:12:23.220 | internal representations for our models.
00:12:26.240 | We also want to make
00:12:27.920 | causal claims about the role of those representations.
00:12:32.320 | Once we have started to learn about how the models behave,
00:12:35.320 | we would like to have an easy path to actually improving
00:12:39.200 | models based on those insights so that we don't simply
00:12:41.760 | passively study them but rather actively make them better.
00:12:46.400 | That's a scorecard. Let's think about these methods.
00:12:49.160 | What we'll see is that probing is great,
00:12:51.640 | as I said, at characterizing representations,
00:12:54.120 | but it cannot offer causal inferences,
00:12:57.000 | and it's unclear whether there's a path from
00:12:59.340 | probing to actually improving models.
00:13:02.100 | For feature attributions, we get
00:13:04.480 | only faint characterizations of the model internal states.
00:13:08.040 | We pretty much just get weights that tell us how
00:13:10.560 | much individual neurons contribute
00:13:12.680 | to the input-output behavior.
00:13:14.580 | But we can get causal guarantees from some of these methods.
00:13:18.040 | We'll talk about integrated gradients as an example of that.
00:13:21.720 | Then these intervention-based methods,
00:13:23.960 | I've got smileys across the board.
00:13:25.640 | This is the class of methods that
00:13:27.080 | I've been most deeply involved with.
00:13:28.760 | It's the class of methods that I favor,
00:13:30.840 | and that is in large part because of
00:13:32.760 | how well they do on this scorecard.
00:13:34.520 | With these methods, we can characterize representations,
00:13:37.440 | we can offer causal guarantees,
00:13:39.700 | and as you'll see, there's an easy path to using
00:13:42.400 | the insights we gained to actually improve our models.
00:13:45.600 | That's the name of the game for me.
00:13:48.680 | We will now begin systematically working through
00:13:51.840 | these three classes of methods trying to more deeply
00:13:54.600 | understand how they work and
00:13:56.380 | why my scorecard looks the way it does.
00:13:59.440 | [BLANK_AUDIO]