Stanford XCS224U: NLU I Analysis Methods for NLU, Part 1: Overview I Spring 2023

00:00:00.000 | Welcome everyone.

00:00:06.040 | This screencast kicks off our unit on analysis methods in NLP.

00:00:10.600 | In the previous unit for the course,

00:00:12.640 | we were very focused on behavioral testing,

00:00:15.360 | and we looked in particular at hypothesis-driven challenge and adversarial tests

00:00:20.780 | as a vehicle for deeply understanding how our models will behave,

00:00:25.080 | in especially unfamiliar scenarios.

00:00:27.560 | What we're going to try to do in this unit is go

00:00:30.320 | one layer deeper and talk about what I've called structural methods,

00:00:33.880 | including probing, feature attribution,

00:00:36.480 | and a class of intervention-based methods.

00:00:38.840 | The idea is that we're going to go beyond simple behavioral testing to understand,

00:00:43.680 | we hope, the causal mechanisms that are guiding the input-output behavior of our models.

00:00:50.480 | In the previous unit,

00:00:52.640 | I tried to make you very aware of the limits of behavioral testing.

00:00:56.440 | Of course, it plays an important role in the field,

00:00:58.800 | and it will complement the methods that we discuss,

00:01:01.160 | but it is intrinsically limited in ways that should worry us when it

00:01:04.920 | comes to offering guarantees about how models will behave.

00:01:08.800 | To make that very vivid for you,

00:01:11.000 | I use this example of an even-odd detector.

00:01:13.840 | Let me walk through that from now taking a slightly different perspective,

00:01:17.840 | which is the illuminated feeling that we get when

00:01:21.080 | we finally get to see how the model actually works.

00:01:24.120 | But recall this even-odd model takes in strings like

00:01:27.040 | four and predicts whether they refer to even or odd numbers.

00:01:30.960 | Four comes in and it predicts even,

00:01:33.160 | 21 and it rightly predicts odd,

00:01:35.520 | 32 even, 36 even, 63 odd.

00:01:41.720 | This is all making you feel that the model is a good model of even-odd detection.

00:01:47.680 | But you need to be careful,

00:01:49.280 | you've only done five tests.

00:01:50.960 | Now I show you how the model actually works,

00:01:53.520 | and it is immediately revealed to you that this is a very poor model.

00:01:57.480 | We got lucky with our first five inputs.

00:01:59.960 | It is a simple lookup on those inputs,

00:02:02.360 | and when it gets an unfamiliar input,

00:02:04.560 | it defaults to predicting odd.

00:02:07.120 | Once we see that,

00:02:08.480 | we know exactly how the model is broken,

00:02:10.600 | and exactly how to foil it behaviorally.

00:02:12.720 | We input 22, and it thinks that that is odd.

00:02:16.320 | But then we get a second even-odd model.

00:02:19.080 | It passes the first five tests,

00:02:21.040 | it makes a good prediction about 22,

00:02:23.360 | good prediction about 5,

00:02:25.240 | and 89, and 56.

00:02:27.400 | Again, your confidence is building,

00:02:29.600 | but you should be aware of the fact that you might have missed some crucial examples.

00:02:34.640 | Again, when I show you the inner workings of this model,

00:02:38.880 | you get immediately illuminated about where it works and where it doesn't.

00:02:43.520 | This model is more sophisticated.

00:02:45.680 | It tokenizes its input and uses the final token as the basis for predicting even-odd.

00:02:51.760 | That is a pretty good theory,

00:02:53.520 | but it has this else clause where it predicts odd,

00:02:56.360 | and now we know exactly how to foil the model.

00:02:59.120 | We input 16, and it thinks that that is odd.

00:03:02.720 | It was really the point at which we got to see the internal causal mechanisms,

00:03:06.800 | that we knew exactly how the model would work,

00:03:08.760 | and exactly where it would fail.

00:03:10.680 | Now we move at last to model 3.

00:03:12.800 | Let's suppose that it gets all of those previous inputs correct.

00:03:15.600 | Is it the one true model of even-odd detection?

00:03:19.000 | Well, we can keep up our behavioral testing,

00:03:21.880 | but you should see by now that no matter how many inputs we offer this model,

00:03:25.640 | we will never get a guarantee for

00:03:27.800 | every integer string that it will behave as intended.

00:03:30.840 | For that guarantee,

00:03:32.760 | we need to look inside this black box.

00:03:36.720 | But of course, in the modern era of NLP models,

00:03:41.500 | they're hardly ever as easy to

00:03:43.760 | understand as the symbolic programs that I was just showing you.

00:03:47.060 | Instead, our models look like this huge array of birds nests,

00:03:52.840 | lots of internal states all connected to all the other states,

00:03:56.560 | completely opaque, they consist mainly of

00:03:59.400 | weights and multiplications of weights,

00:04:01.940 | they have no symbols in them.

00:04:03.440 | Therefore, they are very difficult for us to understand as

00:04:07.180 | humans in a way that will illuminate how they'll behave in unfamiliar settings.

00:04:11.980 | Of course, the dream of these models is that somehow we'll see

00:04:16.500 | patterns of activation or something that look like this and

00:04:20.480 | begin to reveal what is clearly a tree structure.

00:04:24.280 | You might think, aha, the model actually does implicitly represent

00:04:28.440 | constituents or named entities or other kinds of meaningful unit in language,

00:04:34.160 | and then you would feel like you truly understood it.

00:04:37.320 | But of course, that never happens.

00:04:39.720 | Instead, what we get when we look at these models is

00:04:42.400 | apparently just a mass of activations.

00:04:45.980 | You get the feeling that either there's nothing systematic

00:04:48.480 | happening here or we're just looking at it incorrectly.

00:04:52.280 | I'm going to offer a hopeful message on this point.

00:04:55.860 | The mess is only apparent when we

00:04:58.720 | use the right techniques and take the right perspective on these models.

00:05:02.400 | The best of them actually have found really systematic and interesting solutions.

00:05:08.440 | There's another angle we could take on this which connects

00:05:11.560 | back to the stuff about behavioral testing.

00:05:14.340 | I've showed this slide a few times in the course,

00:05:17.000 | it's progress on benchmarks.

00:05:19.120 | Along the x-axis, we have time and the y-axis is a normalized measure

00:05:24.080 | of distance from our estimate of human performance in the red line.

00:05:28.280 | One perspective on this slide is that progress is incredible.

00:05:32.560 | Benchmarks used to take us decades to get to

00:05:35.520 | saturate and now saturation happens in a matter of years.

00:05:40.480 | The other perspective on this plot, of course,

00:05:43.640 | is that the benchmarks are too weak.

00:05:45.660 | We have a suspicion that even the models that are performing well on

00:05:49.840 | these tasks are very far from

00:05:52.240 | the human capability that we are trying to diagnose.

00:05:55.080 | We feel that they have brittle solutions,

00:05:58.320 | concerning solutions that are going to reveal themselves in problematic ways.

00:06:03.480 | To really get past that concern,

00:06:05.720 | we need to go beyond this behavioral testing.

00:06:08.980 | There's another underlying motivation for this,

00:06:12.180 | which is systematicity.

00:06:13.640 | We talked about this in detail in the previous unit.

00:06:16.240 | It's an idea from Froeder and Pilishin.

00:06:18.420 | They say, what we mean when we say that linguistic capacities are

00:06:21.780 | systematic is that the ability to produce or understand

00:06:25.440 | some sentences is intrinsically connected to

00:06:28.220 | the ability to produce understand certain others.

00:06:30.760 | This is the idea that if you know what Sandy loves the puppy means,

00:06:34.640 | then you just know what the puppy loves Sandy means.

00:06:37.540 | If you recognize the distributional affinity between the turtle and

00:06:41.360 | the puppy, you also understand the turtle loves the puppy,

00:06:44.880 | Sandy loves the turtle, and so forth and so on for

00:06:47.320 | suddenly an enormous number of sentences.

00:06:50.640 | The human capacity for language makes it feels like

00:06:53.800 | these aren't new facts that you're learning,

00:06:55.720 | but rather things that follow directly from

00:06:58.200 | an underlying capability that you have.

00:07:00.600 | We offered compositionality as one possible explanation for why in

00:07:05.120 | the language realm our understanding and use of language is so systematic.

00:07:11.320 | The related point here is that you get the feeling that we won't fully trust

00:07:16.480 | our models until we can validate that the solutions that they have

00:07:20.080 | found are also systematic or maybe even compositional in this way.

00:07:24.160 | Otherwise, we'll have concerns that at crucial moments,

00:07:27.440 | their behaviors will be arbitrary seeming to us.

00:07:31.360 | There's another angle that you can take on

00:07:33.900 | this project of explaining model behaviors.

00:07:36.560 | The field has a lot of really crucial high-level goals that

00:07:40.880 | relate to safety and trustworthiness and so forth.

00:07:44.760 | We want to be able to certify where models

00:07:47.680 | can be used and where they should not be used.

00:07:50.600 | We want to be able to certify that our models are free from

00:07:54.320 | pernicious social biases and we want to offer

00:07:57.160 | guarantees that our models are safe in certain contexts.

00:08:01.000 | Given what I've said about behavioral testing,

00:08:03.360 | you can anticipate what I'll say now,

00:08:05.440 | behavioral testing alone will not suffice to achieve these goals.

00:08:09.920 | It could possibly tell us that a model does have

00:08:12.880 | a pernicious social bias or is unsafe in

00:08:15.640 | a certain context or has a certain area

00:08:17.760 | where it should be disapproved for use.

00:08:20.040 | But the positive guarantees free from social bias,

00:08:23.720 | safe in a context or approved for a given use,

00:08:26.880 | those will not be achieved until we get beyond behavioral testing.

00:08:31.080 | For those, we need to understand at a deep level what our models are

00:08:36.640 | structured by and what mechanisms guide their behavior.

00:08:40.360 | We need analytic guarantees about how they will behave,

00:08:43.920 | and that means beyond behavioral testing

00:08:46.040 | to really understand the causal mechanisms.

00:08:49.560 | In service of moving toward that goal,

00:08:52.960 | we're going to discuss, as I said, three main methods.

00:08:56.440 | The first one is probing.

00:08:58.280 | There are some precedents before Tenney et al,

00:09:00.640 | 2019 in the literature,

00:09:02.120 | but I think Tenney et al give real credit for showing that

00:09:05.720 | probing was viable and interesting in the BERT era.

00:09:09.400 | Because what they did is essentially fit

00:09:12.000 | small supervised models to

00:09:13.540 | different layers in the BERT architecture.

00:09:15.680 | What they discovered is that there is a lot of

00:09:18.100 | systematic information encoded in those layers.

00:09:21.020 | This was really eye-opening.

00:09:22.640 | I think that most people believed that

00:09:24.520 | even though BERT was performant,

00:09:26.520 | it was performant in ways that depended on

00:09:29.040 | entirely unsystematic solutions.

00:09:31.640 | What probing began to suggest is that BERT had

00:09:35.000 | induced some really interesting causal structure

00:09:38.200 | about language as part of its training regime.

00:09:41.440 | The way this plot works is that we have

00:09:43.320 | the layers of BERT along the x-axis,

00:09:45.480 | and we have different phenomena in these different panels.

00:09:48.920 | What you can see in the blue especially,

00:09:51.640 | is that different kinds of information are emerging

00:09:54.280 | pretty systematically at

00:09:55.680 | different points in the BERT layer structure.

00:09:58.400 | For example, part of speech

00:09:59.880 | seems to emerge around the middle.

00:10:01.680 | Dependency parses emerge a bit later,

00:10:04.380 | named entities are fainter and later in the structure,

00:10:07.680 | semantic roles pretty strong near the middle,

00:10:10.840 | coreference information emerging later in

00:10:13.400 | the network, and so forth and so on.

00:10:15.480 | This was really eye-opening because I think people

00:10:18.080 | didn't anticipate that all of this would be so

00:10:20.400 | accessible in the hidden representations of these models.

00:10:24.800 | What we'll see is that probing is very

00:10:27.280 | rich in terms of characterizing

00:10:28.960 | these internal representations,

00:10:30.600 | but it cannot offer causal guarantees that

00:10:33.240 | this information is shaping model performance.

00:10:36.600 | We can complement that with a class of methods that

00:10:39.240 | are called feature attribution methods.

00:10:41.680 | The idea here is that we will essentially,

00:10:44.400 | in the deep learning context,

00:10:45.600 | study the gradients of our model and use those to

00:10:48.720 | understand which neurons and which collections of

00:10:52.080 | neurons are most guiding its input-output behavior.

00:10:55.440 | For these methods, we're going to get only

00:10:57.920 | faint characterizations of what

00:10:59.560 | the representations are doing,

00:11:01.160 | but we will get some causal guarantees.

00:11:04.000 | What I've got here to illustrate is

00:11:06.160 | a simple sentiment challenge set.

00:11:08.920 | There are a bunch of hard cases involving

00:11:11.240 | attitude-taking with verbs like

00:11:13.200 | say and shifts in sentiment.

00:11:15.760 | What you see here in the highlighting is that the model

00:11:18.680 | seems to be making use of very intuitive information

00:11:21.800 | to shape what are very good predictions for these cases.

00:11:25.240 | Again, that might be reassuring to us that the model is

00:11:28.480 | doing something human interpretable

00:11:30.960 | and systematic under the hood.

00:11:33.640 | Then finally, we're going to study intervention-based methods.

00:11:37.320 | This is a large class of methods.

00:11:39.480 | I think I'll save the details for a later screencast,

00:11:42.360 | but the essence of this is that we're going to

00:11:44.840 | perform brain surgery on our models.

00:11:47.080 | We are going to manipulate their internal states and

00:11:49.800 | study the effects that that

00:11:51.400 | has on their input-output behavior.

00:11:53.800 | In that way, we can piece together

00:11:56.760 | an understanding of the causal mechanisms

00:11:59.200 | that shape the model's behavior,

00:12:01.360 | pushing us toward exactly the guarantees that we need.

00:12:06.120 | Let me, by way of wrapping up this opening screencast,

00:12:09.720 | offer you a analytical framework for

00:12:11.760 | thinking about the methods that we're going to discuss.

00:12:14.080 | Let's say we have three goals.

00:12:15.960 | First, we want to characterize representations,

00:12:19.000 | input representations, output representations,

00:12:21.560 | but maybe most crucially,

00:12:23.220 | internal representations for our models.

00:12:26.240 | We also want to make

00:12:27.920 | causal claims about the role of those representations.

00:12:32.320 | Once we have started to learn about how the models behave,

00:12:35.320 | we would like to have an easy path to actually improving

00:12:39.200 | models based on those insights so that we don't simply

00:12:41.760 | passively study them but rather actively make them better.

00:12:46.400 | That's a scorecard. Let's think about these methods.

00:12:49.160 | What we'll see is that probing is great,

00:12:51.640 | as I said, at characterizing representations,

00:12:54.120 | but it cannot offer causal inferences,

00:12:57.000 | and it's unclear whether there's a path from

00:12:59.340 | probing to actually improving models.

00:13:02.100 | For feature attributions, we get

00:13:04.480 | only faint characterizations of the model internal states.

00:13:08.040 | We pretty much just get weights that tell us how

00:13:10.560 | much individual neurons contribute

00:13:12.680 | to the input-output behavior.

00:13:14.580 | But we can get causal guarantees from some of these methods.

00:13:18.040 | We'll talk about integrated gradients as an example of that.

00:13:21.720 | Then these intervention-based methods,

00:13:23.960 | I've got smileys across the board.

00:13:25.640 | This is the class of methods that

00:13:27.080 | I've been most deeply involved with.

00:13:28.760 | It's the class of methods that I favor,

00:13:30.840 | and that is in large part because of

00:13:32.760 | how well they do on this scorecard.

00:13:34.520 | With these methods, we can characterize representations,

00:13:37.440 | we can offer causal guarantees,

00:13:39.700 | and as you'll see, there's an easy path to using

00:13:42.400 | the insights we gained to actually improve our models.

00:13:45.600 | That's the name of the game for me.

00:13:48.680 | We will now begin systematically working through

00:13:51.840 | these three classes of methods trying to more deeply

00:13:54.600 | understand how they work and

00:13:56.380 | why my scorecard looks the way it does.

00:13:59.440 | [BLANK_AUDIO]

Stanford XCS224U: NLU I Analysis Methods for NLU, Part 1: Overview I Spring 2023

Chapters