back to indexStanford XCS224U: NLU I Analysis Methods for NLU, Part 1: Overview I Spring 2023
Chapters
0:0 Intro
0:11 Varieties of evaluation
0:50 Limits of behavioral testing
3:37 Models today
4:16 The interpretability dream
5:8 Progress on benchmarks
6:8 Systematicity
7:31 A crucial prerequisite
8:50 Probing internal representations
10:37 Feature attribution
11:34 Intervention-based methods
12:6 Analytical framework
00:00:06.040 |
This screencast kicks off our unit on analysis methods in NLP. 00:00:15.360 |
and we looked in particular at hypothesis-driven challenge and adversarial tests 00:00:20.780 |
as a vehicle for deeply understanding how our models will behave, 00:00:27.560 |
What we're going to try to do in this unit is go 00:00:30.320 |
one layer deeper and talk about what I've called structural methods, 00:00:38.840 |
The idea is that we're going to go beyond simple behavioral testing to understand, 00:00:43.680 |
we hope, the causal mechanisms that are guiding the input-output behavior of our models. 00:00:52.640 |
I tried to make you very aware of the limits of behavioral testing. 00:00:56.440 |
Of course, it plays an important role in the field, 00:00:58.800 |
and it will complement the methods that we discuss, 00:01:01.160 |
but it is intrinsically limited in ways that should worry us when it 00:01:04.920 |
comes to offering guarantees about how models will behave. 00:01:13.840 |
Let me walk through that from now taking a slightly different perspective, 00:01:17.840 |
which is the illuminated feeling that we get when 00:01:21.080 |
we finally get to see how the model actually works. 00:01:24.120 |
But recall this even-odd model takes in strings like 00:01:27.040 |
four and predicts whether they refer to even or odd numbers. 00:01:41.720 |
This is all making you feel that the model is a good model of even-odd detection. 00:01:53.520 |
and it is immediately revealed to you that this is a very poor model. 00:02:29.600 |
but you should be aware of the fact that you might have missed some crucial examples. 00:02:34.640 |
Again, when I show you the inner workings of this model, 00:02:38.880 |
you get immediately illuminated about where it works and where it doesn't. 00:02:45.680 |
It tokenizes its input and uses the final token as the basis for predicting even-odd. 00:02:53.520 |
but it has this else clause where it predicts odd, 00:02:56.360 |
and now we know exactly how to foil the model. 00:03:02.720 |
It was really the point at which we got to see the internal causal mechanisms, 00:03:06.800 |
that we knew exactly how the model would work, 00:03:12.800 |
Let's suppose that it gets all of those previous inputs correct. 00:03:15.600 |
Is it the one true model of even-odd detection? 00:03:21.880 |
but you should see by now that no matter how many inputs we offer this model, 00:03:27.800 |
every integer string that it will behave as intended. 00:03:36.720 |
But of course, in the modern era of NLP models, 00:03:43.760 |
understand as the symbolic programs that I was just showing you. 00:03:47.060 |
Instead, our models look like this huge array of birds nests, 00:03:52.840 |
lots of internal states all connected to all the other states, 00:04:03.440 |
Therefore, they are very difficult for us to understand as 00:04:07.180 |
humans in a way that will illuminate how they'll behave in unfamiliar settings. 00:04:11.980 |
Of course, the dream of these models is that somehow we'll see 00:04:16.500 |
patterns of activation or something that look like this and 00:04:20.480 |
begin to reveal what is clearly a tree structure. 00:04:24.280 |
You might think, aha, the model actually does implicitly represent 00:04:28.440 |
constituents or named entities or other kinds of meaningful unit in language, 00:04:34.160 |
and then you would feel like you truly understood it. 00:04:39.720 |
Instead, what we get when we look at these models is 00:04:45.980 |
You get the feeling that either there's nothing systematic 00:04:48.480 |
happening here or we're just looking at it incorrectly. 00:04:52.280 |
I'm going to offer a hopeful message on this point. 00:04:58.720 |
use the right techniques and take the right perspective on these models. 00:05:02.400 |
The best of them actually have found really systematic and interesting solutions. 00:05:08.440 |
There's another angle we could take on this which connects 00:05:14.340 |
I've showed this slide a few times in the course, 00:05:19.120 |
Along the x-axis, we have time and the y-axis is a normalized measure 00:05:24.080 |
of distance from our estimate of human performance in the red line. 00:05:28.280 |
One perspective on this slide is that progress is incredible. 00:05:35.520 |
saturate and now saturation happens in a matter of years. 00:05:40.480 |
The other perspective on this plot, of course, 00:05:45.660 |
We have a suspicion that even the models that are performing well on 00:05:52.240 |
the human capability that we are trying to diagnose. 00:05:58.320 |
concerning solutions that are going to reveal themselves in problematic ways. 00:06:05.720 |
we need to go beyond this behavioral testing. 00:06:08.980 |
There's another underlying motivation for this, 00:06:13.640 |
We talked about this in detail in the previous unit. 00:06:18.420 |
They say, what we mean when we say that linguistic capacities are 00:06:21.780 |
systematic is that the ability to produce or understand 00:06:28.220 |
the ability to produce understand certain others. 00:06:30.760 |
This is the idea that if you know what Sandy loves the puppy means, 00:06:34.640 |
then you just know what the puppy loves Sandy means. 00:06:37.540 |
If you recognize the distributional affinity between the turtle and 00:06:41.360 |
the puppy, you also understand the turtle loves the puppy, 00:06:44.880 |
Sandy loves the turtle, and so forth and so on for 00:06:50.640 |
The human capacity for language makes it feels like 00:07:00.600 |
We offered compositionality as one possible explanation for why in 00:07:05.120 |
the language realm our understanding and use of language is so systematic. 00:07:11.320 |
The related point here is that you get the feeling that we won't fully trust 00:07:16.480 |
our models until we can validate that the solutions that they have 00:07:20.080 |
found are also systematic or maybe even compositional in this way. 00:07:24.160 |
Otherwise, we'll have concerns that at crucial moments, 00:07:27.440 |
their behaviors will be arbitrary seeming to us. 00:07:36.560 |
The field has a lot of really crucial high-level goals that 00:07:40.880 |
relate to safety and trustworthiness and so forth. 00:07:47.680 |
can be used and where they should not be used. 00:07:50.600 |
We want to be able to certify that our models are free from 00:07:54.320 |
pernicious social biases and we want to offer 00:07:57.160 |
guarantees that our models are safe in certain contexts. 00:08:01.000 |
Given what I've said about behavioral testing, 00:08:05.440 |
behavioral testing alone will not suffice to achieve these goals. 00:08:09.920 |
It could possibly tell us that a model does have 00:08:20.040 |
But the positive guarantees free from social bias, 00:08:23.720 |
safe in a context or approved for a given use, 00:08:26.880 |
those will not be achieved until we get beyond behavioral testing. 00:08:31.080 |
For those, we need to understand at a deep level what our models are 00:08:36.640 |
structured by and what mechanisms guide their behavior. 00:08:40.360 |
We need analytic guarantees about how they will behave, 00:08:52.960 |
we're going to discuss, as I said, three main methods. 00:08:58.280 |
There are some precedents before Tenney et al, 00:09:02.120 |
but I think Tenney et al give real credit for showing that 00:09:05.720 |
probing was viable and interesting in the BERT era. 00:09:15.680 |
What they discovered is that there is a lot of 00:09:18.100 |
systematic information encoded in those layers. 00:09:31.640 |
What probing began to suggest is that BERT had 00:09:35.000 |
induced some really interesting causal structure 00:09:38.200 |
about language as part of its training regime. 00:09:45.480 |
and we have different phenomena in these different panels. 00:09:51.640 |
is that different kinds of information are emerging 00:09:55.680 |
different points in the BERT layer structure. 00:10:04.380 |
named entities are fainter and later in the structure, 00:10:07.680 |
semantic roles pretty strong near the middle, 00:10:15.480 |
This was really eye-opening because I think people 00:10:18.080 |
didn't anticipate that all of this would be so 00:10:20.400 |
accessible in the hidden representations of these models. 00:10:33.240 |
this information is shaping model performance. 00:10:36.600 |
We can complement that with a class of methods that 00:10:45.600 |
study the gradients of our model and use those to 00:10:48.720 |
understand which neurons and which collections of 00:10:52.080 |
neurons are most guiding its input-output behavior. 00:11:15.760 |
What you see here in the highlighting is that the model 00:11:18.680 |
seems to be making use of very intuitive information 00:11:21.800 |
to shape what are very good predictions for these cases. 00:11:25.240 |
Again, that might be reassuring to us that the model is 00:11:33.640 |
Then finally, we're going to study intervention-based methods. 00:11:39.480 |
I think I'll save the details for a later screencast, 00:11:42.360 |
but the essence of this is that we're going to 00:11:47.080 |
We are going to manipulate their internal states and 00:12:01.360 |
pushing us toward exactly the guarantees that we need. 00:12:06.120 |
Let me, by way of wrapping up this opening screencast, 00:12:11.760 |
thinking about the methods that we're going to discuss. 00:12:15.960 |
First, we want to characterize representations, 00:12:19.000 |
input representations, output representations, 00:12:27.920 |
causal claims about the role of those representations. 00:12:32.320 |
Once we have started to learn about how the models behave, 00:12:35.320 |
we would like to have an easy path to actually improving 00:12:39.200 |
models based on those insights so that we don't simply 00:12:41.760 |
passively study them but rather actively make them better. 00:12:46.400 |
That's a scorecard. Let's think about these methods. 00:12:51.640 |
as I said, at characterizing representations, 00:13:04.480 |
only faint characterizations of the model internal states. 00:13:08.040 |
We pretty much just get weights that tell us how 00:13:14.580 |
But we can get causal guarantees from some of these methods. 00:13:18.040 |
We'll talk about integrated gradients as an example of that. 00:13:34.520 |
With these methods, we can characterize representations, 00:13:39.700 |
and as you'll see, there's an easy path to using 00:13:42.400 |
the insights we gained to actually improve our models. 00:13:48.680 |
We will now begin systematically working through 00:13:51.840 |
these three classes of methods trying to more deeply