back to indexStanford XCS224U: NLU I Analysis Methods for NLU, Part 3: Feature Attribution I Spring 2023
Chapters
0:0
3:34 Attributions wrt predicted vs. actual labels
5:22 Gradients inputs fails sensitivity
6:32 Integrated gradients: Intuition
7:27 Integrated gradients: Core computation
8:29 Sensitivity again
9:22 BERT example
12:11 A small challenge test
13:25 Summary
00:00:06.200 |
This is part three in our series on analysis methods for NLP. 00:00:09.920 |
We're going to be focused on feature attribution methods. 00:00:13.260 |
I should say at the start that to keep things manageable, 00:00:18.080 |
integrated gradients from Sundararajan et al 2017. 00:00:25.360 |
an attribution method for reasons I will discuss. 00:00:28.340 |
But it's by no means the only method in this space. 00:00:36.680 |
It will give you access to lots of gradient-based methods like IG, 00:00:49.660 |
In addition, if you would like a deeper dive on 00:00:52.460 |
the calculations and examples that I use in this screencast, 00:00:55.960 |
I recommend the notebook feature attribution, 00:01:11.580 |
a really nice framework for thinking about attribution in general, 00:01:15.820 |
and they do that in the form of three axioms. 00:01:20.340 |
Of the two, the most important one is sensitivity. 00:01:25.420 |
The axiom of sensitivity for attribution methods says, 00:01:28.680 |
if two inputs, x and x prime differ only at dimension i, 00:01:35.500 |
then the feature associated with dimension i has non-zero attribution. 00:01:44.460 |
and it takes inputs that are three-dimensional, 00:01:56.900 |
and that means that the feature in position 2 here must have non-zero attribution. 00:02:02.980 |
Seems very intuitive because obviously this feature 00:02:08.740 |
Just quickly, I'll mention a second axiom, implementation invariance. 00:02:17.260 |
then the attributions for M and M prime are identical. 00:02:21.820 |
If the models can't be distinguished behaviorally, 00:02:24.500 |
then we should give them identical attributions. 00:02:27.060 |
We should not be sensitive to incidental details 00:02:30.420 |
of how they were structured or how they were implemented. 00:02:34.220 |
Let's begin with a baseline method, gradients by inputs. 00:02:39.260 |
This is very intuitive and makes some sense from 00:02:42.300 |
the perspective of doing feature attribution in deep networks. 00:02:46.220 |
What we're going to do is calculate the gradients for 00:02:48.660 |
our model with respect to the chosen feature that we want to 00:02:51.820 |
target and multiply that value by the actual value of the feature. 00:03:00.060 |
but obviously we could do this gradient taking with respect to 00:03:03.540 |
any neuron in one of these deep learning models and 00:03:06.300 |
multiply it by the actual value of that neuron for some example. 00:03:09.940 |
Actually, this method generalizes really nicely 00:03:15.840 |
It's really straightforward to implement that. 00:03:23.380 |
and the second one is just a lightweight wrapper 00:03:26.100 |
around the CAPTEM implementation of input by gradient. 00:03:33.380 |
One issue that I want to linger over here that I find 00:03:46.340 |
the predicted labels or with respect to the actual labels, 00:03:55.220 |
Now, if the model you're studying is very high-performing, 00:04:01.420 |
almost identical and this is unlikely to matter. 00:04:11.100 |
and that's precisely the case where these two will come apart. 00:04:31.940 |
If I do attributions with respect to the true labels, 00:04:39.300 |
If I do attributions with respect to the predicted labels, 00:04:42.980 |
I get a totally different set of attribution scores. 00:04:51.260 |
which ones you want to use to guide your analysis. 00:04:54.260 |
They're giving us different pictures of this model. 00:05:01.860 |
I think it really comes down to what you're trying to 00:05:04.740 |
accomplish with the analysis that you're constructing. 00:05:14.820 |
This issue, by the way, will carry forward through 00:05:22.420 |
Here's the fundamental sticking point for gradients by inputs. 00:05:43.500 |
the ReLU activation applied to one minus the input. 00:06:02.780 |
so we are now required by sensitivity to give 00:06:10.740 |
When you run input by gradients on this input, 00:06:13.220 |
you get 0, and when you run input by gradients on input 2, 00:06:23.020 |
That's a worrisome thing about this baseline method. 00:06:31.740 |
The intuition behind integrated gradients is that we are going 00:06:36.380 |
to explore counterfactual versions of our input, 00:06:42.060 |
As we try to get causal insights into model behavior, 00:06:45.700 |
it becomes ever more essential to think about counterfactuals. 00:06:57.020 |
the example that we would like to do attributions for. 00:07:00.700 |
With integrated gradients, the first thing we do is set up 00:07:08.140 |
Then we are going to create synthetic examples 00:07:11.180 |
interpolated between the baseline and our actual example. 00:07:16.060 |
We are going to study the gradients with respect to 00:07:18.460 |
every single one of these interpolated examples, 00:07:27.500 |
Here's a look at the IG calculation in some detail as you 00:07:31.500 |
might implement it for an actual piece of software. 00:07:40.180 |
and this is going to determine the number of steps that we use to 00:07:48.500 |
We're going to interpolate these inputs between 00:07:57.820 |
Then we compute the gradients for each interpolated input, 00:08:07.620 |
Next step, we do an integral approximation through the averaging, 00:08:14.980 |
these examples that we've taken and created the gradients for. 00:08:19.060 |
Then finally, we do some scaling to remain in 00:08:31.220 |
We have our model M with these one-dimensional inputs, 00:08:44.420 |
fail sensitivity for this example in this model. 00:08:55.980 |
those gradient calculations and averaging through them. 00:08:59.580 |
The result of all of that summing and averaging is 00:09:03.340 |
an attribution of approximately one depending on 00:09:10.980 |
This example is no longer a counter example to sensitivity. 00:09:16.060 |
In fact, it's provable that IG satisfies the sensitivity axiom. 00:09:24.260 |
We're likely to be thinking about BERT style models. 00:09:32.180 |
I have a lot of hidden states and I have a lot of things 00:09:39.660 |
The fundamental thing about IG that makes it so freeing is that 00:09:45.620 |
any neuron in any state in this entire model. 00:09:56.380 |
the causal efficacy of neurons on input-output behavior. 00:10:07.420 |
Let me walk through it at a high level for now. 00:10:20.860 |
you'd need to define your own probabilistic prediction function, 00:10:24.460 |
and you need to define a function that will create for you 00:10:27.100 |
representations for your base as well as for your actual example. 00:10:31.100 |
Those are the things that will interpolate between with IG. 00:10:42.140 |
and then IG forward and whatever layer I pick are 00:10:45.700 |
the core arguments to layer integrated gradients for CAPTEM. 00:10:54.620 |
I'll take attributions with respect to the positive class, the true class. 00:11:05.460 |
Inputs, base IDs, the target is the true class, 00:11:12.180 |
The result of that is some scores which I use, 00:11:17.340 |
across individual representations in the BERT model. 00:11:20.700 |
That's an additional assumption that I'm bringing in that will do 00:11:23.620 |
averaging of attributions across these hidden representations. 00:11:28.780 |
Little more calculating, and then CAPTEM provides 00:11:31.780 |
a nice function for visualizing these things. 00:11:34.380 |
What you get out are little snippets of HTML that 00:11:37.500 |
summarize the attributions and associated metadata. 00:11:47.020 |
actually what attribution label is supposed to do. 00:11:49.500 |
But the nice thing is that we have color coding here with 00:11:58.100 |
Green means evidence toward the positive label, 00:12:05.640 |
and negative is evidence away from the true label, 00:12:11.100 |
Here's a fuller example, and for this one to avoid confusing myself, 00:12:14.720 |
I did relabel the legend away from the true label, 00:12:31.540 |
wrong and less activation for things like great. 00:12:56.140 |
These are attributions with respect to the true label. 00:13:04.440 |
My guess about these attributions is that the model is actually doing 00:13:15.940 |
this slide is a reassuring picture that the model is doing 00:13:29.600 |
give okay characterizations of the representations. 00:13:32.900 |
You really just get a scalar value about the degree of 00:13:35.920 |
importance and you have to make further guesses about 00:13:45.580 |
We can get causal guarantees when we use models like IIG. 00:13:50.220 |
But I'm afraid that there's no clear direct path to 00:13:53.620 |
using methods like IG to directly improve models. 00:13:58.820 |
some ambient information that might guide you in 00:14:08.740 |
a powerful, pretty flexible heuristic method that can 00:14:12.500 |
offer useful insights about how models are solving tasks.