Back to Index

Stanford XCS224U: NLU I Analysis Methods for NLU, Part 3: Feature Attribution I Spring 2023


Chapters

0:0
3:34 Attributions wrt predicted vs. actual labels
5:22 Gradients inputs fails sensitivity
6:32 Integrated gradients: Intuition
7:27 Integrated gradients: Core computation
8:29 Sensitivity again
9:22 BERT example
12:11 A small challenge test
13:25 Summary

Transcript

Welcome back everyone. This is part three in our series on analysis methods for NLP. We're going to be focused on feature attribution methods. I should say at the start that to keep things manageable, we're going to mainly focus on integrated gradients from Sundararajan et al 2017. This is a shining, powerful, inspiring example of an attribution method for reasons I will discuss.

But it's by no means the only method in this space. For one-stop shopping on these methods, I recommend the captum.ai library. It will give you access to lots of gradient-based methods like IG, as well as many others, including more traditional methods like feature ablation and feature permutation. Check out captum.ai.

In addition, if you would like a deeper dive on the calculations and examples that I use in this screencast, I recommend the notebook feature attribution, which is part of the course code repository. Now, I love the integrated gradients paper, Sundararajan et al 2017, because of its method, but also because it offers a really nice framework for thinking about attribution in general, and they do that in the form of three axioms.

I'm going to talk about two of them. Of the two, the most important one is sensitivity. This is very intuitive. The axiom of sensitivity for attribution methods says, if two inputs, x and x prime differ only at dimension i, and lead to different predictions, then the feature associated with dimension i has non-zero attribution.

Here's a quick example. Our model is M, and it takes inputs that are three-dimensional, and for input 1, 0, 1, this model outputs positive, and for 1, 1, 1, it outputs negative. That's a difference in the predictions, and that means that the feature in position 2 here must have non-zero attribution.

Seems very intuitive because obviously this feature is important to the behavior of this model. Just quickly, I'll mention a second axiom, implementation invariance. If two models, M and M prime, have identical input-output behavior, then the attributions for M and M prime are identical. That's very intuitive. If the models can't be distinguished behaviorally, then we should give them identical attributions.

We should not be sensitive to incidental details of how they were structured or how they were implemented. Let's begin with a baseline method, gradients by inputs. This is very intuitive and makes some sense from the perspective of doing feature attribution in deep networks. What we're going to do is calculate the gradients for our model with respect to the chosen feature that we want to target and multiply that value by the actual value of the feature.

Gradients by inputs. It's called gradients by inputs, but obviously we could do this gradient taking with respect to any neuron in one of these deep learning models and multiply it by the actual value of that neuron for some example. Actually, this method generalizes really nicely to any state in a deep learning model.

It's really straightforward to implement that. I've depicted that on the slide here. The first implementation uses raw PyTorch, and the second one is just a lightweight wrapper around the CAPTEM implementation of input by gradient. Shows you how straightforward this can be. One issue that I want to linger over here that I find conceptually difficult is this question of how we should do the attributions.

For classifier models, we have a choice. We can take attributions with respect to the predicted labels or with respect to the actual labels, which are two different dimensions in the output vector for these models. Now, if the model you're studying is very high-performing, then the predicted and actual labels will be almost identical and this is unlikely to matter.

But you might be trying to study a really poor performing model to try to understand where its deficiencies lie, and that's precisely the case where these two will come apart. As an illustration on this slide here, I've defined a simple make classification synthetic problem using scikit-learn. It has four features.

Then I set up a shallow neural classifier and I deliberately under-trained it. It has just one training iteration. This is a very bad model. If I do attributions with respect to the true labels, I get one vector of attribution scores. If I do attributions with respect to the predicted labels, I get a totally different set of attribution scores.

That confronts you with a difficult conceptual question of which ones you want to use to guide your analysis. They're giving us different pictures of this model. I think that there is no a priori reason to favor one over the other. I think it really comes down to what you're trying to accomplish with the analysis that you're constructing.

The best answer I can give is to be explicit about your assumptions and about the methods that you used. This issue, by the way, will carry forward through all of these gradient-based methods, not just inputs by gradients. Here's the fundamental sticking point for gradients by inputs. It fails that sensitivity axiom.

This is an example, a counterexample to sensitivity that comes directly from Sundararajan et al, 2017. We have a very simple model here, M. It takes one-dimensional inputs, and what it does is one minus the ReLU activation applied to one minus the input. Very simple model. When we use the model with input 0, we get 0 as the output.

When we use the model with input 2, we get 1 as the output. We have a difference in output predictions. These are one-dimensional inputs, so we are now required by sensitivity to give non-zero attribution to this feature. But sadly, we do not. When you run input by gradients on this input, you get 0, and when you run input by gradients on input 2, you also get 0, and that is just a direct failure to meet the sensitivity requirement.

That's a worrisome thing about this baseline method. It queues us up nicely to think about how integrated gradients will do better. The intuition behind integrated gradients is that we are going to explore counterfactual versions of our input, and I think that is an important insight. As we try to get causal insights into model behavior, it becomes ever more essential to think about counterfactuals.

Here's the way IG does this. We have two features in our space, X_1 and X_2, and this blue dot represents the example that we would like to do attributions for. With integrated gradients, the first thing we do is set up a baseline and a standard baseline for this would be the zero vector.

Then we are going to create synthetic examples interpolated between the baseline and our actual example. We are going to study the gradients with respect to every single one of these interpolated examples, aggregate them together, and use all of that information to do our attributions. Here's a look at the IG calculation in some detail as you might implement it for an actual piece of software.

Let's break this down into some pieces. Step 1, we have this vector alpha, and this is going to determine the number of steps that we use to get different synthetic inputs between baseline and the actual example. We're going to interpolate these inputs between the baseline and the actual example.

That's what happens in purple here, according to these alpha steps. Then we compute the gradients for each interpolated input, and that part of the calculation, the individual ones, looks exactly like inputs by gradients, of course. Next step, we do an integral approximation through the averaging, that's the summation that you see.

We're going to sum over all of these examples that we've taken and created the gradients for. Then finally, we do some scaling to remain in the space region of the original example. That is the complete IG calculation. Let's return to sensitivity. We have our model M with these one-dimensional inputs, one minus relu applied to one minus x.

This is the example from Sundararajan et al. I showed you before that inputs by gradients fail sensitivity for this example in this model. Integrated gradients does better. The reason it does better, you can see this here, we are summing over all of those gradient calculations and averaging through them.

The result of all of that summing and averaging is an attribution of approximately one depending on exactly which steps that you decide to look at for the IG calculation. This example is no longer a counter example to sensitivity. In fact, it's provable that IG satisfies the sensitivity axiom. Let's think in practical terms now.

We're likely to be thinking about BERT style models. This is a cartoon version of BERT where I have some output labels up here at the top. I have a lot of hidden states and I have a lot of things happening all the way down to maybe multiple and fixed embedding layers.

The fundamental thing about IG that makes it so freeing is that we can do attributions with respect to any neuron in any state in this entire model. We have some of the flexibility of probing, but now we will get causal guarantees that our attributions relate to the causal efficacy of neurons on input-output behavior.

Here's a complete worked example that you might want to work with yourselves, modify, study, and so forth. Let me walk through it at a high level for now. The first thing I do is load in Twitter-based sentiment classifier based in Roberta that I got from Hugging Face. For the sake of CAPTEM, you'd need to define your own probabilistic prediction function, and you need to define a function that will create for you representations for your base as well as for your actual example.

Those are the things that will interpolate between with IG. You need to do one more function to find the forward part of your model. Here I just needed to grab the logits, and then IG forward and whatever layer I pick are the core arguments to layer integrated gradients for CAPTEM.

Here's my example. This is illuminating. It has true class positive. I'll take attributions with respect to the positive class, the true class. Here are my base and actual inputs, and here's the actual attribution step. Inputs, base IDs, the target is the true class, and this is some other keyword argument.

The result of that is some scores which I use, and then I z-score normalize them across individual representations in the BERT model. That's an additional assumption that I'm bringing in that will do averaging of attributions across these hidden representations. Little more calculating, and then CAPTEM provides a nice function for visualizing these things.

What you get out are little snippets of HTML that summarize the attributions and associated metadata. I've got the true label, the predicted label, those align. There's some scores, and I'm not sure actually what attribution label is supposed to do. But the nice thing is that we have color coding here with color proportional to the attribution score.

You have to be a little bit careful here. Green means evidence toward the positive label, which in sentiment might be negative. This is meant to be the true label, and negative is evidence away from the true label, and the white there is neutral. Here's a fuller example, and for this one to avoid confusing myself, I did relabel the legend away from the true label, neutral with respect to the true label, and toward the true label.

This is very intuitive. Where the true label is positive, we get strong attributions for great. Where the true label is negative, we get strong attributions for words like wrong and less activation for things like great. This is intuitive here that we're getting more activation for said than for great, suggesting the model has learned that the reporting verb there modulates the positive sentiment that is in its scope.

Then down at the bottom here, we have one of these tricky situations. The true label is zero, the predicted label is one. These are attributions with respect to the true label. We're seeing, for example, that incorrect is biased toward negative. My guess about these attributions is that the model is actually doing pretty well with this example and maybe missed for some incidental reason.

But overall, I would say qualitatively, this slide is a reassuring picture that the model is doing something systematic with its features in making these sentiment predictions. To summarize, feature attribution can give okay characterizations of the representations. You really just get a scalar value about the degree of importance and you have to make further guesses about why representations are important, but it's still useful guidance.

We can get causal guarantees when we use models like IIG. But I'm afraid that there's no clear direct path to using methods like IG to directly improve models. It's like you've just got some ambient information that might guide you in a subsequent and separate modeling step that would improve your model.

That's a summary of feature attribution, a powerful, pretty flexible heuristic method that can offer useful insights about how models are solving tasks.