Back to Index

Stanford XCS224U: NLU I Analysis Methods for NLU, Part 1: Overview I Spring 2023


Chapters

0:0 Intro
0:11 Varieties of evaluation
0:50 Limits of behavioral testing
3:37 Models today
4:16 The interpretability dream
5:8 Progress on benchmarks
6:8 Systematicity
7:31 A crucial prerequisite
8:50 Probing internal representations
10:37 Feature attribution
11:34 Intervention-based methods
12:6 Analytical framework

Transcript

Welcome everyone. This screencast kicks off our unit on analysis methods in NLP. In the previous unit for the course, we were very focused on behavioral testing, and we looked in particular at hypothesis-driven challenge and adversarial tests as a vehicle for deeply understanding how our models will behave, in especially unfamiliar scenarios.

What we're going to try to do in this unit is go one layer deeper and talk about what I've called structural methods, including probing, feature attribution, and a class of intervention-based methods. The idea is that we're going to go beyond simple behavioral testing to understand, we hope, the causal mechanisms that are guiding the input-output behavior of our models.

In the previous unit, I tried to make you very aware of the limits of behavioral testing. Of course, it plays an important role in the field, and it will complement the methods that we discuss, but it is intrinsically limited in ways that should worry us when it comes to offering guarantees about how models will behave.

To make that very vivid for you, I use this example of an even-odd detector. Let me walk through that from now taking a slightly different perspective, which is the illuminated feeling that we get when we finally get to see how the model actually works. But recall this even-odd model takes in strings like four and predicts whether they refer to even or odd numbers.

Four comes in and it predicts even, 21 and it rightly predicts odd, 32 even, 36 even, 63 odd. This is all making you feel that the model is a good model of even-odd detection. But you need to be careful, you've only done five tests. Now I show you how the model actually works, and it is immediately revealed to you that this is a very poor model.

We got lucky with our first five inputs. It is a simple lookup on those inputs, and when it gets an unfamiliar input, it defaults to predicting odd. Once we see that, we know exactly how the model is broken, and exactly how to foil it behaviorally. We input 22, and it thinks that that is odd.

But then we get a second even-odd model. It passes the first five tests, it makes a good prediction about 22, good prediction about 5, and 89, and 56. Again, your confidence is building, but you should be aware of the fact that you might have missed some crucial examples. Again, when I show you the inner workings of this model, you get immediately illuminated about where it works and where it doesn't.

This model is more sophisticated. It tokenizes its input and uses the final token as the basis for predicting even-odd. That is a pretty good theory, but it has this else clause where it predicts odd, and now we know exactly how to foil the model. We input 16, and it thinks that that is odd.

It was really the point at which we got to see the internal causal mechanisms, that we knew exactly how the model would work, and exactly where it would fail. Now we move at last to model 3. Let's suppose that it gets all of those previous inputs correct. Is it the one true model of even-odd detection?

Well, we can keep up our behavioral testing, but you should see by now that no matter how many inputs we offer this model, we will never get a guarantee for every integer string that it will behave as intended. For that guarantee, we need to look inside this black box.

But of course, in the modern era of NLP models, they're hardly ever as easy to understand as the symbolic programs that I was just showing you. Instead, our models look like this huge array of birds nests, lots of internal states all connected to all the other states, completely opaque, they consist mainly of weights and multiplications of weights, they have no symbols in them.

Therefore, they are very difficult for us to understand as humans in a way that will illuminate how they'll behave in unfamiliar settings. Of course, the dream of these models is that somehow we'll see patterns of activation or something that look like this and begin to reveal what is clearly a tree structure.

You might think, aha, the model actually does implicitly represent constituents or named entities or other kinds of meaningful unit in language, and then you would feel like you truly understood it. But of course, that never happens. Instead, what we get when we look at these models is apparently just a mass of activations.

You get the feeling that either there's nothing systematic happening here or we're just looking at it incorrectly. I'm going to offer a hopeful message on this point. The mess is only apparent when we use the right techniques and take the right perspective on these models. The best of them actually have found really systematic and interesting solutions.

There's another angle we could take on this which connects back to the stuff about behavioral testing. I've showed this slide a few times in the course, it's progress on benchmarks. Along the x-axis, we have time and the y-axis is a normalized measure of distance from our estimate of human performance in the red line.

One perspective on this slide is that progress is incredible. Benchmarks used to take us decades to get to saturate and now saturation happens in a matter of years. The other perspective on this plot, of course, is that the benchmarks are too weak. We have a suspicion that even the models that are performing well on these tasks are very far from the human capability that we are trying to diagnose.

We feel that they have brittle solutions, concerning solutions that are going to reveal themselves in problematic ways. To really get past that concern, we need to go beyond this behavioral testing. There's another underlying motivation for this, which is systematicity. We talked about this in detail in the previous unit.

It's an idea from Froeder and Pilishin. They say, what we mean when we say that linguistic capacities are systematic is that the ability to produce or understand some sentences is intrinsically connected to the ability to produce understand certain others. This is the idea that if you know what Sandy loves the puppy means, then you just know what the puppy loves Sandy means.

If you recognize the distributional affinity between the turtle and the puppy, you also understand the turtle loves the puppy, Sandy loves the turtle, and so forth and so on for suddenly an enormous number of sentences. The human capacity for language makes it feels like these aren't new facts that you're learning, but rather things that follow directly from an underlying capability that you have.

We offered compositionality as one possible explanation for why in the language realm our understanding and use of language is so systematic. The related point here is that you get the feeling that we won't fully trust our models until we can validate that the solutions that they have found are also systematic or maybe even compositional in this way.

Otherwise, we'll have concerns that at crucial moments, their behaviors will be arbitrary seeming to us. There's another angle that you can take on this project of explaining model behaviors. The field has a lot of really crucial high-level goals that relate to safety and trustworthiness and so forth. We want to be able to certify where models can be used and where they should not be used.

We want to be able to certify that our models are free from pernicious social biases and we want to offer guarantees that our models are safe in certain contexts. Given what I've said about behavioral testing, you can anticipate what I'll say now, behavioral testing alone will not suffice to achieve these goals.

It could possibly tell us that a model does have a pernicious social bias or is unsafe in a certain context or has a certain area where it should be disapproved for use. But the positive guarantees free from social bias, safe in a context or approved for a given use, those will not be achieved until we get beyond behavioral testing.

For those, we need to understand at a deep level what our models are structured by and what mechanisms guide their behavior. We need analytic guarantees about how they will behave, and that means beyond behavioral testing to really understand the causal mechanisms. In service of moving toward that goal, we're going to discuss, as I said, three main methods.

The first one is probing. There are some precedents before Tenney et al, 2019 in the literature, but I think Tenney et al give real credit for showing that probing was viable and interesting in the BERT era. Because what they did is essentially fit small supervised models to different layers in the BERT architecture.

What they discovered is that there is a lot of systematic information encoded in those layers. This was really eye-opening. I think that most people believed that even though BERT was performant, it was performant in ways that depended on entirely unsystematic solutions. What probing began to suggest is that BERT had induced some really interesting causal structure about language as part of its training regime.

The way this plot works is that we have the layers of BERT along the x-axis, and we have different phenomena in these different panels. What you can see in the blue especially, is that different kinds of information are emerging pretty systematically at different points in the BERT layer structure.

For example, part of speech seems to emerge around the middle. Dependency parses emerge a bit later, named entities are fainter and later in the structure, semantic roles pretty strong near the middle, coreference information emerging later in the network, and so forth and so on. This was really eye-opening because I think people didn't anticipate that all of this would be so accessible in the hidden representations of these models.

What we'll see is that probing is very rich in terms of characterizing these internal representations, but it cannot offer causal guarantees that this information is shaping model performance. We can complement that with a class of methods that are called feature attribution methods. The idea here is that we will essentially, in the deep learning context, study the gradients of our model and use those to understand which neurons and which collections of neurons are most guiding its input-output behavior.

For these methods, we're going to get only faint characterizations of what the representations are doing, but we will get some causal guarantees. What I've got here to illustrate is a simple sentiment challenge set. There are a bunch of hard cases involving attitude-taking with verbs like say and shifts in sentiment.

What you see here in the highlighting is that the model seems to be making use of very intuitive information to shape what are very good predictions for these cases. Again, that might be reassuring to us that the model is doing something human interpretable and systematic under the hood. Then finally, we're going to study intervention-based methods.

This is a large class of methods. I think I'll save the details for a later screencast, but the essence of this is that we're going to perform brain surgery on our models. We are going to manipulate their internal states and study the effects that that has on their input-output behavior.

In that way, we can piece together an understanding of the causal mechanisms that shape the model's behavior, pushing us toward exactly the guarantees that we need. Let me, by way of wrapping up this opening screencast, offer you a analytical framework for thinking about the methods that we're going to discuss.

Let's say we have three goals. First, we want to characterize representations, input representations, output representations, but maybe most crucially, internal representations for our models. We also want to make causal claims about the role of those representations. Once we have started to learn about how the models behave, we would like to have an easy path to actually improving models based on those insights so that we don't simply passively study them but rather actively make them better.

That's a scorecard. Let's think about these methods. What we'll see is that probing is great, as I said, at characterizing representations, but it cannot offer causal inferences, and it's unclear whether there's a path from probing to actually improving models. For feature attributions, we get only faint characterizations of the model internal states.

We pretty much just get weights that tell us how much individual neurons contribute to the input-output behavior. But we can get causal guarantees from some of these methods. We'll talk about integrated gradients as an example of that. Then these intervention-based methods, I've got smileys across the board. This is the class of methods that I've been most deeply involved with.

It's the class of methods that I favor, and that is in large part because of how well they do on this scorecard. With these methods, we can characterize representations, we can offer causal guarantees, and as you'll see, there's an easy path to using the insights we gained to actually improve our models.

That's the name of the game for me. We will now begin systematically working through these three classes of methods trying to more deeply understand how they work and why my scorecard looks the way it does.