Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 17

Welcome to CS224N, lecture 17, Model Analysis and Explanation. OK, look at us. We're here. Start with some course logistics. We have updated the policy on the guest lecture reactions. They're all due Friday, all at 11.59 PM. You can't use late days for this, so please get them in. Watch the lectures.

They're awesome lectures. They're awesome guests. And you get something like half a point for each of them. And yeah, all three can be submitted up through Friday. OK, so final projects. Remember that the due date is Tuesday. It's Tuesday at 4.30 PM, March 16. And let me emphasize that there's a hard deadline on the three days from then, Friday.

We won't be accepting for additional points off assignments-- sorry, final projects that are submitted after the 4.30 deadline on Friday. We need to get these graded and get grades in. So it's the end stretch, week nine. Our week 10 is really the lectures are us giving you help on the final projects.

So this is really the last week of lectures. Thanks for all your hard work and for asking awesome questions in lecture and in office hours and on ed. And let's get right into it. So today, we get to talk about one of my favorite subjects in natural language processing.

It's model analysis and explanation. So first, we're going to do what I love doing, which is motivating why we want to talk about the topic at all. We'll talk about how we can look at a model at different levels of abstraction to perform different kinds of analysis on it.

We'll talk about out-of-domain evaluation sets. So this will feel familiar to the robust QA folks. Then we'll talk about trying to figure out, for a given example, why did it make the decision that it made? It had some input. It produced some output. Can we come up with some sort of interpretable explanation for it?

And then we'll look at, actually, the representations of the models. So these are the sort of hidden states, the vectors that are being built throughout the processing of the model, try to figure out if we can understand some of the representations and mechanisms that the model is performing. And then we'll actually come back to one of the default states that we've been in this course, which is trying to look at model improvements, removing things from models, seeing how it performs, and relate that to the analysis that we're doing in this lecture, show how it's not all that different.

So if you haven't seen this XKCD, now you have. And it's one of my favorites. I'm going to say all the words. So person A says, this is your machine learning system. Person B says, yup, you pour the data into this big pile of linear algebra and then collect the answers on the other side.

Person A, what if the answers are wrong? And person B, just stir the pile until they start looking right. And I feel like, at its worst, deep learning can feel like this from time to time. You have a model. Maybe it works for some things. Maybe it doesn't work for other things.

You're not sure why it works for some things and doesn't work for others. And the changes that we make to our models, they're based on intuition. But frequently, what have the TAs told everyone in office? I was like, ah, sometimes you just have to try it and see if it's going to work out because it's very hard to tell.

It's very, very difficult to understand our models on sort of any level. And so today, we'll go through a number of ways for trying to sort of carve out little bits of understanding here and there. So beyond it being important because it's in the next KCD comic, why should we care about understanding our models?

One is that we want to know what our models are doing. So here, you have a black box. Black box functions are sort of this idea that you can't look into them and interpret what they're doing. You have an input sentence, say, and then some output prediction. Maybe this black box is actually your final project model, and it gets some accuracy.

Now, we summarize our models. And in your final projects, you'll summarize your model with sort of one or a handful of summary metrics of accuracy or F1 score or blue score or something. But it's a lot of model to explain with just a small number of metrics. So what do they learn?

Why do they succeed, and why do they fail? What's another motivation? So we want to sort of know what our models are doing, OK. But maybe that's because we want to be able to make tomorrow's model. So today, when you're building models in this class at the company, you start out with some kind of recipe that is known to work either at the company or because you have experience from this class.

And it's not perfect. It makes mistakes. You look at the errors. And then over time, you take what works, maybe, and then you find what needs changing. So it seems like maybe adding another layer to the model helped. And maybe that's a nice tweak, and the model performance gets better, et cetera.

And incremental progress doesn't always feel exciting. But I want to pitch to you that it's actually very important for us to understand how much incremental progress can kind of get us towards some of our goals. So that we can have a better job of evaluating when we need big leaps, when we need major changes, because there are problems that we're attacking with our incremental sort of progress, and we're not getting very far.

OK, so we want to make tomorrow's model. Another thing that is, I think, very related to and sort of both a part of and bigger than this field of analysis is model biases. So let's say you take your Word2Vec analogies solver from GloVe or Word2Vec, that is, from assignment one, and you give it the analogy, man is to computer programmer as woman is to-- and it gives you the output, homemaker.

This is a real example from the paper below. You should be like, wow, well, I'm glad I know that now. And of course, you saw the lecture from Yulia Svetkov last week. You say, wow, I'm glad I know that now. And that's a huge problem. What did the model use in its decision?

What biases is it learning from data and possibly making even worse? So that's the kind of thing that you can also do with model analysis beyond just making models better according to some sort of summary metric as well. And then another thing, we don't just want to make tomorrow's model.

And this is something that I think is super important. We don't just want to look at that time scale. We want to say, what about 10, 15, 25 years from now? What kinds of things will we be doing? What are the limits? What can be learned by language model pre-training?

What's the model that will replace the transformer? What's the model that will replace that model? What does deep learning struggle to do? What are we sort of attacking over and over again and failing to make significant progress on? What do neural models tell us about language potentially? There's some people who are primarily interested in understanding language better using neural networks.

Cool. How are our models affecting people, transferring power between groups of people, governments, et cetera? That's an excellent type of analysis. What can't be learned via language model pre-training? So that's sort of the complementary question there. If you sort of come to the edge of what you can learn via language model pre-training, is there stuff that we need total paradigm shifts in order to do well?

So all of this falls under some category of trying to really deeply understand our models and their capabilities. And there's a lot of different methods here that we'll go over today. And one thing that I want you to take away from it is that they're all going to tell us some aspect of the model elucidates some kind of intuition or something, but none of them are we going to say, aha, I really understand 100% about what this model is doing now.

So they're going to provide some clarity, but never total clarity. And one way, if you're trying to decide how you want to understand your model more, I think you should sort of start out by thinking about is, at what level of abstraction do I want to be looking at my model?

So the sort of very high level abstraction, let's say you've trained a QA model to estimate the probabilities of start and end indices in a reading comprehension problem, or you've trained a language model that assigns probabilities to words in context. You can just look at the model as that object.

So it's just a probability distribution defined by your model. You are not looking into it any further than the fact that you can sort of give it inputs and see what outputs it provides. So that's like, who even cares if it's a neural network? It could be anything. But it's a way to understand its behavior.

Another level of abstraction that you can look at, you can dig a little deeper. You can say, well, I know that my network is a bunch of layers that are kind of stacked on top of each other. You've got sort of maybe your transformer encoder with one layer, two layer, three layer.

You can try to see what it's doing as it goes deeper in the layers. So maybe your neural model is a sequence of these vector representations. A third option of sort of specificity is to look at as much detail as you can. You've got these parameters in there. You've got the connections in the computation graph.

So now you're sort of trying to remove all of the abstraction that you can and look at as many details as possible. And all three of these sort of ways of looking at your model and performing analysis are going to be useful and will actually sort of travel slowly from one to two to three as we go through this lecture.

OK. So we haven't actually talked about any analyses yet. So we're going to get started on that now. And we're starting with the sort of testing our model's behaviors. So would we want to see, will my model perform well? I mean, the natural thing to ask is, well, how does it behave on some sort of test set?

And so we don't really care about mechanisms yet. Why is it performing this? By what method is it making its decision? Instead, we're just interested in sort of the more higher level abstraction of, does it perform the way I want it to perform? So let's take our model evaluation that we are already doing and sort of recast it in the framework of analysis.

So you've trained your model on some samples from some distribution. So you've got input/output pairs of some kind. So how does the model behave on samples from the same distribution? It's a simple question. And it's sort of-- it's known as in-domain accuracy. Or you can say that the samples are IID, and that's what you're testing on.

And this is just what we've been doing this whole time. It's your test set accuracy, or F1, or blue score. And so you've got some model with some accuracy. And maybe it's better than some model with some other accuracy on this test set. So this is what you're doing as you're iterating on your models in your final project as well.

You say, well, on my test set, which is what I've decided to care about for now, model A does better. They both seem pretty good. And so maybe I'll choose model A to keep working on. Maybe I'll choose it if you were putting something into production. But remember back to this idea that it's just one number to summarize a very complex system.

It's not going to be sufficient to tell you how it's going to perform in a wide variety of settings. OK. So we've been doing this. This is model evaluation as model analysis. Now we are going to say, what if we are not testing on exactly the same type of data that we trained on?

So now we're asking, did the model learn something such that it's able to sort of extrapolate or perform how I want it to on data that looks a little bit different from what it was trained on? And we're going to take the example of natural language inference. So to recall the task of natural language inference-- and this is through the multi-NLI data set that we're just pulling our definition-- you have a premise.

He turned and saw John sleeping in his half tent. And you have a hypothesis. He saw John was asleep. And then you give them both to a model. And this is the model that we had before that gets some good accuracy. And the model is supposed to tell whether the hypothesis is sort of implied by the premise or contradicting.

So it could be contradicting, maybe, if the hypothesis is John was awake, for example, or he saw John was awake. Maybe that would be a contradiction. Neutral, if sort of both could be true at the same time, so to speak. And then entailment, in this case, it seems like they're saying that the premise implies the hypothesis.

And so you would say, probably, this is likely to get the right answer, since the accuracy of the model is 95%. 95% of the time, it gets the right answer. And we're going to dig deeper into that. What if the model is not doing what we think we want it to be doing in order to perform natural language inference?

So in a data set like multi-NLI, the authors who gathered the data set will have asked humans to perform the task and gotten the accuracy that the humans achieved. And models nowadays are achieving accuracies that are around where humans are achieving, which sounds great at first. But as we'll see, it's not the same as actually performing the task more broadly in the right way.

So what if the model is not doing something smart, effectively? We're going to use a diagnostic test set of carefully constructed examples that seem like things the model should be able to do to test for a specific skill or capacity. In this case, we'll use HANS. So HANS is the heuristic analysis for NLI systems data set.

And it's intended to take systems that do natural language inference and test whether they're using some simple syntactic heuristics. What we'll have in each of these cases, we'll have some heuristic. We'll talk through the definition. We'll get an example. So the first thing is lexical overlap. So the model might do this thing where it assumes that a premise entails all hypotheses constructed from words in the premise.

So in this example, you have the premise, the doctor was paid by the actor. And then the hypothesis is the doctor paid the actor. And you'll notice that in bold here, get the doctor, and then paid, and then the actor. And so if you use this heuristic, you will think that the doctor was paid by the actor implies the doctor paid the actor.

That does not imply it, of course. And so you could expect a model. You want the model to be able to do this. It's somewhat simple. But if it's using this heuristic, it won't get this example right. Next is subsequence heuristics. So here, if the model assumes that the premise entails all of its contiguous subsequences, it will get this one wrong as well.

So this example is the doctor near the actor danced. That's the premise. The hypothesis is the actor danced. Now, this is a simple syntactic thing. The doctor is doing the dancing near the actor is this prepositional phrase. And so the model uses this heuristic. Oh, look, the actor danced.

That's a subsequence entailed. Awesome. Then it'll get this one wrong as well. And here's another one that's a lot like subsequence. But so if the model thinks that the premise entails all complete subtrees-- so this is like fully formed phrases. So the artist slept here is a fully formed subtree.

If the artist slept, the actor ran. And then that's the premise. Does it entail the hypothesis the actor slept? No. Sorry, the artist slept. That does not entail it because this is in that conditional. OK. Let me pause here for some questions before I move on to see how these models do.

Anyone unclear about how this sort of evaluation is being set up? No? OK. Cool. OK. OK, so how do models perform? That's sort of the question of the hour. What we'll do is we'll look at these results from the same paper that released the data set. So they took four strong multi-analyte models with the following accuracies.

So the accuracies here are something between 60% and 80% Bert over here is doing the best. And in domain, in that first setting that we talked about, you get these reasonable accuracies. And that is sort of what we said before about it like looking pretty good. And when we evaluate on Hans, in this setting here, we have examples where the heuristics we talked about actually work.

So if the model is using the heuristic, it will get this right. And it gets very high accuracies. And then if we evaluate the model in the settings where if it uses the heuristic, it gets the examples wrong, maybe Bert's doing epsilon better than some of the other stuff here.

But it's a very different story. And you saw those examples. They're not complex in our own idea of complexity. And so this is why it feels like a clear failure of the system. Now, you can say, though, that, well, maybe the training data sort of didn't have any of those sort of phenomena.

So the model couldn't have learned not to do that. And that's sort of a reasonable argument, except, well, Bert is pre-trained on a bunch of language text. So you might expect, you might hope that it does better. So we saw that example of models performing well on examples that are like those that it was trained on, and then performing not very well at all on examples that seem reasonable but are sort of a little bit tricky.

Now we're going to take this idea of having a test set that we've carefully crafted and go in a slightly different direction. So we're going to have, what does it mean to try to understand the linguistic properties of our models? So that syntactic heuristics question was one thing for natural language inference.

But can we sort of test how the models, whether they think certain things are sort of right or wrong as language models? And the first way that we'll do this is we'll ask, well, how do we think about sort of what humans think of as good language? How do we evaluate their sort of preferences about language?

And one answer is minimal pairs. And the idea of a minimal pair is that you've got one sentence that sounds OK to a speaker. So this sentence is, the chef who made the pizzas is here. It's called it's an acceptable sentence, at least to me. And then with a small change, a minimal change, the sentence is no longer OK to the speaker.

So the chef who made the pizzas are here. And this-- whoops. This should be present tense verbs. In English, present tense verbs agree in number with their subject when they are third person. So chef, pizzas, OK. And this is sort of a pretty general thing. Most people don't like this.

It's a misconjugated verb. And so the syntax here looks like you have the chef who made the pizzas. And then this arc of agreement in number is requiring the word is here to be singular is instead of plural are, despite the fact that there's this noun pizzas, which is plural, closer linearly, comes back to dependency parsing.

We're back. OK. And what this looks like in the tree structure is, well, chef and is are attached in the tree. Chef is the subject of is. Pizzas is down here in this subtree. And so that subject-verb relationship has this sort of agreement thing. So this is a pretty sort of basic and interesting property of language that also reflects the syntactic sort of hierarchical structure of language.

So we've been training these language models, sampling from them, seeing that they get interesting things. And they tend to seem to generate syntactic content. But does it really understand, or does it behave as if it understands this idea of agreement more broadly? And does it sort of get the syntax right so that it matches the subjects and the verbs?

But language models can't tell us exactly whether they think that a sentence is good or bad. They just tell us the probability of a sentence. So before, we had acceptable and unacceptable. That's what we get from humans. And the language model's analog is just, does it assign higher probability to the acceptable sentence in the minimal pair?

So you have the probability under the model of the chef who made the pizzas is here. And then you have the probability under the model of the chef who made the pizzas are here. And you want this probability here to be higher. And if it is, that's sort of like a simple way to test whether the model got it right effectively.

And just like in Hans, we can develop a test set with very carefully chosen properties. So most sentences in English don't have terribly complex subject-verb agreement structure with a lot of words in the middle, like pizzas, that are going to make it difficult. So if I say, the dog runs, sort of no way to get it wrong, because this index is very simple.

So we can create, or we can look for sentences that have-- these are the things called attractors in the sentence. So pizzas is an attractor, because the model might be attracted to the plurality here and get the conjugation wrong. So this is our question. Can language models sort of very generally handle these examples with attractors?

So we can take examples with zero attractors, see whether the model gets the minimal pairs evaluation right. We can take examples with one attractor, two attractors. You can see how people would still reasonably understand these sentences, right? Chef who made the pizzas and prepped the ingredients is. It's still the chef who is.

And then on and on and on, it gets rarer, obviously. But you can have more and more attractors. And so now we've created this test set that's intended to evaluate this very specific linguistic phenomenon. So in this paper here, Concur et al. trained an LSTM language model on a subset of Wikipedia back in 2018.

And they evaluated it sort of in these buckets that are specified by the paper that sort of introduced subject-verb agreement to the NLP field, more recently at least. And they evaluated it in buckets based on the number of attractors. And so in this table here that you're about to see, the numbers are sort of the percent of times that you get this assign higher probability to the correct sentence in the minimal pair.

So if you were just to do random or majority class, you get these errors. Oh, sorry, it's the percent of times that you get it wrong. Sorry about that. So lower is better. And so with no attractors, you get very low error rates. So this is 1.3 error rate with a 350-dimensional LSTM.

And with one attractor, your error rate is higher. But actually, humans start to get errors with more attractors too. So zero attractors is easy. The larger the LSTM, it looks like in general, the better you're doing. So the smaller model's doing worse. And then even on very difficult examples with four attractors, which try to think of an example in your head, like the chef made the pizzas and took out the trash.

It sort of has to be this long sentence. The error rate is definitely higher, so it gets more difficult. But it's still relatively low. And so even on these very hard examples, models are actually performing subject-verb number agreement relatively well. Very cool. OK. Here's some examples that a model got wrong.

This is actually a worse model than the ones from the paper that was just there. But I think, actually, the errors are quite interesting. So here's a sentence. The ship that the player drives has a very high speed. Now, this model thought that was less probable than the ship that the player drives have a very high speed.

My hypothesis is that it sort of misanalyzes drives as a plural noun, for example. It's sort of a difficult construction there. I think it's pretty interesting. Likewise here, this one is fun. The lead is also rather long. Five paragraphs is pretty lengthy. So here, five paragraphs is a singular noun together.

It's like a unit of length, I guess. But the model thought that it was more likely to say five paragraphs are pretty lengthy, because it's referring to this sort of five paragraphs as the five actual paragraphs themselves, as opposed to a single unit of length describing the lead. Fascinating.

OK. Any questions again? So I guess there are a couple. Can we do the similar heuristic analysis for other tasks, such as Q&A, classification? Yes. So yes, I think that it's easier to do this kind of analysis for the Hans style analysis with question answering and other sorts of tasks, because you can construct examples that similarly have these heuristics and then have the answer depend on the syntax or not.

The actual probability of one sentence is higher than the other, of course, sort of a language model dependent thing. But the idea that you can develop bespoke test sets for various tasks, I think, is very, very general and something I think is actually quite interesting. Yes. So I won't go on further, but I think the answer is just yes.

So there's another one. How do you know where to find these failure cases? Maybe that's the right time to advertise linguistics classes. Sorry. You're still very quiet over here. How do we find what? How do you know where to find these failure cases? Oh, interesting. Yes, how do we know where to find the failure cases?

That's a good question. I mean, I think I agree with Chris that actually thinking about what is interesting about things in language is one way to do it. I mean, the heuristics that we saw in our language model-- sorry, in our NLI models with Hans, you can imagine that they-- if the model was sort of ignoring facts about language and sort of just doing this sort of rough bag of words with some extra magic, then it would do well about as bad as it's doing here.

And these sorts of ideas about understanding that this statement, if the artist slept, the actor ran, does not imply the artist slept, is the kind of thing that maybe you'd think up on your own, but also you'd spend time sort of pondering about and thinking broad thoughts about in linguistics curricula as well.

So anything else, Chris? Yeah. So there's also-- well, I guess someone was also saying-- I think it's about the sort of intervening verbs example-- intervening nouns, sorry, example. But the data set itself probably includes mistakes with higher attractors. Yeah, yeah, that's a good point. Yeah, because humans make more and more mistakes as the number of attractors gets larger.

On the other hand, I think that the mistakes are fewer in written text than in spoken. Maybe I'm just making that up. That's what I think. But yeah, it would be interesting to actually go through that test set and see how many of the errors a really strong model makes are actually due to the sort of observed form being incorrect.

I'd be super curious. OK, should I move on? Yeah. Great. OK, so what does it feel like we're doing when we are kind of constructing these sort of bespoke, small, careful test sets for various phenomena? Well, it sort of feels like unit testing. And in fact, this sort of idea has been brought to the fore, you might say, in NLP unit tests, but for these NLP neural networks.

And in particular, the paper here that I'm citing at the bottom suggests this minimum functionality test. You want a small test set that targets a specific behavior. That should sound like some of the things that we've already talked about. But in this case, we're going to get even more specific.

So here's a single test case. We're going to have an expected label, what was actually predicted, whether the model passed this unit test. And the labels are going to be sentiment analysis here. So negative label, positive label, or neutral are the three options. And the unit test is going to consist simply of sentences that follow this template.

I, then a negation, a positive verb, and then the thing. So if you negation positive verb, it means you negative verb. And so here's an example. I can't say I recommend the food. The expected label is negative. The answer that the model provided-- and this is, I think, a commercial sentiment analysis system.

So it predicted positive. And then I didn't love the flight. The expected label was negative. And then the predicted answer was neutral. And this commercial sentiment analysis system gets a lot of what you could imagine are pretty reasonably simple examples wrong. And so what your bureau at all 2020 showed is that they could actually provide a system that sort of had this framework of building test cases for NLP models to ML engineers working on these products and give them that interface.

And they would actually find bugs-- bugs being categories of high error-- find bugs in their models that they could then kind of try to go and fix. And this was kind of an efficient way of trying to find things that were simple and still wrong with what should be pretty sophisticated neural systems.

But I really like this. And it's sort of a nice way of thinking more specifically about what are the capabilities in sort of precise terms of our models. And altogether now, you've seen problems in natural language inference. You've seen language models actually perform pretty well at the language modeling objective.

But then you see-- you just saw an example of a commercial sentiment analysis system that sort of should do better and doesn't. And this comes to this really, I think, broad and important takeaway, which is if you get high accuracy on the in-domain test set, you are not guaranteed high accuracy on even what you might consider to be reasonable out-of-domain evaluations.

And life is always out of domain. And if you're building a system that will be given to users, it's immediately out of domain, at the very least because it's trained on text that's now older than the things that the users are now saying. So it's a really, really important takeaway that your sort of benchmark accuracy is a single number that does not guarantee good performance on a wide variety of things.

And from a what are our neural networks doing perspective, one way to think about it is that models seem to be learning the data set, fitting sort of the fine-grained heuristics and statistics that help it fit this one data set, as opposed to learning the task. So humans can perform natural language inference.

If you give them examples from whatever data set, once you've told them how to do the task, they'll be very generally strong at it. But you take your MNLI model, and you test it on Hans, and it got whatever that was, below chance accuracy. That's not the kind of thing that you want to see.

So it definitely learns the data set well, because the accuracy in-domain is high. But our models are seemingly not frequently learning sort of the mechanisms that we would like them to be learning. Last week, we heard about language models and sort of the implicit knowledge that they encode about the world through pre-training.

And one of the ways that we saw it interact with language models was providing them with a prompt, like Dante was born in Masque, and then seeing if it puts high probability on the correct continuation, which requires you to access knowledge about where Dante was born. And we didn't frame it this way last week, but this fits into the set of behavioral studies that we've done so far.

This is a specific kind of input. You could ask this for multiple people. You could swap out Dante for other people. You could swap out born in for, I don't know, died in or something. And then there are like test suites again. And so it's all connected. OK, so I won't go too deep into sort of the knowledge of language models in terms of world knowledge, because we've gone over it some.

But when you're thinking about ways of interacting with your models, this sort of behavioral study can be very, very general, even though, remember, we're at still this highest level of abstraction, where we're just looking at the probability distributions that are defined. All right, so now we'll go into-- so we've sort of looked at understanding in fine-grained areas what our model is actually doing.

What about sort of why for an individual input is it getting the answer right or wrong? And then are there changes to the inputs that look fine to humans, but actually make the models do a bad job? So one study that I love to reference that really draws back into our original motivation of using LSTM networks instead of simple recurrent neural networks was that they could use long context.

But how long is your long and short-term memory? And the idea of Kendall-Wall et al. 2018 was shuffle or remove contexts that are farther than some k words away, changing k. And if the accuracy, if the predictive ability of your language model, the perplexity, doesn't change once you do that, it means the model wasn't actually using that context.

I think this is so cool. So on the x-axis, we've got how far away from the word that you're trying to predict. Are you actually sort of corrupting, shuffling, or removing stuff from the sequence? And then on the y-axis is the increase in loss. So if the increase in loss is zero, it means that the model was not using the thing that you just removed.

Because if it was using it, it would now do worse without it. And so if you shuffle in the blue line here, if you shuffle the history that's farther away from 50 words, the model does not even notice. I think that's really interesting. One, it says everything past 50 words of this LSTM language model, you could have given it in random order, and it wouldn't have noticed.

And then two, it says that if you're closer than that, it actually is making use of the word order. That's a pretty long memory. OK, that's really interesting. And then if you actually remove the words entirely, you can kind of notice that the words are missing up to 200 words away.

So you don't care about the order they're in, but you care whether they're there or not. And so this is an evaluation of, well, do LSTMs have long-term memory? Well, this one at least has effectively no longer than 200 words of memory, but also no less. So very cool.

So that's a general study for a single model. It talks about its average behavior over a wide range of examples. But we want to talk about individual predictions on individual inputs. So let's talk about that. So one way of interpreting why did my model make this decision that's very popular is, for a single example, what parts of the input actually led to the decision?

And this is where we come in with saliency maps. So a saliency map provides a score for each word indicating its importance to the model's prediction. So you've got something like Bert here. You've got Bert. Bert is making a prediction for this mask. The mask rushed to the emergency room to see her patient.

And the predictions that the model is making is things with 47%. It's going to be nurse that's here in the mask instead, or maybe woman, or doctor, or mother, or girl. And then the saliency map is being visualized here in orange. According to this method of saliency called simple gradients, which we'll get into, emergency, her, and the SEP token-- let's not worry about the SEP token for now.

But emergency and her are the important words, apparently. And the SEP token shows up in every sentence. So I'm not going to-- and so these two together are, according to this method, what's important for the model to make this prediction to mask. And you can see maybe some statistics, biases, et cetera, that is picked up in the predictions and then have it mapped out onto the sentence.

And this is-- well, it seems like it's really helping interpretability. And yeah, I think that this is a very useful tool. Actually, this is part of a demo from Alan NLP that allows you to do this yourself for any sentence that you want. So what's this way of making saliency maps?

We're not going to go-- there's so many ways to do it. We're going to take a very simple one and work through why it makes sense. So the issue is, how do you define importance? What does it mean to be important to the model's prediction? And here's one way of thinking about it.

It's called the simple gradient method. Let's get a little formal. You've got words x1 to xn. And then you've got a model score for a given output class. So maybe you've got, in the BERT example, each output class was each output word that you could possibly predict. And then you take the norm of the gradient of the score, with respect to each word.

So what we're saying here is, the score is the unnormalized probability for that class. So you've got a single class. You're taking the score. It's how likely it is, not yet normalized by how likely everything else is. Gradient, how much is it going to change if I move it a little bit in one direction or another?

And then you take the norm to get a scalar from a vector. So it looks like this. The salience of word i, you have the norm bars on the outside, gradient with respect to xi. So that's if I change a little bit locally xi, how much does my score change?

So the idea is that a high gradient norm means that if I were to change it locally, I'd affect the score a lot. And that means it was very important to the decision. Let's visualize this a little bit. So here on the y-axis, we've got loss. Just the loss of the model-- sorry, this should be score.

Should be score. And on the x-axis, you've got word space. The word space is like sort of a flattening of the ability to move your word embedding in 1,000 dimensional space. So I've just plotted it here in one dimension. And now, a high saliency thing, you can see that the relationship between what should be score and moving the word in word space, you move it a little bit on the x-axis, and the score changes a lot.

That's that derivative. That's the gradient. Awesome, love it. Low saliency, you move the word around locally, and the score doesn't change. So the interpretation is that means that the actual identity of this word wasn't that important to the prediction, because I could have changed it, and the score wouldn't have changed.

Now, why are there more methods than this? Because honestly, reading that, I was like, that sounds awesome. That sounds great. So there are sort of lots of issues with this kind of method and lots of ways of getting around them. Here's one issue. It's not perfect, because, well, maybe your linear approximation that the gradient gives you holds only very, very locally.

So here, the gradient is 0. So this is a low saliency word, because I'm at the bottom of this parabola. But if I were to move even a little bit in either direction, the score would shoot up. So is this not an important word? It seems important to be right there, as opposed to anywhere else even sort of nearby in order for the score not to go up.

But the simple gradients method won't capture this, because it just looks at the gradient, which is that 0 right there. But if you want to look into more, there's a bunch of different methods that are sort of applied in these papers. And I think that is a good tool for the toolbox.

OK, so that is one way of explaining a prediction. And it has some issues, like why are individual words being scored, as opposed to phrases or something like that. But for now, we're going to move on to another type of explanation. And I'm going to check the time. OK, cool.

Actually, yeah, let me pause for a second. Any questions about this? I mean, earlier on, there were a couple of questions. One of them was, what are your thoughts on whether looking at attention weights is a methodologically rigorous way of determining the importance that the model places on certain tokens?

It seems like there's some back and forth in the literature. That is a great question. And I probably won't engage with that question as much as I could if we had a second lecture on this. I actually will provide some attention analyses and tell you they're interesting. And then I'll sort of say a little bit about why they can be interesting without being sort of maybe the end all of analysis of where information is flowing in a transformer, for example.

I think the debate is something that we would have to get into in a much longer period of time. But look at the slides that I show about attention and the caveats that I provide. And let me know if that answers your question first, because we have quite a number of slides on it.

And if not, please, please ask again. And we can chat more about it. And maybe you can go on. Great. OK. So I think this is a really fascinating question, which also gets at what was important about the input, but in actually kind of an even more direct way, which is, could I just keep some minimal part of the input and get the same answer?

So here's an example from SQuAD. You have this passage in 1899. John Jacob Astor IV invested $100,000 for Tesla. OK. And then the answer that is being predicted by the model is going to always be in blue in these examples, Colorado Springs Experiments. So you've got this passage. And the question is, what did Tesla spend Astor's money on?

That's why the prediction is Colorado Springs Experiments. The model gets the answer right, which is nice. And we would like to think it's because it's doing some kind of reading comprehension. But here's the issue. It turns out, based on this fascinating paper, that if you just reduce the question to did, you actually get exactly the same answer.

In fact, with the original question, the model had sort of a 0.78 confidence probability in that answer. And with the reduced question did, you get even higher confidence. And if you give a human this, they would not be able to know really what you're trying to ask about. So it seems like some things are going really wonky here.

Here's another. So here's sort of like a very high level overview of the method. In fact, it actually references our input saliency methods. Ah, nice. It's connected. So you iteratively remove non-salient or unimportant words. So here's a passage again talking about football, I think. Yeah. And-- oh, nice. OK, so the question is, where did the Broncos practice for the Super Bowl as the prediction of Stanford University?

And that is correct. So again, seems nice. And now, we're not actually going to get the model to be incorrect. We're just going to say, how can I change this question such that I still get the answer right? So I'm going to remove the word that was least important according to a saliency method.

So now, it's where did the practice for the Super Bowl? Already, this is sort of unanswerable because you've got two teams practicing. You don't even know which one you're asking about. So why the model still thinks it's so confident in Stanford University makes no sense. But you can just sort of keep going.

And now, I think, here, the model stops being confident in the answer, Stanford University. But I think this is really interesting just to show that if the model is able to do this with very high confidence, it's not reflecting the uncertainty that really should be there because you can't know what you're even asking about.

OK, so what was important to make this answer? Well, at least these parts were important because you could keep just those parts and get the same answer. Fascinating. All right, so that's sort of the end of the admittedly brief section on thinking about input saliency methods and similar things.

Now, we're going to talk about actually breaking models and understanding models by breaking them. OK, cool. So if we have a passage here, Peyton Manning became the first quarterback, something, Super Bowl, age 39, past record held by John Elway. Again, we're doing question answering. We've got this question. What was the name of the quarterback who was 38 in the Super Bowl?

The prediction is correct. Looks good. Now, we're not going to change the question to try to sort of make the question nonsensical while keeping the same answer. Instead, we're going to change the passage by adding the sentence at the end, which really shouldn't distract anyone. This is quarterback, well-known quarterback, Jeff Dean, had jersey number 37 in Champ Bowl.

So this just doesn't-- it's really not even related. But now, the prediction is Jeff Dean for our nice QA model. And so this shows, as well, that it seems like maybe there's this end of the passage bias as to where the answer should be, for example. And so this is an adversarial example where we flipped the prediction by adding something that is innocuous to humans.

And so sort of the higher level takeaway is, oh, it seems like the QA model that we had that seemed good is not actually performing QA how we want it to, even though its in-domain accuracy was good. And here's another example. So you've got this paragraph with a question, what has been the result of this publicity?

The answer is increased scrutiny on teacher misconduct. Now, instead of changing the paragraph, we're going to change the question in really, really seemingly insignificant ways to change the model's prediction. So first, what HA-- now you've got this typo, L-- then the result of this publicity, the answer changes to teacher misconduct.

Likely, a human would sort of ignore this typo or something and answer the right answer. And then this is really nuts. Instead of asking, what has been the result of this publicity, if you ask, what's been the result of this publicity, the answer also changes. And this is-- the authors call this a semantically equivalent adversary.

This is pretty rough. And in general, swapping what for what in this QA model breaks it pretty frequently. And so again, when you go back and sort of re-tinker how to build your model, you're going to be thinking about these things, not just the sort of average accuracy. So that's sort of talking about noise.

Are models robust to noise in their inputs? Are humans robust to noise is another question we can ask. And so you can kind of go to this popular sort of meme passed around the internet from time to time, where you have all the letters in these words scrambled. You say, according to research at Cambridge University, it doesn't matter in what order the letters in a word are.

And so it seems like-- I think I did a pretty good job there. Seemingly, we got this noise. That's a specific kind of noise. And we can be robust as humans to reading and processing the language without actually all that much of a difficulty. So that's maybe something that we might want our models to also be robust to.

And it's very practical as well. Noise is a part of all NLP systems inputs at all times. There's just no such thing effectively as having users, for example, and not having any noise. And so there's a study that was performed on some popular machine translation models, where you train machine translation models in French, German, and Czech, I think all to English.

And you get blue scores. These blue scores will look a lot better than the ones in your assignment four because much, much more training data. The idea is these are actually pretty strong machine translation systems. And that's an in-domain clean text. Now, if you add character swaps, like the ones we saw in that sentence about Cambridge, the blue scores take a pretty harsh dive.

Not very good. And even if you take a somewhat more natural typo noise distribution here, you'll see that you're still getting 20-ish, yeah, very high drops in blue score through simply natural noise. And so maybe you'll go back and retrain the model on more types of noise. And then you ask, oh, if I do that, is it robust to even different kinds of noise?

These are the questions that are going to be really important. And it's important to know that you're able to break your model really easily so that you can then go and try to make it more robust. OK, now, let's see, 20 minutes, awesome. Now we're going to, I guess, yeah.

So now we're going to look at the representations of our neural networks. We've talked about their behavior and then whether we could change or observe reasons behind their behavior. Now we'll go into less abstraction, look more at the actual vector representations that are being built by models. And we can answer a different kind of question, at the very least, than with the other studies.

The first thing is related to the question I was asked about attention, which is that some modeling components lend themselves to inspection. Now this is a sentence that I chose somewhat carefully, actually, because in part of this debate, are they interpretable components? We'll see. But they lend themselves to inspection in the following way.

You can visualize them well, and you can correlate them easily with various properties. So let's say you have attention heads in BERT. This is from a really nice study that was done here, where you look at attention heads of BERT, and you say, on most sentences, this attention head, head 1, 1, seems to do this very global aggregation.

Simple kind of operation does this pretty consistently. That's cool. Is it interpretable? Well, maybe. So it's the first layer, which means that this word found is sort of uncontextualized. But in deeper layers, the problem is that once you do some rounds of attention, you've had information mixing and flowing between words.

And how do you know exactly what information you're combining, what you're attending to, even? It's a little hard to tell. And saliency methods more directly evaluate the importance of models. But it's still interesting to see, at sort of a local mechanistic point of view, what kinds of things are being attended to.

So let's take another example. Some attention heads seem to perform simple operations. So you have the global aggregation here that we saw already. Others seem to attend pretty robustly to the next token. Cool. Next token is a great signal. Some heads attend to the SEP token. So here you have attending to SEP.

And then maybe some attend to periods. Maybe that's sort of splitting sentences together and things like that. Not things that are hard to do, but things that some attention heads seem to pretty robustly perform. Again now, though, deep in the network, what's actually represented at this period at layer 11?

Little unclear. Little unclear. OK. So some heads, though, are correlated with really interesting linguistic properties. So this head is actually attending to noun modifiers. So you've got this the complicated language in the huge new law. That's pretty fascinating. Even if the model is not doing this as a causal mechanism to do syntax necessarily, the fact that these things so strongly correlate is actually pretty, pretty cool.

And so what we have in all of these studies is we've got sort of an approximate interpretation and quantitative analysis allowing us to reason about very complicated model behavior. They're all approximations, but they're definitely interesting. One other example is that of coreference. So we saw some work on coreference.

And it seems like this head does a pretty OK job of actually matching up coreferent entities. These are in red. Talks, negotiations, she, her. And that's not obvious how to do that. This is a difficult task. And so it does so with some percentage of the time. And again, it's sort of connecting very complex model behavior to these sort of interpretable summaries of correlating properties.

Other cases, you can have individual hidden units that lend themselves to interpretation. So here, you've got a character level LSTM language model. Each row here is a sentence. If you can't read it, it's totally OK. The interpretation that you should take is that as we walk along the sentence, this single unit is going from, I think, very negative to very positive or very positive to very negative.

I don't really remember. But it's tracking the position in the line. So it's just a linear position unit and pretty robustly doing so across all of these sentences. So this is from a nice visualization study way back in 2016, way back. Here's another cell from that same LSTM language model that seems to sort of turn on inside quotes.

So here's a quote. And then it turns on. So I guess that's positive in the blue. End quote here. And then it's negative. Here, you start with no quote, negative in the red, see a quote, and then blue. Seems, again, very interpretable. Also, potentially a very useful feature to keep in mind.

And this is just an individual unit in the LSTM that you can just look at and see that it does this. Very, very interesting. Even farther on this-- and this is actually a study by some AI and neuroscience researchers-- is we saw the LSTMs were good at subject verb number agreement.

Can we figure out the mechanisms by which the LSTM is solving the task? Can we actually get some insight into that? And so we have a word level language model. The word level language model is going to be a little small. But you have a sentence, "the boy gently and kindly greets the." And this cell that's being tracked here-- so it's an individual hidden unit, one dimension-- is actually, after it sees boy, it sort of starts to go higher.

And then it goes down to something very small once it sees greets. And this cell seems to correlate with the scope of a subject verb number agreement instance, effectively. So here, "the boy that watches the dog that watches the cat greets." You've got that cell, again, staying high, maintaining the scope of subject until greets, and at which point it stops.

What allows it to do that? Probably some complex other dynamics in the network. But it's still a fascinating, I think, insight. And yeah, this is just neuron 1,150 in this LSTM. So those are all observational studies that you could do by picking out individual components of the model that you can just take each one of and correlating them with some behavior.

Now, we'll look at a general class of methods called probing, by which we still use supervised knowledge, like knowledge of the type of coreference that we're looking for. But instead of seeing if it correlates with something that's immediately interpretable, like a attention head, we're going to look into the vector representations of the model and see if these properties can be read out by some simple function to say, oh, maybe this property was made very easily accessible by my neural network.

So let's dig into this. So the general paradigm is that you've got language data that goes into some big pre-trained transformer with fine tuning. And you get state-of-the-art results. Soda means state-of-the-art. And so the question for the probing methodology is, if it's providing these general purpose language representations, what does it actually encode about language?

Can we quantify this? Can we figure out what kinds of things is learning about language that we seemingly now don't have to tell it? And so you might have something like a sentence, like I record the record. That's an interesting sentence. And you put it into your transformer model with its word embeddings at the beginning, maybe some layers of self-attention and stuff.

And you make some predictions. And now our objects of study are going to be these intermediate layers. So it's a vector per word or subword for every layer. And the question is, can we use these linguistic properties, like the dependency parsing that we had way back in the early part of the course, to understand correlations between properties in the vectors and these things that we can interpret?

We can interpret dependency parses. So there are a couple of things that we might want to look for here. We might want to look for semantics. So here in this sentence, I record the record. I am an agent. That's a semantics thing. Record is a patient. It's the thing I'm recording.

You might have syntax. So you might have this syntax tree that you're interested in. That's the dependency parse tree. Maybe you're interested in part of speech, because you have record and record. And the first one's a verb. The second one's a noun. They're identical strings. Does the model encode that one is one and the other is the other?

So how do we do this kind of study? So we're going to decide on a layer that we want to analyze. And we're going to freeze BERT. So we're not going to fine tune BERT. All the parameters are frozen. So we're going to decide on layer 2 of BERT.

And we're going to pass it some sentences. We decide on what's called a probe family. And the question I'm asking is, can I use a model from my family, say linear, to decode a property that I'm interested in really well from this layer? So it's indicating that this property is easily accessible to linear models, effectively.

So maybe I train a linear classifier right on top of BERT. And I get a really high accuracy. And that's sort of interesting already, because you know, from prior work in part of speech tagging, that if you run a linear classifier on simpler features that aren't BERT, you probably don't get as high an accuracy.

So that's an interesting sort of takeaway. But then you can also take a baseline. So I want to compare two layers now. So I've got layer 1 here. I want to compare it to layer 2. I train a probe on it as well. Maybe the accuracy isn't as good.

And now I can say, oh, wow, look, by layer 2, part of speech is more easily accessible to linear functions than it was at layer 1. So what did that? Well, the self-attention and feed-forward stuff made it more easily accessible. That's interesting, because it's a statement about the information processing of the model.

So we're going to analyze these layers. Let's take a second more to think about it. And just really give me just a second. So if you have the model's representations, h1 to ht, and you have a function family F, that's the subset linear models. So maybe you have a feed-forward neural network, some fixed set of hyperparameters.

Freeze the model, train the probe, so you get some predictions for part of speech tagging or whatever. That's just the probe applied to the hidden state of the model. The probe is a member of the probe family. And then the extent that we can predict y is a measure of accessibility.

So that's just written out, not as pictorially. So I'm not going to stay on this for too much longer. And it may help in the search for causal mechanisms, but it sort of just gives us a rough understanding of processing of the model and what things are accessible at what layer.

So what are some results here? So one result is that BERT, if you run linear probes on it, does really, really well on things that require syntax and part of speech, named entity recognition. Actually, in some cases, approximately as well as just doing the very best thing you could possibly do without BERT.

So it just makes easily accessible, amazingly strong features for these properties. And that's an interesting sort of emergent quality of BERT, you might say. It seems like as well that the layers of BERT have this property where-- so if you look at the columns of this plot here, each column is a task.

You've got input words at the sort of layer 0 of BERT here. Layer 24 is the last layer of BERT large. Lower performance is yellow. Higher performance is blue. And the resolution isn't perfect, but consistently, the best place to read out these properties is somewhere a bit past the middle of the model, which is this very consistent rule, which is fascinating.

And then it seems as well like if you look at this function of increasingly abstract or increasingly difficult to compute linguistic properties on this axis, an increasing depth in the network on that axis. So the deeper you go in the network, it seems like the more easily you can access more and more abstract linguistic properties, suggesting that that accessibility is being constructed over time by the layers of processing of BERT.

So it's building more and more abstract features, which I think is, again, a really interesting result. And now I think-- yeah, one thing that I think comes to mind that really brings us back right to day one is we built intuitions around Word2Vec. We were asking, what does each dimension of Word2Vec mean?

And the answer was, not really anything. But we could build intuitions about it and think about properties of it through these connections between simple mathematical properties of Word2Vec and linguistic properties that we could understand. So we had this approximation, which is not 100% true. But it's an approximation that says cosine similarity is effectively correlated with semantic similarity.

Think about even if all we're going to do at the end of the day is fine tune these word embeddings anyway. Likewise, we had this idea about the analogies being encoded by linear offsets. So some relationships are linear in space. And they didn't have to be. That's fascinating. It's this emergent property that we've now been able to study since we discovered this.

Why is that the case in Word2Vec? And in general, even though you can't interpret the individual dimensions of Word2Vec, these emergent, interpretable connections between approximate linguistic ideas and simple math on these objects is fascinating. And so one piece of work that extends this idea comes back to dependency parse trees.

So they describe the syntax of sentences. And in a paper that I did with Chris, we showed that actually BERTs and models like it make dependency parse tree structure emergent, more easily accessible than one might imagine in its vector space. So if you've got a tree right here, the chef who rented the store was out of food, what you can do is think about the tree in terms of distances between words.

So you've got the number of edges in the tree between two words is their path distance. So you've got that the distance between chef and was is 1. And we're going to use this interpretation of a tree as a distance to make a connection with BERT's embedding space. And what we were able to show is that under a single linear transformation, the squared Euclidean distance between BERT vectors for the same sentence actually correlates well, if you choose the B matrix right, with the distances in the tree.

So here in this Euclidean space that we've transformed, the approximate distance between chef and was is also 1. Likewise, the difference between was and store is 4 in the tree. And in my simple transformation of BERT space, the distance between store and was is also approximately 4. And this is true across a wide range of sentences.

And this is, to me, a fascinating example of, again, emergent approximate structure in these very nonlinear models that don't necessarily need to encode things so simply. OK. All right. Great. So probing studies and correlation studies are, I think, interesting and point us in directions to build intuitions about models.

But they're not arguments that the model is actually using the thing that you're finding to make a decision. They're not causal studies. This is for probing and correlation studies. So in some work that I did around the same time, we showed actually that certain conditions on probes allow you to achieve high accuracy on a task that's effectively just fitting random labels.

And so there's a difficulty of interpreting what the model could or could not be doing with this thing that is somehow easily accessible. It's interesting that this property is easily accessible. But the model might not be doing anything with it, for example, because it's totally random. Likewise, another paper showed that you can achieve high accuracy with a probe even if the model is trained to know that thing that you're probing for is not useful.

And there's causal studies that try to extend this work. It's much more difficult to read this paper. And it's a fascinating line of future work. Now in my last two minutes, I want to talk about recasting model tweaks and ablations as analysis. So we had this improvement process where we had a network that was going to work OK.

And we would see whether we could tweak it in simple ways to improve it. And then you could see whether you could remove anything and have it still be OK. And that's kind of like analysis. I have my network. Do I want it to-- is it going to be better if it's more complicated?

If it's going to be better if it's simpler? Can I get away with it being simpler? And so one example of some folks who did this is they took this idea of multi-headed attention and said, oh, so many heads. Are all the heads important? And what they showed is that if you train a system with multi-headed attention and then just remove the heads at test time and not use them at all, you can actually do pretty well on the original task, not retraining at all, without some of the attention heads, showing that they weren't important.

You could just get rid of them after training. And likewise, you can do the same thing for-- this is on machine translation. This is on multi-NLI. You can actually get away without a large, large percentage of your attention heads. Let's see. Yeah, so another thing that you could think about is questioning sort of the basics of the models that we're building.

So we have transformer models that are sort of self-attention, feedforward, self-attention, feedforward. But why in that order with some of the things emitted here? And this paper asked this question and said, if this is my transformer, self-attention, feedforward, self-attention, feedforward, et cetera, et cetera, et cetera, what if I just reordered it so that I had a bunch of self-attentions at the head and a bunch of feedforwards at the back?

And they tried a bunch of these orderings. And this one actually does better. So this achieves a lower perplexity on a benchmark. And this is a way of analyzing what's important about the architectures that I'm building and how can they be changed in order to perform better. So neural models are very complex.

And they're difficult to characterize and impossible to characterize with a single sort of statistic, I think, for your test set accuracy, especially in domain. And we want to find intuitive descriptions of model behaviors. But we should look at multiple levels of abstraction. And none of them are going to be complete.

When someone tells you that their neural network is interpretable, I encourage you to engage critically with that. It's not necessarily false. But the levels of interpretability and what you can interpret, these are the questions that you should be asking. Because it's going to be opaque in some ways, almost definitely.

And then bring this lens to your model building as you try to think about how to build better models, even if you're not going to be doing analysis as sort of one of your main driving goals. And with that, good luck on your final projects. I realize we're at time.

The teaching staff is really appreciative of your efforts over this difficult quarter. And yeah, hope-- yeah, there's a lecture left on Thursday. But yeah, this is my last one. So thanks, everyone.

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 17 - Model Analysis and Explanation

Chapters

Transcript