back to indexStanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 17 - Model Analysis and Explanation
Chapters
0:0 Introduction
4:7 Why Care
6:19 Model biases
8:28 Deep model analysis
12:47 Natural language inference
14:48 HANS
17:40 How do models perform
19:27 Linguistic properties
24:38 Error rates
26:23 Examples
27:40 Questions
31:50 Unit Testing
36:19 Language Models
37:31 Long Term Memory
40:5 Saliency Maps
42:2 Simple Gradient Method
47:29 Example from Squad
48:52 Example from Quest
50:34 Breaking Models
53:22 Robust to Noise
56:23 Attention
00:00:00.000 |
Welcome to CS224N, lecture 17, Model Analysis and Explanation. 00:00:21.760 |
We have updated the policy on the guest lecture reactions. 00:00:30.600 |
You can't use late days for this, so please get them in. 00:00:39.680 |
And you get something like half a point for each of them. 00:00:42.360 |
And yeah, all three can be submitted up through Friday. 00:00:55.120 |
And let me emphasize that there's a hard deadline 00:01:10.920 |
are submitted after the 4.30 deadline on Friday. 00:01:15.600 |
We need to get these graded and get grades in. 00:01:24.320 |
are us giving you help on the final projects. 00:01:32.600 |
awesome questions in lecture and in office hours and on ed. 00:01:40.420 |
of my favorite subjects in natural language processing. 00:01:47.040 |
So first, we're going to do what I love doing, 00:01:49.080 |
which is motivating why we want to talk about the topic at all. 00:01:59.040 |
to perform different kinds of analysis on it. 00:02:02.000 |
We'll talk about out-of-domain evaluation sets. 00:02:05.080 |
So this will feel familiar to the robust QA folks. 00:02:13.680 |
for a given example, why did it make the decision that it made? 00:02:19.520 |
Can we come up with some sort of interpretable explanation for it? 00:02:30.880 |
the vectors that are being built throughout the processing 00:02:34.200 |
of the model, try to figure out if we can understand 00:02:41.400 |
And then we'll actually come back to one of the default 00:02:50.160 |
removing things from models, seeing how it performs, 00:02:55.280 |
doing in this lecture, show how it's not all that different. 00:02:58.360 |
So if you haven't seen this XKCD, now you have. 00:03:09.600 |
So person A says, this is your machine learning system. 00:03:19.120 |
and then collect the answers on the other side. 00:03:46.240 |
But frequently, what have the TAs told everyone in office? 00:03:51.040 |
have to try it and see if it's going to work out 00:03:55.080 |
It's very, very difficult to understand our models 00:04:08.040 |
So beyond it being important because it's in the next KCD 00:04:13.640 |
comic, why should we care about understanding our models? 00:04:18.360 |
One is that we want to know what our models are doing. 00:04:33.440 |
You have an input sentence, say, and then some output 00:04:37.680 |
Maybe this black box is actually your final project model, 00:04:48.840 |
And in your final projects, you'll summarize your model 00:04:51.040 |
with sort of one or a handful of summary metrics of accuracy 00:05:09.520 |
So we want to sort of know what our models are doing, OK. 00:05:17.080 |
So today, when you're building models in this class 00:05:21.120 |
at the company, you start out with some kind of recipe 00:05:27.440 |
or because you have experience from this class. 00:05:33.440 |
And then over time, you take what works, maybe, 00:05:39.520 |
So it seems like maybe adding another layer to the model 00:05:43.760 |
And maybe that's a nice tweak, and the model performance 00:05:48.880 |
And incremental progress doesn't always feel exciting. 00:05:53.920 |
But I want to pitch to you that it's actually 00:05:58.320 |
how much incremental progress can kind of get us 00:06:03.080 |
So that we can have a better job of evaluating 00:06:06.720 |
when we need big leaps, when we need major changes, 00:06:12.080 |
attacking with our incremental sort of progress, 00:06:20.320 |
Another thing that is, I think, very related to 00:06:29.480 |
So let's say you take your Word2Vec analogies solver 00:06:34.040 |
from GloVe or Word2Vec, that is, from assignment one, 00:06:39.720 |
and you give it the analogy, man is to computer programmer 00:06:50.840 |
You should be like, wow, well, I'm glad I know that now. 00:06:55.160 |
And of course, you saw the lecture from Yulia Svetkov 00:07:10.400 |
So that's the kind of thing that you can also 00:07:12.600 |
do with model analysis beyond just making models better 00:07:15.320 |
according to some sort of summary metric as well. 00:07:22.800 |
And this is something that I think is super important. 00:07:25.120 |
We don't just want to look at that time scale. 00:07:30.200 |
We want to say, what about 10, 15, 25 years from now? 00:07:37.640 |
What can be learned by language model pre-training? 00:07:41.080 |
What's the model that will replace the transformer? 00:07:43.920 |
What's the model that will replace that model? 00:07:48.000 |
What are we sort of attacking over and over again 00:07:52.880 |
What do neural models tell us about language potentially? 00:08:12.720 |
What can't be learned via language model pre-training? 00:08:15.140 |
So that's sort of the complementary question there. 00:08:19.680 |
you can learn via language model pre-training, 00:08:22.240 |
is there stuff that we need total paradigm shifts 00:08:32.160 |
of trying to really deeply understand our models 00:08:40.440 |
And one thing that I want you to take away from it 00:08:49.440 |
some kind of intuition or something, but none of them 00:08:54.640 |
understand 100% about what this model is doing now. 00:09:06.420 |
I think you should sort of start out by thinking about is, 00:09:20.600 |
to estimate the probabilities of start and end indices 00:09:25.960 |
or you've trained a language model that assigns probabilities 00:09:30.480 |
You can just look at the model as that object. 00:09:37.480 |
You are not looking into it any further than the fact 00:09:45.000 |
So that's like, who even cares if it's a neural network? 00:09:53.240 |
Another level of abstraction that you can look at, 00:10:02.000 |
You've got sort of maybe your transformer encoder 00:10:23.300 |
You've got the connections in the computation graph. 00:10:26.120 |
So now you're sort of trying to remove all of the abstraction 00:10:29.560 |
that you can and look at as many details as possible. 00:10:33.640 |
of looking at your model and performing analysis 00:10:38.400 |
sort of travel slowly from one to two to three 00:10:47.080 |
So we haven't actually talked about any analyses yet. 00:10:59.520 |
So would we want to see, will my model perform well? 00:11:10.000 |
And so we don't really care about mechanisms yet. 00:11:17.400 |
Instead, we're just interested in sort of the more higher 00:11:28.540 |
that we are already doing and sort of recast it 00:11:37.340 |
So you've got input/output pairs of some kind. 00:11:53.900 |
And this is just what we've been doing this whole time. 00:11:56.300 |
It's your test set accuracy, or F1, or blue score. 00:11:59.940 |
And so you've got some model with some accuracy. 00:12:04.620 |
And maybe it's better than some model with some other accuracy 00:12:10.640 |
iterating on your models in your final project as well. 00:12:22.020 |
And so maybe I'll choose model A to keep working on. 00:12:24.740 |
Maybe I'll choose it if you were putting something 00:12:28.620 |
But remember back to this idea that it's just one number 00:12:37.940 |
how it's going to perform in a wide variety of settings. 00:12:52.020 |
testing on exactly the same type of data that we trained on? 00:12:56.300 |
So now we're asking, did the model learn something 00:12:58.900 |
such that it's able to sort of extrapolate or perform 00:13:02.340 |
how I want it to on data that looks a little bit different 00:13:06.100 |
And we're going to take the example of natural language 00:13:08.980 |
So to recall the task of natural language inference-- 00:13:16.860 |
He turned and saw John sleeping in his half tent. 00:13:31.780 |
whether the hypothesis is sort of implied by the premise 00:13:39.500 |
if the hypothesis is John was awake, for example, 00:13:45.840 |
Neutral, if sort of both could be true at the same time, 00:13:51.900 |
it seems like they're saying that the premise implies 00:14:11.780 |
we think we want it to be doing in order to perform 00:14:19.580 |
who gathered the data set will have asked humans 00:14:38.820 |
as actually performing the task more broadly in the right way. 00:14:45.460 |
So what if the model is not doing something smart, 00:14:54.740 |
seem like things the model should be able to do to test 00:15:03.100 |
So HANS is the heuristic analysis for NLI systems data 00:15:11.860 |
and test whether they're using some simple syntactic 00:15:16.140 |
What we'll have in each of these cases, we'll have some heuristic. 00:15:30.260 |
all hypotheses constructed from words in the premise. 00:15:40.820 |
And then the hypothesis is the doctor paid the actor. 00:15:43.740 |
And you'll notice that in bold here, get the doctor, 00:15:52.460 |
you will think that the doctor was paid by the actor 00:16:10.500 |
So here, if the model assumes that the premise entails 00:16:19.100 |
So this example is the doctor near the actor danced. 00:16:28.300 |
The doctor is doing the dancing near the actor 00:16:42.420 |
And here's another one that's a lot like subsequence. 00:16:45.940 |
But so if the model thinks that the premise entails 00:16:50.660 |
all complete subtrees-- so this is like fully formed phrases. 00:16:55.660 |
So the artist slept here is a fully formed subtree. 00:17:03.960 |
Does it entail the hypothesis the actor slept? 00:17:10.980 |
That does not entail it because this is in that conditional. 00:17:20.460 |
Anyone unclear about how this sort of evaluation 00:17:49.340 |
from the same paper that released the data set. 00:17:51.860 |
So they took four strong multi-analyte models 00:17:57.660 |
So the accuracies here are something between 60% and 80% 00:18:04.660 |
And in domain, in that first setting that we talked about, 00:18:13.820 |
And that is sort of what we said before about it 00:18:19.820 |
And when we evaluate on Hans, in this setting 00:18:24.740 |
here, we have examples where the heuristics we talked about 00:18:37.020 |
And then if we evaluate the model in the settings 00:18:40.700 |
where if it uses the heuristic, it gets the examples wrong, 00:18:55.540 |
They're not complex in our own idea of complexity. 00:19:03.180 |
And so this is why it feels like a clear failure of the system. 00:19:08.420 |
Now, you can say, though, that, well, maybe the training data 00:19:11.780 |
sort of didn't have any of those sort of phenomena. 00:19:14.740 |
So the model couldn't have learned not to do that. 00:19:18.060 |
And that's sort of a reasonable argument, except, well, 00:19:20.700 |
Bert is pre-trained on a bunch of language text. 00:19:23.540 |
So you might expect, you might hope that it does better. 00:19:26.380 |
So we saw that example of models performing well 00:19:37.380 |
on examples that are like those that it was trained on, 00:19:49.380 |
Now we're going to take this idea of having a test 00:19:52.340 |
set that we've carefully crafted and go in a slightly 00:19:57.260 |
mean to try to understand the linguistic properties 00:20:03.380 |
was one thing for natural language inference. 00:20:08.260 |
whether they think certain things are sort of right 00:20:14.300 |
And the first way that we'll do this is we'll ask, well, 00:20:21.260 |
How do we evaluate their sort of preferences about language? 00:20:30.580 |
that you've got one sentence that sounds OK to a speaker. 00:20:34.740 |
So this sentence is, the chef who made the pizzas is here. 00:20:39.660 |
It's called it's an acceptable sentence, at least to me. 00:20:43.700 |
And then with a small change, a minimal change, 00:21:01.260 |
In English, present tense verbs agree in number 00:21:03.460 |
with their subject when they are third person. 00:21:33.180 |
that there's this noun pizzas, which is plural, 00:21:36.580 |
closer linearly, comes back to dependency parsing. 00:21:42.140 |
And what this looks like in the tree structure 00:21:45.060 |
is, well, chef and is are attached in the tree. 00:22:02.500 |
So this is a pretty sort of basic and interesting property 00:22:05.660 |
of language that also reflects the syntactic sort 00:22:11.060 |
So we've been training these language models, 00:22:12.900 |
sampling from them, seeing that they get interesting things. 00:22:15.740 |
And they tend to seem to generate syntactic content. 00:22:21.980 |
behave as if it understands this idea of agreement more broadly? 00:22:28.380 |
so that it matches the subjects and the verbs? 00:22:33.860 |
exactly whether they think that a sentence is good or bad. 00:22:36.980 |
They just tell us the probability of a sentence. 00:22:40.300 |
So before, we had acceptable and unacceptable. 00:22:49.780 |
to the acceptable sentence in the minimal pair? 00:22:52.180 |
So you have the probability under the model of the chef who 00:22:59.980 |
under the model of the chef who made the pizzas are here. 00:23:03.740 |
And you want this probability here to be higher. 00:23:08.020 |
And if it is, that's sort of like a simple way 00:23:10.500 |
to test whether the model got it right effectively. 00:23:15.460 |
And just like in Hans, we can develop a test set 00:23:31.180 |
like pizzas, that are going to make it difficult. 00:23:39.340 |
to get it wrong, because this index is very simple. 00:23:44.860 |
So we can create, or we can look for sentences that have-- 00:23:49.940 |
these are the things called attractors in the sentence. 00:24:03.940 |
Can language models sort of very generally handle 00:24:08.500 |
So we can take examples with zero attractors, 00:24:11.340 |
see whether the model gets the minimal pairs evaluation right. 00:24:14.540 |
We can take examples with one attractor, two attractors. 00:24:18.340 |
You can see how people would still reasonably understand 00:24:21.820 |
Chef who made the pizzas and prepped the ingredients is. 00:24:26.460 |
And then on and on and on, it gets rarer, obviously. 00:24:34.180 |
that's intended to evaluate this very specific linguistic 00:24:43.140 |
trained an LSTM language model on a subset of Wikipedia 00:24:47.900 |
And they evaluated it sort of in these buckets 00:24:50.540 |
that are specified by the paper that sort of introduced 00:25:06.140 |
And so in this table here that you're about to see, 00:25:19.660 |
So if you were just to do random or majority class, 00:25:23.220 |
Oh, sorry, it's the percent of times that you get it wrong. 00:25:29.780 |
And so with no attractors, you get very low error rates. 00:25:33.460 |
So this is 1.3 error rate with a 350-dimensional LSTM. 00:25:38.940 |
And with one attractor, your error rate is higher. 00:25:50.220 |
The larger the LSTM, it looks like in general, 00:25:56.460 |
And then even on very difficult examples with four attractors, 00:26:00.220 |
which try to think of an example in your head, 00:26:02.420 |
like the chef made the pizzas and took out the trash. 00:26:10.340 |
so it gets more difficult. But it's still relatively low. 00:26:16.900 |
models are actually performing subject-verb number agreement 00:26:31.960 |
But I think, actually, the errors are quite interesting. 00:26:35.900 |
The ship that the player drives has a very high speed. 00:26:41.320 |
Now, this model thought that was less probable than the ship 00:26:45.100 |
that the player drives have a very high speed. 00:26:50.940 |
My hypothesis is that it sort of misanalyzes drives 00:27:12.520 |
So here, five paragraphs is a singular noun together. 00:27:20.340 |
But the model thought that it was more likely to say 00:27:26.380 |
because it's referring to this sort of five paragraphs 00:27:33.380 |
as opposed to a single unit of length describing the lead. 00:27:59.180 |
for other tasks, such as Q&A, classification? 00:28:07.580 |
So yes, I think that it's easier to do this kind of analysis 00:28:11.260 |
for the Hans style analysis with question answering 00:28:18.340 |
and other sorts of tasks, because you can construct 00:28:22.140 |
examples that similarly have these heuristics 00:28:32.820 |
and then have the answer depend on the syntax or not. 00:28:43.300 |
But the idea that you can develop bespoke test 00:28:48.700 |
sets for various tasks, I think, is very, very general 00:28:54.140 |
and something I think is actually quite interesting. 00:28:59.860 |
So I won't go on further, but I think the answer is just yes. 00:29:07.380 |
How do you know where to find these failure cases? 00:29:10.180 |
Maybe that's the right time to advertise linguistics classes. 00:29:19.740 |
How do you know where to find these failure cases? 00:29:24.100 |
Yes, how do we know where to find the failure cases? 00:29:33.740 |
is interesting about things in language is one way to do it. 00:29:39.500 |
I mean, the heuristics that we saw in our language model-- 00:29:53.620 |
if the model was sort of ignoring facts about language 00:29:56.780 |
and sort of just doing this sort of rough bag of words 00:29:59.540 |
with some extra magic, then it would do well about as bad 00:30:10.540 |
that this statement, if the artist slept, the actor ran, 00:30:13.260 |
does not imply the artist slept, is the kind of thing 00:30:18.380 |
but also you'd spend time sort of pondering about and thinking 00:30:22.760 |
broad thoughts about in linguistics curricula as well. 00:30:35.940 |
So there's also-- well, I guess someone was also saying-- 00:30:41.020 |
I think it's about the sort of intervening verbs example-- 00:30:46.660 |
But the data set itself probably includes mistakes 00:30:55.540 |
Yeah, because humans make more and more mistakes 00:31:03.880 |
On the other hand, I think that the mistakes are fewer 00:31:13.560 |
But yeah, it would be interesting to actually go 00:31:15.520 |
through that test set and see how many of the errors 00:31:22.360 |
due to the sort of observed form being incorrect. 00:32:03.500 |
And in fact, this sort of idea has been brought to the fore, 00:32:18.380 |
that I'm citing at the bottom suggests this minimum 00:32:23.220 |
You want a small test set that targets a specific behavior. 00:32:30.820 |
But in this case, we're going to get even more specific. 00:32:36.820 |
We're going to have an expected label, what was actually 00:32:40.220 |
predicted, whether the model passed this unit test. 00:32:43.660 |
And the labels are going to be sentiment analysis here. 00:32:57.780 |
I, then a negation, a positive verb, and then the thing. 00:33:14.660 |
and this is, I think, a commercial sentiment analysis 00:33:29.820 |
And this commercial sentiment analysis system 00:33:41.500 |
showed is that they could actually provide a system that 00:33:44.700 |
sort of had this framework of building test cases for NLP 00:33:48.300 |
models to ML engineers working on these products 00:34:01.900 |
find bugs in their models that they could then 00:34:08.380 |
of trying to find things that were simple and still wrong 00:34:11.300 |
with what should be pretty sophisticated neural systems. 00:34:17.660 |
And it's sort of a nice way of thinking more specifically 00:34:21.180 |
about what are the capabilities in sort of precise terms 00:34:33.380 |
You've seen language models actually perform pretty well 00:34:38.860 |
you just saw an example of a commercial sentiment analysis 00:34:41.740 |
system that sort of should do better and doesn't. 00:34:50.180 |
is if you get high accuracy on the in-domain test set, 00:34:58.980 |
what you might consider to be reasonable out-of-domain 00:35:08.180 |
And if you're building a system that will be given to users, 00:35:11.980 |
it's immediately out of domain, at the very least 00:35:15.620 |
now older than the things that the users are now saying. 00:35:23.300 |
is a single number that does not guarantee good performance 00:35:28.060 |
And from a what are our neural networks doing perspective, 00:35:32.100 |
one way to think about it is that models seem 00:35:36.300 |
sort of the fine-grained heuristics and statistics that 00:35:44.580 |
So humans can perform natural language inference. 00:35:46.980 |
If you give them examples from whatever data set, 00:35:55.260 |
But you take your MNLI model, and you test it on Hans, 00:35:59.900 |
and it got whatever that was, below chance accuracy. 00:36:03.100 |
That's not the kind of thing that you want to see. 00:36:23.700 |
that they encode about the world through pre-training. 00:36:26.380 |
And one of the ways that we saw it interact with language 00:36:39.220 |
requires you to access knowledge about where Dante was born. 00:36:45.900 |
but this fits into the set of behavioral studies 00:36:57.140 |
You could swap out born in for, I don't know, 00:37:16.580 |
of interacting with your models, this sort of behavioral study 00:37:20.980 |
can be very, very general, even though, remember, 00:37:23.820 |
we're at still this highest level of abstraction, 00:37:26.900 |
where we're just looking at the probability distributions that 00:37:29.860 |
All right, so now we'll go into-- so we've sort of looked 00:37:41.540 |
What about sort of why for an individual input 00:37:55.980 |
So one study that I love to reference that really draws 00:38:00.000 |
back into our original motivation of using LSTM 00:38:04.380 |
networks instead of simple recurrent neural networks 00:38:10.420 |
But how long is your long and short-term memory? 00:38:23.140 |
that are farther than some k words away, changing k. 00:38:29.140 |
And if the accuracy, if the predictive ability 00:38:37.820 |
means the model wasn't actually using that context. 00:38:42.100 |
So on the x-axis, we've got how far away from the word 00:38:48.260 |
Are you actually sort of corrupting, shuffling, 00:38:54.140 |
And then on the y-axis is the increase in loss. 00:39:00.460 |
it means that the model was not using the thing 00:39:11.460 |
if you shuffle the history that's farther away from 50 00:39:20.080 |
One, it says everything past 50 words of this LSTM language 00:39:23.620 |
model, you could have given it in random order, 00:39:28.500 |
And then two, it says that if you're closer than that, 00:39:36.740 |
And then if you actually remove the words entirely, 00:39:45.660 |
So you don't care about the order they're in, 00:39:54.800 |
Well, this one at least has effectively no longer 00:40:03.860 |
So that's a general study for a single model. 00:40:14.580 |
But we want to talk about individual predictions 00:40:19.340 |
So one way of interpreting why did my model make 00:40:23.860 |
this decision that's very popular is, for a single 00:40:27.180 |
example, what parts of the input actually led to the decision? 00:40:31.340 |
And this is where we come in with saliency maps. 00:40:47.580 |
The mask rushed to the emergency room to see her patient. 00:40:57.300 |
It's going to be nurse that's here in the mask instead, 00:41:01.060 |
or maybe woman, or doctor, or mother, or girl. 00:41:04.580 |
And then the saliency map is being visualized here in orange. 00:41:09.740 |
called simple gradients, which we'll get into, 00:41:17.900 |
But emergency and her are the important words, apparently. 00:41:21.860 |
And the SEP token shows up in every sentence. 00:41:25.820 |
and so these two together are, according to this method, 00:41:29.380 |
what's important for the model to make this prediction to mask. 00:41:33.420 |
And you can see maybe some statistics, biases, et cetera, 00:41:39.100 |
and then have it mapped out onto the sentence. 00:41:41.820 |
And this is-- well, it seems like it's really 00:41:47.060 |
And yeah, I think that this is a very useful tool. 00:41:52.580 |
Actually, this is part of a demo from Alan NLP 00:41:56.300 |
that allows you to do this yourself for any sentence 00:42:05.660 |
We're not going to go-- there's so many ways to do it. 00:42:12.660 |
So the issue is, how do you define importance? 00:42:17.420 |
What does it mean to be important to the model's 00:42:28.300 |
And then you've got a model score for a given output class. 00:42:38.740 |
And then you take the norm of the gradient of the score, 00:42:48.620 |
is the unnormalized probability for that class. 00:43:05.340 |
if I move it a little bit in one direction or another? 00:43:08.380 |
And then you take the norm to get a scalar from a vector. 00:43:12.260 |
The salience of word i, you have the norm bars on the outside, 00:43:18.900 |
So that's if I change a little bit locally xi, 00:43:32.060 |
And that means it was very important to the decision. 00:43:46.980 |
The word space is like sort of a flattening of the ability 00:43:51.700 |
to move your word embedding in 1,000 dimensional space. 00:43:54.740 |
So I've just plotted it here in one dimension. 00:44:00.880 |
can see that the relationship between what should be score 00:44:13.740 |
Low saliency, you move the word around locally, 00:44:27.680 |
because I could have changed it, and the score 00:44:46.620 |
It's not perfect, because, well, maybe your linear approximation 00:44:51.860 |
that the gradient gives you holds only very, very locally. 00:45:06.340 |
in either direction, the score would shoot up. 00:45:15.980 |
as opposed to anywhere else even sort of nearby in order 00:45:22.060 |
But the simple gradients method won't capture this, 00:45:36.420 |
And I think that is a good tool for the toolbox. 00:45:42.540 |
OK, so that is one way of explaining a prediction. 00:45:47.260 |
And it has some issues, like why are individual words being 00:45:53.100 |
scored, as opposed to phrases or something like that. 00:45:56.980 |
But for now, we're going to move on to another type 00:46:07.620 |
I mean, earlier on, there were a couple of questions. 00:46:21.520 |
is a methodologically rigorous way of determining 00:46:24.540 |
the importance that the model places on certain tokens? 00:46:27.960 |
It seems like there's some back and forth in the literature. 00:46:34.820 |
And I probably won't engage with that question 00:46:36.900 |
as much as I could if we had a second lecture on this. 00:46:40.660 |
I actually will provide some attention analyses 00:46:46.900 |
about why they can be interesting without being 00:46:53.380 |
sort of maybe the end all of analysis of where information 00:47:08.420 |
would have to get into in a much longer period of time. 00:47:11.580 |
But look at the slides that I show about attention 00:47:15.740 |
And let me know if that answers your question first, 00:47:17.900 |
because we have quite a number of slides on it. 00:47:28.340 |
So I think this is a really fascinating question, which 00:47:31.820 |
also gets at what was important about the input, 00:47:35.220 |
but in actually kind of an even more direct way, which 00:47:38.260 |
is, could I just keep some minimal part of the input 00:47:47.140 |
John Jacob Astor IV invested $100,000 for Tesla. 00:47:51.940 |
And then the answer that is being predicted by the model 00:47:54.220 |
is going to always be in blue in these examples, Colorado 00:47:59.860 |
And the question is, what did Tesla spend Astor's money on? 00:48:03.660 |
That's why the prediction is Colorado Springs Experiments. 00:48:06.020 |
The model gets the answer right, which is nice. 00:48:10.300 |
And we would like to think it's because it's doing 00:48:16.460 |
It turns out, based on this fascinating paper, 00:48:33.140 |
the model had sort of a 0.78 confidence probability 00:48:46.100 |
would not be able to know really what you're trying to ask about. 00:48:49.720 |
So it seems like some things are going really wonky here. 00:48:58.980 |
In fact, it actually references our input saliency methods. 00:49:03.180 |
So you iteratively remove non-salient or unimportant 00:49:08.980 |
So here's a passage again talking about football, 00:49:16.660 |
OK, so the question is, where did the Broncos practice 00:49:19.060 |
for the Super Bowl as the prediction of Stanford 00:49:33.820 |
change this question such that I still get the answer right? 00:49:38.940 |
was least important according to a saliency method. 00:49:41.780 |
So now, it's where did the practice for the Super Bowl? 00:49:48.700 |
You don't even know which one you're asking about. 00:49:52.620 |
so confident in Stanford University makes no sense. 00:50:03.220 |
stops being confident in the answer, Stanford University. 00:50:16.620 |
reflecting the uncertainty that really should be there 00:50:19.660 |
because you can't know what you're even asking about. 00:50:23.420 |
OK, so what was important to make this answer? 00:50:35.900 |
All right, so that's sort of the end of the admittedly brief 00:50:45.340 |
Now, we're going to talk about actually breaking models 00:50:58.500 |
Super Bowl, age 39, past record held by John Elway. 00:51:12.060 |
Now, we're not going to change the question to try to sort 00:51:15.040 |
of make the question nonsensical while keeping the same answer. 00:51:22.540 |
by adding the sentence at the end, which really 00:51:25.540 |
This is quarterback, well-known quarterback, Jeff Dean, 00:51:34.700 |
But now, the prediction is Jeff Dean for our nice QA model. 00:51:44.020 |
seems like maybe there's this end of the passage bias 00:51:47.260 |
as to where the answer should be, for example. 00:51:52.900 |
where we flipped the prediction by adding something 00:52:01.700 |
that we had that seemed good is not actually performing QA 00:52:04.740 |
how we want it to, even though its in-domain accuracy was 00:52:12.220 |
So you've got this paragraph with a question, 00:52:19.620 |
The answer is increased scrutiny on teacher misconduct. 00:52:25.100 |
we're going to change the question in really, 00:52:32.740 |
So first, what HA-- now you've got this typo, L-- 00:52:42.420 |
Likely, a human would sort of ignore this typo or something 00:52:49.420 |
Instead of asking, what has been the result of this publicity, 00:52:52.700 |
if you ask, what's been the result of this publicity, 00:52:59.380 |
And this is-- the authors call this a semantically equivalent 00:53:05.700 |
And in general, swapping what for what in this QA model 00:53:13.100 |
And so again, when you go back and sort of re-tinker 00:53:31.060 |
Are humans robust to noise is another question we can ask. 00:53:34.100 |
And so you can kind of go to this popular sort of meme 00:53:38.740 |
passed around the internet from time to time, 00:53:41.620 |
where you have all the letters in these words scrambled. 00:53:44.900 |
You say, according to research at Cambridge University, 00:53:49.140 |
it doesn't matter in what order the letters in a word are. 00:54:01.380 |
And we can be robust as humans to reading and processing 00:54:05.060 |
the language without actually all that much of a difficulty. 00:54:10.140 |
So that's maybe something that we might want our models 00:54:19.020 |
Noise is a part of all NLP systems inputs at all times. 00:54:25.380 |
as having users, for example, and not having any noise. 00:54:32.540 |
on some popular machine translation models, where 00:54:36.300 |
you train machine translation models in French, German, 00:54:48.660 |
The idea is these are actually pretty strong machine 00:54:56.100 |
Now, if you add character swaps, like the ones 00:55:09.620 |
And even if you take a somewhat more natural typo noise 00:55:15.220 |
distribution here, you'll see that you're still 00:55:18.020 |
getting 20-ish, yeah, very high drops in blue score 00:55:27.900 |
And so maybe you'll go back and retrain the model on more types 00:55:32.620 |
is it robust to even different kinds of noise? 00:55:34.820 |
These are the questions that are going to be really important. 00:55:39.120 |
able to break your model really easily so that you can then 00:55:57.580 |
So now we're going to look at the representations 00:56:12.980 |
look more at the actual vector representations that 00:56:17.380 |
And we can answer a different kind of question, 00:56:20.460 |
at the very least, than with the other studies. 00:56:28.740 |
is that some modeling components lend themselves to inspection. 00:56:33.660 |
Now this is a sentence that I chose somewhat carefully, 00:56:43.220 |
But they lend themselves to inspection in the following way. 00:56:46.580 |
You can visualize them well, and you can correlate them easily 00:56:51.660 |
So let's say you have attention heads in BERT. 00:56:53.860 |
This is from a really nice study that was done here, 00:57:00.580 |
and you say, on most sentences, this attention head, head 1, 00:57:08.380 |
Simple kind of operation does this pretty consistently. 00:57:18.460 |
So it's the first layer, which means that this word found 00:57:29.300 |
is that once you do some rounds of attention, 00:57:32.820 |
you've had information mixing and flowing between words. 00:57:36.820 |
And how do you know exactly what information you're combining, 00:57:52.060 |
at sort of a local mechanistic point of view, 00:57:59.580 |
Some attention heads seem to perform simple operations. 00:58:05.500 |
Others seem to attend pretty robustly to the next token. 00:58:18.760 |
Maybe that's sort of splitting sentences together and things 00:58:25.340 |
that some attention heads seem to pretty robustly perform. 00:58:32.460 |
what's actually represented at this period at layer 11? 00:58:43.900 |
with really interesting linguistic properties. 00:58:46.060 |
So this head is actually attending to noun modifiers. 00:58:59.980 |
Even if the model is not doing this as a causal mechanism 00:59:11.720 |
is we've got sort of an approximate interpretation 00:59:18.380 |
us to reason about very complicated model behavior. 00:59:29.600 |
And it seems like this head does a pretty OK job of actually 00:59:45.520 |
And so it does so with some percentage of the time. 00:59:49.960 |
And again, it's sort of connecting very complex model 00:59:52.240 |
behavior to these sort of interpretable summaries 01:00:00.240 |
Other cases, you can have individual hidden units 01:00:04.480 |
So here, you've got a character level LSTM language model. 01:00:20.640 |
very negative to very positive or very positive 01:00:31.760 |
and pretty robustly doing so across all of these sentences. 01:00:41.920 |
Here's another cell from that same LSTM language model 01:00:57.160 |
Here, you start with no quote, negative in the red, 01:01:05.560 |
Also, potentially a very useful feature to keep in mind. 01:01:08.000 |
And this is just an individual unit in the LSTM 01:01:10.200 |
that you can just look at and see that it does this. 01:01:19.080 |
and this is actually a study by some AI and neuroscience 01:01:25.120 |
is we saw the LSTMs were good at subject verb number agreement. 01:01:29.560 |
Can we figure out the mechanisms by which the LSTM is 01:01:40.400 |
But you have a sentence, "the boy gently and kindly 01:01:47.840 |
so it's an individual hidden unit, one dimension-- 01:01:52.320 |
is actually, after it sees boy, it sort of starts to go higher. 01:01:57.800 |
And then it goes down to something very small 01:02:02.360 |
And this cell seems to correlate with the scope of a subject 01:02:09.560 |
So here, "the boy that watches the dog that watches the cat 01:02:16.520 |
maintaining the scope of subject until greets, 01:02:23.480 |
Probably some complex other dynamics in the network. 01:02:27.320 |
But it's still a fascinating, I think, insight. 01:02:31.000 |
And yeah, this is just neuron 1,150 in this LSTM. 01:02:42.440 |
that you could do by picking out individual components 01:02:47.040 |
of the model that you can just take each one of 01:02:53.240 |
Now, we'll look at a general class of methods 01:03:04.160 |
of the type of coreference that we're looking for. 01:03:06.840 |
But instead of seeing if it correlates with something 01:03:13.920 |
to look into the vector representations of the model 01:03:19.120 |
by some simple function to say, oh, maybe this property was 01:03:23.440 |
made very easily accessible by my neural network. 01:03:30.720 |
got language data that goes into some big pre-trained 01:03:40.960 |
And so the question for the probing methodology 01:03:44.240 |
is, if it's providing these general purpose language 01:03:47.320 |
representations, what does it actually encode about language? 01:03:56.100 |
is learning about language that we seemingly now 01:04:00.480 |
And so you might have something like a sentence, 01:04:13.840 |
maybe some layers of self-attention and stuff. 01:04:22.560 |
So it's a vector per word or subword for every layer. 01:04:27.020 |
And the question is, can we use these linguistic properties, 01:04:32.380 |
had way back in the early part of the course, 01:04:35.380 |
to understand correlations between properties 01:04:41.040 |
in the vectors and these things that we can interpret? 01:04:53.500 |
So here in this sentence, I record the record. 01:05:19.000 |
Does the model encode that one is one and the other 01:05:26.200 |
So we're going to decide on a layer that we want to analyze. 01:05:47.680 |
to decode a property that I'm interested in really 01:05:53.800 |
So it's indicating that this property is easily 01:06:00.000 |
So maybe I train a linear classifier right on top of BERT. 01:06:08.880 |
And that's sort of interesting already, because you know, 01:06:13.920 |
that if you run a linear classifier on simpler features 01:06:34.440 |
And now I can say, oh, wow, look, by layer 2, 01:06:38.260 |
part of speech is more easily accessible to linear functions 01:06:45.200 |
Well, the self-attention and feed-forward stuff 01:06:49.360 |
That's interesting, because it's a statement about the information 01:07:05.960 |
So if you have the model's representations, h1 to ht, 01:07:13.800 |
So maybe you have a feed-forward neural network, 01:07:21.520 |
so you get some predictions for part of speech tagging 01:07:24.880 |
That's just the probe applied to the hidden state of the model. 01:07:34.760 |
So that's just written out, not as pictorially. 01:07:38.560 |
So I'm not going to stay on this for too much longer. 01:07:44.200 |
And it may help in the search for causal mechanisms, 01:07:48.480 |
but it sort of just gives us a rough understanding 01:07:57.000 |
So one result is that BERT, if you run linear probes on it, 01:08:07.600 |
Actually, in some cases, approximately as well as just 01:08:10.600 |
doing the very best thing you could possibly do without BERT. 01:08:15.440 |
So it just makes easily accessible, amazingly strong 01:08:19.920 |
And that's an interesting sort of emergent quality of BERT, 01:08:26.000 |
It seems like as well that the layers of BERT 01:08:31.200 |
so if you look at the columns of this plot here, 01:08:37.000 |
You've got input words at the sort of layer 0 of BERT here. 01:08:50.240 |
but consistently, the best place to read out these properties 01:08:53.880 |
is somewhere a bit past the middle of the model, which 01:08:57.240 |
is this very consistent rule, which is fascinating. 01:09:04.160 |
look at this function of increasingly abstract 01:09:11.400 |
an increasing depth in the network on that axis. 01:09:19.360 |
can access more and more abstract linguistic properties, 01:09:26.700 |
constructed over time by the layers of processing of BERT. 01:09:30.080 |
So it's building more and more abstract features, which 01:09:33.160 |
I think is, again, a really interesting result. 01:09:41.440 |
to mind that really brings us back right to day one 01:09:48.840 |
We were asking, what does each dimension of Word2Vec mean? 01:09:56.840 |
and think about properties of it through these connections 01:10:00.640 |
between simple mathematical properties of Word2Vec 01:10:04.320 |
and linguistic properties that we could understand. 01:10:08.040 |
So we had this approximation, which is not 100% true. 01:10:11.400 |
But it's an approximation that says cosine similarity is 01:10:15.760 |
effectively correlated with semantic similarity. 01:10:23.560 |
to do at the end of the day is fine tune these word 01:10:27.720 |
Likewise, we had this idea about the analogies being 01:10:47.520 |
interpret the individual dimensions of Word2Vec, 01:10:56.840 |
and simple math on these objects is fascinating. 01:11:00.520 |
And so one piece of work that extends this idea 01:11:14.560 |
we showed that actually BERTs and models like it 01:11:17.840 |
make dependency parse tree structure emergent, 01:11:28.400 |
the chef who rented the store was out of food, what you can 01:11:38.920 |
So you've got the number of edges in the tree between two 01:11:44.160 |
So you've got that the distance between chef and was is 1. 01:11:48.240 |
And we're going to use this interpretation of a tree 01:11:50.320 |
as a distance to make a connection with BERT's 01:11:54.840 |
And what we were able to show is that under a single linear 01:11:57.800 |
transformation, the squared Euclidean distance between BERT 01:12:02.000 |
vectors for the same sentence actually correlates well, 01:12:12.280 |
So here in this Euclidean space that we've transformed, 01:12:16.440 |
the approximate distance between chef and was is also 1. 01:12:20.960 |
Likewise, the difference between was and store 01:12:25.960 |
And in my simple transformation of BERT space, 01:12:29.560 |
the distance between store and was is also approximately 4. 01:12:33.440 |
And this is true across a wide range of sentences. 01:12:36.480 |
And this is, to me, a fascinating example of, 01:12:39.880 |
again, emergent approximate structure in these very 01:12:43.400 |
nonlinear models that don't necessarily need to encode 01:12:56.640 |
are, I think, interesting and point us in directions 01:13:01.680 |
But they're not arguments that the model is actually 01:13:03.800 |
using the thing that you're finding to make a decision. 01:13:12.000 |
So in some work that I did around the same time, 01:13:15.960 |
we showed actually that certain conditions on probes 01:13:19.440 |
allow you to achieve high accuracy on a task that's 01:13:31.000 |
be doing with this thing that is somehow easily accessible. 01:13:34.800 |
It's interesting that this property is easily accessible. 01:13:37.520 |
But the model might not be doing anything with it, for example, 01:13:46.800 |
even if the model is trained to know that thing that you're 01:13:52.160 |
And there's causal studies that try to extend this work. 01:14:04.680 |
to talk about recasting model tweaks and ablations 01:14:11.240 |
where we had a network that was going to work OK. 01:14:17.640 |
And then you could see whether you could remove anything 01:14:24.480 |
is it going to be better if it's more complicated? 01:14:30.320 |
And so one example of some folks who did this 01:14:33.160 |
is they took this idea of multi-headed attention 01:14:50.880 |
not retraining at all, without some of the attention heads, 01:14:56.000 |
You could just get rid of them after training. 01:14:58.480 |
And likewise, you can do the same thing for-- 01:15:03.120 |
You can actually get away without a large, large 01:15:12.040 |
Yeah, so another thing that you could think about 01:15:15.040 |
is questioning sort of the basics of the models 01:15:20.720 |
are sort of self-attention, feedforward, self-attention, 01:15:23.840 |
But why in that order with some of the things emitted here? 01:15:30.960 |
if this is my transformer, self-attention, feedforward, 01:15:33.760 |
self-attention, feedforward, et cetera, et cetera, et cetera, 01:15:45.760 |
So this achieves a lower perplexity on a benchmark. 01:15:51.040 |
important about the architectures that I'm building 01:15:53.320 |
and how can they be changed in order to perform better. 01:15:59.960 |
and impossible to characterize with a single sort 01:16:02.560 |
of statistic, I think, for your test set accuracy, 01:16:07.400 |
And we want to find intuitive descriptions of model 01:16:11.440 |
But we should look at multiple levels of abstraction. 01:16:16.760 |
When someone tells you that their neural network is 01:16:19.160 |
interpretable, I encourage you to engage critically with that. 01:16:30.840 |
Because it's going to be opaque in some ways, 01:16:35.160 |
And then bring this lens to your model building 01:16:39.500 |
as you try to think about how to build better models, 01:16:41.720 |
even if you're not going to be doing analysis as sort of one 01:16:46.880 |
And with that, good luck on your final projects. 01:16:52.120 |
The teaching staff is really appreciative of your efforts 01:16:57.360 |
And yeah, hope-- yeah, there's a lecture left on Thursday.