back to indexStanford XCS224U I Behavioral Eval of NLU Models, Pt 2: Analytical Considerations I Spring 2023

00:00:13.580 | 
surround this kind of assessment and analysis. 00:00:31.480 | 
We talk about adversarial testing as a shorthand, 00:01:01.380 | 
Does my system understand how negation works? 00:01:07.280 | 
negation was sparsely represented and you want to ask, 00:01:15.960 | 
Behavioral testing could help you address that question. 00:01:19.160 | 
Does my system work with a new style or genre? 00:01:25.320 | 
Maybe it was trained on text from newspapers and you would 00:01:33.480 | 
This system is supposed to know about numerical terms, 00:01:38.880 | 
outside of its training experiences for such terms. 00:01:49.280 | 
the system that it's not good at numerical terms, 00:02:00.640 | 
does the system produce socially problematic, 00:02:04.800 | 
At this point, you're actively trying to construct examples 00:02:10.920 | 
standard experiences for the model in an effort to 00:02:18.400 | 
That is probably more thoroughly adversarial. 00:02:21.560 | 
Maybe the most adversarial of all would be exploring 00:02:30.980 | 
normal examples and append a certain sequence of 00:02:35.840 | 
something very surprising happen as a result, 00:02:41.260 | 
for gaps in its security in some general sense. 00:02:47.360 | 
These are all interesting behavioral tests to run. 00:02:59.220 | 
No amount of behavioral testing can truly offer 00:03:02.600 | 
you a guarantee about what our systems will be like. 00:03:08.500 | 
the set of examples that you decided to construct. 00:03:30.520 | 
But the promise of this model is that it takes in 00:03:39.120 | 
Here four has come in and it has predicted even. 00:03:42.200 | 
So far so good. 21 comes in and it predicted odd. 00:03:55.440 | 
This behavioral testing is going great and suggests that we 00:03:58.880 | 
have a very solid model of even-odd detection. 00:04:02.480 | 
But suppose I now expose for you how this model works. 00:04:06.720 | 
I show you the insides and what that reveals for 00:04:09.440 | 
you is that this model is just a simple lookup, 00:04:12.560 | 
and we got lucky on those five inputs because those are 00:04:15.400 | 
exactly the five inputs that the model was prepared for. 00:04:21.680 | 
Now you know exactly how to expose a weakness of the system. 00:04:34.100 | 
Notice that it was not the behavioral test that gave us this insight, 00:04:38.160 | 
but rather being able to peek under the hood. 00:04:45.840 | 
and let's assume it gets all those green cases correct as well. 00:04:56.640 | 
and you think now we really have an excellent model of even-odd detection. 00:05:06.200 | 
What's revealed when you look inside is that this is 00:05:08.600 | 
a more sophisticated version of the same lookup. 00:05:13.980 | 
the final token by splitting on white space and 00:05:16.760 | 
use that as the basis for classification decisions. 00:05:20.560 | 
If that final word is not in its lookup table, 00:05:30.520 | 
it predicts odd by following that elsewhere case, 00:05:36.600 | 
But again, we saw the flaw not from our behavioral test, 00:05:50.900 | 
but we will always have doubts in the back of our minds that we have missed 00:06:02.500 | 
the actual situation that you are in if you deploy a model out into the world. 00:06:08.980 | 
and now you have to see what happens for the unfamiliar cases. 00:06:13.340 | 
Another more incidental limitation to keep in mind of behavioral testing, 00:06:24.460 | 
When you look through the literature on challenge and adversarial tests, 00:06:27.820 | 
you find that mostly people are adopting the metrics that are 00:06:34.060 | 
and simply probing the models within those guardrails. 00:06:37.420 | 
I think that's fine, but in the fullness of adversarial testing, 00:06:41.260 | 
we should feel free to break out of the confines of these tasks and assess 00:06:45.260 | 
models in new ways to expose new limitations and so forth. 00:06:50.620 | 
I'm going to play by the rules by and large in this lecture, 00:06:53.580 | 
but have in mind that one way to be adversarial would be to put 00:06:57.660 | 
models in entirely unfamiliar situations and ask new things of them. 00:07:02.740 | 
Here's another really crucial analytical point 00:07:07.020 | 
that we need to think about when we do behavioral testing. 00:07:11.420 | 
is this a failure of the model or is it a failure of the underlying dataset? 00:07:17.260 | 
Lovely paper that provides a framework for thinking about this is Liu et al, 00:07:26.500 | 
We're going to talk about that idea in a second, 00:07:28.700 | 
but the guiding idea behind the paper is embodied in this quote. 00:07:33.540 | 
What should we conclude when a system fails on a challenge dataset? 00:07:40.740 | 
blind spots in the design of the original dataset, 00:07:45.820 | 
In others, the challenge might expose an inherent inability of 00:07:49.180 | 
a particular model family to handle certain kinds of natural language phenomena. 00:07:55.660 | 
These are, of course, not mutually exclusive. 00:08:03.740 | 
want to claim they have found model weaknesses. 00:08:07.860 | 
If you can show that the transformer architecture is 00:08:10.780 | 
fundamentally incapable of capturing some phenomenon, 00:08:17.020 | 
That is important. It might mean that the transformer is 00:08:19.860 | 
a non-starter when it comes to modeling language. 00:08:23.420 | 
But frankly, it's more likely that you have found a dataset weakness. 00:08:28.860 | 
There is something about the available training data 00:08:31.500 | 
that means the model has not hit your learning targets. 00:08:34.780 | 
That is a much less interesting result because it often 00:08:37.580 | 
means that we just need to supplement with more data. 00:08:40.820 | 
We need to be careful about this because we don't want to 00:08:43.500 | 
mistake dataset weaknesses for model weaknesses. 00:08:48.260 | 
We made a similar point in a paper that we did 00:08:51.580 | 
about posing fair but challenging evaluation tasks. 00:08:55.340 | 
We write, however, for any evaluation method, 00:09:05.260 | 
support the generalization we are asking of it? 00:09:08.700 | 
Unless we can say yes with complete certainty, 00:09:12.020 | 
we can't be sure whether a failed evaluation traces to 00:09:15.060 | 
a model limitation or a data limitation that no model could overcome. 00:09:24.260 | 
we don't mean that we're particularly worried about 00:09:26.380 | 
them that they might be mistreated or something. 00:09:28.740 | 
Rather, we are worried about an analytic mistake where we blame a model for 00:09:33.780 | 
a failing when in fact the failing is on us because something about 00:09:39.900 | 
fully disambiguate the learning targets that we had in mind. 00:09:43.340 | 
This can easily happen and it can lead to misdiagnosis of problems. 00:09:49.980 | 
a very human level that can show that any agent 00:09:52.540 | 
could feel stumped by a misspecified problem. 00:09:55.620 | 
Suppose I begin the numerical sequence 3, 5, 7, 00:09:59.180 | 
and I ask you to guess what the next number is. 00:10:07.180 | 
it seems reasonable to assume that I was listing out odd numbers, 00:10:12.740 | 
or prime numbers, in which case you should say 11. 00:10:19.380 | 
the prime case for me to scold you for saying 9 in this context. 00:10:23.880 | 
But that is exactly the mistake that we are at risk of 00:10:26.940 | 
making when we pose challenged problems to our systems. 00:10:30.740 | 
Here's another case in which this could happen that's more 00:10:33.540 | 
oriented toward natural language understanding. 00:10:36.020 | 
Suppose I want to probe systems to see whether they can 00:10:42.060 | 
What I do is show the system cases of combinations of p and q, 00:10:46.300 | 
where they're both true and where p is false and q is true. 00:10:53.140 | 
generalize by filling out this entire truth table. 00:10:59.260 | 
the hypothesis space for normal Boolean logic, 00:11:04.500 | 
I might have in mind the material conditional symbol 00:11:08.180 | 
by symbolized by the arrow here or disjunction, 00:11:11.240 | 
inclusive conjunction as symbolized by this V symbol down here. 00:11:15.200 | 
My training data as depicted on the left here simply did 00:11:18.880 | 
not disambiguate what my learning target was. 00:11:21.440 | 
Again, it is no fair to scold systems if they arrive at 00:11:34.300 | 
Liu et al. 2019 is a lovely framework for thinking about how to 00:11:40.220 | 
distinguish between dataset weaknesses and model weaknesses. 00:11:43.620 | 
This is the framework that they call inoculation by fine-tuning. 00:11:47.280 | 
This is a diagram from their paper. Let's walk through it. 00:11:49.920 | 
Suppose we train our system on our original data, 00:11:55.700 | 
the original test set and some challenge set that we're interested in. 00:12:00.200 | 
We observe that the system does well on that original test 00:12:10.600 | 
I've already presented to you the major choice point here. 00:12:18.400 | 
The proposed method for resolving that question works as follows. 00:12:22.260 | 
We're going to fine-tune on a few challenge examples. 00:12:26.040 | 
We're going to update the model and then retest on 00:12:29.120 | 
both the original and the challenge datasets. 00:12:32.640 | 
We have three possible general outcomes here. 00:12:36.120 | 
The dataset weakness case is the case where now, 00:12:40.840 | 
we see good performance on both the original and our challenge dataset. 00:12:45.460 | 
In particular, the challenge performance has gone way up. 00:12:48.840 | 
That is an indication to us that there were simply some gaps in 00:12:52.320 | 
the available training experiences of our model that 00:13:02.640 | 
a situation where even after doing this fine-tuning, 00:13:05.360 | 
we still see poor performance on our challenge dataset, 00:13:09.160 | 
even though we have maintained performance on the original. 00:13:12.600 | 
That might mean that there is something about 00:13:17.620 | 
that are fundamentally difficult for this model. 00:13:22.700 | 
Then the third outcome, also important, annotation artifacts. 00:13:31.540 | 
performance on the original test set has plummeted. 00:13:35.860 | 
That's a case where we might discover that our challenge dataset 00:13:39.540 | 
is doing something unusual and problematic to the model. 00:13:51.380 | 
an adversarial test that they study in detail, 00:13:54.700 | 
and that was released in relation to NLI models. 00:13:58.120 | 
They're organized by the three outcomes that they see. 00:14:15.560 | 
here are performance on the new challenge set. 00:14:18.140 | 
This is a dataset weakness in that you see that as we 00:14:20.880 | 
fine-tune across this x-axis on more and more challenge examples, 00:14:25.360 | 
we see performance on that challenge set go up, 00:14:30.400 | 
that fine-tuning process on the original dataset. 00:14:38.520 | 
The model weakness case is also pretty clear to see. 00:14:41.840 | 
Here again, we have the original dataset with these dots. 00:14:46.500 | 
across all of the different levels of fine-tuning. 00:14:51.580 | 
the corresponding line for the challenge dataset, 00:14:58.040 | 
we never really budge on performance on those examples, 00:15:01.260 | 
suggesting that there's a real problem with the underlying model. 00:15:07.760 | 
and this is the case where our fine-tuning actually 00:15:10.560 | 
introduces something chaotic into the mix by disturbing the model. 00:15:14.960 | 
The net effect there is that for the original dataset, 00:15:24.200 | 
but really at a cost to the overall performance of the system. 00:15:30.280 | 
in the challenge set are somehow problematic. 00:15:40.760 | 
tell you that comes from work that we did in my group, 00:15:43.460 | 
and this relates to having negation as a learning target. 00:15:46.720 | 
Again, this is in the spirit of helping you avoid what could 00:15:49.280 | 
be a serious analytic mistake for behavioral testing. 00:15:53.240 | 
We have this intuitive learning target related to negation. 00:16:00.080 | 
That is the classic entailment reversing property of negation. 00:16:03.480 | 
It applies at all levels in language and is responsible for why, 00:16:06.960 | 
for example, where we have pizza entails food, 00:16:13.280 | 
Simple intuitive learning target with lots of consequences for language, 00:16:17.120 | 
and then we have this observation through many papers in the literature, 00:16:21.280 | 
that our top performing natural language inference models 00:16:27.360 | 
Of course, the tempting conclusion there is that 00:16:29.940 | 
our top performing models are incapable of learning negation. 00:16:34.300 | 
We want to make that conclusion because it's a headline result 00:16:38.320 | 
that will mean we have a really fundamental limitation that we have discovered. 00:16:42.600 | 
But we have to pair that with the observation that negation is severely 00:16:48.180 | 
underrepresented in the NLI benchmarks that are driving these models. 00:16:54.180 | 
That should introduce doubt in our minds that we've really found a model weakness, 00:16:58.660 | 
we might be confronting a dataset weakness instead. 00:17:04.420 | 
we followed the inoculation by fine-tuning template and constructed 00:17:08.900 | 
a slightly synthetic dataset that we call MoNLI from monotonicity NLI. 00:17:15.380 | 
In positive MoNLI, there are about 1,500 examples. 00:17:19.300 | 
We took actual hypotheses from the SNLI benchmark, 00:17:25.220 | 
and we used a WordNet to find a special case of food like pizza, 00:17:30.340 | 
an entailment case, and then we created a new example, 00:17:41.740 | 
and B entails A. We also have negative MoNLI, 00:17:46.820 | 
which has a similar number of examples and follows the same protocol, 00:17:52.300 | 
negated examples like the children are not holding plants. 00:18:03.380 | 
Because of the entailment reversing property of negation, 00:18:09.780 | 
A entails B, and B is neutral with respect to A, 00:18:16.740 | 
We did our level best to pose this as a very hard generalization task. 00:18:22.060 | 
In the sense that we held out entire words for testing, 00:18:26.900 | 
to be sure that we were getting a look at whether or not systems had truly 00:18:30.860 | 
acquired a theory of lexical relations in addition to acquiring a theory of negation. 00:18:38.740 | 
but we're also trying to be sure that we have good coverage 00:18:44.860 | 
One thing we did with this dataset is use MoNLI as a challenge dataset. 00:18:53.240 | 
Let's look at the BERT row of this table here. 00:18:59.580 | 
and it does extremely well on the positive part of the MoNLI split, 00:19:04.500 | 
but it has essentially zero accuracy on the negative part of MoNLI. 00:19:13.440 | 
the model is simply ignoring negations and therefore getting 00:19:17.180 | 
every single one of these examples wrong because they look like positive cases to the model. 00:19:22.060 | 
You might think, "Aha, we have found a fundamental limitation of BERT," 00:19:27.720 | 
If we do a little bit of inoculation by fine-tuning on negative MoNLI cases, 00:19:33.180 | 
performance on that split immediately goes up. 00:19:39.460 | 
and we have excellent performance on the negative split for MoNLI, 00:19:43.260 | 
and this strongly suggests that we had found not a model weakness, 00:19:50.260 | 
Final thing I want to say here by way of wrapping up, 00:19:56.660 | 
is that I have emphasized fairness for our systems. 00:20:00.860 | 
I think that is important to have in mind so that we don't confuse ourselves. 00:20:05.340 | 
But I couldn't resist pointing out that biological creatures are amazing, 00:20:13.820 | 
tasks that are unfair in the sense that I just described it. 00:20:33.700 | 
and then ask you to pick from these two options here, 00:20:46.700 | 
Whereas if I show you two different shapes and ask you to make a similar choice, 00:20:50.600 | 
now what people do is go for the two different ones. 00:20:56.960 | 
do consistently with essentially no training data. 00:21:00.640 | 
As post here, I maintain that these tasks are unfair, 00:21:04.640 | 
and yet nonetheless, humans and many biological entities 00:21:08.440 | 
are able to systematically solve these tasks. 00:21:11.300 | 
That is a puzzle about human and other biological creature cognition, 00:21:16.580 | 
and it's something that we should keep in mind. 00:21:21.940 | 
how would we get our machine learning models to solve such tasks, 00:21:29.440 | 
For example, we can do hierarchical versions of equality, 00:21:37.880 | 
and people solve them out of the box, so to speak, 00:21:42.720 | 
not enough training instances to fully disambiguate the task. 00:21:46.400 | 
Again, pointing out that biological creatures are amazing. 00:21:51.320 | 
We should pose fair tasks to our systems while keeping in 00:21:55.120 | 
mind that there are scenarios in which we might have 00:21:58.720 | 
an expectation for a solution that is not supported by the data, 00:22:03.000 | 
but nonetheless, the one that all of us arrive at with seemingly no effort.