back to indexStanford XCS224U I Behavioral Eval of NLU Models, Pt 2: Analytical Considerations I Spring 2023
00:00:13.580 |
surround this kind of assessment and analysis. 00:00:31.480 |
We talk about adversarial testing as a shorthand, 00:01:01.380 |
Does my system understand how negation works? 00:01:07.280 |
negation was sparsely represented and you want to ask, 00:01:15.960 |
Behavioral testing could help you address that question. 00:01:19.160 |
Does my system work with a new style or genre? 00:01:25.320 |
Maybe it was trained on text from newspapers and you would 00:01:33.480 |
This system is supposed to know about numerical terms, 00:01:38.880 |
outside of its training experiences for such terms. 00:01:49.280 |
the system that it's not good at numerical terms, 00:02:00.640 |
does the system produce socially problematic, 00:02:04.800 |
At this point, you're actively trying to construct examples 00:02:10.920 |
standard experiences for the model in an effort to 00:02:18.400 |
That is probably more thoroughly adversarial. 00:02:21.560 |
Maybe the most adversarial of all would be exploring 00:02:30.980 |
normal examples and append a certain sequence of 00:02:35.840 |
something very surprising happen as a result, 00:02:41.260 |
for gaps in its security in some general sense. 00:02:47.360 |
These are all interesting behavioral tests to run. 00:02:59.220 |
No amount of behavioral testing can truly offer 00:03:02.600 |
you a guarantee about what our systems will be like. 00:03:08.500 |
the set of examples that you decided to construct. 00:03:30.520 |
But the promise of this model is that it takes in 00:03:39.120 |
Here four has come in and it has predicted even. 00:03:42.200 |
So far so good. 21 comes in and it predicted odd. 00:03:55.440 |
This behavioral testing is going great and suggests that we 00:03:58.880 |
have a very solid model of even-odd detection. 00:04:02.480 |
But suppose I now expose for you how this model works. 00:04:06.720 |
I show you the insides and what that reveals for 00:04:09.440 |
you is that this model is just a simple lookup, 00:04:12.560 |
and we got lucky on those five inputs because those are 00:04:15.400 |
exactly the five inputs that the model was prepared for. 00:04:21.680 |
Now you know exactly how to expose a weakness of the system. 00:04:34.100 |
Notice that it was not the behavioral test that gave us this insight, 00:04:38.160 |
but rather being able to peek under the hood. 00:04:45.840 |
and let's assume it gets all those green cases correct as well. 00:04:56.640 |
and you think now we really have an excellent model of even-odd detection. 00:05:06.200 |
What's revealed when you look inside is that this is 00:05:08.600 |
a more sophisticated version of the same lookup. 00:05:13.980 |
the final token by splitting on white space and 00:05:16.760 |
use that as the basis for classification decisions. 00:05:20.560 |
If that final word is not in its lookup table, 00:05:30.520 |
it predicts odd by following that elsewhere case, 00:05:36.600 |
But again, we saw the flaw not from our behavioral test, 00:05:50.900 |
but we will always have doubts in the back of our minds that we have missed 00:06:02.500 |
the actual situation that you are in if you deploy a model out into the world. 00:06:08.980 |
and now you have to see what happens for the unfamiliar cases. 00:06:13.340 |
Another more incidental limitation to keep in mind of behavioral testing, 00:06:24.460 |
When you look through the literature on challenge and adversarial tests, 00:06:27.820 |
you find that mostly people are adopting the metrics that are 00:06:34.060 |
and simply probing the models within those guardrails. 00:06:37.420 |
I think that's fine, but in the fullness of adversarial testing, 00:06:41.260 |
we should feel free to break out of the confines of these tasks and assess 00:06:45.260 |
models in new ways to expose new limitations and so forth. 00:06:50.620 |
I'm going to play by the rules by and large in this lecture, 00:06:53.580 |
but have in mind that one way to be adversarial would be to put 00:06:57.660 |
models in entirely unfamiliar situations and ask new things of them. 00:07:02.740 |
Here's another really crucial analytical point 00:07:07.020 |
that we need to think about when we do behavioral testing. 00:07:11.420 |
is this a failure of the model or is it a failure of the underlying dataset? 00:07:17.260 |
Lovely paper that provides a framework for thinking about this is Liu et al, 00:07:26.500 |
We're going to talk about that idea in a second, 00:07:28.700 |
but the guiding idea behind the paper is embodied in this quote. 00:07:33.540 |
What should we conclude when a system fails on a challenge dataset? 00:07:40.740 |
blind spots in the design of the original dataset, 00:07:45.820 |
In others, the challenge might expose an inherent inability of 00:07:49.180 |
a particular model family to handle certain kinds of natural language phenomena. 00:07:55.660 |
These are, of course, not mutually exclusive. 00:08:03.740 |
want to claim they have found model weaknesses. 00:08:07.860 |
If you can show that the transformer architecture is 00:08:10.780 |
fundamentally incapable of capturing some phenomenon, 00:08:17.020 |
That is important. It might mean that the transformer is 00:08:19.860 |
a non-starter when it comes to modeling language. 00:08:23.420 |
But frankly, it's more likely that you have found a dataset weakness. 00:08:28.860 |
There is something about the available training data 00:08:31.500 |
that means the model has not hit your learning targets. 00:08:34.780 |
That is a much less interesting result because it often 00:08:37.580 |
means that we just need to supplement with more data. 00:08:40.820 |
We need to be careful about this because we don't want to 00:08:43.500 |
mistake dataset weaknesses for model weaknesses. 00:08:48.260 |
We made a similar point in a paper that we did 00:08:51.580 |
about posing fair but challenging evaluation tasks. 00:08:55.340 |
We write, however, for any evaluation method, 00:09:05.260 |
support the generalization we are asking of it? 00:09:08.700 |
Unless we can say yes with complete certainty, 00:09:12.020 |
we can't be sure whether a failed evaluation traces to 00:09:15.060 |
a model limitation or a data limitation that no model could overcome. 00:09:24.260 |
we don't mean that we're particularly worried about 00:09:26.380 |
them that they might be mistreated or something. 00:09:28.740 |
Rather, we are worried about an analytic mistake where we blame a model for 00:09:33.780 |
a failing when in fact the failing is on us because something about 00:09:39.900 |
fully disambiguate the learning targets that we had in mind. 00:09:43.340 |
This can easily happen and it can lead to misdiagnosis of problems. 00:09:49.980 |
a very human level that can show that any agent 00:09:52.540 |
could feel stumped by a misspecified problem. 00:09:55.620 |
Suppose I begin the numerical sequence 3, 5, 7, 00:09:59.180 |
and I ask you to guess what the next number is. 00:10:07.180 |
it seems reasonable to assume that I was listing out odd numbers, 00:10:12.740 |
or prime numbers, in which case you should say 11. 00:10:19.380 |
the prime case for me to scold you for saying 9 in this context. 00:10:23.880 |
But that is exactly the mistake that we are at risk of 00:10:26.940 |
making when we pose challenged problems to our systems. 00:10:30.740 |
Here's another case in which this could happen that's more 00:10:33.540 |
oriented toward natural language understanding. 00:10:36.020 |
Suppose I want to probe systems to see whether they can 00:10:42.060 |
What I do is show the system cases of combinations of p and q, 00:10:46.300 |
where they're both true and where p is false and q is true. 00:10:53.140 |
generalize by filling out this entire truth table. 00:10:59.260 |
the hypothesis space for normal Boolean logic, 00:11:04.500 |
I might have in mind the material conditional symbol 00:11:08.180 |
by symbolized by the arrow here or disjunction, 00:11:11.240 |
inclusive conjunction as symbolized by this V symbol down here. 00:11:15.200 |
My training data as depicted on the left here simply did 00:11:18.880 |
not disambiguate what my learning target was. 00:11:21.440 |
Again, it is no fair to scold systems if they arrive at 00:11:34.300 |
Liu et al. 2019 is a lovely framework for thinking about how to 00:11:40.220 |
distinguish between dataset weaknesses and model weaknesses. 00:11:43.620 |
This is the framework that they call inoculation by fine-tuning. 00:11:47.280 |
This is a diagram from their paper. Let's walk through it. 00:11:49.920 |
Suppose we train our system on our original data, 00:11:55.700 |
the original test set and some challenge set that we're interested in. 00:12:00.200 |
We observe that the system does well on that original test 00:12:10.600 |
I've already presented to you the major choice point here. 00:12:18.400 |
The proposed method for resolving that question works as follows. 00:12:22.260 |
We're going to fine-tune on a few challenge examples. 00:12:26.040 |
We're going to update the model and then retest on 00:12:29.120 |
both the original and the challenge datasets. 00:12:32.640 |
We have three possible general outcomes here. 00:12:36.120 |
The dataset weakness case is the case where now, 00:12:40.840 |
we see good performance on both the original and our challenge dataset. 00:12:45.460 |
In particular, the challenge performance has gone way up. 00:12:48.840 |
That is an indication to us that there were simply some gaps in 00:12:52.320 |
the available training experiences of our model that 00:13:02.640 |
a situation where even after doing this fine-tuning, 00:13:05.360 |
we still see poor performance on our challenge dataset, 00:13:09.160 |
even though we have maintained performance on the original. 00:13:12.600 |
That might mean that there is something about 00:13:17.620 |
that are fundamentally difficult for this model. 00:13:22.700 |
Then the third outcome, also important, annotation artifacts. 00:13:31.540 |
performance on the original test set has plummeted. 00:13:35.860 |
That's a case where we might discover that our challenge dataset 00:13:39.540 |
is doing something unusual and problematic to the model. 00:13:51.380 |
an adversarial test that they study in detail, 00:13:54.700 |
and that was released in relation to NLI models. 00:13:58.120 |
They're organized by the three outcomes that they see. 00:14:15.560 |
here are performance on the new challenge set. 00:14:18.140 |
This is a dataset weakness in that you see that as we 00:14:20.880 |
fine-tune across this x-axis on more and more challenge examples, 00:14:25.360 |
we see performance on that challenge set go up, 00:14:30.400 |
that fine-tuning process on the original dataset. 00:14:38.520 |
The model weakness case is also pretty clear to see. 00:14:41.840 |
Here again, we have the original dataset with these dots. 00:14:46.500 |
across all of the different levels of fine-tuning. 00:14:51.580 |
the corresponding line for the challenge dataset, 00:14:58.040 |
we never really budge on performance on those examples, 00:15:01.260 |
suggesting that there's a real problem with the underlying model. 00:15:07.760 |
and this is the case where our fine-tuning actually 00:15:10.560 |
introduces something chaotic into the mix by disturbing the model. 00:15:14.960 |
The net effect there is that for the original dataset, 00:15:24.200 |
but really at a cost to the overall performance of the system. 00:15:30.280 |
in the challenge set are somehow problematic. 00:15:40.760 |
tell you that comes from work that we did in my group, 00:15:43.460 |
and this relates to having negation as a learning target. 00:15:46.720 |
Again, this is in the spirit of helping you avoid what could 00:15:49.280 |
be a serious analytic mistake for behavioral testing. 00:15:53.240 |
We have this intuitive learning target related to negation. 00:16:00.080 |
That is the classic entailment reversing property of negation. 00:16:03.480 |
It applies at all levels in language and is responsible for why, 00:16:06.960 |
for example, where we have pizza entails food, 00:16:13.280 |
Simple intuitive learning target with lots of consequences for language, 00:16:17.120 |
and then we have this observation through many papers in the literature, 00:16:21.280 |
that our top performing natural language inference models 00:16:27.360 |
Of course, the tempting conclusion there is that 00:16:29.940 |
our top performing models are incapable of learning negation. 00:16:34.300 |
We want to make that conclusion because it's a headline result 00:16:38.320 |
that will mean we have a really fundamental limitation that we have discovered. 00:16:42.600 |
But we have to pair that with the observation that negation is severely 00:16:48.180 |
underrepresented in the NLI benchmarks that are driving these models. 00:16:54.180 |
That should introduce doubt in our minds that we've really found a model weakness, 00:16:58.660 |
we might be confronting a dataset weakness instead. 00:17:04.420 |
we followed the inoculation by fine-tuning template and constructed 00:17:08.900 |
a slightly synthetic dataset that we call MoNLI from monotonicity NLI. 00:17:15.380 |
In positive MoNLI, there are about 1,500 examples. 00:17:19.300 |
We took actual hypotheses from the SNLI benchmark, 00:17:25.220 |
and we used a WordNet to find a special case of food like pizza, 00:17:30.340 |
an entailment case, and then we created a new example, 00:17:41.740 |
and B entails A. We also have negative MoNLI, 00:17:46.820 |
which has a similar number of examples and follows the same protocol, 00:17:52.300 |
negated examples like the children are not holding plants. 00:18:03.380 |
Because of the entailment reversing property of negation, 00:18:09.780 |
A entails B, and B is neutral with respect to A, 00:18:16.740 |
We did our level best to pose this as a very hard generalization task. 00:18:22.060 |
In the sense that we held out entire words for testing, 00:18:26.900 |
to be sure that we were getting a look at whether or not systems had truly 00:18:30.860 |
acquired a theory of lexical relations in addition to acquiring a theory of negation. 00:18:38.740 |
but we're also trying to be sure that we have good coverage 00:18:44.860 |
One thing we did with this dataset is use MoNLI as a challenge dataset. 00:18:53.240 |
Let's look at the BERT row of this table here. 00:18:59.580 |
and it does extremely well on the positive part of the MoNLI split, 00:19:04.500 |
but it has essentially zero accuracy on the negative part of MoNLI. 00:19:13.440 |
the model is simply ignoring negations and therefore getting 00:19:17.180 |
every single one of these examples wrong because they look like positive cases to the model. 00:19:22.060 |
You might think, "Aha, we have found a fundamental limitation of BERT," 00:19:27.720 |
If we do a little bit of inoculation by fine-tuning on negative MoNLI cases, 00:19:33.180 |
performance on that split immediately goes up. 00:19:39.460 |
and we have excellent performance on the negative split for MoNLI, 00:19:43.260 |
and this strongly suggests that we had found not a model weakness, 00:19:50.260 |
Final thing I want to say here by way of wrapping up, 00:19:56.660 |
is that I have emphasized fairness for our systems. 00:20:00.860 |
I think that is important to have in mind so that we don't confuse ourselves. 00:20:05.340 |
But I couldn't resist pointing out that biological creatures are amazing, 00:20:13.820 |
tasks that are unfair in the sense that I just described it. 00:20:33.700 |
and then ask you to pick from these two options here, 00:20:46.700 |
Whereas if I show you two different shapes and ask you to make a similar choice, 00:20:50.600 |
now what people do is go for the two different ones. 00:20:56.960 |
do consistently with essentially no training data. 00:21:00.640 |
As post here, I maintain that these tasks are unfair, 00:21:04.640 |
and yet nonetheless, humans and many biological entities 00:21:08.440 |
are able to systematically solve these tasks. 00:21:11.300 |
That is a puzzle about human and other biological creature cognition, 00:21:16.580 |
and it's something that we should keep in mind. 00:21:21.940 |
how would we get our machine learning models to solve such tasks, 00:21:29.440 |
For example, we can do hierarchical versions of equality, 00:21:37.880 |
and people solve them out of the box, so to speak, 00:21:42.720 |
not enough training instances to fully disambiguate the task. 00:21:46.400 |
Again, pointing out that biological creatures are amazing. 00:21:51.320 |
We should pose fair tasks to our systems while keeping in 00:21:55.120 |
mind that there are scenarios in which we might have 00:21:58.720 |
an expectation for a solution that is not supported by the data, 00:22:03.000 |
but nonetheless, the one that all of us arrive at with seemingly no effort.