back to indexStanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 4: COGS and ReCOGS I Spring 2023
Chapters
0:0 Intro
0:55 Task
2:42 Motivations
4:20 Understanding COGS logical forms
6:40 COGS splits
7:9 Generalization categories
8:59 Synthetic leaderboard
10:44 Why removing redundant tokens matters
12:40 What is behind the Os for CP/PP recursion?
15:45 What is behind the Os for PP modifiers?
18:22 Modifications for ReCOGS
20:45 Conceptual questions
00:00:12.300 |
we talked about the principle of compositionality. 00:00:14.860 |
We come now to our point of intersection with 00:00:19.580 |
We're going to talk about the benchmarks COGS and 00:00:32.160 |
What we were trying to do with re-COGS is first 00:00:34.600 |
understand why some of the generalization splits in 00:00:37.800 |
COGS have proved so challenging for present-day models. 00:00:41.720 |
In addition, reformulate COGS somewhat so that it comes 00:00:45.360 |
closer to testing purely semantic phenomena and 00:01:06.020 |
that is descriptions of the meanings of the sentences. 00:01:22.760 |
This is describing the theme argument of the helping event, 00:01:26.000 |
and x_3 is the event variable for this helping event, 00:01:35.160 |
The event description also has an agent that is identified by 00:01:54.360 |
the logical form and otherwise the semantics is very similar. 00:02:05.480 |
You can see that re-COGS logical forms tend to be shorter. 00:02:15.600 |
and the event descriptions are somewhat more transparent. 00:02:25.200 |
which is then used to identify both the theme and the agent. 00:02:32.920 |
We have the star operator for definite descriptions as well, 00:02:36.640 |
and the other simplifications are as in the first example. 00:02:41.120 |
What are the motivations for both COGS and re-COGS? 00:02:45.520 |
Well, they really tie into the compositionality principle. 00:02:52.960 |
familiar elements in ways that are systematic. 00:02:57.480 |
the COGS generalization splits are even sometimes hard to 00:03:01.100 |
appreciate as compositional generalization tasks because 00:03:05.140 |
the relevant leaps that we need to make seem very small 00:03:07.960 |
indeed to us as speakers of a language like English. 00:03:18.200 |
That is one way of explaining why we're able to make 00:03:21.040 |
so much use of novel combinations of familiar elements. 00:03:24.180 |
Compositionality tells us that in some sense, 00:03:26.820 |
the meanings of those novel combinations were 00:03:28.960 |
fully determined by the meanings of their parts. 00:03:32.080 |
The core question we're trying to address is, 00:03:34.920 |
can our best models do compositional generalization? 00:03:39.400 |
Relatedly, have they too found compositional solutions? 00:03:47.840 |
I think the vision of COGS and re-COGS is that 00:03:56.920 |
The hope is that if they succeed at question 3, 00:04:06.020 |
One way to think about this is that if a model manages to 00:04:13.860 |
that is the best explanation for that is that 00:04:25.560 |
because they have some interesting features about them that 00:04:41.520 |
involve one or more obligatory or optional roles. 00:05:03.500 |
Here's a related English sentence, the vase broke. 00:05:11.540 |
grammatical subject position in the English sentence, 00:05:21.280 |
determined by linear position in the input sentence. 00:05:54.360 |
because it looks like we have free variables, 00:05:57.020 |
but they're all existentially bound with widest scope. 00:06:06.620 |
In fact, we're meant to interpret this as though there were 00:06:11.400 |
quantified variables at the start of the logical form. 00:06:19.240 |
They're bound more locally by that star operator. 00:06:22.520 |
For example, the sailor ran translates as star sailor x_1, 00:06:30.680 |
Then that continues with the sailor being the agent of 00:06:33.840 |
the running event bound to variable 2 in second position there. 00:06:40.240 |
The COGS splits have the following structure. 00:07:06.080 |
different compositional generalization phenomena. 00:07:28.320 |
For example, subject to object for common noun, 00:07:32.840 |
we see examples like a hedgehog ate the cake, 00:07:38.120 |
In the generalization split in this category, 00:07:46.800 |
For the primitive to grammatical role splits, 00:08:01.360 |
and we have to figure out what to do with them. 00:08:09.600 |
subject modification means that modifiers like 00:08:26.520 |
both sentential complements and PP complements, 00:08:29.120 |
we see some number of recursive depth at training, 00:08:38.280 |
Then there are some other things that involve 00:08:43.120 |
like active to passive and passive to active. 00:09:24.880 |
you see what looks like a pretty rosy picture. 00:09:43.840 |
you find some impressively high numbers indeed. 00:09:51.080 |
But then travel with me further to the left from 00:10:26.560 |
these models are grappling with these COGS splits. 00:10:44.560 |
The first thing we did is just observe that there are a lot 00:10:48.440 |
of redundant tokens in the COGS logical forms. 00:10:52.480 |
For example, every single variable begins with 00:10:55.720 |
x space underscore space and then the numeral. 00:11:06.880 |
remove it and replace it with just a one there. 00:11:09.840 |
This has a profound effect on the performance of models, 00:11:23.600 |
an understanding of why this matters so much. 00:11:32.480 |
the commonest bigram in the dataset is comma x, 00:11:40.880 |
inside those parenthetical expressions with commas. 00:11:50.400 |
When we make these adjustments to the logical forms, 00:12:04.000 |
the COGS dataset is now on par with those variables. 00:12:08.280 |
Overall, we're seeing a much even or picture, 00:12:16.080 |
dependent on these local conditional probabilities. 00:12:26.000 |
we decided to go ahead and remove those tokens. 00:12:36.320 |
those really persistently hard structural splits. 00:12:42.000 |
First question, what is behind the zeros for CP and PP recursion? 00:12:47.840 |
That was one of the hardest structural generalization splits. 00:12:55.240 |
The fundamental observation here is that models are exposed to 00:13:13.480 |
Here we're showing the distribution for input sentences, 00:13:16.280 |
the generalization split lengths are in green and they have 00:13:19.040 |
this very long tail out to very long examples. 00:13:26.560 |
You can see that the generalization ones are much, much longer. 00:13:30.240 |
We have again this very long tail of very long examples. 00:13:36.280 |
I think it's perfectly reasonable to be pushing 00:13:39.360 |
models to generalize to ever greater lengths at test time. 00:13:51.120 |
But remember, our goal was to test for CP and PP recursion, 00:13:55.560 |
and we can now see that that recursion question has been 00:13:58.640 |
totally entwined with this question about length generalization. 00:14:09.080 |
we concatenate existing examples and re-index 00:14:15.920 |
to cover all of the variable names that we end up seeing at test time. 00:14:20.240 |
Because remember, one feature of this length issue is that 00:14:24.760 |
not only positions that remain untrained during training, 00:14:29.840 |
and then end up appearing relevant during testing, 00:14:34.880 |
If the longest sequence you had in COGS training led you to have 00:14:39.160 |
variable name 45 and you encounter 46 at test time, 00:14:55.120 |
virtue of the fact that it never appeared during training. 00:14:58.660 |
Again, it's fine to push models to overcome that problem, 00:15:02.440 |
but here you can see it's coupled together with 00:15:04.760 |
this recursion question and we decoupled them by simply 00:15:11.600 |
together and adjusting the variable names accordingly. 00:15:18.120 |
completely overcomes the problem for both LSTMs and for transformers. 00:15:26.480 |
the previous state of the art on this recursion problem. 00:15:32.020 |
the hard aspect of this split is not recursion, 00:15:36.200 |
but rather this persistent issue about length generalization. 00:15:40.080 |
An important lesson about that column of zeros there. 00:15:44.880 |
Second hard question, what is behind the zeros for PP modifiers? 00:15:54.640 |
Here's our hypothesis about what's happening. 00:15:57.600 |
For COGS, the trained data actually teach the model 00:16:01.300 |
that PPs occur only with a specific set of variables and positions. 00:16:08.320 |
they then struggle with examples that contradict it. 00:16:14.840 |
these PP modifiers suggests that they have a very limited distribution. 00:16:21.320 |
we confront them with the fact that that distribution was misleading. 00:16:25.520 |
I think that poses some pretty deep questions about what's 00:16:28.920 |
fair in terms of posing a compositional generalization task. 00:16:39.980 |
what we did is take original COGS sentences and manipulate them in 00:16:44.280 |
various ways so that PPs would associate with a wider range of 00:16:48.940 |
linear positions and correspondingly a wider range of variable names. 00:17:04.840 |
Topicalization is a pretty routine operation on English objects, 00:17:11.400 |
It has no effect on the underlying semantic representation, 00:17:15.060 |
but for COGS, it has the effect of shifting around the variable names. 00:17:19.900 |
In addition to get a fuller coverage of variable names and positions and so forth, 00:17:28.160 |
We simply stick in "ums" in various random points in the input sentences. 00:17:36.720 |
but again, it has the effect of shifting around 00:17:46.240 |
Instead of having just standard prepositional modifiers, 00:17:49.340 |
we also do things like a leaf painting the spaceship, 00:17:52.680 |
again, to fill in the gaps in these variable names and positions. 00:17:57.200 |
The result is a very large performance increase for 00:18:05.000 |
Again, suggesting that our hypothesis is on the right track and functionally, 00:18:09.940 |
the blocker here was that models had been taught that 00:18:13.000 |
PPs associate with certain variable names and certain positions, 00:18:17.480 |
which is not precisely what we were hoping to test for. 00:18:25.560 |
we perform some modifications to COGS to get recogs. 00:18:36.420 |
We did that redundant token removal that I described before, 00:18:42.280 |
We did some meaning-preserving data augmentation of the sort I 00:18:49.180 |
Then we introduced this notion of arbitrary variable naming. 00:18:55.380 |
variable names that are tied to the position in the input string. 00:18:59.220 |
Rather, they're randomly aside in a semantically consistent way. 00:19:03.340 |
We have many more examples as a result in an effort to teach 00:19:07.140 |
models to abstract away from the precise names of variables. 00:19:16.340 |
that looks like the one at the bottom of this slide. 00:19:19.420 |
The overall effect of performance is summarized in the diagram on the right here. 00:19:24.780 |
For COGS LFs, we have this really challenging aspect 00:19:28.020 |
that structural generalization tasks have dismal performance. 00:19:31.940 |
The redundant token removal doesn't really affect that, 00:19:40.580 |
arbitrary variable renaming do dramatically improve performance on 00:19:45.020 |
those structural generalization splits with the net effect of 00:19:48.500 |
evening out performance across these two aspects of the COGS/recogs problem. 00:20:01.580 |
My high-level takeaway here is that recogs is not necessarily an easier task. 00:20:10.780 |
harder looking according to our experiments than COGS. 00:20:16.480 |
performance and show that it's possible to get traction on 00:20:20.260 |
those totally recalcitrant structural generalization splits that led to 00:20:27.460 |
the literature before with those columns of all zeros. 00:20:31.200 |
Overall, we think that this is a healthier benchmark to hill climb on. 00:20:36.180 |
That seems good as well because recogs is just getting us closer 00:20:40.660 |
to testing the semantic phenomena that we care about. 00:20:47.040 |
I thought I would pose some conceptual questions that still linger for 00:20:50.240 |
me having done this deep dive on COGS and recogs. 00:20:54.200 |
First, how can we test for meaning if we're predicting logical forms? 00:20:58.720 |
Logical forms are just more syntactic expressions. 00:21:06.760 |
the logical forms always have some arbitrariness about them. 00:21:10.260 |
That's always going to get in the way of us really purely 00:21:13.080 |
seeing whether models understand the meanings of these input sentences. 00:21:17.900 |
What is a fair generalization test in the current context? 00:21:22.280 |
A lot of our insights about compositionality start to border on things that might 00:21:26.700 |
look like they are unfair in the sense of my earlier screencasts in this unit, 00:21:31.740 |
because we are deliberately holding out from the training experiences 00:21:36.100 |
some examples that we expect the model to grapple with at test time. 00:21:41.580 |
For example, models are shown a world that manifests 00:21:45.760 |
a specific restriction like PPs appear only in object position, 00:21:49.920 |
and then asked to grapple with the world in which 00:21:57.440 |
these cases where we restrict the training experiences, 00:22:01.120 |
we want models not to learn those restrictions as in the COGS splits, 00:22:07.900 |
whereas in other cases we do want them to learn those restrictions. 00:22:11.880 |
Actually just conceptually figuring out for different phenomena, 00:22:15.280 |
which category they fall into seems extremely difficult to me. 00:22:23.280 |
humans and how should that inform our generalization test? 00:22:29.140 |
We have an assumption that natural languages are compositional, 00:22:36.980 |
Maybe the thing to do is figure out where and how humans generalize, 00:22:43.020 |
a general expectation that our models will be able to follow suit. 00:22:49.800 |
For example, if we have goals that are not supported by our datasets, 00:22:54.980 |
but that seem like good goals for models to reach, 00:22:57.860 |
how should we express that in our tasks and in our models? 00:23:01.020 |
We expect these issues to be embodied in the data, in the examples. 00:23:05.640 |
But here we have goals that seem to reach beyond what we can encode in examples, 00:23:10.580 |
especially given the variation that we see in 2B and 2C. 00:23:17.260 |
but I find it exciting that we can pose them and that we have models that might 00:23:21.500 |
plausibly be on their way to achieving something 00:23:24.380 |
like the compositional generalization that we are actually seeking.