Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 4: COGS and ReCOGS I Spring 2023

Welcome back everyone. This is screencast four in our series on advanced behavioral testing for NLU. In the previous screencast, we talked about the principle of compositionality. We come now to our point of intersection with the homework and the associated bake-off. We're going to talk about the benchmarks COGS and re-COGS which are both designed to test compositional generalization for our models.

COGS set the agenda here, and then re-COGS is our extension of it. What we were trying to do with re-COGS is first understand why some of the generalization splits in COGS have proved so challenging for present-day models. In addition, reformulate COGS somewhat so that it comes closer to testing purely semantic phenomena and abstracting away from incidental features of some of the logical forms that are in COGS.

Let's start with the task description. We'll look at COGS first. The inputs are simple English sentences like a rose was helped by a dog, and the outputs are logical forms, that is descriptions of the meanings of the sentences. For COGS and re-COGS, these are event semantic style descriptions. Here we've got rose and indefinite corresponding to the grammatical subject here with variable one.

The next conjunct is help theme. This is describing the theme argument of the helping event, and x_3 is the event variable for this helping event, and x_1 links back to rose, identifying this rose as the theme. The event description also has an agent that is identified by variable x_6 and again binds into that helping event description x_3, and x_6 is identified as being a dog.

Here's a similar example. This one involves the definite description, the sailor that corresponds to this star operator in the logical form and otherwise the semantics is very similar. Re-COGS is very similar in many respects. Here I've got that same first example, a rose was helped by a dog. You can see that re-COGS logical forms tend to be shorter.

We've removed a lot of the symbols that are associated with variables. We've reorganized the conjuncts somewhat, and the event descriptions are somewhat more transparent. For example, we have a separate predication identifying the help event and binding it to variable x_7, which is then used to identify both the theme and the agent.

Here's that second example, the sailor dusted a boy. We have the star operator for definite descriptions as well, and the other simplifications are as in the first example. What are the motivations for both COGS and re-COGS? Well, they really tie into the compositionality principle. We have an observation that humans easily interpret novel combinations of familiar elements in ways that are systematic.

This is so effortless in fact that the COGS generalization splits are even sometimes hard to appreciate as compositional generalization tasks because the relevant leaps that we need to make seem very small indeed to us as speakers of a language like English. The explanation for why this is so easy and effortless for us is compositionality.

That is one way of explaining why we're able to make so much use of novel combinations of familiar elements. Compositionality tells us that in some sense, the meanings of those novel combinations were fully determined by the meanings of their parts. The core question we're trying to address is, can our best models do compositional generalization?

Relatedly, have they too found compositional solutions? That would be a more internal question about their underlying causal mechanisms. I think the vision of COGS and re-COGS is that we have behavioral tasks that can help us resolve question 3 about generalization. The hope is that if they succeed at question 3, we'll have an informed answer to the deeper question posed in 4 about the nature of their solutions.

One way to think about this is that if a model manages to succeed at a task like COGS or re-COGS, that can only be, that is the best explanation for that is that they have found a compositional solution. We should pause to more deeply understand the COGS logical forms because they have some interesting features about them that actually help explain the pattern of results that we see in the literature.

To start, I alluded to this before, verbs specify primitive events that have their own core conceptual structure and can involve one or more obligatory or optional roles. Here's a quick example. Our sentence is, Emma broke a vase. This is at its heart, a breaking event and the logical form here is saying that it has two participants, the agent Emma and the theme, which is the vase, which is an indefinite, identified by this predication up here.

Here's a related English sentence, the vase broke. This one is lacking its agent argument and the theme argument has been promoted to grammatical subject position in the English sentence, but otherwise, this is very similar. In COGS, variable numbering is determined by linear position in the input sentence. That is, for example, the reason that we're using the variable x_2 here for the event description is because the verb break which anchors that event is in position two in the input sentence.

That turns out actually to be a really important feature of these logical forms that seriously impacts the performance of modern models, especially ones with positional encoding. All the variables in COGS and recogs logical forms are bound, because it looks like we have free variables, but they're all existentially bound with widest scope.

For example, whereas you have a logical form like this where it looks like variables 1 and 2 are free variables. In fact, we're meant to interpret this as though there were a prefix of existentially quantified variables at the start of the logical form. Definite descriptions are also bound. They're bound more locally by that star operator.

For example, the sailor ran translates as star sailor x_1, because sailor is in the first position, and that's a complete definite description. Then that continues with the sailor being the agent of the running event bound to variable 2 in second position there. The COGS splits have the following structure.

We have a pretty large train set. We also have a dev set and a test set, and both of those are IID, that is their standard evaluations in the terms that we're using in this unit. The interesting part is this group of 21,000 generalization examples corresponding to 21 different splits, each one trying to probe models for different compositional generalization phenomena.

This is a table from the paper that enumerates all of the generalization splits, at least in some fashion, they're broken up into different categories. We've talked a bit about these before, but let me just highlight a few. For this first block here, we're talking about putting familiar phrases into new positions.

For example, subject to object for common noun, means that in training, we see examples like a hedgehog ate the cake, where hedgehog is a grammatical subject. In the generalization split in this category, we first encounter hedgehog in a position that is not the subject, and the others are similar.

For the primitive to grammatical role splits, we see these primitives like shark and Paula as isolated elements in training. Then in generalization, we encounter them in full sentential context, and we have to figure out what to do with them. There are also novel combinations of modified phrases. For example, object modification to subject modification means that modifiers like on the plate occur in object position during the train examples, and then in subject position in the generalization split.

That turns out to be very difficult indeed. For deeper recursion for both sentential complements and PP complements, we see some number of recursive depth at training, and then in generalization, we see even more depth for those recursion. Then there are some other things that involve alternation of syntactic role like active to passive and passive to active.

Some cases like Emily baked and the giraffe baked a cake, or we have shifting of the argument structures and so forth. Then at the bottom, we have some splits that involve verb classes. That's a high-level overview of the splits. What we did for the recogs paper is assemble what I've here called a synthetic leaderboard.

This is just us pulling together results from a bunch of prominent papers in the literature that tackled the cogs problem. Let's first look all the way to the right here at the overall column. If you look there, which is a standard move for NLPers to make, just look at overall, you see what looks like a pretty rosy picture.

For cogs, models are getting up into the 80s on these generalization splits, and that looks like they've really gotten traction. Then if you look just to the left of that, at that lex column, that's grouping together the generalization splits that involve lexical generalization, you find some impressively high numbers indeed.

It looks like models are really good at those lexical tasks. But then travel with me further to the left from there to the three columns that we've called structural generalization tasks, object PP to subject PP, column of all zeros, CP recursion, pretty much all zeros, PP recursion, a similarly dismal story.

This looks to me like models are simply failing to get any traction at all. Something is systematically wrong. These are not just low numbers, this is an indication of something fundamentally being amiss about how these models are grappling with these COGS splits. This was really the puzzle that we posed when we started thinking about the recogs work.

What is behind these columns of zero or near zero numbers? It's a very worrisome situation, we want to get to the bottom of it. The first thing we did is just observe that there are a lot of redundant tokens in the COGS logical forms. For example, every single variable begins with x space underscore space and then the numeral.

The numeral is the only distinguishing feature of that variable. What we decided to do is simply remove it and replace it with just a one there. This has a profound effect on the performance of models, even though it's obviously semantics preserving change to the logical forms. I think we can think in terms of basic conditional probabilities to come to an understanding of why this matters so much.

Think just at the level of bigram frequency. For COGS, overwhelmingly, the commonest bigram in the dataset is comma x, and that's because so many of these expressions involve variables inside those parenthetical expressions with commas. Then way down with much smaller frequency, are commas with names next to them. When we make these adjustments to the logical forms, everything evens out much more.

Those variables are still prominent, 1, 4, 6, and 3. But notice that proper name Emma, the most frequent proper name in the COGS dataset is now on par with those variables. Overall, we're seeing a much even or picture, which is a much happier state for language models to be in because they are so dependent on these local conditional probabilities.

Since this makes for a happier dataset, and is obviously semantics preserving, we decided to go ahead and remove those tokens. That's a simple one. That helps mainly with lexical generalization splits, but it is not especially impactful for those really persistently hard structural splits. Let's turn to them now. First question, what is behind the zeros for CP and PP recursion?

That was one of the hardest structural generalization splits. Here we should think first about length. The fundamental observation here is that models are exposed to one example length during training and a very different distribution of lengths during the generalization splits. In particular, the longest examples occur in the generalization splits.

Here we're showing the distribution for input sentences, the generalization split lengths are in green and they have this very long tail out to very long examples. Here are the output LFs, same thing. Look at the green examples. You can see that the generalization ones are much, much longer. We have again this very long tail of very long examples.

Now I should be clear, I think it's perfectly reasonable to be pushing models to generalize to ever greater lengths at test time. This is persistently hard for many models, and we should compel the field to find ways to address that limitation. But remember, our goal was to test for CP and PP recursion, and we can now see that that recursion question has been totally entwined with this question about length generalization.

We wanted to separate them apart. To decouple length from depth, we concatenate existing examples and re-index the variable names using the COGS protocol to cover all of the variable names that we end up seeing at test time. Because remember, one feature of this length issue is that not only positions that remain untrained during training, and then end up appearing relevant during testing, but also the names of the variables.

If the longest sequence you had in COGS training led you to have variable name 45 and you encounter 46 at test time, then not only are the associated positions corresponding to random embeddings, but also that token itself has random vector associated with it in virtue of the fact that it never appeared during training.

Again, it's fine to push models to overcome that problem, but here you can see it's coupled together with this recursion question and we decoupled them by simply augmenting examples by concatenating them together and adjusting the variable names accordingly. What we find is that that essentially completely overcomes the problem for both LSTMs and for transformers.

The performance that we reach is well above the previous state of the art on this recursion problem. What this suggests to us is that the hard aspect of this split is not recursion, which we were trying to test for, but rather this persistent issue about length generalization. An important lesson about that column of zeros there.

Second hard question, what is behind the zeros for PP modifiers? Remember that was literally a column of zeros in our synthetic leaderboard. Here's our hypothesis about what's happening. For COGS, the trained data actually teach the model that PPs occur only with a specific set of variables and positions. When models learn this lesson, they then struggle with examples that contradict it.

Every experience that these models have with these PP modifiers suggests that they have a very limited distribution. Then at generalization time, we confront them with the fact that that distribution was misleading. I think that poses some pretty deep questions about what's fair in terms of posing a compositional generalization task.

But at the same time, we have diagnosed, I think, why we have all those zeros. To further that argument, what we did is take original COGS sentences and manipulate them in various ways so that PPs would associate with a wider range of linear positions and correspondingly a wider range of variable names.

We have various tricks for doing that. For example, we can pre-pose object noun phrases to get sentences like, "The box in the tent, Emma was lent." Topicalization is a pretty routine operation on English objects, and here we have simply implemented that. It has no effect on the underlying semantic representation, but for COGS, it has the effect of shifting around the variable names.

In addition to get a fuller coverage of variable names and positions and so forth, we also have interjections. We simply stick in "ums" in various random points in the input sentences. It's a meaning-preserving operation. It's just a filled pause, but again, it has the effect of shifting around the variable names and associated positions.

To further cover the space, we also do this thing with participles. Instead of having just standard prepositional modifiers, we also do things like a leaf painting the spaceship, again, to fill in the gaps in these variable names and positions. The result is a very large performance increase for both LSTMs and transformers on these splits.

Again, suggesting that our hypothesis is on the right track and functionally, the blocker here was that models had been taught that PPs associate with certain variable names and certain positions, which is not precisely what we were hoping to test for. On the basis of these insights, we perform some modifications to COGS to get recogs.

This is a high-level summary of what we did. Imagine an input sentence, Mia ate a cake. The COGS LF looks like this. We did that redundant token removal that I described before, mostly focused on the variable names. We did some meaning-preserving data augmentation of the sort I described for the CP and PP recursion cases.

Then we introduced this notion of arbitrary variable naming. Recogs examples do not have variable names that are tied to the position in the input string. Rather, they're randomly aside in a semantically consistent way. We have many more examples as a result in an effort to teach models to abstract away from the precise names of variables.

For our given example here, we end up with a recogs logical form that looks like the one at the bottom of this slide. The overall effect of performance is summarized in the diagram on the right here. For COGS LFs, we have this really challenging aspect that structural generalization tasks have dismal performance.

The redundant token removal doesn't really affect that, as you can see in these two low red bars. But meaning-preserving data augmentation and arbitrary variable renaming do dramatically improve performance on those structural generalization splits with the net effect of evening out performance across these two aspects of the COGS/recogs problem.

Here's a summary of the results. We have LSTMs on the top, transformers on the bottom. My high-level takeaway here is that recogs is not necessarily an easier task. In fact, some aspects of it are actually harder looking according to our experiments than COGS. But what we have done is even out performance and show that it's possible to get traction on those totally recalcitrant structural generalization splits that led to that really strange situation for the literature before with those columns of all zeros.

Overall, we think that this is a healthier benchmark to hill climb on. That seems good as well because recogs is just getting us closer to testing the semantic phenomena that we care about. To wrap up the screencast, I thought I would pose some conceptual questions that still linger for me having done this deep dive on COGS and recogs.

First, how can we test for meaning if we're predicting logical forms? Logical forms are just more syntactic expressions. We want to get at the true meaning. We have to do that through logical forms and the logical forms always have some arbitrariness about them. That's always going to get in the way of us really purely seeing whether models understand the meanings of these input sentences.

What is a fair generalization test in the current context? A lot of our insights about compositionality start to border on things that might look like they are unfair in the sense of my earlier screencasts in this unit, because we are deliberately holding out from the training experiences some examples that we expect the model to grapple with at test time.

For example, models are shown a world that manifests a specific restriction like PPs appear only in object position, and then asked to grapple with the world in which they also appear in subject positions. The hard part about this is that in some of these cases where we restrict the training experiences, we want models not to learn those restrictions as in the COGS splits, whereas in other cases we do want them to learn those restrictions.

Actually just conceptually figuring out for different phenomena, which category they fall into seems extremely difficult to me. What are the limits of compositionality for humans and how should that inform our generalization test? Maybe that's the ultimate ground truth here. We have an assumption that natural languages are compositional, but that's a very strong assumption.

It surely has some limitations. Maybe the thing to do is figure out where and how humans generalize, and where they don't, and then just set up a general expectation that our models will be able to follow suit. But that poses many conceptual questions. For example, if we have goals that are not supported by our datasets, but that seem like good goals for models to reach, how should we express that in our tasks and in our models?

We expect these issues to be embodied in the data, in the examples. But here we have goals that seem to reach beyond what we can encode in examples, especially given the variation that we see in 2B and 2C. I don't have answers to these questions, but I find it exciting that we can pose them and that we have models that might plausibly be on their way to achieving something like the compositional generalization that we are actually seeking.

Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 4: COGS and ReCOGS I Spring 2023

Chapters

Transcript