Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 4: COGS and ReCOGS I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.240 | This is screencast four in our series

00:00:08.520 | on advanced behavioral testing for NLU.

00:00:10.920 | In the previous screencast,

00:00:12.300 | we talked about the principle of compositionality.

00:00:14.860 | We come now to our point of intersection with

00:00:17.360 | the homework and the associated bake-off.

00:00:19.580 | We're going to talk about the benchmarks COGS and

00:00:22.060 | re-COGS which are both designed to test

00:00:24.440 | compositional generalization for our models.

00:00:27.760 | COGS set the agenda here,

00:00:30.100 | and then re-COGS is our extension of it.

00:00:32.160 | What we were trying to do with re-COGS is first

00:00:34.600 | understand why some of the generalization splits in

00:00:37.800 | COGS have proved so challenging for present-day models.

00:00:41.720 | In addition, reformulate COGS somewhat so that it comes

00:00:45.360 | closer to testing purely semantic phenomena and

00:00:49.120 | abstracting away from incidental features of

00:00:51.920 | some of the logical forms that are in COGS.

00:00:55.040 | Let's start with the task description.

00:00:57.440 | We'll look at COGS first.

00:00:59.080 | The inputs are simple English sentences

00:01:01.560 | like a rose was helped by a dog,

00:01:03.600 | and the outputs are logical forms,

00:01:06.020 | that is descriptions of the meanings of the sentences.

00:01:08.680 | For COGS and re-COGS,

00:01:10.440 | these are event semantic style descriptions.

00:01:13.840 | Here we've got rose and indefinite

00:01:16.300 | corresponding to the grammatical subject

00:01:18.240 | here with variable one.

00:01:20.320 | The next conjunct is help theme.

00:01:22.760 | This is describing the theme argument of the helping event,

00:01:26.000 | and x_3 is the event variable for this helping event,

00:01:29.920 | and x_1 links back to rose,

00:01:32.480 | identifying this rose as the theme.

00:01:35.160 | The event description also has an agent that is identified by

00:01:38.520 | variable x_6 and again binds into

00:01:40.640 | that helping event description x_3,

00:01:43.240 | and x_6 is identified as being a dog.

00:01:47.080 | Here's a similar example.

00:01:49.120 | This one involves the definite description,

00:01:51.080 | the sailor that corresponds to

00:01:53.000 | this star operator in

00:01:54.360 | the logical form and otherwise the semantics is very similar.

00:01:58.640 | Re-COGS is very similar in many respects.

00:02:01.920 | Here I've got that same first example,

00:02:03.800 | a rose was helped by a dog.

00:02:05.480 | You can see that re-COGS logical forms tend to be shorter.

00:02:08.960 | We've removed a lot of the symbols

00:02:10.940 | that are associated with variables.

00:02:13.060 | We've reorganized the conjuncts somewhat,

00:02:15.600 | and the event descriptions are somewhat more transparent.

00:02:18.760 | For example, we have a separate predication

00:02:21.160 | identifying the help event and

00:02:23.240 | binding it to variable x_7,

00:02:25.200 | which is then used to identify both the theme and the agent.

00:02:29.680 | Here's that second example,

00:02:31.560 | the sailor dusted a boy.

00:02:32.920 | We have the star operator for definite descriptions as well,

00:02:36.640 | and the other simplifications are as in the first example.

00:02:41.120 | What are the motivations for both COGS and re-COGS?

00:02:45.520 | Well, they really tie into the compositionality principle.

00:02:48.160 | We have an observation that humans easily

00:02:51.120 | interpret novel combinations of

00:02:52.960 | familiar elements in ways that are systematic.

00:02:55.520 | This is so effortless in fact that

00:02:57.480 | the COGS generalization splits are even sometimes hard to

00:03:01.100 | appreciate as compositional generalization tasks because

00:03:05.140 | the relevant leaps that we need to make seem very small

00:03:07.960 | indeed to us as speakers of a language like English.

00:03:11.960 | The explanation for why this is so easy and

00:03:15.460 | effortless for us is compositionality.

00:03:18.200 | That is one way of explaining why we're able to make

00:03:21.040 | so much use of novel combinations of familiar elements.

00:03:24.180 | Compositionality tells us that in some sense,

00:03:26.820 | the meanings of those novel combinations were

00:03:28.960 | fully determined by the meanings of their parts.

00:03:32.080 | The core question we're trying to address is,

00:03:34.920 | can our best models do compositional generalization?

00:03:39.400 | Relatedly, have they too found compositional solutions?

00:03:43.580 | That would be a more internal question

00:03:45.300 | about their underlying causal mechanisms.

00:03:47.840 | I think the vision of COGS and re-COGS is that

00:03:50.840 | we have behavioral tasks that can help us

00:03:53.600 | resolve question 3 about generalization.

00:03:56.920 | The hope is that if they succeed at question 3,

00:03:59.720 | we'll have an informed answer to

00:04:01.660 | the deeper question posed in 4

00:04:03.620 | about the nature of their solutions.

00:04:06.020 | One way to think about this is that if a model manages to

00:04:09.400 | succeed at a task like COGS or re-COGS,

00:04:12.680 | that can only be,

00:04:13.860 | that is the best explanation for that is that

00:04:16.520 | they have found a compositional solution.

00:04:20.000 | We should pause to more deeply

00:04:23.280 | understand the COGS logical forms

00:04:25.560 | because they have some interesting features about them that

00:04:28.000 | actually help explain the pattern

00:04:30.320 | of results that we see in the literature.

00:04:32.760 | To start, I alluded to this before,

00:04:35.520 | verbs specify primitive events that have

00:04:38.360 | their own core conceptual structure and can

00:04:41.520 | involve one or more obligatory or optional roles.

00:04:44.880 | Here's a quick example.

00:04:46.060 | Our sentence is, Emma broke a vase.

00:04:48.720 | This is at its heart,

00:04:50.160 | a breaking event and the logical form

00:04:52.760 | here is saying that it has two participants,

00:04:55.120 | the agent Emma and the theme,

00:04:58.320 | which is the vase, which is an indefinite,

00:05:00.720 | identified by this predication up here.

00:05:03.500 | Here's a related English sentence, the vase broke.

00:05:06.520 | This one is lacking its agent argument and

00:05:09.680 | the theme argument has been promoted to

00:05:11.540 | grammatical subject position in the English sentence,

00:05:14.820 | but otherwise, this is very similar.

00:05:17.440 | In COGS, variable numbering is

00:05:21.280 | determined by linear position in the input sentence.

00:05:25.280 | That is, for example,

00:05:26.640 | the reason that we're using the variable x_2

00:05:29.080 | here for the event description is because

00:05:31.640 | the verb break which anchors that event

00:05:34.220 | is in position two in the input sentence.

00:05:37.800 | That turns out actually to be

00:05:39.680 | a really important feature of

00:05:40.980 | these logical forms that seriously

00:05:43.160 | impacts the performance of modern models,

00:05:45.880 | especially ones with positional encoding.

00:05:48.880 | All the variables in COGS and

00:05:51.680 | recogs logical forms are bound,

00:05:54.360 | because it looks like we have free variables,

00:05:57.020 | but they're all existentially bound with widest scope.

00:06:00.280 | For example, whereas you have

00:06:01.520 | a logical form like this where it looks

00:06:03.220 | like variables 1 and 2 are free variables.

00:06:06.620 | In fact, we're meant to interpret this as though there were

00:06:09.360 | a prefix of existentially

00:06:11.400 | quantified variables at the start of the logical form.

00:06:16.040 | Definite descriptions are also bound.

00:06:19.240 | They're bound more locally by that star operator.

00:06:22.520 | For example, the sailor ran translates as star sailor x_1,

00:06:26.640 | because sailor is in the first position,

00:06:28.660 | and that's a complete definite description.

00:06:30.680 | Then that continues with the sailor being the agent of

00:06:33.840 | the running event bound to variable 2 in second position there.

00:06:40.240 | The COGS splits have the following structure.

00:06:43.600 | We have a pretty large train set.

00:06:46.320 | We also have a dev set and a test set,

00:06:49.020 | and both of those are IID,

00:06:50.840 | that is their standard evaluations

00:06:52.800 | in the terms that we're using in this unit.

00:06:54.980 | The interesting part is this group of

00:06:57.560 | 21,000 generalization examples

00:07:01.380 | corresponding to 21 different splits,

00:07:03.680 | each one trying to probe models for

00:07:06.080 | different compositional generalization phenomena.

00:07:10.040 | This is a table from the paper that

00:07:12.240 | enumerates all of the generalization splits,

00:07:15.240 | at least in some fashion,

00:07:16.440 | they're broken up into different categories.

00:07:18.680 | We've talked a bit about these before,

00:07:20.680 | but let me just highlight a few.

00:07:22.360 | For this first block here,

00:07:23.680 | we're talking about putting

00:07:24.760 | familiar phrases into new positions.

00:07:28.320 | For example, subject to object for common noun,

00:07:31.360 | means that in training,

00:07:32.840 | we see examples like a hedgehog ate the cake,

00:07:35.440 | where hedgehog is a grammatical subject.

00:07:38.120 | In the generalization split in this category,

00:07:40.880 | we first encounter hedgehog in

00:07:43.480 | a position that is not the subject,

00:07:45.520 | and the others are similar.

00:07:46.800 | For the primitive to grammatical role splits,

00:07:49.920 | we see these primitives like shark and

00:07:53.120 | Paula as isolated elements in training.

00:07:56.640 | Then in generalization, we encounter

00:07:58.680 | them in full sentential context,

00:08:01.360 | and we have to figure out what to do with them.

00:08:03.680 | There are also novel combinations

00:08:05.560 | of modified phrases.

00:08:06.880 | For example, object modification to

00:08:09.600 | subject modification means that modifiers like

00:08:12.960 | on the plate occur in

00:08:15.120 | object position during the train examples,

00:08:17.680 | and then in subject position

00:08:20.000 | in the generalization split.

00:08:21.840 | That turns out to be very difficult indeed.

00:08:24.680 | For deeper recursion for

00:08:26.520 | both sentential complements and PP complements,

00:08:29.120 | we see some number of recursive depth at training,

00:08:32.920 | and then in generalization,

00:08:34.400 | we see even more depth for those recursion.

00:08:38.280 | Then there are some other things that involve

00:08:40.440 | alternation of syntactic role

00:08:43.120 | like active to passive and passive to active.

00:08:46.080 | Some cases like Emily

00:08:47.840 | baked and the giraffe baked a cake,

00:08:49.640 | or we have shifting of

00:08:50.800 | the argument structures and so forth.

00:08:52.960 | Then at the bottom, we have some splits

00:08:55.360 | that involve verb classes.

00:08:58.520 | That's a high-level overview of the splits.

00:09:02.480 | What we did for the recogs paper is

00:09:04.680 | assemble what I've here

00:09:05.760 | called a synthetic leaderboard.

00:09:07.600 | This is just us pulling together

00:09:09.640 | results from a bunch of

00:09:11.040 | prominent papers in the literature

00:09:12.760 | that tackled the cogs problem.

00:09:15.000 | Let's first look all the way to

00:09:17.080 | the right here at the overall column.

00:09:19.520 | If you look there, which is

00:09:21.120 | a standard move for NLPers to make,

00:09:23.400 | just look at overall,

00:09:24.880 | you see what looks like a pretty rosy picture.

00:09:27.400 | For cogs, models are getting up into

00:09:30.560 | the 80s on these generalization splits,

00:09:32.840 | and that looks like they've

00:09:33.720 | really gotten traction.

00:09:35.440 | Then if you look just to the left of that,

00:09:37.880 | at that lex column,

00:09:39.120 | that's grouping together

00:09:40.480 | the generalization splits that

00:09:41.880 | involve lexical generalization,

00:09:43.840 | you find some impressively high numbers indeed.

00:09:46.640 | It looks like models are really

00:09:48.280 | good at those lexical tasks.

00:09:51.080 | But then travel with me further to the left from

00:09:54.160 | there to the three columns that we've

00:09:56.480 | called structural generalization tasks,

00:09:59.560 | object PP to subject PP,

00:10:01.960 | column of all zeros,

00:10:04.040 | CP recursion, pretty much all zeros,

00:10:07.760 | PP recursion, a similarly dismal story.

00:10:12.040 | This looks to me like models are

00:10:14.480 | simply failing to get any traction at all.

00:10:17.560 | Something is systematically wrong.

00:10:19.960 | These are not just low numbers,

00:10:22.080 | this is an indication of something

00:10:23.800 | fundamentally being amiss about how

00:10:26.560 | these models are grappling with these COGS splits.

00:10:30.280 | This was really the puzzle that we

00:10:32.400 | posed when we started

00:10:33.600 | thinking about the recogs work.

00:10:35.320 | What is behind these columns

00:10:37.720 | of zero or near zero numbers?

00:10:40.000 | It's a very worrisome situation,

00:10:42.480 | we want to get to the bottom of it.

00:10:44.560 | The first thing we did is just observe that there are a lot

00:10:48.440 | of redundant tokens in the COGS logical forms.

00:10:52.480 | For example, every single variable begins with

00:10:55.720 | x space underscore space and then the numeral.

00:11:00.080 | The numeral is the only

00:11:01.640 | distinguishing feature of that variable.

00:11:04.520 | What we decided to do is simply

00:11:06.880 | remove it and replace it with just a one there.

00:11:09.840 | This has a profound effect on the performance of models,

00:11:13.920 | even though it's obviously semantics

00:11:16.400 | preserving change to the logical forms.

00:11:19.120 | I think we can think in terms of

00:11:20.960 | basic conditional probabilities to come to

00:11:23.600 | an understanding of why this matters so much.

00:11:26.600 | Think just at the level of bigram frequency.

00:11:29.640 | For COGS, overwhelmingly,

00:11:32.480 | the commonest bigram in the dataset is comma x,

00:11:36.200 | and that's because so many of

00:11:38.160 | these expressions involve variables

00:11:40.880 | inside those parenthetical expressions with commas.

00:11:44.320 | Then way down with much smaller frequency,

00:11:47.680 | are commas with names next to them.

00:11:50.400 | When we make these adjustments to the logical forms,

00:11:53.720 | everything evens out much more.

00:11:56.000 | Those variables are still prominent,

00:11:58.160 | 1, 4, 6, and 3.

00:11:59.880 | But notice that proper name Emma,

00:12:02.440 | the most frequent proper name in

00:12:04.000 | the COGS dataset is now on par with those variables.

00:12:08.280 | Overall, we're seeing a much even or picture,

00:12:11.040 | which is a much happier state for

00:12:13.520 | language models to be in because they are so

00:12:16.080 | dependent on these local conditional probabilities.

00:12:20.200 | Since this makes for a happier dataset,

00:12:23.560 | and is obviously semantics preserving,

00:12:26.000 | we decided to go ahead and remove those tokens.

00:12:29.240 | That's a simple one. That helps

00:12:30.920 | mainly with lexical generalization splits,

00:12:33.440 | but it is not especially impactful for

00:12:36.320 | those really persistently hard structural splits.

00:12:40.280 | Let's turn to them now.

00:12:42.000 | First question, what is behind the zeros for CP and PP recursion?

00:12:47.840 | That was one of the hardest structural generalization splits.

00:12:52.000 | Here we should think first about length.

00:12:55.240 | The fundamental observation here is that models are exposed to

00:13:00.120 | one example length during training and

00:13:03.800 | a very different distribution of

00:13:06.080 | lengths during the generalization splits.

00:13:08.640 | In particular, the longest examples

00:13:11.600 | occur in the generalization splits.

00:13:13.480 | Here we're showing the distribution for input sentences,

00:13:16.280 | the generalization split lengths are in green and they have

00:13:19.040 | this very long tail out to very long examples.

00:13:22.120 | Here are the output LFs, same thing.

00:13:24.520 | Look at the green examples.

00:13:26.560 | You can see that the generalization ones are much, much longer.

00:13:30.240 | We have again this very long tail of very long examples.

00:13:34.560 | Now I should be clear,

00:13:36.280 | I think it's perfectly reasonable to be pushing

00:13:39.360 | models to generalize to ever greater lengths at test time.

00:13:43.680 | This is persistently hard for many models,

00:13:46.160 | and we should compel the field to

00:13:48.440 | find ways to address that limitation.

00:13:51.120 | But remember, our goal was to test for CP and PP recursion,

00:13:55.560 | and we can now see that that recursion question has been

00:13:58.640 | totally entwined with this question about length generalization.

00:14:03.880 | We wanted to separate them apart.

00:14:06.440 | To decouple length from depth,

00:14:09.080 | we concatenate existing examples and re-index

00:14:13.120 | the variable names using the COGS protocol

00:14:15.920 | to cover all of the variable names that we end up seeing at test time.

00:14:20.240 | Because remember, one feature of this length issue is that

00:14:24.760 | not only positions that remain untrained during training,

00:14:29.840 | and then end up appearing relevant during testing,

00:14:32.320 | but also the names of the variables.

00:14:34.880 | If the longest sequence you had in COGS training led you to have

00:14:39.160 | variable name 45 and you encounter 46 at test time,

00:14:44.040 | then not only are the associated positions

00:14:46.760 | corresponding to random embeddings,

00:14:48.560 | but also that token itself has

00:14:51.520 | random vector associated with it in

00:14:55.120 | virtue of the fact that it never appeared during training.

00:14:58.660 | Again, it's fine to push models to overcome that problem,

00:15:02.440 | but here you can see it's coupled together with

00:15:04.760 | this recursion question and we decoupled them by simply

00:15:08.840 | augmenting examples by concatenating them

00:15:11.600 | together and adjusting the variable names accordingly.

00:15:15.320 | What we find is that that essentially

00:15:18.120 | completely overcomes the problem for both LSTMs and for transformers.

00:15:23.680 | The performance that we reach is well above

00:15:26.480 | the previous state of the art on this recursion problem.

00:15:29.840 | What this suggests to us is that

00:15:32.020 | the hard aspect of this split is not recursion,

00:15:34.640 | which we were trying to test for,

00:15:36.200 | but rather this persistent issue about length generalization.

00:15:40.080 | An important lesson about that column of zeros there.

00:15:44.880 | Second hard question, what is behind the zeros for PP modifiers?

00:15:50.200 | Remember that was literally a column

00:15:52.380 | of zeros in our synthetic leaderboard.

00:15:54.640 | Here's our hypothesis about what's happening.

00:15:57.600 | For COGS, the trained data actually teach the model

00:16:01.300 | that PPs occur only with a specific set of variables and positions.

00:16:06.200 | When models learn this lesson,

00:16:08.320 | they then struggle with examples that contradict it.

00:16:11.680 | Every experience that these models have with

00:16:14.840 | these PP modifiers suggests that they have a very limited distribution.

00:16:19.420 | Then at generalization time,

00:16:21.320 | we confront them with the fact that that distribution was misleading.

00:16:25.520 | I think that poses some pretty deep questions about what's

00:16:28.920 | fair in terms of posing a compositional generalization task.

00:16:33.680 | But at the same time, we have diagnosed,

00:16:35.780 | I think, why we have all those zeros.

00:16:38.160 | To further that argument,

00:16:39.980 | what we did is take original COGS sentences and manipulate them in

00:16:44.280 | various ways so that PPs would associate with a wider range of

00:16:48.940 | linear positions and correspondingly a wider range of variable names.

00:16:54.120 | We have various tricks for doing that.

00:16:55.920 | For example, we can pre-pose

00:16:58.700 | object noun phrases to get sentences like,

00:17:01.860 | "The box in the tent, Emma was lent."

00:17:04.840 | Topicalization is a pretty routine operation on English objects,

00:17:09.660 | and here we have simply implemented that.

00:17:11.400 | It has no effect on the underlying semantic representation,

00:17:15.060 | but for COGS, it has the effect of shifting around the variable names.

00:17:19.900 | In addition to get a fuller coverage of variable names and positions and so forth,

00:17:26.320 | we also have interjections.

00:17:28.160 | We simply stick in "ums" in various random points in the input sentences.

00:17:33.120 | It's a meaning-preserving operation.

00:17:35.240 | It's just a filled pause,

00:17:36.720 | but again, it has the effect of shifting around

00:17:39.360 | the variable names and associated positions.

00:17:42.480 | To further cover the space,

00:17:44.400 | we also do this thing with participles.

00:17:46.240 | Instead of having just standard prepositional modifiers,

00:17:49.340 | we also do things like a leaf painting the spaceship,

00:17:52.680 | again, to fill in the gaps in these variable names and positions.

00:17:57.200 | The result is a very large performance increase for

00:18:01.320 | both LSTMs and transformers on these splits.

00:18:05.000 | Again, suggesting that our hypothesis is on the right track and functionally,

00:18:09.940 | the blocker here was that models had been taught that

00:18:13.000 | PPs associate with certain variable names and certain positions,

00:18:17.480 | which is not precisely what we were hoping to test for.

00:18:22.120 | On the basis of these insights,

00:18:25.560 | we perform some modifications to COGS to get recogs.

00:18:29.140 | This is a high-level summary of what we did.

00:18:31.380 | Imagine an input sentence,

00:18:32.620 | Mia ate a cake.

00:18:34.020 | The COGS LF looks like this.

00:18:36.420 | We did that redundant token removal that I described before,

00:18:39.860 | mostly focused on the variable names.

00:18:42.280 | We did some meaning-preserving data augmentation of the sort I

00:18:45.820 | described for the CP and PP recursion cases.

00:18:49.180 | Then we introduced this notion of arbitrary variable naming.

00:18:53.020 | Recogs examples do not have

00:18:55.380 | variable names that are tied to the position in the input string.

00:18:59.220 | Rather, they're randomly aside in a semantically consistent way.

00:19:03.340 | We have many more examples as a result in an effort to teach

00:19:07.140 | models to abstract away from the precise names of variables.

00:19:11.940 | For our given example here,

00:19:13.860 | we end up with a recogs logical form

00:19:16.340 | that looks like the one at the bottom of this slide.

00:19:19.420 | The overall effect of performance is summarized in the diagram on the right here.

00:19:24.780 | For COGS LFs, we have this really challenging aspect

00:19:28.020 | that structural generalization tasks have dismal performance.

00:19:31.940 | The redundant token removal doesn't really affect that,

00:19:35.140 | as you can see in these two low red bars.

00:19:37.840 | But meaning-preserving data augmentation and

00:19:40.580 | arbitrary variable renaming do dramatically improve performance on

00:19:45.020 | those structural generalization splits with the net effect of

00:19:48.500 | evening out performance across these two aspects of the COGS/recogs problem.

00:19:54.980 | Here's a summary of the results.

00:19:56.980 | We have LSTMs on the top,

00:19:59.840 | transformers on the bottom.

00:20:01.580 | My high-level takeaway here is that recogs is not necessarily an easier task.

00:20:08.260 | In fact, some aspects of it are actually

00:20:10.780 | harder looking according to our experiments than COGS.

00:20:14.120 | But what we have done is even out

00:20:16.480 | performance and show that it's possible to get traction on

00:20:20.260 | those totally recalcitrant structural generalization splits that led to

00:20:25.140 | that really strange situation for

00:20:27.460 | the literature before with those columns of all zeros.

00:20:31.200 | Overall, we think that this is a healthier benchmark to hill climb on.

00:20:36.180 | That seems good as well because recogs is just getting us closer

00:20:40.660 | to testing the semantic phenomena that we care about.

00:20:44.980 | To wrap up the screencast,

00:20:47.040 | I thought I would pose some conceptual questions that still linger for

00:20:50.240 | me having done this deep dive on COGS and recogs.

00:20:54.200 | First, how can we test for meaning if we're predicting logical forms?

00:20:58.720 | Logical forms are just more syntactic expressions.

00:21:02.160 | We want to get at the true meaning.

00:21:04.140 | We have to do that through logical forms and

00:21:06.760 | the logical forms always have some arbitrariness about them.

00:21:10.260 | That's always going to get in the way of us really purely

00:21:13.080 | seeing whether models understand the meanings of these input sentences.

00:21:17.900 | What is a fair generalization test in the current context?

00:21:22.280 | A lot of our insights about compositionality start to border on things that might

00:21:26.700 | look like they are unfair in the sense of my earlier screencasts in this unit,

00:21:31.740 | because we are deliberately holding out from the training experiences

00:21:36.100 | some examples that we expect the model to grapple with at test time.

00:21:41.580 | For example, models are shown a world that manifests

00:21:45.760 | a specific restriction like PPs appear only in object position,

00:21:49.920 | and then asked to grapple with the world in which

00:21:52.220 | they also appear in subject positions.

00:21:54.760 | The hard part about this is that in some of

00:21:57.440 | these cases where we restrict the training experiences,

00:22:01.120 | we want models not to learn those restrictions as in the COGS splits,

00:22:07.900 | whereas in other cases we do want them to learn those restrictions.

00:22:11.880 | Actually just conceptually figuring out for different phenomena,

00:22:15.280 | which category they fall into seems extremely difficult to me.

00:22:20.280 | What are the limits of compositionality for

00:22:23.280 | humans and how should that inform our generalization test?

00:22:26.340 | Maybe that's the ultimate ground truth here.

00:22:29.140 | We have an assumption that natural languages are compositional,

00:22:32.660 | but that's a very strong assumption.

00:22:34.580 | It surely has some limitations.

00:22:36.980 | Maybe the thing to do is figure out where and how humans generalize,

00:22:40.980 | and where they don't, and then just set up

00:22:43.020 | a general expectation that our models will be able to follow suit.

00:22:47.420 | But that poses many conceptual questions.

00:22:49.800 | For example, if we have goals that are not supported by our datasets,

00:22:54.980 | but that seem like good goals for models to reach,

00:22:57.860 | how should we express that in our tasks and in our models?

00:23:01.020 | We expect these issues to be embodied in the data, in the examples.

00:23:05.640 | But here we have goals that seem to reach beyond what we can encode in examples,

00:23:10.580 | especially given the variation that we see in 2B and 2C.

00:23:15.260 | I don't have answers to these questions,

00:23:17.260 | but I find it exciting that we can pose them and that we have models that might

00:23:21.500 | plausibly be on their way to achieving something

00:23:24.380 | like the compositional generalization that we are actually seeking.

00:23:29.100 | [BLANK_AUDIO]

Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 4: COGS and ReCOGS I Spring 2023

Chapters