back to index

Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 4: COGS and ReCOGS I Spring 2023


Chapters

0:0 Intro
0:55 Task
2:42 Motivations
4:20 Understanding COGS logical forms
6:40 COGS splits
7:9 Generalization categories
8:59 Synthetic leaderboard
10:44 Why removing redundant tokens matters
12:40 What is behind the Os for CP/PP recursion?
15:45 What is behind the Os for PP modifiers?
18:22 Modifications for ReCOGS
20:45 Conceptual questions

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.240 | This is screencast four in our series
00:00:08.520 | on advanced behavioral testing for NLU.
00:00:10.920 | In the previous screencast,
00:00:12.300 | we talked about the principle of compositionality.
00:00:14.860 | We come now to our point of intersection with
00:00:17.360 | the homework and the associated bake-off.
00:00:19.580 | We're going to talk about the benchmarks COGS and
00:00:22.060 | re-COGS which are both designed to test
00:00:24.440 | compositional generalization for our models.
00:00:27.760 | COGS set the agenda here,
00:00:30.100 | and then re-COGS is our extension of it.
00:00:32.160 | What we were trying to do with re-COGS is first
00:00:34.600 | understand why some of the generalization splits in
00:00:37.800 | COGS have proved so challenging for present-day models.
00:00:41.720 | In addition, reformulate COGS somewhat so that it comes
00:00:45.360 | closer to testing purely semantic phenomena and
00:00:49.120 | abstracting away from incidental features of
00:00:51.920 | some of the logical forms that are in COGS.
00:00:55.040 | Let's start with the task description.
00:00:57.440 | We'll look at COGS first.
00:00:59.080 | The inputs are simple English sentences
00:01:01.560 | like a rose was helped by a dog,
00:01:03.600 | and the outputs are logical forms,
00:01:06.020 | that is descriptions of the meanings of the sentences.
00:01:08.680 | For COGS and re-COGS,
00:01:10.440 | these are event semantic style descriptions.
00:01:13.840 | Here we've got rose and indefinite
00:01:16.300 | corresponding to the grammatical subject
00:01:18.240 | here with variable one.
00:01:20.320 | The next conjunct is help theme.
00:01:22.760 | This is describing the theme argument of the helping event,
00:01:26.000 | and x_3 is the event variable for this helping event,
00:01:29.920 | and x_1 links back to rose,
00:01:32.480 | identifying this rose as the theme.
00:01:35.160 | The event description also has an agent that is identified by
00:01:38.520 | variable x_6 and again binds into
00:01:40.640 | that helping event description x_3,
00:01:43.240 | and x_6 is identified as being a dog.
00:01:47.080 | Here's a similar example.
00:01:49.120 | This one involves the definite description,
00:01:51.080 | the sailor that corresponds to
00:01:53.000 | this star operator in
00:01:54.360 | the logical form and otherwise the semantics is very similar.
00:01:58.640 | Re-COGS is very similar in many respects.
00:02:01.920 | Here I've got that same first example,
00:02:03.800 | a rose was helped by a dog.
00:02:05.480 | You can see that re-COGS logical forms tend to be shorter.
00:02:08.960 | We've removed a lot of the symbols
00:02:10.940 | that are associated with variables.
00:02:13.060 | We've reorganized the conjuncts somewhat,
00:02:15.600 | and the event descriptions are somewhat more transparent.
00:02:18.760 | For example, we have a separate predication
00:02:21.160 | identifying the help event and
00:02:23.240 | binding it to variable x_7,
00:02:25.200 | which is then used to identify both the theme and the agent.
00:02:29.680 | Here's that second example,
00:02:31.560 | the sailor dusted a boy.
00:02:32.920 | We have the star operator for definite descriptions as well,
00:02:36.640 | and the other simplifications are as in the first example.
00:02:41.120 | What are the motivations for both COGS and re-COGS?
00:02:45.520 | Well, they really tie into the compositionality principle.
00:02:48.160 | We have an observation that humans easily
00:02:51.120 | interpret novel combinations of
00:02:52.960 | familiar elements in ways that are systematic.
00:02:55.520 | This is so effortless in fact that
00:02:57.480 | the COGS generalization splits are even sometimes hard to
00:03:01.100 | appreciate as compositional generalization tasks because
00:03:05.140 | the relevant leaps that we need to make seem very small
00:03:07.960 | indeed to us as speakers of a language like English.
00:03:11.960 | The explanation for why this is so easy and
00:03:15.460 | effortless for us is compositionality.
00:03:18.200 | That is one way of explaining why we're able to make
00:03:21.040 | so much use of novel combinations of familiar elements.
00:03:24.180 | Compositionality tells us that in some sense,
00:03:26.820 | the meanings of those novel combinations were
00:03:28.960 | fully determined by the meanings of their parts.
00:03:32.080 | The core question we're trying to address is,
00:03:34.920 | can our best models do compositional generalization?
00:03:39.400 | Relatedly, have they too found compositional solutions?
00:03:43.580 | That would be a more internal question
00:03:45.300 | about their underlying causal mechanisms.
00:03:47.840 | I think the vision of COGS and re-COGS is that
00:03:50.840 | we have behavioral tasks that can help us
00:03:53.600 | resolve question 3 about generalization.
00:03:56.920 | The hope is that if they succeed at question 3,
00:03:59.720 | we'll have an informed answer to
00:04:01.660 | the deeper question posed in 4
00:04:03.620 | about the nature of their solutions.
00:04:06.020 | One way to think about this is that if a model manages to
00:04:09.400 | succeed at a task like COGS or re-COGS,
00:04:12.680 | that can only be,
00:04:13.860 | that is the best explanation for that is that
00:04:16.520 | they have found a compositional solution.
00:04:20.000 | We should pause to more deeply
00:04:23.280 | understand the COGS logical forms
00:04:25.560 | because they have some interesting features about them that
00:04:28.000 | actually help explain the pattern
00:04:30.320 | of results that we see in the literature.
00:04:32.760 | To start, I alluded to this before,
00:04:35.520 | verbs specify primitive events that have
00:04:38.360 | their own core conceptual structure and can
00:04:41.520 | involve one or more obligatory or optional roles.
00:04:44.880 | Here's a quick example.
00:04:46.060 | Our sentence is, Emma broke a vase.
00:04:48.720 | This is at its heart,
00:04:50.160 | a breaking event and the logical form
00:04:52.760 | here is saying that it has two participants,
00:04:55.120 | the agent Emma and the theme,
00:04:58.320 | which is the vase, which is an indefinite,
00:05:00.720 | identified by this predication up here.
00:05:03.500 | Here's a related English sentence, the vase broke.
00:05:06.520 | This one is lacking its agent argument and
00:05:09.680 | the theme argument has been promoted to
00:05:11.540 | grammatical subject position in the English sentence,
00:05:14.820 | but otherwise, this is very similar.
00:05:17.440 | In COGS, variable numbering is
00:05:21.280 | determined by linear position in the input sentence.
00:05:25.280 | That is, for example,
00:05:26.640 | the reason that we're using the variable x_2
00:05:29.080 | here for the event description is because
00:05:31.640 | the verb break which anchors that event
00:05:34.220 | is in position two in the input sentence.
00:05:37.800 | That turns out actually to be
00:05:39.680 | a really important feature of
00:05:40.980 | these logical forms that seriously
00:05:43.160 | impacts the performance of modern models,
00:05:45.880 | especially ones with positional encoding.
00:05:48.880 | All the variables in COGS and
00:05:51.680 | recogs logical forms are bound,
00:05:54.360 | because it looks like we have free variables,
00:05:57.020 | but they're all existentially bound with widest scope.
00:06:00.280 | For example, whereas you have
00:06:01.520 | a logical form like this where it looks
00:06:03.220 | like variables 1 and 2 are free variables.
00:06:06.620 | In fact, we're meant to interpret this as though there were
00:06:09.360 | a prefix of existentially
00:06:11.400 | quantified variables at the start of the logical form.
00:06:16.040 | Definite descriptions are also bound.
00:06:19.240 | They're bound more locally by that star operator.
00:06:22.520 | For example, the sailor ran translates as star sailor x_1,
00:06:26.640 | because sailor is in the first position,
00:06:28.660 | and that's a complete definite description.
00:06:30.680 | Then that continues with the sailor being the agent of
00:06:33.840 | the running event bound to variable 2 in second position there.
00:06:40.240 | The COGS splits have the following structure.
00:06:43.600 | We have a pretty large train set.
00:06:46.320 | We also have a dev set and a test set,
00:06:49.020 | and both of those are IID,
00:06:50.840 | that is their standard evaluations
00:06:52.800 | in the terms that we're using in this unit.
00:06:54.980 | The interesting part is this group of
00:06:57.560 | 21,000 generalization examples
00:07:01.380 | corresponding to 21 different splits,
00:07:03.680 | each one trying to probe models for
00:07:06.080 | different compositional generalization phenomena.
00:07:10.040 | This is a table from the paper that
00:07:12.240 | enumerates all of the generalization splits,
00:07:15.240 | at least in some fashion,
00:07:16.440 | they're broken up into different categories.
00:07:18.680 | We've talked a bit about these before,
00:07:20.680 | but let me just highlight a few.
00:07:22.360 | For this first block here,
00:07:23.680 | we're talking about putting
00:07:24.760 | familiar phrases into new positions.
00:07:28.320 | For example, subject to object for common noun,
00:07:31.360 | means that in training,
00:07:32.840 | we see examples like a hedgehog ate the cake,
00:07:35.440 | where hedgehog is a grammatical subject.
00:07:38.120 | In the generalization split in this category,
00:07:40.880 | we first encounter hedgehog in
00:07:43.480 | a position that is not the subject,
00:07:45.520 | and the others are similar.
00:07:46.800 | For the primitive to grammatical role splits,
00:07:49.920 | we see these primitives like shark and
00:07:53.120 | Paula as isolated elements in training.
00:07:56.640 | Then in generalization, we encounter
00:07:58.680 | them in full sentential context,
00:08:01.360 | and we have to figure out what to do with them.
00:08:03.680 | There are also novel combinations
00:08:05.560 | of modified phrases.
00:08:06.880 | For example, object modification to
00:08:09.600 | subject modification means that modifiers like
00:08:12.960 | on the plate occur in
00:08:15.120 | object position during the train examples,
00:08:17.680 | and then in subject position
00:08:20.000 | in the generalization split.
00:08:21.840 | That turns out to be very difficult indeed.
00:08:24.680 | For deeper recursion for
00:08:26.520 | both sentential complements and PP complements,
00:08:29.120 | we see some number of recursive depth at training,
00:08:32.920 | and then in generalization,
00:08:34.400 | we see even more depth for those recursion.
00:08:38.280 | Then there are some other things that involve
00:08:40.440 | alternation of syntactic role
00:08:43.120 | like active to passive and passive to active.
00:08:46.080 | Some cases like Emily
00:08:47.840 | baked and the giraffe baked a cake,
00:08:49.640 | or we have shifting of
00:08:50.800 | the argument structures and so forth.
00:08:52.960 | Then at the bottom, we have some splits
00:08:55.360 | that involve verb classes.
00:08:58.520 | That's a high-level overview of the splits.
00:09:02.480 | What we did for the recogs paper is
00:09:04.680 | assemble what I've here
00:09:05.760 | called a synthetic leaderboard.
00:09:07.600 | This is just us pulling together
00:09:09.640 | results from a bunch of
00:09:11.040 | prominent papers in the literature
00:09:12.760 | that tackled the cogs problem.
00:09:15.000 | Let's first look all the way to
00:09:17.080 | the right here at the overall column.
00:09:19.520 | If you look there, which is
00:09:21.120 | a standard move for NLPers to make,
00:09:23.400 | just look at overall,
00:09:24.880 | you see what looks like a pretty rosy picture.
00:09:27.400 | For cogs, models are getting up into
00:09:30.560 | the 80s on these generalization splits,
00:09:32.840 | and that looks like they've
00:09:33.720 | really gotten traction.
00:09:35.440 | Then if you look just to the left of that,
00:09:37.880 | at that lex column,
00:09:39.120 | that's grouping together
00:09:40.480 | the generalization splits that
00:09:41.880 | involve lexical generalization,
00:09:43.840 | you find some impressively high numbers indeed.
00:09:46.640 | It looks like models are really
00:09:48.280 | good at those lexical tasks.
00:09:51.080 | But then travel with me further to the left from
00:09:54.160 | there to the three columns that we've
00:09:56.480 | called structural generalization tasks,
00:09:59.560 | object PP to subject PP,
00:10:01.960 | column of all zeros,
00:10:04.040 | CP recursion, pretty much all zeros,
00:10:07.760 | PP recursion, a similarly dismal story.
00:10:12.040 | This looks to me like models are
00:10:14.480 | simply failing to get any traction at all.
00:10:17.560 | Something is systematically wrong.
00:10:19.960 | These are not just low numbers,
00:10:22.080 | this is an indication of something
00:10:23.800 | fundamentally being amiss about how
00:10:26.560 | these models are grappling with these COGS splits.
00:10:30.280 | This was really the puzzle that we
00:10:32.400 | posed when we started
00:10:33.600 | thinking about the recogs work.
00:10:35.320 | What is behind these columns
00:10:37.720 | of zero or near zero numbers?
00:10:40.000 | It's a very worrisome situation,
00:10:42.480 | we want to get to the bottom of it.
00:10:44.560 | The first thing we did is just observe that there are a lot
00:10:48.440 | of redundant tokens in the COGS logical forms.
00:10:52.480 | For example, every single variable begins with
00:10:55.720 | x space underscore space and then the numeral.
00:11:00.080 | The numeral is the only
00:11:01.640 | distinguishing feature of that variable.
00:11:04.520 | What we decided to do is simply
00:11:06.880 | remove it and replace it with just a one there.
00:11:09.840 | This has a profound effect on the performance of models,
00:11:13.920 | even though it's obviously semantics
00:11:16.400 | preserving change to the logical forms.
00:11:19.120 | I think we can think in terms of
00:11:20.960 | basic conditional probabilities to come to
00:11:23.600 | an understanding of why this matters so much.
00:11:26.600 | Think just at the level of bigram frequency.
00:11:29.640 | For COGS, overwhelmingly,
00:11:32.480 | the commonest bigram in the dataset is comma x,
00:11:36.200 | and that's because so many of
00:11:38.160 | these expressions involve variables
00:11:40.880 | inside those parenthetical expressions with commas.
00:11:44.320 | Then way down with much smaller frequency,
00:11:47.680 | are commas with names next to them.
00:11:50.400 | When we make these adjustments to the logical forms,
00:11:53.720 | everything evens out much more.
00:11:56.000 | Those variables are still prominent,
00:11:58.160 | 1, 4, 6, and 3.
00:11:59.880 | But notice that proper name Emma,
00:12:02.440 | the most frequent proper name in
00:12:04.000 | the COGS dataset is now on par with those variables.
00:12:08.280 | Overall, we're seeing a much even or picture,
00:12:11.040 | which is a much happier state for
00:12:13.520 | language models to be in because they are so
00:12:16.080 | dependent on these local conditional probabilities.
00:12:20.200 | Since this makes for a happier dataset,
00:12:23.560 | and is obviously semantics preserving,
00:12:26.000 | we decided to go ahead and remove those tokens.
00:12:29.240 | That's a simple one. That helps
00:12:30.920 | mainly with lexical generalization splits,
00:12:33.440 | but it is not especially impactful for
00:12:36.320 | those really persistently hard structural splits.
00:12:40.280 | Let's turn to them now.
00:12:42.000 | First question, what is behind the zeros for CP and PP recursion?
00:12:47.840 | That was one of the hardest structural generalization splits.
00:12:52.000 | Here we should think first about length.
00:12:55.240 | The fundamental observation here is that models are exposed to
00:13:00.120 | one example length during training and
00:13:03.800 | a very different distribution of
00:13:06.080 | lengths during the generalization splits.
00:13:08.640 | In particular, the longest examples
00:13:11.600 | occur in the generalization splits.
00:13:13.480 | Here we're showing the distribution for input sentences,
00:13:16.280 | the generalization split lengths are in green and they have
00:13:19.040 | this very long tail out to very long examples.
00:13:22.120 | Here are the output LFs, same thing.
00:13:24.520 | Look at the green examples.
00:13:26.560 | You can see that the generalization ones are much, much longer.
00:13:30.240 | We have again this very long tail of very long examples.
00:13:34.560 | Now I should be clear,
00:13:36.280 | I think it's perfectly reasonable to be pushing
00:13:39.360 | models to generalize to ever greater lengths at test time.
00:13:43.680 | This is persistently hard for many models,
00:13:46.160 | and we should compel the field to
00:13:48.440 | find ways to address that limitation.
00:13:51.120 | But remember, our goal was to test for CP and PP recursion,
00:13:55.560 | and we can now see that that recursion question has been
00:13:58.640 | totally entwined with this question about length generalization.
00:14:03.880 | We wanted to separate them apart.
00:14:06.440 | To decouple length from depth,
00:14:09.080 | we concatenate existing examples and re-index
00:14:13.120 | the variable names using the COGS protocol
00:14:15.920 | to cover all of the variable names that we end up seeing at test time.
00:14:20.240 | Because remember, one feature of this length issue is that
00:14:24.760 | not only positions that remain untrained during training,
00:14:29.840 | and then end up appearing relevant during testing,
00:14:32.320 | but also the names of the variables.
00:14:34.880 | If the longest sequence you had in COGS training led you to have
00:14:39.160 | variable name 45 and you encounter 46 at test time,
00:14:44.040 | then not only are the associated positions
00:14:46.760 | corresponding to random embeddings,
00:14:48.560 | but also that token itself has
00:14:51.520 | random vector associated with it in
00:14:55.120 | virtue of the fact that it never appeared during training.
00:14:58.660 | Again, it's fine to push models to overcome that problem,
00:15:02.440 | but here you can see it's coupled together with
00:15:04.760 | this recursion question and we decoupled them by simply
00:15:08.840 | augmenting examples by concatenating them
00:15:11.600 | together and adjusting the variable names accordingly.
00:15:15.320 | What we find is that that essentially
00:15:18.120 | completely overcomes the problem for both LSTMs and for transformers.
00:15:23.680 | The performance that we reach is well above
00:15:26.480 | the previous state of the art on this recursion problem.
00:15:29.840 | What this suggests to us is that
00:15:32.020 | the hard aspect of this split is not recursion,
00:15:34.640 | which we were trying to test for,
00:15:36.200 | but rather this persistent issue about length generalization.
00:15:40.080 | An important lesson about that column of zeros there.
00:15:44.880 | Second hard question, what is behind the zeros for PP modifiers?
00:15:50.200 | Remember that was literally a column
00:15:52.380 | of zeros in our synthetic leaderboard.
00:15:54.640 | Here's our hypothesis about what's happening.
00:15:57.600 | For COGS, the trained data actually teach the model
00:16:01.300 | that PPs occur only with a specific set of variables and positions.
00:16:06.200 | When models learn this lesson,
00:16:08.320 | they then struggle with examples that contradict it.
00:16:11.680 | Every experience that these models have with
00:16:14.840 | these PP modifiers suggests that they have a very limited distribution.
00:16:19.420 | Then at generalization time,
00:16:21.320 | we confront them with the fact that that distribution was misleading.
00:16:25.520 | I think that poses some pretty deep questions about what's
00:16:28.920 | fair in terms of posing a compositional generalization task.
00:16:33.680 | But at the same time, we have diagnosed,
00:16:35.780 | I think, why we have all those zeros.
00:16:38.160 | To further that argument,
00:16:39.980 | what we did is take original COGS sentences and manipulate them in
00:16:44.280 | various ways so that PPs would associate with a wider range of
00:16:48.940 | linear positions and correspondingly a wider range of variable names.
00:16:54.120 | We have various tricks for doing that.
00:16:55.920 | For example, we can pre-pose
00:16:58.700 | object noun phrases to get sentences like,
00:17:01.860 | "The box in the tent, Emma was lent."
00:17:04.840 | Topicalization is a pretty routine operation on English objects,
00:17:09.660 | and here we have simply implemented that.
00:17:11.400 | It has no effect on the underlying semantic representation,
00:17:15.060 | but for COGS, it has the effect of shifting around the variable names.
00:17:19.900 | In addition to get a fuller coverage of variable names and positions and so forth,
00:17:26.320 | we also have interjections.
00:17:28.160 | We simply stick in "ums" in various random points in the input sentences.
00:17:33.120 | It's a meaning-preserving operation.
00:17:35.240 | It's just a filled pause,
00:17:36.720 | but again, it has the effect of shifting around
00:17:39.360 | the variable names and associated positions.
00:17:42.480 | To further cover the space,
00:17:44.400 | we also do this thing with participles.
00:17:46.240 | Instead of having just standard prepositional modifiers,
00:17:49.340 | we also do things like a leaf painting the spaceship,
00:17:52.680 | again, to fill in the gaps in these variable names and positions.
00:17:57.200 | The result is a very large performance increase for
00:18:01.320 | both LSTMs and transformers on these splits.
00:18:05.000 | Again, suggesting that our hypothesis is on the right track and functionally,
00:18:09.940 | the blocker here was that models had been taught that
00:18:13.000 | PPs associate with certain variable names and certain positions,
00:18:17.480 | which is not precisely what we were hoping to test for.
00:18:22.120 | On the basis of these insights,
00:18:25.560 | we perform some modifications to COGS to get recogs.
00:18:29.140 | This is a high-level summary of what we did.
00:18:31.380 | Imagine an input sentence,
00:18:32.620 | Mia ate a cake.
00:18:34.020 | The COGS LF looks like this.
00:18:36.420 | We did that redundant token removal that I described before,
00:18:39.860 | mostly focused on the variable names.
00:18:42.280 | We did some meaning-preserving data augmentation of the sort I
00:18:45.820 | described for the CP and PP recursion cases.
00:18:49.180 | Then we introduced this notion of arbitrary variable naming.
00:18:53.020 | Recogs examples do not have
00:18:55.380 | variable names that are tied to the position in the input string.
00:18:59.220 | Rather, they're randomly aside in a semantically consistent way.
00:19:03.340 | We have many more examples as a result in an effort to teach
00:19:07.140 | models to abstract away from the precise names of variables.
00:19:11.940 | For our given example here,
00:19:13.860 | we end up with a recogs logical form
00:19:16.340 | that looks like the one at the bottom of this slide.
00:19:19.420 | The overall effect of performance is summarized in the diagram on the right here.
00:19:24.780 | For COGS LFs, we have this really challenging aspect
00:19:28.020 | that structural generalization tasks have dismal performance.
00:19:31.940 | The redundant token removal doesn't really affect that,
00:19:35.140 | as you can see in these two low red bars.
00:19:37.840 | But meaning-preserving data augmentation and
00:19:40.580 | arbitrary variable renaming do dramatically improve performance on
00:19:45.020 | those structural generalization splits with the net effect of
00:19:48.500 | evening out performance across these two aspects of the COGS/recogs problem.
00:19:54.980 | Here's a summary of the results.
00:19:56.980 | We have LSTMs on the top,
00:19:59.840 | transformers on the bottom.
00:20:01.580 | My high-level takeaway here is that recogs is not necessarily an easier task.
00:20:08.260 | In fact, some aspects of it are actually
00:20:10.780 | harder looking according to our experiments than COGS.
00:20:14.120 | But what we have done is even out
00:20:16.480 | performance and show that it's possible to get traction on
00:20:20.260 | those totally recalcitrant structural generalization splits that led to
00:20:25.140 | that really strange situation for
00:20:27.460 | the literature before with those columns of all zeros.
00:20:31.200 | Overall, we think that this is a healthier benchmark to hill climb on.
00:20:36.180 | That seems good as well because recogs is just getting us closer
00:20:40.660 | to testing the semantic phenomena that we care about.
00:20:44.980 | To wrap up the screencast,
00:20:47.040 | I thought I would pose some conceptual questions that still linger for
00:20:50.240 | me having done this deep dive on COGS and recogs.
00:20:54.200 | First, how can we test for meaning if we're predicting logical forms?
00:20:58.720 | Logical forms are just more syntactic expressions.
00:21:02.160 | We want to get at the true meaning.
00:21:04.140 | We have to do that through logical forms and
00:21:06.760 | the logical forms always have some arbitrariness about them.
00:21:10.260 | That's always going to get in the way of us really purely
00:21:13.080 | seeing whether models understand the meanings of these input sentences.
00:21:17.900 | What is a fair generalization test in the current context?
00:21:22.280 | A lot of our insights about compositionality start to border on things that might
00:21:26.700 | look like they are unfair in the sense of my earlier screencasts in this unit,
00:21:31.740 | because we are deliberately holding out from the training experiences
00:21:36.100 | some examples that we expect the model to grapple with at test time.
00:21:41.580 | For example, models are shown a world that manifests
00:21:45.760 | a specific restriction like PPs appear only in object position,
00:21:49.920 | and then asked to grapple with the world in which
00:21:52.220 | they also appear in subject positions.
00:21:54.760 | The hard part about this is that in some of
00:21:57.440 | these cases where we restrict the training experiences,
00:22:01.120 | we want models not to learn those restrictions as in the COGS splits,
00:22:07.900 | whereas in other cases we do want them to learn those restrictions.
00:22:11.880 | Actually just conceptually figuring out for different phenomena,
00:22:15.280 | which category they fall into seems extremely difficult to me.
00:22:20.280 | What are the limits of compositionality for
00:22:23.280 | humans and how should that inform our generalization test?
00:22:26.340 | Maybe that's the ultimate ground truth here.
00:22:29.140 | We have an assumption that natural languages are compositional,
00:22:32.660 | but that's a very strong assumption.
00:22:34.580 | It surely has some limitations.
00:22:36.980 | Maybe the thing to do is figure out where and how humans generalize,
00:22:40.980 | and where they don't, and then just set up
00:22:43.020 | a general expectation that our models will be able to follow suit.
00:22:47.420 | But that poses many conceptual questions.
00:22:49.800 | For example, if we have goals that are not supported by our datasets,
00:22:54.980 | but that seem like good goals for models to reach,
00:22:57.860 | how should we express that in our tasks and in our models?
00:23:01.020 | We expect these issues to be embodied in the data, in the examples.
00:23:05.640 | But here we have goals that seem to reach beyond what we can encode in examples,
00:23:10.580 | especially given the variation that we see in 2B and 2C.
00:23:15.260 | I don't have answers to these questions,
00:23:17.260 | but I find it exciting that we can pose them and that we have models that might
00:23:21.500 | plausibly be on their way to achieving something
00:23:24.380 | like the compositional generalization that we are actually seeking.
00:23:29.100 | [BLANK_AUDIO]