Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 3: Compositionality I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.120 | This is part three in our series on

00:00:08.040 | advanced behavioral testing for NLU.

00:00:10.440 | Our focus is the principle of compositionality.

00:00:13.040 | This is a principle that is important to me as

00:00:15.260 | a linguistic semanticist and it's also arguably

00:00:18.480 | a prerequisite for understanding the goals of

00:00:21.480 | COGs and re-COGs which are

00:00:23.120 | compositional generalization benchmarks.

00:00:26.260 | Let's start with an informal statement of

00:00:28.800 | the compositionality principle.

00:00:30.540 | It says, "The meaning of a phrase is a function of

00:00:33.880 | the meanings of its immediate syntactic constituents

00:00:36.720 | and the way they are combined."

00:00:38.800 | That's the principle. Let's unpack it by way of an example.

00:00:42.000 | I have a simple syntactic structure here,

00:00:44.800 | a full sentence, "Every student admired the idea."

00:00:48.040 | The compositionality principle says that the meaning of

00:00:50.960 | this S node for sentence here is fully

00:00:53.880 | determined by the meaning of its two constituent parts,

00:00:56.880 | NP and VP.

00:00:59.040 | You can see that this implies a recursive process.

00:01:01.760 | What is the meaning for the NP?

00:01:03.080 | Well, that is fully determined by the meanings of

00:01:05.780 | this debt for determiner node and this N for noun node.

00:01:10.480 | The meanings of those are easy to see.

00:01:13.240 | Those are fully determined by their parts,

00:01:15.080 | which there's just one part for each,

00:01:16.880 | and those are lexical items.

00:01:18.480 | That's where this recursive process grounds out.

00:01:21.340 | The intuition is that you just have to learn

00:01:24.340 | all the lexical items of the languages that you speak.

00:01:27.400 | Having done that and having

00:01:28.940 | figured out how they combine with each other,

00:01:30.920 | you have a recursive process that allows you to combine

00:01:33.800 | things in new ways and

00:01:35.200 | understand novel combinations of these elements.

00:01:37.860 | The compositionality is saying that you have

00:01:39.620 | guarantee there because the meaning of

00:01:41.940 | the whole will be a function of

00:01:43.980 | the meaning of the parts and how they are combined.

00:01:46.420 | We could also think about this in a bottom-up fashion.

00:01:49.600 | We start with those lexical items,

00:01:51.480 | their meanings are stipulated or

00:01:53.280 | learned and memorized.

00:01:55.640 | Then those meanings in turn

00:01:57.600 | determine the meanings of these parent nodes here,

00:02:00.980 | which in turn determine the meanings of

00:02:03.240 | the complex nodes above them and so forth until we have

00:02:06.320 | derived bottom-up a meaning for the entire sentence.

00:02:10.640 | Why do linguists tend

00:02:13.320 | to adhere to the compositionality principle?

00:02:15.560 | Well, this can be a little bit hard to reconstruct,

00:02:18.200 | but I would say that the usual motivations are as follows.

00:02:21.600 | First, we might just hope that

00:02:23.920 | as semantics is trying to study language,

00:02:26.480 | we would model all the meaningful units of the language,

00:02:29.600 | and that would imply that we have gone all the way

00:02:31.480 | down to even the most incidental looking

00:02:33.960 | lexical items and given them meanings in

00:02:36.400 | isolation like good lexicographers

00:02:38.800 | might feel they should do.

00:02:40.460 | In practice, I should point out that that means there's

00:02:43.260 | a lot of abstraction around linguistic semantics

00:02:45.960 | because it is just hard,

00:02:47.800 | perhaps impossible, to give

00:02:49.640 | a meaning for a word like every in

00:02:51.520 | isolation from the things that it combines with.

00:02:53.920 | What happens in practice actually

00:02:56.080 | is that the meanings assigned are functions.

00:02:58.560 | What we're saying here is that every is

00:03:00.360 | a functional meaning that when

00:03:02.160 | combined with the meaning for student,

00:03:04.440 | delivers another function that when

00:03:06.840 | combined with the meaning of this verb phrase,

00:03:09.000 | finally gives us a meaning for this S node up here,

00:03:11.720 | and it's something like universal quantification where,

00:03:14.780 | in this case, if something is a student,

00:03:16.680 | then it has the property of admiring the idea.

00:03:19.320 | That would be the fundamental claim of the sentence,

00:03:21.600 | and you can see there that that claim was driven by

00:03:24.720 | every down there in

00:03:26.100 | this determiner position inside the subject.

00:03:28.960 | A great deal of abstraction,

00:03:30.600 | but that is a technique for

00:03:32.160 | giving meanings to all the meaningful units,

00:03:34.220 | which should be a consequence

00:03:35.880 | of adhering to compositionality.

00:03:38.220 | You often hear linguists talk about

00:03:40.900 | the supposed infinite capacity

00:03:42.720 | that humans have for dealing with language.

00:03:44.900 | I grant that there is some sense in which this is true

00:03:47.520 | because there seems to be no principle bound on

00:03:50.360 | the complexity or length of

00:03:52.000 | the sentences that we can understand

00:03:53.640 | in a abstract way.

00:03:55.440 | But this needs to be heavily qualified.

00:03:58.020 | I'm sad to report that we are all finite beings,

00:04:01.040 | and therefore there is only a finite capacity in all of

00:04:03.720 | us to make and understand language.

00:04:08.280 | This is a little bit overblown,

00:04:09.920 | but I see what this is getting,

00:04:11.040 | and I think the fundamental intuition is

00:04:12.680 | something more like creativity.

00:04:14.780 | We have an impressive ability to be

00:04:17.000 | creative with language.

00:04:18.400 | By and large, the sentences that you

00:04:21.000 | produce today and the sentences that you

00:04:23.040 | interpreted today had never been

00:04:25.200 | encountered before in all of human history.

00:04:27.540 | Most sentences are like that,

00:04:29.320 | and yet nonetheless, we are able to instantly and

00:04:31.940 | effortlessly produce these sentences

00:04:34.240 | and understand what they mean.

00:04:35.880 | That does imply that there is some capacity in

00:04:39.680 | us for making use of a finite resource,

00:04:42.800 | say the lexical items,

00:04:44.040 | combining them in new ways in order to be

00:04:46.680 | creative with language and

00:04:48.160 | compositionality could be seen as an explanation for that.

00:04:51.760 | There's also a related idea from

00:04:54.080 | cognitive science called systematicity,

00:04:56.040 | which I think is a slightly more general notion than

00:04:58.940 | compositionality and may be a more correct characterization.

00:05:03.280 | Let's dive into that a little bit under

00:05:05.760 | the heading of compositionality or systematicity.

00:05:09.200 | The systematicity idea traces,

00:05:11.400 | as far as I know, to Fodor and Pilashin.

00:05:13.840 | They say, "What we mean when we say

00:05:16.300 | that linguistic capacities are systematic,

00:05:18.940 | is that the ability to produce or understand

00:05:21.680 | some sentences is intrinsically connected to

00:05:24.700 | the ability to produce or understand certain other ones."

00:05:28.360 | The idea is that if you understand the sentence,

00:05:31.000 | Sandy loves the puppy,

00:05:32.620 | then just by that very fact,

00:05:34.920 | you also understand the puppy loves Sandy.

00:05:37.600 | If you recognize that there is

00:05:39.520 | a certain distributional affinity

00:05:41.440 | between the turtle and the puppy,

00:05:43.520 | you can also instantly and

00:05:45.680 | effortlessly understand the turtle loves the puppy,

00:05:48.280 | the puppy loves the turtle,

00:05:49.800 | the turtle loves Sandy, and so forth and so on.

00:05:52.240 | You get this instant explosion in the number of things that

00:05:55.760 | you know in some sense as a consequence of

00:05:58.680 | your own understanding of language being so systematic.

00:06:02.320 | I do think that compositionality could be

00:06:05.240 | a particular way of explaining what we

00:06:07.600 | observe about the systematicity

00:06:09.360 | of the human capacity for language.

00:06:11.360 | But I think systematicity is arguably more general.

00:06:14.200 | You can see that it's given a distributional

00:06:16.440 | characterization here that might allow for things that are

00:06:19.340 | not strictly compositional but nonetheless,

00:06:22.140 | importantly, systematic.

00:06:24.680 | Systematicity is a powerful idea for thinking about

00:06:28.460 | the intuition behind many of

00:06:29.920 | the behavioral tests that we run,

00:06:31.420 | especially the hypothesis-driven challenge tests that we run.

00:06:35.380 | Because very often when we express concerns about systems,

00:06:38.960 | they are concerns that are grounded in

00:06:40.880 | a certain lack of systematicity.

00:06:43.080 | Here's a brief example to illustrate this.

00:06:45.200 | This is from a real sentiment classification model

00:06:48.440 | that I developed that I thought was pretty good,

00:06:50.940 | and I started posing little challenge problems to it.

00:06:54.160 | I was initially very encouraged by these examples.

00:06:58.040 | The bakery sells a mean apple pie is

00:07:01.800 | generally a positive claim about this bakery's pies,

00:07:05.400 | and it involves this very unusual sense of mean,

00:07:08.640 | which essentially means good.

00:07:10.440 | A mean apple pie is typically a good one.

00:07:12.920 | I was encouraged that the gold label

00:07:15.440 | and the predicted label aligned here.

00:07:17.920 | Similarly, for they sell a mean apple pie,

00:07:21.200 | I was happy to see this alignment,

00:07:23.160 | and I started to think that my model truly understand

00:07:26.180 | this very specialized sense of the adjective mean.

00:07:30.140 | But that fell apart with the next two examples.

00:07:33.160 | She sells a mean apple pie,

00:07:34.880 | he sells a mean apple pie,

00:07:36.360 | both of those were predicted negative,

00:07:38.340 | whereas the gold label is of course still positive.

00:07:41.120 | The errors are worrisome,

00:07:42.720 | but the deeper thing that I was worried

00:07:44.400 | about is the lack of systematicity,

00:07:46.340 | because as a human,

00:07:48.000 | I have no expectation that changing the subject from

00:07:51.260 | a plural pronoun to a singular one or using

00:07:54.280 | a pronoun as opposed to a full noun phrase like

00:07:56.560 | the bakery would have any effect on

00:07:58.800 | the interpretation of the adjective mean in these cases.

00:08:01.900 | Yet nonetheless, the model's predictions

00:08:04.320 | changed and that manifests for me as a lack of systematicity.

00:08:08.720 | That's a guiding intuition behind many of

00:08:11.380 | the adversarial or challenge

00:08:12.880 | datasets that people have posed.

00:08:14.360 | They have a hypothesis grounded in the systematicity of

00:08:17.400 | language and they observe departures from that in

00:08:20.160 | their models and they begin to worry about those models.

00:08:24.560 | It's interesting to reflect on

00:08:26.720 | the compositionality principle in

00:08:28.240 | the context of the history of AI models.

00:08:31.200 | In the earliest eras of AI like

00:08:33.860 | the Sherdlue system or the chat 80 system

00:08:36.560 | that we saw on the first day,

00:08:38.140 | we got compositionality by design because those were

00:08:41.960 | implemented grammars, symbolic grammars that

00:08:44.640 | themselves adhere to the compositionality principle.

00:08:47.520 | We didn't wonder about whether

00:08:49.720 | these NLU models were compositional

00:08:51.600 | because we presupposed that they would be.

00:08:55.040 | Parts of that actually did carry forward

00:08:58.120 | into the more modern machine learning era.

00:09:00.800 | For example, many semantic parsing systems,

00:09:03.440 | like this one depicted from Percy Leung's work,

00:09:06.360 | were also compositional in the sense that

00:09:08.620 | underlyingly there was a compositional grammar

00:09:11.240 | and the task was to learn weights

00:09:13.140 | on the rules of that grammar.

00:09:14.740 | Arguably, the resulting artifacts were

00:09:17.420 | compositional with some stochasticity

00:09:19.500 | associated with them being probabilistic models.

00:09:23.080 | Even in the more modern deep learning era,

00:09:26.180 | we again saw systems that were arguably compositional.

00:09:29.540 | This is from the paper that launched

00:09:31.140 | the Stanford Sentiment Treebank.

00:09:32.640 | It's a recursive tree-structured neural network.

00:09:36.940 | It abides by the compositionality principle in

00:09:40.120 | the sense that all the nodes depicted

00:09:41.900 | in these structures denote vectors.

00:09:44.180 | There was a complicated deep learning function that

00:09:46.560 | combined those vectors to derive

00:09:48.480 | the meaning for their parent nodes.

00:09:50.280 | It did that recursively until we

00:09:52.200 | got a meaning for the entire sentence.

00:09:54.600 | It's not symbolic in the way of

00:09:56.800 | these older systems and in

00:09:58.400 | the way of much work in linguistic semantics,

00:10:00.720 | but it is arguably a compositional system,

00:10:03.660 | an intriguing property of

00:10:05.580 | those deep learning artifacts in fact.

00:10:08.240 | But we have, it seems,

00:10:10.640 | moved away from that perspective.

00:10:12.720 | Now we're confronted all the time with

00:10:15.100 | these huge typically transformer-based models

00:10:17.800 | where everything is connected to everything else.

00:10:20.360 | It is certainly clear that we have

00:10:22.180 | no guarantees of compositionality

00:10:24.520 | or systematicity built into these networks.

00:10:27.360 | In fact, in the earliest days,

00:10:29.180 | people often worried that

00:10:30.860 | even though they were performing well,

00:10:32.840 | they were learning non-systematic solutions and that

00:10:35.800 | motivated a lot of challenge testing for them.

00:10:39.120 | The question now is,

00:10:40.840 | can we pose behavioral tests that will truly

00:10:43.520 | assess whether models like this,

00:10:45.840 | which are hard to understand analytically,

00:10:48.000 | have found systematic solutions

00:10:50.960 | to the language problems that we pose?

00:10:53.000 | If the answer is no, we should worry.

00:10:55.360 | If the answer is yes,

00:10:56.880 | it's an amazing discovery about why

00:10:59.020 | these models perform so well and also

00:11:01.360 | often an amazing discovery about the power of

00:11:04.200 | data-driven learning alone to

00:11:06.320 | deliver systematic solutions to language problems.

00:11:10.120 | It's an open question but

00:11:11.840 | a tremendously exciting set of

00:11:14.080 | questions to be exploring now as

00:11:16.200 | our models are getting so good even at

00:11:18.480 | the hard behavioral tasks that we pose for them.

00:11:22.080 | [BLANK_AUDIO]