back to index

Stanford XCS224U: NLU I Behavioral Evaluation of NLU Models, Part 3: Compositionality I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.120 | This is part three in our series on
00:00:08.040 | advanced behavioral testing for NLU.
00:00:10.440 | Our focus is the principle of compositionality.
00:00:13.040 | This is a principle that is important to me as
00:00:15.260 | a linguistic semanticist and it's also arguably
00:00:18.480 | a prerequisite for understanding the goals of
00:00:21.480 | COGs and re-COGs which are
00:00:23.120 | compositional generalization benchmarks.
00:00:26.260 | Let's start with an informal statement of
00:00:28.800 | the compositionality principle.
00:00:30.540 | It says, "The meaning of a phrase is a function of
00:00:33.880 | the meanings of its immediate syntactic constituents
00:00:36.720 | and the way they are combined."
00:00:38.800 | That's the principle. Let's unpack it by way of an example.
00:00:42.000 | I have a simple syntactic structure here,
00:00:44.800 | a full sentence, "Every student admired the idea."
00:00:48.040 | The compositionality principle says that the meaning of
00:00:50.960 | this S node for sentence here is fully
00:00:53.880 | determined by the meaning of its two constituent parts,
00:00:56.880 | NP and VP.
00:00:59.040 | You can see that this implies a recursive process.
00:01:01.760 | What is the meaning for the NP?
00:01:03.080 | Well, that is fully determined by the meanings of
00:01:05.780 | this debt for determiner node and this N for noun node.
00:01:10.480 | The meanings of those are easy to see.
00:01:13.240 | Those are fully determined by their parts,
00:01:15.080 | which there's just one part for each,
00:01:16.880 | and those are lexical items.
00:01:18.480 | That's where this recursive process grounds out.
00:01:21.340 | The intuition is that you just have to learn
00:01:24.340 | all the lexical items of the languages that you speak.
00:01:27.400 | Having done that and having
00:01:28.940 | figured out how they combine with each other,
00:01:30.920 | you have a recursive process that allows you to combine
00:01:33.800 | things in new ways and
00:01:35.200 | understand novel combinations of these elements.
00:01:37.860 | The compositionality is saying that you have
00:01:39.620 | guarantee there because the meaning of
00:01:41.940 | the whole will be a function of
00:01:43.980 | the meaning of the parts and how they are combined.
00:01:46.420 | We could also think about this in a bottom-up fashion.
00:01:49.600 | We start with those lexical items,
00:01:51.480 | their meanings are stipulated or
00:01:53.280 | learned and memorized.
00:01:55.640 | Then those meanings in turn
00:01:57.600 | determine the meanings of these parent nodes here,
00:02:00.980 | which in turn determine the meanings of
00:02:03.240 | the complex nodes above them and so forth until we have
00:02:06.320 | derived bottom-up a meaning for the entire sentence.
00:02:10.640 | Why do linguists tend
00:02:13.320 | to adhere to the compositionality principle?
00:02:15.560 | Well, this can be a little bit hard to reconstruct,
00:02:18.200 | but I would say that the usual motivations are as follows.
00:02:21.600 | First, we might just hope that
00:02:23.920 | as semantics is trying to study language,
00:02:26.480 | we would model all the meaningful units of the language,
00:02:29.600 | and that would imply that we have gone all the way
00:02:31.480 | down to even the most incidental looking
00:02:33.960 | lexical items and given them meanings in
00:02:36.400 | isolation like good lexicographers
00:02:38.800 | might feel they should do.
00:02:40.460 | In practice, I should point out that that means there's
00:02:43.260 | a lot of abstraction around linguistic semantics
00:02:45.960 | because it is just hard,
00:02:47.800 | perhaps impossible, to give
00:02:49.640 | a meaning for a word like every in
00:02:51.520 | isolation from the things that it combines with.
00:02:53.920 | What happens in practice actually
00:02:56.080 | is that the meanings assigned are functions.
00:02:58.560 | What we're saying here is that every is
00:03:00.360 | a functional meaning that when
00:03:02.160 | combined with the meaning for student,
00:03:04.440 | delivers another function that when
00:03:06.840 | combined with the meaning of this verb phrase,
00:03:09.000 | finally gives us a meaning for this S node up here,
00:03:11.720 | and it's something like universal quantification where,
00:03:14.780 | in this case, if something is a student,
00:03:16.680 | then it has the property of admiring the idea.
00:03:19.320 | That would be the fundamental claim of the sentence,
00:03:21.600 | and you can see there that that claim was driven by
00:03:24.720 | every down there in
00:03:26.100 | this determiner position inside the subject.
00:03:28.960 | A great deal of abstraction,
00:03:30.600 | but that is a technique for
00:03:32.160 | giving meanings to all the meaningful units,
00:03:34.220 | which should be a consequence
00:03:35.880 | of adhering to compositionality.
00:03:38.220 | You often hear linguists talk about
00:03:40.900 | the supposed infinite capacity
00:03:42.720 | that humans have for dealing with language.
00:03:44.900 | I grant that there is some sense in which this is true
00:03:47.520 | because there seems to be no principle bound on
00:03:50.360 | the complexity or length of
00:03:52.000 | the sentences that we can understand
00:03:53.640 | in a abstract way.
00:03:55.440 | But this needs to be heavily qualified.
00:03:58.020 | I'm sad to report that we are all finite beings,
00:04:01.040 | and therefore there is only a finite capacity in all of
00:04:03.720 | us to make and understand language.
00:04:08.280 | This is a little bit overblown,
00:04:09.920 | but I see what this is getting,
00:04:11.040 | and I think the fundamental intuition is
00:04:12.680 | something more like creativity.
00:04:14.780 | We have an impressive ability to be
00:04:17.000 | creative with language.
00:04:18.400 | By and large, the sentences that you
00:04:21.000 | produce today and the sentences that you
00:04:23.040 | interpreted today had never been
00:04:25.200 | encountered before in all of human history.
00:04:27.540 | Most sentences are like that,
00:04:29.320 | and yet nonetheless, we are able to instantly and
00:04:31.940 | effortlessly produce these sentences
00:04:34.240 | and understand what they mean.
00:04:35.880 | That does imply that there is some capacity in
00:04:39.680 | us for making use of a finite resource,
00:04:42.800 | say the lexical items,
00:04:44.040 | combining them in new ways in order to be
00:04:46.680 | creative with language and
00:04:48.160 | compositionality could be seen as an explanation for that.
00:04:51.760 | There's also a related idea from
00:04:54.080 | cognitive science called systematicity,
00:04:56.040 | which I think is a slightly more general notion than
00:04:58.940 | compositionality and may be a more correct characterization.
00:05:03.280 | Let's dive into that a little bit under
00:05:05.760 | the heading of compositionality or systematicity.
00:05:09.200 | The systematicity idea traces,
00:05:11.400 | as far as I know, to Fodor and Pilashin.
00:05:13.840 | They say, "What we mean when we say
00:05:16.300 | that linguistic capacities are systematic,
00:05:18.940 | is that the ability to produce or understand
00:05:21.680 | some sentences is intrinsically connected to
00:05:24.700 | the ability to produce or understand certain other ones."
00:05:28.360 | The idea is that if you understand the sentence,
00:05:31.000 | Sandy loves the puppy,
00:05:32.620 | then just by that very fact,
00:05:34.920 | you also understand the puppy loves Sandy.
00:05:37.600 | If you recognize that there is
00:05:39.520 | a certain distributional affinity
00:05:41.440 | between the turtle and the puppy,
00:05:43.520 | you can also instantly and
00:05:45.680 | effortlessly understand the turtle loves the puppy,
00:05:48.280 | the puppy loves the turtle,
00:05:49.800 | the turtle loves Sandy, and so forth and so on.
00:05:52.240 | You get this instant explosion in the number of things that
00:05:55.760 | you know in some sense as a consequence of
00:05:58.680 | your own understanding of language being so systematic.
00:06:02.320 | I do think that compositionality could be
00:06:05.240 | a particular way of explaining what we
00:06:07.600 | observe about the systematicity
00:06:09.360 | of the human capacity for language.
00:06:11.360 | But I think systematicity is arguably more general.
00:06:14.200 | You can see that it's given a distributional
00:06:16.440 | characterization here that might allow for things that are
00:06:19.340 | not strictly compositional but nonetheless,
00:06:22.140 | importantly, systematic.
00:06:24.680 | Systematicity is a powerful idea for thinking about
00:06:28.460 | the intuition behind many of
00:06:29.920 | the behavioral tests that we run,
00:06:31.420 | especially the hypothesis-driven challenge tests that we run.
00:06:35.380 | Because very often when we express concerns about systems,
00:06:38.960 | they are concerns that are grounded in
00:06:40.880 | a certain lack of systematicity.
00:06:43.080 | Here's a brief example to illustrate this.
00:06:45.200 | This is from a real sentiment classification model
00:06:48.440 | that I developed that I thought was pretty good,
00:06:50.940 | and I started posing little challenge problems to it.
00:06:54.160 | I was initially very encouraged by these examples.
00:06:58.040 | The bakery sells a mean apple pie is
00:07:01.800 | generally a positive claim about this bakery's pies,
00:07:05.400 | and it involves this very unusual sense of mean,
00:07:08.640 | which essentially means good.
00:07:10.440 | A mean apple pie is typically a good one.
00:07:12.920 | I was encouraged that the gold label
00:07:15.440 | and the predicted label aligned here.
00:07:17.920 | Similarly, for they sell a mean apple pie,
00:07:21.200 | I was happy to see this alignment,
00:07:23.160 | and I started to think that my model truly understand
00:07:26.180 | this very specialized sense of the adjective mean.
00:07:30.140 | But that fell apart with the next two examples.
00:07:33.160 | She sells a mean apple pie,
00:07:34.880 | he sells a mean apple pie,
00:07:36.360 | both of those were predicted negative,
00:07:38.340 | whereas the gold label is of course still positive.
00:07:41.120 | The errors are worrisome,
00:07:42.720 | but the deeper thing that I was worried
00:07:44.400 | about is the lack of systematicity,
00:07:46.340 | because as a human,
00:07:48.000 | I have no expectation that changing the subject from
00:07:51.260 | a plural pronoun to a singular one or using
00:07:54.280 | a pronoun as opposed to a full noun phrase like
00:07:56.560 | the bakery would have any effect on
00:07:58.800 | the interpretation of the adjective mean in these cases.
00:08:01.900 | Yet nonetheless, the model's predictions
00:08:04.320 | changed and that manifests for me as a lack of systematicity.
00:08:08.720 | That's a guiding intuition behind many of
00:08:11.380 | the adversarial or challenge
00:08:12.880 | datasets that people have posed.
00:08:14.360 | They have a hypothesis grounded in the systematicity of
00:08:17.400 | language and they observe departures from that in
00:08:20.160 | their models and they begin to worry about those models.
00:08:24.560 | It's interesting to reflect on
00:08:26.720 | the compositionality principle in
00:08:28.240 | the context of the history of AI models.
00:08:31.200 | In the earliest eras of AI like
00:08:33.860 | the Sherdlue system or the chat 80 system
00:08:36.560 | that we saw on the first day,
00:08:38.140 | we got compositionality by design because those were
00:08:41.960 | implemented grammars, symbolic grammars that
00:08:44.640 | themselves adhere to the compositionality principle.
00:08:47.520 | We didn't wonder about whether
00:08:49.720 | these NLU models were compositional
00:08:51.600 | because we presupposed that they would be.
00:08:55.040 | Parts of that actually did carry forward
00:08:58.120 | into the more modern machine learning era.
00:09:00.800 | For example, many semantic parsing systems,
00:09:03.440 | like this one depicted from Percy Leung's work,
00:09:06.360 | were also compositional in the sense that
00:09:08.620 | underlyingly there was a compositional grammar
00:09:11.240 | and the task was to learn weights
00:09:13.140 | on the rules of that grammar.
00:09:14.740 | Arguably, the resulting artifacts were
00:09:17.420 | compositional with some stochasticity
00:09:19.500 | associated with them being probabilistic models.
00:09:23.080 | Even in the more modern deep learning era,
00:09:26.180 | we again saw systems that were arguably compositional.
00:09:29.540 | This is from the paper that launched
00:09:31.140 | the Stanford Sentiment Treebank.
00:09:32.640 | It's a recursive tree-structured neural network.
00:09:36.940 | It abides by the compositionality principle in
00:09:40.120 | the sense that all the nodes depicted
00:09:41.900 | in these structures denote vectors.
00:09:44.180 | There was a complicated deep learning function that
00:09:46.560 | combined those vectors to derive
00:09:48.480 | the meaning for their parent nodes.
00:09:50.280 | It did that recursively until we
00:09:52.200 | got a meaning for the entire sentence.
00:09:54.600 | It's not symbolic in the way of
00:09:56.800 | these older systems and in
00:09:58.400 | the way of much work in linguistic semantics,
00:10:00.720 | but it is arguably a compositional system,
00:10:03.660 | an intriguing property of
00:10:05.580 | those deep learning artifacts in fact.
00:10:08.240 | But we have, it seems,
00:10:10.640 | moved away from that perspective.
00:10:12.720 | Now we're confronted all the time with
00:10:15.100 | these huge typically transformer-based models
00:10:17.800 | where everything is connected to everything else.
00:10:20.360 | It is certainly clear that we have
00:10:22.180 | no guarantees of compositionality
00:10:24.520 | or systematicity built into these networks.
00:10:27.360 | In fact, in the earliest days,
00:10:29.180 | people often worried that
00:10:30.860 | even though they were performing well,
00:10:32.840 | they were learning non-systematic solutions and that
00:10:35.800 | motivated a lot of challenge testing for them.
00:10:39.120 | The question now is,
00:10:40.840 | can we pose behavioral tests that will truly
00:10:43.520 | assess whether models like this,
00:10:45.840 | which are hard to understand analytically,
00:10:48.000 | have found systematic solutions
00:10:50.960 | to the language problems that we pose?
00:10:53.000 | If the answer is no, we should worry.
00:10:55.360 | If the answer is yes,
00:10:56.880 | it's an amazing discovery about why
00:10:59.020 | these models perform so well and also
00:11:01.360 | often an amazing discovery about the power of
00:11:04.200 | data-driven learning alone to
00:11:06.320 | deliver systematic solutions to language problems.
00:11:10.120 | It's an open question but
00:11:11.840 | a tremendously exciting set of
00:11:14.080 | questions to be exploring now as
00:11:16.200 | our models are getting so good even at
00:11:18.480 | the hard behavioral tasks that we pose for them.
00:11:22.080 | [BLANK_AUDIO]