Stanford XCS224U: NLU I NLP Methods and Metrics, Part 3: Generation Metrics I Spring 2023

00:00:00.000 | Welcome back everyone.
00:00:06.040 | This is part 3 in our series on methods and metrics.
00:00:08.820 | We're going to talk about generation metrics.
00:00:11.380 | In the previous screencast,
00:00:12.920 | we talked about classifier metrics.
00:00:14.700 | Those seem conceptually straightforward at first,
00:00:17.740 | but turn out to harbor lots of intricacies.
00:00:21.080 | That goes double at least for generation.
00:00:23.820 | Generation is incredibly conceptually challenging.
00:00:27.080 | I would say the fundamental issue here is that
00:00:29.760 | there is more than one effective way to say most things.
00:00:33.680 | That immediately raises the question of
00:00:36.160 | what we are even trying to measure.
00:00:38.220 | Is it fluency?
00:00:39.860 | Is it truthfulness?
00:00:41.400 | Communicative effectiveness? Maybe something else?
00:00:44.200 | These are all interestingly different questions.
00:00:46.900 | After all, you could have a system that was
00:00:48.940 | highly fluent but spewing falsehoods,
00:00:51.880 | or even a system that was highly disfluent,
00:00:54.080 | but achieving its goals in communication.
00:00:56.540 | Those examples show that you really need to have
00:00:59.240 | clarity on the high-level goals before you can even
00:01:02.160 | think about which metrics to choose for generation.
00:01:06.760 | Let's begin the discussion with perplexity.
00:01:09.840 | That's a natural starting point.
00:01:11.440 | It's a analog of accuracy,
00:01:14.040 | but in the generation space.
00:01:16.160 | For some sequence X and
00:01:18.280 | some probability distribution or model
00:01:20.320 | that can assign probability distributions,
00:01:22.800 | the perplexity for that sequence according to that model is
00:01:26.360 | really just the geometric mean of the probabilities
00:01:29.520 | assigned to the individual time steps.
00:01:32.180 | Then when we average over an entire corpus for the mean perplexity,
00:01:36.560 | we just do the geometric mean of
00:01:38.720 | the individual perplexity scores per sequence.
00:01:42.320 | Properties of perplexity, well,
00:01:44.920 | its bounds are one and infinity with one the best,
00:01:47.560 | so we are seeking to minimize this quantity.
00:01:50.380 | It is equivalent to the exponentiation of the cross entropy loss.
00:01:54.760 | This is really important.
00:01:56.360 | Most modern day language models use a cross entropy loss.
00:02:01.080 | What that means is that whether you wanted to or not,
00:02:03.780 | you are effectively optimizing that model for perplexity.
00:02:08.600 | What's the value encoded?
00:02:10.800 | It's something like, does the model assign
00:02:12.880 | high probability to the input sequence?
00:02:15.760 | When we think about assessment,
00:02:17.520 | what that means is that we have some assessment set of sequences.
00:02:20.900 | We run our model on those examples,
00:02:23.260 | we get the average perplexity across the examples,
00:02:26.180 | and we report that number as an estimate of system quality.
00:02:31.520 | Relatedly, there are a number of
00:02:33.760 | weaknesses that we really need to think about.
00:02:35.800 | First, perplexity is heavily dependent on the underlying vocabulary.
00:02:40.240 | One easy way to see this is just to imagine that we take
00:02:43.000 | every token in the sequences and map them to a single UNK character.
00:02:47.520 | In that case, we will have perfect perplexity,
00:02:50.440 | but we will have a terrible generation system.
00:02:54.180 | Perplexity also really does not allow comparisons between datasets.
00:02:59.780 | The issue here is that we don't have
00:03:01.700 | any ground truth on what's a good or bad perplexity separate from the data.
00:03:06.160 | What that means is that comparing across two datasets with
00:03:09.180 | perplexity numbers is just comparing incomparables.
00:03:13.020 | Relatedly, even comparisons between models is tricky.
00:03:17.020 | You can see this emerging here because we really need
00:03:20.100 | a lot of things to be held constant to do a model comparison.
00:03:23.380 | Certainly, the tokenization needs to be the same.
00:03:25.940 | Certainly, the datasets need to be the same.
00:03:28.300 | Ideally, many other aspects of the systems are held
00:03:31.860 | constant when we do a perplexity comparison.
00:03:34.840 | Otherwise, again, we are in danger of comparing incomparables.
00:03:39.860 | Word error rate might be better,
00:03:43.620 | and this is more tightly aligned with actually human-created reference text,
00:03:47.820 | which could be a step up from
00:03:49.780 | perplexity in terms of how we think about assessment.
00:03:53.500 | This is really a family of measures.
00:03:55.860 | You need to pick a distance measure between strings,
00:03:58.540 | and then your word error rate is parameterized by that distance metric.
00:04:02.540 | Here's the full calculation for a gold text X and a predicted text pred.
00:04:09.520 | We do the distance between X and pred
00:04:12.220 | according to whatever distance measure we chose,
00:04:14.280 | and we divide that by the length of the reference or gold text.
00:04:18.740 | Then when we average over a whole corpus,
00:04:21.400 | what we do is for the numerator,
00:04:23.000 | sum up all of the distances between gold and predicted texts,
00:04:26.960 | and we divide that by the total length of all of the reference texts.
00:04:32.200 | Properties of word error rate,
00:04:34.460 | its bounds are zero and infinity with zero the best,
00:04:36.920 | it's an error rate, so we're trying to minimize it.
00:04:39.480 | The value encoded is something like how aligned is
00:04:42.420 | the predicted sequence with the actual sequence,
00:04:44.800 | so in that way, it's similar to F scores
00:04:47.560 | once you have thought about your distance metric.
00:04:50.740 | Weaknesses, well, it can accommodate just one reference text.
00:04:55.100 | Our fundamental challenge here is that there's
00:04:57.140 | more than one good way to say most things,
00:04:59.640 | whereas here we can accommodate only a single way,
00:05:02.620 | presumably a good one,
00:05:04.020 | of saying the thing that we care about.
00:05:06.400 | It's also a very syntactic notion by default.
00:05:10.300 | Most distance metrics are string edit metrics,
00:05:13.600 | and they're very sensitive to the actual structure of the string.
00:05:16.960 | As a result, by and large,
00:05:19.320 | these metrics will treat it was good,
00:05:21.700 | it was not good, and it was great as all similarly distant from each other,
00:05:26.120 | when of course semantically,
00:05:27.600 | it was good and it was great are alike and different from it was not good.
00:05:34.040 | Blue scores build on intuitions around word error rate,
00:05:38.440 | and they're trying to be more sensitive to the fact that there's
00:05:40.700 | more than one way to say most things.
00:05:43.220 | Here's how blue scores work.
00:05:44.560 | It's again going to be a balance of precision and
00:05:46.820 | recall but now tailored to the generation space.
00:05:50.020 | The notion of precision is modified N-gram precision.
00:05:53.740 | Let's walk through this simple example here.
00:05:56.000 | We have a candidate which is unusual,
00:05:58.100 | it's just seven occurrences of the word the.
00:06:00.900 | Obviously, not a good candidate.
00:06:02.860 | We have two reference texts,
00:06:04.960 | these are presumed to be human-created texts.
00:06:07.620 | The modified N-gram precision for the is 2/7.
00:06:11.980 | There are seven occurrences of the in the candidate,
00:06:15.100 | and the two comes from
00:06:16.760 | the reference text that contains the maximum number of thes.
00:06:21.460 | That's reference text 1,
00:06:23.260 | it has two tokens of the,
00:06:24.860 | whereas reference text 2 has just one token of the.
00:06:28.100 | We get two points for that,
00:06:29.840 | and then the modified N-gram precision for the is 2/7.
00:06:34.500 | That's a notion of precision.
00:06:36.760 | We need to balance that with a notion of recall,
00:06:39.380 | otherwise we might end up with systems that do
00:06:41.500 | very short generations in order to be very precise.
00:06:45.940 | For recall, blue introduces what they call a brevity penalty.
00:06:50.020 | In essence, what this is doing is saying,
00:06:52.500 | if the generated text is
00:06:54.340 | shorter than the text I expect for my corpus,
00:06:57.740 | I pay a little penalty.
00:06:59.280 | Once I get to the expected length,
00:07:01.100 | you stop paying a recall or brevity penalty,
00:07:04.100 | and you start to rely on the modified N-gram precision.
00:07:08.140 | The blue score itself is just a product of
00:07:10.620 | the brevity penalty score and the sum of
00:07:12.980 | the weighted modified N-gram precision values for each N.
00:07:16.780 | What I mean by that is we could do this for
00:07:18.980 | unigrams, bigrams, trigrams.
00:07:21.660 | We could assign different weight to
00:07:23.120 | those different notions of N-gram,
00:07:24.840 | and all of those are incorporated
00:07:26.700 | if we want into the blue score.
00:07:29.420 | Properties of the blue score,
00:07:31.640 | its bounds are 0 and 1 with one the best that we
00:07:34.000 | have no expectation for
00:07:35.660 | naturalistic data that any system will achieve one,
00:07:38.500 | because there's no way we can have
00:07:39.900 | all the relevant reference texts conceptually.
00:07:43.280 | The value encoded is something like an appropriate,
00:07:46.440 | we hope, balance of precision and
00:07:48.640 | recall as implemented in that brevity penalty.
00:07:52.300 | It's similar to the word error rate,
00:07:54.620 | but it seeks to accommodate the fact that there are
00:07:56.620 | typically multiple suitable outputs for a given input,
00:07:59.520 | our fundamental challenge for generation.
00:08:02.420 | Weaknesses, well, there's a long literature on this,
00:08:06.420 | some of it arguing that blue fails to
00:08:09.100 | correlate with human scoring for translation,
00:08:11.540 | which is an important application domain for blue,
00:08:14.060 | so that's worrisome.
00:08:15.520 | It's very sensitive to the N-gram order of things,
00:08:18.740 | so in that way, it is very attuned to
00:08:20.940 | syntactic elements of these comparisons.
00:08:23.780 | It's insensitive to N-gram type.
00:08:26.140 | Again, that's a notion of string dependence.
00:08:29.020 | That dog, the dog,
00:08:30.800 | and that toaster might be
00:08:32.720 | treated identically with your blue scoring,
00:08:34.680 | even though that dog and that dog are obviously
00:08:36.820 | closer to each other than that toaster.
00:08:40.460 | Finally, you have to be really
00:08:43.260 | thoughtful about the domain that you're operating in,
00:08:45.860 | because blue might be just mismatched
00:08:48.060 | to the goals of generation in that space.
00:08:49.880 | For example, Leo et al,
00:08:51.440 | 2016 in the process of developing and
00:08:53.700 | evaluating neural conversational agents,
00:08:56.560 | just argue against blue as a metric for dialogue.
00:09:00.660 | Think carefully about what your generations mean,
00:09:04.240 | what reference texts you actually have,
00:09:06.640 | and whether or not everything is
00:09:08.380 | aligned given your high-level goals.
00:09:10.340 | Again, a common refrain for this unit of the course.
00:09:14.500 | I mentioned two reference-based metrics,
00:09:17.560 | and I call them reference-based.
00:09:19.020 | Word error rate and blue are both like this because they
00:09:21.360 | depend on these reference texts,
00:09:23.820 | these human-created texts.
00:09:25.920 | Others in that family include rouge and
00:09:29.100 | meteor and what you can see happening here,
00:09:31.460 | especially with meteor is that we're
00:09:33.060 | trying to be oriented toward a task,
00:09:35.760 | maybe a semantic task like summarization,
00:09:37.720 | and also less sensitive to fine-grained details of
00:09:41.840 | the reference texts and the generated text
00:09:43.920 | to key into more semantic notions.
00:09:46.520 | Meteor does that with things like stemming and synonyms.
00:09:50.440 | With scoring procedures like CIDR and BERT score,
00:09:53.440 | we actually move into
00:09:54.800 | vector space models that we might hope
00:09:56.600 | capture many deep aspects of semantics.
00:09:59.560 | CIDR does this with TF-IDF vectors.
00:10:02.480 | It's a powerful idea,
00:10:03.540 | though it does make it heavily
00:10:04.880 | dependent on the nature of the dataset.
00:10:07.680 | Then BERT score uses weighted maxim at
00:10:10.920 | the token level to define scores between two texts.
00:10:14.820 | That's a very semantic notion.
00:10:16.380 | In fact, the scoring procedure looks a lot like
00:10:18.480 | the one that we use for the Colbert retrieval model.
00:10:21.760 | What you can see happening,
00:10:23.120 | especially with BERT score,
00:10:24.480 | is that we're trying to get away from
00:10:26.420 | all the details of these strings and
00:10:28.360 | really key into deeper aspects of meaning.
00:10:31.960 | I thought I'd mention image-based NLG metrics.
00:10:36.040 | Some of you might be developing systems that
00:10:37.880 | take images as input, produce text,
00:10:40.080 | and then we want to ask the question of whether or not
00:10:41.920 | that's a good text for the image.
00:10:44.600 | For this task, reference-based metrics
00:10:48.200 | like blue and word error rate will be fine,
00:10:50.520 | assuming that the human annotations exist and are
00:10:53.440 | aligned with the actual goal that you
00:10:55.600 | have for the text that you're
00:10:57.200 | generating conditional on these images.
00:11:00.300 | But that could be a large burden
00:11:02.500 | for many domains and many tasks.
00:11:04.260 | We won't have annotations
00:11:06.200 | in the right way for these reference-based metrics.
00:11:08.860 | You might think about reference-less metrics.
00:11:11.640 | Metrics in this space seek to score
00:11:13.980 | text-image pairs with no need for human-created references.
00:11:18.120 | At this moment, the most popular of these is certainly
00:11:20.860 | ClipScore but there are others like UMIC and Spurts.
00:11:24.220 | The vision here is to drop the need for human annotation,
00:11:27.060 | which is a major bottleneck for
00:11:29.780 | evaluation and instead just score
00:11:32.020 | these image-text pairs in isolation.
00:11:35.180 | I think this is really promising.
00:11:37.000 | We do have a paper, Christ et al.
00:11:38.420 | 2022, where we criticize
00:11:40.340 | these reference-less metrics on the grounds that they are
00:11:43.300 | insensitive to the purpose of
00:11:45.140 | the text and the context in which the image appeared.
00:11:48.420 | Those are crucial aspects when you think about
00:11:50.820 | our goals for generation in this context.
00:11:53.300 | It's a shame that these reference-less metrics
00:11:55.700 | are just missing that information.
00:11:57.940 | However, we are optimistic that we can design
00:12:00.900 | variants of ClipScore and related metrics
00:12:03.900 | that can actually bring in these notions of quality.
00:12:06.880 | I think reference-less metrics may be
00:12:09.140 | a fruitful path forward for
00:12:10.580 | evaluation for image-based NLG.
00:12:14.060 | Then finally, to round this out,
00:12:15.940 | I thought I'd offer a more high-level comment
00:12:18.340 | under the heading of task-oriented metrics.
00:12:21.860 | We've been very focused so far on comparisons with
00:12:24.740 | reference texts and other notions of intrinsic quality.
00:12:28.400 | But we should reflect on the fact that by and large,
00:12:30.820 | when we do generation,
00:12:31.800 | we're trying to achieve some goal of
00:12:33.420 | communication or to help an agent take some action.
00:12:37.140 | The classical off-the-shelf reference-based metrics will
00:12:40.980 | capture aspects of the task
00:12:42.880 | only to the extent that the human annotations did.
00:12:46.100 | If your reference texts aren't sensitive to
00:12:48.180 | what the goal of generation was,
00:12:50.220 | then that won't be reflected in your evaluation.
00:12:53.200 | You can imagine model-based metrics that are tuned to
00:12:56.700 | specific tasks and therefore task-oriented in their nature,
00:13:00.100 | but that's actually currently a very rare model.
00:13:03.120 | I think it's fruitful though to think about what the goal of
00:13:06.340 | the text was and consider whether
00:13:08.140 | your evaluation could be based in that goal.
00:13:10.420 | This would be a new mode of thinking.
00:13:12.380 | You would ask yourself, can an agent that
00:13:14.480 | receive the generated text use it to solve the task?
00:13:17.940 | Then your metric would be task success.
00:13:21.580 | Or was a specific piece of
00:13:23.780 | information reliably communicated?
00:13:25.700 | Again, we could just ask directly whether the agent receiving
00:13:29.060 | the message reliably extracted the information we care about.
00:13:33.420 | Or did the message lead the person to take a desirable action,
00:13:36.660 | which would be a more indirect measure of communicative success?
00:13:40.860 | That could be the fundamental thing that we
00:13:43.100 | use for our metric for generation.
00:13:45.280 | That will capture some aspects and
00:13:47.580 | leave out some others, for example, fluency.
00:13:49.820 | But I think overall,
00:13:51.060 | you can imagine that this is much more tightly aligned
00:13:54.000 | with the goals that we actually have for our generation systems.
00:13:59.020 | [BLANK_AUDIO]