Stanford XCS224U: NLU I NLP Methods and Metrics, Part 3: Generation Metrics I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.040 | This is part 3 in our series on methods and metrics.

00:00:08.820 | We're going to talk about generation metrics.

00:00:11.380 | In the previous screencast,

00:00:12.920 | we talked about classifier metrics.

00:00:14.700 | Those seem conceptually straightforward at first,

00:00:17.740 | but turn out to harbor lots of intricacies.

00:00:21.080 | That goes double at least for generation.

00:00:23.820 | Generation is incredibly conceptually challenging.

00:00:27.080 | I would say the fundamental issue here is that

00:00:29.760 | there is more than one effective way to say most things.

00:00:33.680 | That immediately raises the question of

00:00:36.160 | what we are even trying to measure.

00:00:38.220 | Is it fluency?

00:00:39.860 | Is it truthfulness?

00:00:41.400 | Communicative effectiveness? Maybe something else?

00:00:44.200 | These are all interestingly different questions.

00:00:46.900 | After all, you could have a system that was

00:00:48.940 | highly fluent but spewing falsehoods,

00:00:51.880 | or even a system that was highly disfluent,

00:00:54.080 | but achieving its goals in communication.

00:00:56.540 | Those examples show that you really need to have

00:00:59.240 | clarity on the high-level goals before you can even

00:01:02.160 | think about which metrics to choose for generation.

00:01:06.760 | Let's begin the discussion with perplexity.

00:01:09.840 | That's a natural starting point.

00:01:11.440 | It's a analog of accuracy,

00:01:14.040 | but in the generation space.

00:01:16.160 | For some sequence X and

00:01:18.280 | some probability distribution or model

00:01:20.320 | that can assign probability distributions,

00:01:22.800 | the perplexity for that sequence according to that model is

00:01:26.360 | really just the geometric mean of the probabilities

00:01:29.520 | assigned to the individual time steps.

00:01:32.180 | Then when we average over an entire corpus for the mean perplexity,

00:01:36.560 | we just do the geometric mean of

00:01:38.720 | the individual perplexity scores per sequence.

00:01:42.320 | Properties of perplexity, well,

00:01:44.920 | its bounds are one and infinity with one the best,

00:01:47.560 | so we are seeking to minimize this quantity.

00:01:50.380 | It is equivalent to the exponentiation of the cross entropy loss.

00:01:54.760 | This is really important.

00:01:56.360 | Most modern day language models use a cross entropy loss.

00:02:01.080 | What that means is that whether you wanted to or not,

00:02:03.780 | you are effectively optimizing that model for perplexity.

00:02:08.600 | What's the value encoded?

00:02:10.800 | It's something like, does the model assign

00:02:12.880 | high probability to the input sequence?

00:02:15.760 | When we think about assessment,

00:02:17.520 | what that means is that we have some assessment set of sequences.

00:02:20.900 | We run our model on those examples,

00:02:23.260 | we get the average perplexity across the examples,

00:02:26.180 | and we report that number as an estimate of system quality.

00:02:31.520 | Relatedly, there are a number of

00:02:33.760 | weaknesses that we really need to think about.

00:02:35.800 | First, perplexity is heavily dependent on the underlying vocabulary.

00:02:40.240 | One easy way to see this is just to imagine that we take

00:02:43.000 | every token in the sequences and map them to a single UNK character.

00:02:47.520 | In that case, we will have perfect perplexity,

00:02:50.440 | but we will have a terrible generation system.

00:02:54.180 | Perplexity also really does not allow comparisons between datasets.

00:02:59.780 | The issue here is that we don't have

00:03:01.700 | any ground truth on what's a good or bad perplexity separate from the data.

00:03:06.160 | What that means is that comparing across two datasets with

00:03:09.180 | perplexity numbers is just comparing incomparables.

00:03:13.020 | Relatedly, even comparisons between models is tricky.

00:03:17.020 | You can see this emerging here because we really need

00:03:20.100 | a lot of things to be held constant to do a model comparison.

00:03:23.380 | Certainly, the tokenization needs to be the same.

00:03:25.940 | Certainly, the datasets need to be the same.

00:03:28.300 | Ideally, many other aspects of the systems are held

00:03:31.860 | constant when we do a perplexity comparison.

00:03:34.840 | Otherwise, again, we are in danger of comparing incomparables.

00:03:39.860 | Word error rate might be better,

00:03:43.620 | and this is more tightly aligned with actually human-created reference text,

00:03:47.820 | which could be a step up from

00:03:49.780 | perplexity in terms of how we think about assessment.

00:03:53.500 | This is really a family of measures.

00:03:55.860 | You need to pick a distance measure between strings,

00:03:58.540 | and then your word error rate is parameterized by that distance metric.

00:04:02.540 | Here's the full calculation for a gold text X and a predicted text pred.

00:04:09.520 | We do the distance between X and pred

00:04:12.220 | according to whatever distance measure we chose,

00:04:14.280 | and we divide that by the length of the reference or gold text.

00:04:18.740 | Then when we average over a whole corpus,

00:04:21.400 | what we do is for the numerator,

00:04:23.000 | sum up all of the distances between gold and predicted texts,

00:04:26.960 | and we divide that by the total length of all of the reference texts.

00:04:32.200 | Properties of word error rate,

00:04:34.460 | its bounds are zero and infinity with zero the best,

00:04:36.920 | it's an error rate, so we're trying to minimize it.

00:04:39.480 | The value encoded is something like how aligned is

00:04:42.420 | the predicted sequence with the actual sequence,

00:04:44.800 | so in that way, it's similar to F scores

00:04:47.560 | once you have thought about your distance metric.

00:04:50.740 | Weaknesses, well, it can accommodate just one reference text.

00:04:55.100 | Our fundamental challenge here is that there's

00:04:57.140 | more than one good way to say most things,

00:04:59.640 | whereas here we can accommodate only a single way,

00:05:02.620 | presumably a good one,

00:05:04.020 | of saying the thing that we care about.

00:05:06.400 | It's also a very syntactic notion by default.

00:05:10.300 | Most distance metrics are string edit metrics,

00:05:13.600 | and they're very sensitive to the actual structure of the string.

00:05:16.960 | As a result, by and large,

00:05:19.320 | these metrics will treat it was good,

00:05:21.700 | it was not good, and it was great as all similarly distant from each other,

00:05:26.120 | when of course semantically,

00:05:27.600 | it was good and it was great are alike and different from it was not good.

00:05:34.040 | Blue scores build on intuitions around word error rate,

00:05:38.440 | and they're trying to be more sensitive to the fact that there's

00:05:40.700 | more than one way to say most things.

00:05:43.220 | Here's how blue scores work.

00:05:44.560 | It's again going to be a balance of precision and

00:05:46.820 | recall but now tailored to the generation space.

00:05:50.020 | The notion of precision is modified N-gram precision.

00:05:53.740 | Let's walk through this simple example here.

00:05:56.000 | We have a candidate which is unusual,

00:05:58.100 | it's just seven occurrences of the word the.

00:06:00.900 | Obviously, not a good candidate.

00:06:02.860 | We have two reference texts,

00:06:04.960 | these are presumed to be human-created texts.

00:06:07.620 | The modified N-gram precision for the is 2/7.

00:06:11.980 | There are seven occurrences of the in the candidate,

00:06:15.100 | and the two comes from

00:06:16.760 | the reference text that contains the maximum number of thes.

00:06:21.460 | That's reference text 1,

00:06:23.260 | it has two tokens of the,

00:06:24.860 | whereas reference text 2 has just one token of the.

00:06:28.100 | We get two points for that,

00:06:29.840 | and then the modified N-gram precision for the is 2/7.

00:06:34.500 | That's a notion of precision.

00:06:36.760 | We need to balance that with a notion of recall,

00:06:39.380 | otherwise we might end up with systems that do

00:06:41.500 | very short generations in order to be very precise.

00:06:45.940 | For recall, blue introduces what they call a brevity penalty.

00:06:50.020 | In essence, what this is doing is saying,

00:06:52.500 | if the generated text is

00:06:54.340 | shorter than the text I expect for my corpus,

00:06:57.740 | I pay a little penalty.

00:06:59.280 | Once I get to the expected length,

00:07:01.100 | you stop paying a recall or brevity penalty,

00:07:04.100 | and you start to rely on the modified N-gram precision.

00:07:08.140 | The blue score itself is just a product of

00:07:10.620 | the brevity penalty score and the sum of

00:07:12.980 | the weighted modified N-gram precision values for each N.

00:07:16.780 | What I mean by that is we could do this for

00:07:18.980 | unigrams, bigrams, trigrams.

00:07:21.660 | We could assign different weight to

00:07:23.120 | those different notions of N-gram,

00:07:24.840 | and all of those are incorporated

00:07:26.700 | if we want into the blue score.

00:07:29.420 | Properties of the blue score,

00:07:31.640 | its bounds are 0 and 1 with one the best that we

00:07:34.000 | have no expectation for

00:07:35.660 | naturalistic data that any system will achieve one,

00:07:38.500 | because there's no way we can have

00:07:39.900 | all the relevant reference texts conceptually.

00:07:43.280 | The value encoded is something like an appropriate,

00:07:46.440 | we hope, balance of precision and

00:07:48.640 | recall as implemented in that brevity penalty.

00:07:52.300 | It's similar to the word error rate,

00:07:54.620 | but it seeks to accommodate the fact that there are

00:07:56.620 | typically multiple suitable outputs for a given input,

00:07:59.520 | our fundamental challenge for generation.

00:08:02.420 | Weaknesses, well, there's a long literature on this,

00:08:06.420 | some of it arguing that blue fails to

00:08:09.100 | correlate with human scoring for translation,

00:08:11.540 | which is an important application domain for blue,

00:08:14.060 | so that's worrisome.

00:08:15.520 | It's very sensitive to the N-gram order of things,

00:08:18.740 | so in that way, it is very attuned to

00:08:20.940 | syntactic elements of these comparisons.

00:08:23.780 | It's insensitive to N-gram type.

00:08:26.140 | Again, that's a notion of string dependence.

00:08:29.020 | That dog, the dog,

00:08:30.800 | and that toaster might be

00:08:32.720 | treated identically with your blue scoring,

00:08:34.680 | even though that dog and that dog are obviously

00:08:36.820 | closer to each other than that toaster.

00:08:40.460 | Finally, you have to be really

00:08:43.260 | thoughtful about the domain that you're operating in,

00:08:45.860 | because blue might be just mismatched

00:08:48.060 | to the goals of generation in that space.

00:08:49.880 | For example, Leo et al,

00:08:51.440 | 2016 in the process of developing and

00:08:53.700 | evaluating neural conversational agents,

00:08:56.560 | just argue against blue as a metric for dialogue.

00:09:00.660 | Think carefully about what your generations mean,

00:09:04.240 | what reference texts you actually have,

00:09:06.640 | and whether or not everything is

00:09:08.380 | aligned given your high-level goals.

00:09:10.340 | Again, a common refrain for this unit of the course.

00:09:14.500 | I mentioned two reference-based metrics,

00:09:17.560 | and I call them reference-based.

00:09:19.020 | Word error rate and blue are both like this because they

00:09:21.360 | depend on these reference texts,

00:09:23.820 | these human-created texts.

00:09:25.920 | Others in that family include rouge and

00:09:29.100 | meteor and what you can see happening here,

00:09:31.460 | especially with meteor is that we're

00:09:33.060 | trying to be oriented toward a task,

00:09:35.760 | maybe a semantic task like summarization,

00:09:37.720 | and also less sensitive to fine-grained details of

00:09:41.840 | the reference texts and the generated text

00:09:43.920 | to key into more semantic notions.

00:09:46.520 | Meteor does that with things like stemming and synonyms.

00:09:50.440 | With scoring procedures like CIDR and BERT score,

00:09:53.440 | we actually move into

00:09:54.800 | vector space models that we might hope

00:09:56.600 | capture many deep aspects of semantics.

00:09:59.560 | CIDR does this with TF-IDF vectors.

00:10:02.480 | It's a powerful idea,

00:10:03.540 | though it does make it heavily

00:10:04.880 | dependent on the nature of the dataset.

00:10:07.680 | Then BERT score uses weighted maxim at

00:10:10.920 | the token level to define scores between two texts.

00:10:14.820 | That's a very semantic notion.

00:10:16.380 | In fact, the scoring procedure looks a lot like

00:10:18.480 | the one that we use for the Colbert retrieval model.

00:10:21.760 | What you can see happening,

00:10:23.120 | especially with BERT score,

00:10:24.480 | is that we're trying to get away from

00:10:26.420 | all the details of these strings and

00:10:28.360 | really key into deeper aspects of meaning.

00:10:31.960 | I thought I'd mention image-based NLG metrics.

00:10:36.040 | Some of you might be developing systems that

00:10:37.880 | take images as input, produce text,

00:10:40.080 | and then we want to ask the question of whether or not

00:10:41.920 | that's a good text for the image.

00:10:44.600 | For this task, reference-based metrics

00:10:48.200 | like blue and word error rate will be fine,

00:10:50.520 | assuming that the human annotations exist and are

00:10:53.440 | aligned with the actual goal that you

00:10:55.600 | have for the text that you're

00:10:57.200 | generating conditional on these images.

00:11:00.300 | But that could be a large burden

00:11:02.500 | for many domains and many tasks.

00:11:04.260 | We won't have annotations

00:11:06.200 | in the right way for these reference-based metrics.

00:11:08.860 | You might think about reference-less metrics.

00:11:11.640 | Metrics in this space seek to score

00:11:13.980 | text-image pairs with no need for human-created references.

00:11:18.120 | At this moment, the most popular of these is certainly

00:11:20.860 | ClipScore but there are others like UMIC and Spurts.

00:11:24.220 | The vision here is to drop the need for human annotation,

00:11:27.060 | which is a major bottleneck for

00:11:29.780 | evaluation and instead just score

00:11:32.020 | these image-text pairs in isolation.

00:11:35.180 | I think this is really promising.

00:11:37.000 | We do have a paper, Christ et al.

00:11:38.420 | 2022, where we criticize

00:11:40.340 | these reference-less metrics on the grounds that they are

00:11:43.300 | insensitive to the purpose of

00:11:45.140 | the text and the context in which the image appeared.

00:11:48.420 | Those are crucial aspects when you think about

00:11:50.820 | our goals for generation in this context.

00:11:53.300 | It's a shame that these reference-less metrics

00:11:55.700 | are just missing that information.

00:11:57.940 | However, we are optimistic that we can design

00:12:00.900 | variants of ClipScore and related metrics

00:12:03.900 | that can actually bring in these notions of quality.

00:12:06.880 | I think reference-less metrics may be

00:12:09.140 | a fruitful path forward for

00:12:10.580 | evaluation for image-based NLG.

00:12:14.060 | Then finally, to round this out,

00:12:15.940 | I thought I'd offer a more high-level comment

00:12:18.340 | under the heading of task-oriented metrics.

00:12:21.860 | We've been very focused so far on comparisons with

00:12:24.740 | reference texts and other notions of intrinsic quality.

00:12:28.400 | But we should reflect on the fact that by and large,

00:12:30.820 | when we do generation,

00:12:31.800 | we're trying to achieve some goal of

00:12:33.420 | communication or to help an agent take some action.

00:12:37.140 | The classical off-the-shelf reference-based metrics will

00:12:40.980 | capture aspects of the task

00:12:42.880 | only to the extent that the human annotations did.

00:12:46.100 | If your reference texts aren't sensitive to

00:12:48.180 | what the goal of generation was,

00:12:50.220 | then that won't be reflected in your evaluation.

00:12:53.200 | You can imagine model-based metrics that are tuned to

00:12:56.700 | specific tasks and therefore task-oriented in their nature,

00:13:00.100 | but that's actually currently a very rare model.

00:13:03.120 | I think it's fruitful though to think about what the goal of

00:13:06.340 | the text was and consider whether

00:13:08.140 | your evaluation could be based in that goal.

00:13:10.420 | This would be a new mode of thinking.

00:13:12.380 | You would ask yourself, can an agent that

00:13:14.480 | receive the generated text use it to solve the task?

00:13:17.940 | Then your metric would be task success.

00:13:21.580 | Or was a specific piece of

00:13:23.780 | information reliably communicated?

00:13:25.700 | Again, we could just ask directly whether the agent receiving

00:13:29.060 | the message reliably extracted the information we care about.

00:13:33.420 | Or did the message lead the person to take a desirable action,

00:13:36.660 | which would be a more indirect measure of communicative success?

00:13:40.860 | That could be the fundamental thing that we

00:13:43.100 | use for our metric for generation.

00:13:45.280 | That will capture some aspects and

00:13:47.580 | leave out some others, for example, fluency.

00:13:49.820 | But I think overall,

00:13:51.060 | you can imagine that this is much more tightly aligned

00:13:54.000 | with the goals that we actually have for our generation systems.

00:13:59.020 | [BLANK_AUDIO]