back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 3: Generation Metrics I Spring 2023

00:00:06.040 | 
This is part 3 in our series on methods and metrics. 00:00:08.820 | 
We're going to talk about generation metrics. 00:00:14.700 | 
Those seem conceptually straightforward at first, 00:00:23.820 | 
Generation is incredibly conceptually challenging. 00:00:27.080 | 
I would say the fundamental issue here is that 00:00:29.760 | 
there is more than one effective way to say most things. 00:00:41.400 | 
Communicative effectiveness? Maybe something else? 00:00:44.200 | 
These are all interestingly different questions. 00:00:56.540 | 
Those examples show that you really need to have 00:00:59.240 | 
clarity on the high-level goals before you can even 00:01:02.160 | 
think about which metrics to choose for generation. 00:01:22.800 | 
the perplexity for that sequence according to that model is 00:01:26.360 | 
really just the geometric mean of the probabilities 00:01:32.180 | 
Then when we average over an entire corpus for the mean perplexity, 00:01:38.720 | 
the individual perplexity scores per sequence. 00:01:44.920 | 
its bounds are one and infinity with one the best, 00:01:50.380 | 
It is equivalent to the exponentiation of the cross entropy loss. 00:01:56.360 | 
Most modern day language models use a cross entropy loss. 00:02:01.080 | 
What that means is that whether you wanted to or not, 00:02:03.780 | 
you are effectively optimizing that model for perplexity. 00:02:17.520 | 
what that means is that we have some assessment set of sequences. 00:02:23.260 | 
we get the average perplexity across the examples, 00:02:26.180 | 
and we report that number as an estimate of system quality. 00:02:33.760 | 
weaknesses that we really need to think about. 00:02:35.800 | 
First, perplexity is heavily dependent on the underlying vocabulary. 00:02:40.240 | 
One easy way to see this is just to imagine that we take 00:02:43.000 | 
every token in the sequences and map them to a single UNK character. 00:02:47.520 | 
In that case, we will have perfect perplexity, 00:02:50.440 | 
but we will have a terrible generation system. 00:02:54.180 | 
Perplexity also really does not allow comparisons between datasets. 00:03:01.700 | 
any ground truth on what's a good or bad perplexity separate from the data. 00:03:06.160 | 
What that means is that comparing across two datasets with 00:03:09.180 | 
perplexity numbers is just comparing incomparables. 00:03:13.020 | 
Relatedly, even comparisons between models is tricky. 00:03:17.020 | 
You can see this emerging here because we really need 00:03:20.100 | 
a lot of things to be held constant to do a model comparison. 00:03:23.380 | 
Certainly, the tokenization needs to be the same. 00:03:28.300 | 
Ideally, many other aspects of the systems are held 00:03:34.840 | 
Otherwise, again, we are in danger of comparing incomparables. 00:03:43.620 | 
and this is more tightly aligned with actually human-created reference text, 00:03:49.780 | 
perplexity in terms of how we think about assessment. 00:03:55.860 | 
You need to pick a distance measure between strings, 00:03:58.540 | 
and then your word error rate is parameterized by that distance metric. 00:04:02.540 | 
Here's the full calculation for a gold text X and a predicted text pred. 00:04:12.220 | 
according to whatever distance measure we chose, 00:04:14.280 | 
and we divide that by the length of the reference or gold text. 00:04:23.000 | 
sum up all of the distances between gold and predicted texts, 00:04:26.960 | 
and we divide that by the total length of all of the reference texts. 00:04:34.460 | 
its bounds are zero and infinity with zero the best, 00:04:36.920 | 
it's an error rate, so we're trying to minimize it. 00:04:39.480 | 
The value encoded is something like how aligned is 00:04:42.420 | 
the predicted sequence with the actual sequence, 00:04:47.560 | 
once you have thought about your distance metric. 00:04:50.740 | 
Weaknesses, well, it can accommodate just one reference text. 00:04:55.100 | 
Our fundamental challenge here is that there's 00:04:59.640 | 
whereas here we can accommodate only a single way, 00:05:06.400 | 
It's also a very syntactic notion by default. 00:05:10.300 | 
Most distance metrics are string edit metrics, 00:05:13.600 | 
and they're very sensitive to the actual structure of the string. 00:05:21.700 | 
it was not good, and it was great as all similarly distant from each other, 00:05:27.600 | 
it was good and it was great are alike and different from it was not good. 00:05:34.040 | 
Blue scores build on intuitions around word error rate, 00:05:38.440 | 
and they're trying to be more sensitive to the fact that there's 00:05:44.560 | 
It's again going to be a balance of precision and 00:05:46.820 | 
recall but now tailored to the generation space. 00:05:50.020 | 
The notion of precision is modified N-gram precision. 00:06:04.960 | 
these are presumed to be human-created texts. 00:06:07.620 | 
The modified N-gram precision for the is 2/7. 00:06:11.980 | 
There are seven occurrences of the in the candidate, 00:06:16.760 | 
the reference text that contains the maximum number of thes. 00:06:24.860 | 
whereas reference text 2 has just one token of the. 00:06:29.840 | 
and then the modified N-gram precision for the is 2/7. 00:06:36.760 | 
We need to balance that with a notion of recall, 00:06:39.380 | 
otherwise we might end up with systems that do 00:06:41.500 | 
very short generations in order to be very precise. 00:06:45.940 | 
For recall, blue introduces what they call a brevity penalty. 00:06:54.340 | 
shorter than the text I expect for my corpus, 00:07:04.100 | 
and you start to rely on the modified N-gram precision. 00:07:12.980 | 
the weighted modified N-gram precision values for each N. 00:07:31.640 | 
its bounds are 0 and 1 with one the best that we 00:07:35.660 | 
naturalistic data that any system will achieve one, 00:07:39.900 | 
all the relevant reference texts conceptually. 00:07:43.280 | 
The value encoded is something like an appropriate, 00:07:48.640 | 
recall as implemented in that brevity penalty. 00:07:54.620 | 
but it seeks to accommodate the fact that there are 00:07:56.620 | 
typically multiple suitable outputs for a given input, 00:08:02.420 | 
Weaknesses, well, there's a long literature on this, 00:08:09.100 | 
correlate with human scoring for translation, 00:08:11.540 | 
which is an important application domain for blue, 00:08:15.520 | 
It's very sensitive to the N-gram order of things, 00:08:34.680 | 
even though that dog and that dog are obviously 00:08:43.260 | 
thoughtful about the domain that you're operating in, 00:08:56.560 | 
just argue against blue as a metric for dialogue. 00:09:00.660 | 
Think carefully about what your generations mean, 00:09:10.340 | 
Again, a common refrain for this unit of the course. 00:09:19.020 | 
Word error rate and blue are both like this because they 00:09:37.720 | 
and also less sensitive to fine-grained details of 00:09:46.520 | 
Meteor does that with things like stemming and synonyms. 00:09:50.440 | 
With scoring procedures like CIDR and BERT score, 00:10:10.920 | 
the token level to define scores between two texts. 00:10:16.380 | 
In fact, the scoring procedure looks a lot like 00:10:18.480 | 
the one that we use for the Colbert retrieval model. 00:10:31.960 | 
I thought I'd mention image-based NLG metrics. 00:10:40.080 | 
and then we want to ask the question of whether or not 00:10:50.520 | 
assuming that the human annotations exist and are 00:11:06.200 | 
in the right way for these reference-based metrics. 00:11:08.860 | 
You might think about reference-less metrics. 00:11:13.980 | 
text-image pairs with no need for human-created references. 00:11:18.120 | 
At this moment, the most popular of these is certainly 00:11:20.860 | 
ClipScore but there are others like UMIC and Spurts. 00:11:24.220 | 
The vision here is to drop the need for human annotation, 00:11:40.340 | 
these reference-less metrics on the grounds that they are 00:11:45.140 | 
the text and the context in which the image appeared. 00:11:48.420 | 
Those are crucial aspects when you think about 00:11:53.300 | 
It's a shame that these reference-less metrics 00:11:57.940 | 
However, we are optimistic that we can design 00:12:03.900 | 
that can actually bring in these notions of quality. 00:12:15.940 | 
I thought I'd offer a more high-level comment 00:12:21.860 | 
We've been very focused so far on comparisons with 00:12:24.740 | 
reference texts and other notions of intrinsic quality. 00:12:28.400 | 
But we should reflect on the fact that by and large, 00:12:33.420 | 
communication or to help an agent take some action. 00:12:37.140 | 
The classical off-the-shelf reference-based metrics will 00:12:42.880 | 
only to the extent that the human annotations did. 00:12:50.220 | 
then that won't be reflected in your evaluation. 00:12:53.200 | 
You can imagine model-based metrics that are tuned to 00:12:56.700 | 
specific tasks and therefore task-oriented in their nature, 00:13:00.100 | 
but that's actually currently a very rare model. 00:13:03.120 | 
I think it's fruitful though to think about what the goal of 00:13:14.480 | 
receive the generated text use it to solve the task? 00:13:25.700 | 
Again, we could just ask directly whether the agent receiving 00:13:29.060 | 
the message reliably extracted the information we care about. 00:13:33.420 | 
Or did the message lead the person to take a desirable action, 00:13:36.660 | 
which would be a more indirect measure of communicative success? 00:13:51.060 | 
you can imagine that this is much more tightly aligned 00:13:54.000 | 
with the goals that we actually have for our generation systems.