back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 3: Generation Metrics I Spring 2023
00:00:06.040 |
This is part 3 in our series on methods and metrics. 00:00:08.820 |
We're going to talk about generation metrics. 00:00:14.700 |
Those seem conceptually straightforward at first, 00:00:23.820 |
Generation is incredibly conceptually challenging. 00:00:27.080 |
I would say the fundamental issue here is that 00:00:29.760 |
there is more than one effective way to say most things. 00:00:41.400 |
Communicative effectiveness? Maybe something else? 00:00:44.200 |
These are all interestingly different questions. 00:00:56.540 |
Those examples show that you really need to have 00:00:59.240 |
clarity on the high-level goals before you can even 00:01:02.160 |
think about which metrics to choose for generation. 00:01:22.800 |
the perplexity for that sequence according to that model is 00:01:26.360 |
really just the geometric mean of the probabilities 00:01:32.180 |
Then when we average over an entire corpus for the mean perplexity, 00:01:38.720 |
the individual perplexity scores per sequence. 00:01:44.920 |
its bounds are one and infinity with one the best, 00:01:50.380 |
It is equivalent to the exponentiation of the cross entropy loss. 00:01:56.360 |
Most modern day language models use a cross entropy loss. 00:02:01.080 |
What that means is that whether you wanted to or not, 00:02:03.780 |
you are effectively optimizing that model for perplexity. 00:02:17.520 |
what that means is that we have some assessment set of sequences. 00:02:23.260 |
we get the average perplexity across the examples, 00:02:26.180 |
and we report that number as an estimate of system quality. 00:02:33.760 |
weaknesses that we really need to think about. 00:02:35.800 |
First, perplexity is heavily dependent on the underlying vocabulary. 00:02:40.240 |
One easy way to see this is just to imagine that we take 00:02:43.000 |
every token in the sequences and map them to a single UNK character. 00:02:47.520 |
In that case, we will have perfect perplexity, 00:02:50.440 |
but we will have a terrible generation system. 00:02:54.180 |
Perplexity also really does not allow comparisons between datasets. 00:03:01.700 |
any ground truth on what's a good or bad perplexity separate from the data. 00:03:06.160 |
What that means is that comparing across two datasets with 00:03:09.180 |
perplexity numbers is just comparing incomparables. 00:03:13.020 |
Relatedly, even comparisons between models is tricky. 00:03:17.020 |
You can see this emerging here because we really need 00:03:20.100 |
a lot of things to be held constant to do a model comparison. 00:03:23.380 |
Certainly, the tokenization needs to be the same. 00:03:28.300 |
Ideally, many other aspects of the systems are held 00:03:34.840 |
Otherwise, again, we are in danger of comparing incomparables. 00:03:43.620 |
and this is more tightly aligned with actually human-created reference text, 00:03:49.780 |
perplexity in terms of how we think about assessment. 00:03:55.860 |
You need to pick a distance measure between strings, 00:03:58.540 |
and then your word error rate is parameterized by that distance metric. 00:04:02.540 |
Here's the full calculation for a gold text X and a predicted text pred. 00:04:12.220 |
according to whatever distance measure we chose, 00:04:14.280 |
and we divide that by the length of the reference or gold text. 00:04:23.000 |
sum up all of the distances between gold and predicted texts, 00:04:26.960 |
and we divide that by the total length of all of the reference texts. 00:04:34.460 |
its bounds are zero and infinity with zero the best, 00:04:36.920 |
it's an error rate, so we're trying to minimize it. 00:04:39.480 |
The value encoded is something like how aligned is 00:04:42.420 |
the predicted sequence with the actual sequence, 00:04:47.560 |
once you have thought about your distance metric. 00:04:50.740 |
Weaknesses, well, it can accommodate just one reference text. 00:04:55.100 |
Our fundamental challenge here is that there's 00:04:59.640 |
whereas here we can accommodate only a single way, 00:05:06.400 |
It's also a very syntactic notion by default. 00:05:10.300 |
Most distance metrics are string edit metrics, 00:05:13.600 |
and they're very sensitive to the actual structure of the string. 00:05:21.700 |
it was not good, and it was great as all similarly distant from each other, 00:05:27.600 |
it was good and it was great are alike and different from it was not good. 00:05:34.040 |
Blue scores build on intuitions around word error rate, 00:05:38.440 |
and they're trying to be more sensitive to the fact that there's 00:05:44.560 |
It's again going to be a balance of precision and 00:05:46.820 |
recall but now tailored to the generation space. 00:05:50.020 |
The notion of precision is modified N-gram precision. 00:06:04.960 |
these are presumed to be human-created texts. 00:06:07.620 |
The modified N-gram precision for the is 2/7. 00:06:11.980 |
There are seven occurrences of the in the candidate, 00:06:16.760 |
the reference text that contains the maximum number of thes. 00:06:24.860 |
whereas reference text 2 has just one token of the. 00:06:29.840 |
and then the modified N-gram precision for the is 2/7. 00:06:36.760 |
We need to balance that with a notion of recall, 00:06:39.380 |
otherwise we might end up with systems that do 00:06:41.500 |
very short generations in order to be very precise. 00:06:45.940 |
For recall, blue introduces what they call a brevity penalty. 00:06:54.340 |
shorter than the text I expect for my corpus, 00:07:04.100 |
and you start to rely on the modified N-gram precision. 00:07:12.980 |
the weighted modified N-gram precision values for each N. 00:07:31.640 |
its bounds are 0 and 1 with one the best that we 00:07:35.660 |
naturalistic data that any system will achieve one, 00:07:39.900 |
all the relevant reference texts conceptually. 00:07:43.280 |
The value encoded is something like an appropriate, 00:07:48.640 |
recall as implemented in that brevity penalty. 00:07:54.620 |
but it seeks to accommodate the fact that there are 00:07:56.620 |
typically multiple suitable outputs for a given input, 00:08:02.420 |
Weaknesses, well, there's a long literature on this, 00:08:09.100 |
correlate with human scoring for translation, 00:08:11.540 |
which is an important application domain for blue, 00:08:15.520 |
It's very sensitive to the N-gram order of things, 00:08:34.680 |
even though that dog and that dog are obviously 00:08:43.260 |
thoughtful about the domain that you're operating in, 00:08:56.560 |
just argue against blue as a metric for dialogue. 00:09:00.660 |
Think carefully about what your generations mean, 00:09:10.340 |
Again, a common refrain for this unit of the course. 00:09:19.020 |
Word error rate and blue are both like this because they 00:09:37.720 |
and also less sensitive to fine-grained details of 00:09:46.520 |
Meteor does that with things like stemming and synonyms. 00:09:50.440 |
With scoring procedures like CIDR and BERT score, 00:10:10.920 |
the token level to define scores between two texts. 00:10:16.380 |
In fact, the scoring procedure looks a lot like 00:10:18.480 |
the one that we use for the Colbert retrieval model. 00:10:31.960 |
I thought I'd mention image-based NLG metrics. 00:10:40.080 |
and then we want to ask the question of whether or not 00:10:50.520 |
assuming that the human annotations exist and are 00:11:06.200 |
in the right way for these reference-based metrics. 00:11:08.860 |
You might think about reference-less metrics. 00:11:13.980 |
text-image pairs with no need for human-created references. 00:11:18.120 |
At this moment, the most popular of these is certainly 00:11:20.860 |
ClipScore but there are others like UMIC and Spurts. 00:11:24.220 |
The vision here is to drop the need for human annotation, 00:11:40.340 |
these reference-less metrics on the grounds that they are 00:11:45.140 |
the text and the context in which the image appeared. 00:11:48.420 |
Those are crucial aspects when you think about 00:11:53.300 |
It's a shame that these reference-less metrics 00:11:57.940 |
However, we are optimistic that we can design 00:12:03.900 |
that can actually bring in these notions of quality. 00:12:15.940 |
I thought I'd offer a more high-level comment 00:12:21.860 |
We've been very focused so far on comparisons with 00:12:24.740 |
reference texts and other notions of intrinsic quality. 00:12:28.400 |
But we should reflect on the fact that by and large, 00:12:33.420 |
communication or to help an agent take some action. 00:12:37.140 |
The classical off-the-shelf reference-based metrics will 00:12:42.880 |
only to the extent that the human annotations did. 00:12:50.220 |
then that won't be reflected in your evaluation. 00:12:53.200 |
You can imagine model-based metrics that are tuned to 00:12:56.700 |
specific tasks and therefore task-oriented in their nature, 00:13:00.100 |
but that's actually currently a very rare model. 00:13:03.120 |
I think it's fruitful though to think about what the goal of 00:13:14.480 |
receive the generated text use it to solve the task? 00:13:25.700 |
Again, we could just ask directly whether the agent receiving 00:13:29.060 |
the message reliably extracted the information we care about. 00:13:33.420 |
Or did the message lead the person to take a desirable action, 00:13:36.660 |
which would be a more indirect measure of communicative success? 00:13:51.060 |
you can imagine that this is much more tightly aligned 00:13:54.000 |
with the goals that we actually have for our generation systems.