back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 2: Classifier Metrics I Spring 2023
Chapters
0:0
0:29 Overview
1:41 Confusion matrices
4:53 Accuracy and the cross-entropy loss
9:52 Averaging F scores
10:1 Macro-averaged F scores
11:24 Weighted average F scores
11:58 Micro-averaged F scores
13:20 Precision-recall curves
00:00:06.040 |
This is part two in our series on methods and metrics. 00:00:09.160 |
In part one, I introduced some really high-level 00:00:11.760 |
overarching themes around methods and metrics. 00:00:14.760 |
We're now going to do a deep technical dive on 00:00:17.280 |
classifier metrics because I imagine many of you 00:00:20.160 |
will be dealing with classifiers for your final project. 00:00:25.240 |
I'd like to keep our eyes on those high-level themes. 00:00:32.900 |
Carrying over a lesson from the first screencast, 00:00:35.480 |
I would emphasize that different evaluation metrics 00:01:03.200 |
existing metrics depending on what your goals are. 00:01:06.180 |
That's something that I emphasized in part 1 of 00:01:08.560 |
this series that we should really be thinking about 00:01:15.520 |
For established tasks, I grant that there will 00:01:27.000 |
back if you feel that's the right thing to do. 00:01:33.600 |
We all have to be vigilant and push back if we think 00:01:49.640 |
I'm going to use a ternary sentiment problem. 00:01:54.240 |
the rows and predictions going down the columns. 00:01:57.600 |
This confusion matrix is saying, for example, 00:02:02.640 |
gold positive and that the system predicted as positive. 00:02:08.960 |
gold positive and the system predicted neutral. 00:02:14.960 |
let me emphasize for you that a threshold was likely 00:02:21.400 |
especially if you have a probabilistic classifier, 00:02:24.000 |
what you got out was a probability distribution 00:02:32.000 |
out which one would count as the actual prediction. 00:02:34.960 |
Obviously, different decision rules will give 00:02:50.460 |
Another thing that's worth keeping track of is 00:02:53.040 |
the support that is the number of cases that for 00:02:56.080 |
the gold data fall into each one of the categories. 00:02:59.320 |
Here you can see that it is highly imbalanced, 00:03:01.920 |
and you should have in mind that that will be 00:03:03.760 |
an important factor in choosing a good metric. 00:03:17.680 |
that means that we sum up all the diagonal elements and 00:03:20.440 |
divide that by the sum of all the elements in the table. 00:03:33.980 |
the simplest terms how often is the system correct? 00:03:45.800 |
Second, we have a complete failure to control for class size. 00:03:53.520 |
That is insensitive to the different classes that you 00:03:59.080 |
the way it makes predictions for those classes. 00:04:12.080 |
neutral and essentially all of the predictions are neutral. 00:04:15.720 |
As a result, it hardly matters what you do for 00:04:18.400 |
the system in terms of positive and negative because 00:04:26.800 |
That could be good. It's giving us a picture of how 00:04:29.440 |
your system performs on the most frequent case, 00:04:32.120 |
and it will reflect the value that I've suggested here. 00:04:35.400 |
But it might be directly at odds with our goals of 00:04:38.760 |
really doing well on even the smallest categories. 00:04:48.280 |
I'm going to offer you some metrics for that. 00:04:50.760 |
But one thing you should keep in mind again is that if you 00:04:56.520 |
you are implicitly optimizing your model for accuracy because 00:05:00.440 |
accuracy is inversely proportional to the negative log loss, 00:05:07.320 |
You might set goals for yourself that are like good macro F1. 00:05:19.200 |
For example, optimization processes tend to favor 00:05:24.960 |
a picture of why that happens for classifiers. 00:05:28.600 |
One other technical note that I wanted to make, 00:05:36.280 |
That's accuracy for soft labels where you have 00:05:39.520 |
a full probability distribution over the classes. 00:05:42.880 |
The reason we often simplify this away is that typically for 00:05:48.400 |
There's exactly one label dimension that is true, 00:05:51.440 |
and that means that for all other classes, the false ones, 00:06:02.640 |
That's how you get back to these standard formulations. 00:06:06.680 |
But this is the general formulation and you can in principle learn from 00:06:13.280 |
and that will be fundamentally the same kind of operation 00:06:19.200 |
But we do want to move away from raw accuracy, 00:06:22.800 |
and the first step to doing that is precision. 00:06:25.520 |
The precision for a class K is the correct predictions for 00:06:35.720 |
Here I've shown you the calculation for precision for 00:06:40.400 |
similar calculations for the negative and neutral classes. 00:06:50.680 |
undefined for cases where you would need to divide by 0. 00:06:58.600 |
you see lots of warnings about metrics when you encounter this case. 00:07:03.440 |
The value encoded is penalizing incorrect guesses, 00:07:10.360 |
You can achieve high precision for a class K simply by rarely guessing K. 00:07:16.240 |
If you just make sure you're very cautious about this class, 00:07:20.160 |
you will get high precision in all likelihood, 00:07:22.880 |
but that's not necessarily the full set of values we want to encode. 00:07:31.000 |
The recall for class K is the correct predictions for 00:07:34.200 |
K divided by the sum of all true members of K. 00:07:39.440 |
and I've given you the sample calculation for the positive class. 00:07:47.120 |
The value encoded is that we're going to penalize missed true cases. 00:07:58.280 |
We can achieve high recall for K simply by always guessing K. 00:08:03.040 |
If I want to be sure I don't miss any examples, 00:08:05.280 |
I'll just guess constantly and increase my chances of not having any misses. 00:08:09.840 |
Now you can see very directly that we should balance this against precision, 00:08:21.120 |
but we can in principle have this weight beta, 00:08:23.520 |
which will control the degree to which we favor precision and recall. 00:08:30.320 |
There are scenarios where precision is important, 00:08:36.920 |
your metrics with those high-level values that you have. 00:08:43.400 |
and what we're doing is simply the harmonic mean of precision and recall. 00:08:50.400 |
I've given the F1 scores along the rows for each one of those classes. 00:09:00.960 |
precision and recall as the harmonic mean of those two. 00:09:07.080 |
How much do the predictions for class K align with true instances of K, 00:09:13.040 |
with beta controlling the weight placed on precision and recall? 00:09:16.480 |
It's like both precision and recall have been 00:09:19.680 |
baked into this notion of aligning with the truth. 00:09:23.160 |
The weaknesses, there's no normalization for the size of the dataset, 00:09:28.520 |
and it ignores all the values off the row and column for K. 00:09:35.880 |
I don't pay attention to any of these other values, 00:09:38.320 |
no matter how many examples there are in those off elements. 00:09:43.280 |
That's a structural bias that gets built in here, 00:09:46.280 |
a place that these metrics miss when you think per class. 00:09:56.600 |
macro averaging, weighted averaging, and micro averaging. 00:10:02.000 |
This is the most dominant choice in the field. 00:10:04.680 |
The reason is that we as NLPers tend to care about 00:10:07.840 |
categories no matter how large or small they are. 00:10:13.800 |
the small classes than the large ones because they're interesting or hard. 00:10:17.640 |
The macro average is simply going to average across them numerically. 00:10:21.400 |
I simply do the average of these three numbers, 00:10:30.840 |
same as F scores plus the assumption that all classes are equal, 00:10:37.280 |
The weaknesses, a classifier that does well only on small classes, 00:10:44.400 |
That's the dual of caring a lot about small classes. 00:10:47.320 |
Suppose you do obsess over positive and negative in this scenario, 00:10:55.600 |
your system is encountering mostly neutral cases, 00:11:03.560 |
A classifier that does well only on large classes might do 00:11:09.640 |
This is the case where you might really care about 00:11:13.160 |
those small classes even more than you care about the large one, 00:11:16.560 |
and that's not reflected in the macro average because it 00:11:22.600 |
Weighted average F scores is a straightforward way to average where you 00:11:27.640 |
simply take into account the total support and so it's 00:11:30.800 |
a straight up weighted numerical average of the three F1 scores. 00:11:40.360 |
The value encoded is the same as the F scores, 00:11:43.640 |
plus the assumption that class size does in fact matter in this case. 00:11:53.160 |
So we're back to that same weakness that we had for accuracy. 00:11:57.440 |
The final way of averaging F scores is called micro averaging. 00:12:01.840 |
What you do here is take each one of the classes and form 00:12:05.240 |
its own binary confusion matrix for that class, 00:12:09.360 |
and then you add them together and you get a single binary table. 00:12:14.520 |
The properties for this are again bounds 0 and 1, 00:12:24.240 |
the yes category in that final table that you constructed. 00:12:30.640 |
So the weakness are same as F scores plus a score for yes and a score for no, 00:12:36.280 |
which is annoying because what do you do with the no category? 00:12:42.000 |
but the yes one was just accuracy after all and that as far as I can tell, 00:12:45.720 |
is the only one that everyone pays attention to. 00:12:48.760 |
So overall, I feel like at this point you could just ignore micro average F scores. 00:12:53.960 |
You still see them in the literature and in results tables from Scikit, I believe. 00:12:58.600 |
But overall, it's basically macro average to abstract away from class size, 00:13:03.600 |
or weighted average to bring in the overall class size as an element in the metric. 00:13:09.400 |
Those are the two that I would suggest going forward and only for 00:13:12.560 |
fully balanced problems should you fall back to accuracy, 00:13:16.400 |
where class size won't be an interfering factor. 00:13:19.600 |
Finally, I wanted to return to that observation I made at the start, 00:13:23.680 |
that it's irksome that we need to always impose a decision boundary. 00:13:28.680 |
We have to do the same thing with precision and recall. 00:13:35.360 |
precision and recall curves that would allow us to explore 00:13:38.160 |
the full range of possible ways that our system could make predictions, 00:13:44.200 |
This offers much more information about trade-offs between these two pressures, 00:13:48.640 |
and could make it much easier for people to align 00:13:52.040 |
system choices with the underlying values that they have for their system. 00:13:57.840 |
I know it's impractical to ask this because the field is 00:14:03.120 |
but I think it could be interesting to think about 00:14:05.120 |
precision recall curves to get much more information. 00:14:08.360 |
Then if we do need to choose a single number, 00:14:11.280 |
average precision is a summary of this curve that again, 00:14:17.160 |
precision and recall and brings in much more information. 00:14:27.360 |
where again, it offered a very nuanced lesson about how systems were doing.