back to index

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 2: Classifier Metrics I Spring 2023


Chapters

0:0
0:29 Overview
1:41 Confusion matrices
4:53 Accuracy and the cross-entropy loss
9:52 Averaging F scores
10:1 Macro-averaged F scores
11:24 Weighted average F scores
11:58 Micro-averaged F scores
13:20 Precision-recall curves

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.040 | This is part two in our series on methods and metrics.
00:00:09.160 | In part one, I introduced some really high-level
00:00:11.760 | overarching themes around methods and metrics.
00:00:14.760 | We're now going to do a deep technical dive on
00:00:17.280 | classifier metrics because I imagine many of you
00:00:20.160 | will be dealing with classifiers for your final project.
00:00:23.160 | Even though this is a deep technical dive,
00:00:25.240 | I'd like to keep our eyes on those high-level themes.
00:00:28.940 | In fact, let's start there with
00:00:30.720 | an overview of thinking about this.
00:00:32.900 | Carrying over a lesson from the first screencast,
00:00:35.480 | I would emphasize that different evaluation metrics
00:00:38.200 | encode different values.
00:00:40.600 | That means that choosing a metric is
00:00:42.600 | a crucial aspect of your experimental work.
00:00:45.300 | You need to think about your hypotheses,
00:00:47.680 | and your data, and your models,
00:00:49.700 | and how all of those come together to inform
00:00:52.160 | good choices around metrics,
00:00:54.160 | even if you are fitting something
00:00:56.060 | as seemingly simple as a classifier.
00:00:58.680 | You should feel free to motivate
00:01:00.840 | new metrics and specific uses of
00:01:03.200 | existing metrics depending on what your goals are.
00:01:06.180 | That's something that I emphasized in part 1 of
00:01:08.560 | this series that we should really be thinking about
00:01:11.400 | how our metrics align with
00:01:13.160 | the things that we're actually trying to do.
00:01:15.520 | For established tasks, I grant that there will
00:01:18.520 | usually be pressure to use specific metrics,
00:01:21.520 | the ones that are in the leaderboards
00:01:23.280 | or in the prior literature.
00:01:24.840 | But you should feel empowered to push
00:01:27.000 | back if you feel that's the right thing to do.
00:01:29.000 | After all, areas of research
00:01:31.360 | can stagnate due to poor metrics.
00:01:33.600 | We all have to be vigilant and push back if we think
00:01:36.800 | that a metric is just leading us astray.
00:01:40.480 | In that spirit, let's start with
00:01:44.200 | confusion matrices, the basis for
00:01:46.160 | many calculations in this area.
00:01:48.300 | As a running example,
00:01:49.640 | I'm going to use a ternary sentiment problem.
00:01:52.060 | I'll have the gold labels across
00:01:54.240 | the rows and predictions going down the columns.
00:01:57.600 | This confusion matrix is saying, for example,
00:01:59.960 | that there were 15 cases that were
00:02:02.640 | gold positive and that the system predicted as positive.
00:02:06.800 | Whereas there are 100 cases that were
00:02:08.960 | gold positive and the system predicted neutral.
00:02:12.240 | In the spirit of taking nothing for granted,
00:02:14.960 | let me emphasize for you that a threshold was likely
00:02:17.840 | imposed for these categorical predictions
00:02:20.120 | coming from the model,
00:02:21.400 | especially if you have a probabilistic classifier,
00:02:24.000 | what you got out was a probability distribution
00:02:26.560 | over the three classes in this case,
00:02:29.400 | and you applied some decision rule to figure
00:02:32.000 | out which one would count as the actual prediction.
00:02:34.960 | Obviously, different decision rules will give
00:02:37.480 | you very different tables of results.
00:02:40.080 | In the background, you should have in
00:02:41.920 | mind that that is an ingredient here.
00:02:43.720 | In fact, at the end of the slideshow,
00:02:45.720 | I'll suggest a metric that allows you
00:02:47.540 | to pull back from that assumption.
00:02:50.460 | Another thing that's worth keeping track of is
00:02:53.040 | the support that is the number of cases that for
00:02:56.080 | the gold data fall into each one of the categories.
00:02:59.320 | Here you can see that it is highly imbalanced,
00:03:01.920 | and you should have in mind that that will be
00:03:03.760 | an important factor in choosing a good metric.
00:03:07.840 | Let's begin with accuracy.
00:03:10.640 | Accuracy is the correct predictions
00:03:12.800 | divided by the total number of examples.
00:03:15.560 | Given a confusion table like this,
00:03:17.680 | that means that we sum up all the diagonal elements and
00:03:20.440 | divide that by the sum of all the elements in the table.
00:03:23.640 | That's accuracy.
00:03:25.040 | The bounds for accuracy are zero and one,
00:03:27.480 | with zero the worst and one the best.
00:03:30.080 | The value encoded in accuracy is just in
00:03:33.980 | the simplest terms how often is the system correct?
00:03:38.480 | That actually relates to two weaknesses.
00:03:40.960 | First, there is no per class metric.
00:03:43.240 | We have to do this over the entire table.
00:03:45.800 | Second, we have a complete failure to control for class size.
00:03:50.160 | Think about that value encoded.
00:03:51.840 | How often is the system correct?
00:03:53.520 | That is insensitive to the different classes that you
00:03:57.160 | have in your system and
00:03:59.080 | the way it makes predictions for those classes.
00:04:01.080 | It is just looking at the wrong number
00:04:03.040 | of times that you made the right guess.
00:04:05.880 | Actually, our table is a good illustration
00:04:08.280 | of how this can be problematic.
00:04:09.880 | Essentially, all of the true cases are
00:04:12.080 | neutral and essentially all of the predictions are neutral.
00:04:15.720 | As a result, it hardly matters what you do for
00:04:18.400 | the system in terms of positive and negative because
00:04:21.400 | accuracy will be totally dominated
00:04:23.840 | by performance on that neutral category.
00:04:26.800 | That could be good. It's giving us a picture of how
00:04:29.440 | your system performs on the most frequent case,
00:04:32.120 | and it will reflect the value that I've suggested here.
00:04:35.400 | But it might be directly at odds with our goals of
00:04:38.760 | really doing well on even the smallest categories.
00:04:42.560 | Suppose that you do have a goal
00:04:45.680 | of doing well even on the small categories.
00:04:48.280 | I'm going to offer you some metrics for that.
00:04:50.760 | But one thing you should keep in mind again is that if you
00:04:54.200 | are using a cross-entropy loss,
00:04:56.520 | you are implicitly optimizing your model for accuracy because
00:05:00.440 | accuracy is inversely proportional to the negative log loss,
00:05:05.080 | that is the cross-entropy loss.
00:05:07.320 | You might set goals for yourself that are like good macro F1.
00:05:11.320 | That's a metric I'll introduce in a second.
00:05:13.480 | But keep in mind that your system is
00:05:15.440 | actually oriented toward accuracy,
00:05:17.720 | and that will have consequences.
00:05:19.200 | For example, optimization processes tend to favor
00:05:22.800 | the largest classes and this is
00:05:24.960 | a picture of why that happens for classifiers.
00:05:28.600 | One other technical note that I wanted to make,
00:05:31.360 | the cross-entropy loss is actually
00:05:33.240 | a special case of the KL divergence loss.
00:05:36.280 | That's accuracy for soft labels where you have
00:05:39.520 | a full probability distribution over the classes.
00:05:42.880 | The reason we often simplify this away is that typically for
00:05:45.960 | classifiers we have a one-hot vector.
00:05:48.400 | There's exactly one label dimension that is true,
00:05:51.440 | and that means that for all other classes, the false ones,
00:05:55.240 | this ends up being a total of zero,
00:05:57.800 | and that means we can simplify it down to
00:05:59.800 | the negative log of the true class.
00:06:02.640 | That's how you get back to these standard formulations.
00:06:06.680 | But this is the general formulation and you can in principle learn from
00:06:09.960 | distributions over the labels that you have,
00:06:13.280 | and that will be fundamentally the same kind of operation
00:06:16.120 | with the same in-built biases.
00:06:19.200 | But we do want to move away from raw accuracy,
00:06:22.800 | and the first step to doing that is precision.
00:06:25.520 | The precision for a class K is the correct predictions for
00:06:29.240 | K divided by the sum of all guesses for K.
00:06:32.720 | We're going to operate column-wise here.
00:06:35.720 | Here I've shown you the calculation for precision for
00:06:38.840 | the positive class and we could do
00:06:40.400 | similar calculations for the negative and neutral classes.
00:06:44.040 | The bounds of precision are 0 and 1,
00:06:46.440 | with 0 the worst and 1 the best,
00:06:48.400 | with a small caveat that precision is
00:06:50.680 | undefined for cases where you would need to divide by 0.
00:06:54.440 | We just map those to 0 typically,
00:06:56.720 | and sometimes if you're using Scikit,
00:06:58.600 | you see lots of warnings about metrics when you encounter this case.
00:07:03.440 | The value encoded is penalizing incorrect guesses,
00:07:08.240 | and that leads directly to the weakness.
00:07:10.360 | You can achieve high precision for a class K simply by rarely guessing K.
00:07:16.240 | If you just make sure you're very cautious about this class,
00:07:20.160 | you will get high precision in all likelihood,
00:07:22.880 | but that's not necessarily the full set of values we want to encode.
00:07:27.320 | Typically, we balance that with recall.
00:07:31.000 | The recall for class K is the correct predictions for
00:07:34.200 | K divided by the sum of all true members of K.
00:07:37.320 | Here we're going to operate row-wise,
00:07:39.440 | and I've given you the sample calculation for the positive class.
00:07:43.440 | The bounds are 0 and 1,
00:07:45.080 | with 0 the worst and 1 the best.
00:07:47.120 | The value encoded is that we're going to penalize missed true cases.
00:07:52.520 | It is a dual of precision,
00:07:56.000 | and that encodes its weakness as well.
00:07:58.280 | We can achieve high recall for K simply by always guessing K.
00:08:03.040 | If I want to be sure I don't miss any examples,
00:08:05.280 | I'll just guess constantly and increase my chances of not having any misses.
00:08:09.840 | Now you can see very directly that we should balance this against precision,
00:08:13.400 | which is imposing the opposite value.
00:08:15.960 | That's the usual motivation for F scores.
00:08:19.320 | Usually F1 scores,
00:08:21.120 | but we can in principle have this weight beta,
00:08:23.520 | which will control the degree to which we favor precision and recall.
00:08:27.680 | Again, no need to go on autopilot.
00:08:30.320 | There are scenarios where precision is important,
00:08:32.720 | and scenarios where recall is important,
00:08:34.840 | and you could use beta to align
00:08:36.920 | your metrics with those high-level values that you have.
00:08:40.720 | But by default, it's one,
00:08:42.240 | which is an even balance,
00:08:43.400 | and what we're doing is simply the harmonic mean of precision and recall.
00:08:48.000 | This can be a per class notion.
00:08:50.400 | I've given the F1 scores along the rows for each one of those classes.
00:08:55.360 | The bounds are 0 and 1,
00:08:57.160 | with 0 the worst and 1 the best,
00:08:58.760 | and this is always going to be between
00:09:00.960 | precision and recall as the harmonic mean of those two.
00:09:04.360 | The value encoded is something like this.
00:09:07.080 | How much do the predictions for class K align with true instances of K,
00:09:13.040 | with beta controlling the weight placed on precision and recall?
00:09:16.480 | It's like both precision and recall have been
00:09:19.680 | baked into this notion of aligning with the truth.
00:09:23.160 | The weaknesses, there's no normalization for the size of the dataset,
00:09:28.520 | and it ignores all the values off the row and column for K.
00:09:32.560 | If I'm doing the F1 for the positive class,
00:09:35.880 | I don't pay attention to any of these other values,
00:09:38.320 | no matter how many examples there are in those off elements.
00:09:43.280 | That's a structural bias that gets built in here,
00:09:46.280 | a place that these metrics miss when you think per class.
00:09:51.440 | We can average F scores in multiple ways.
00:09:55.360 | I'm going to talk about three;
00:09:56.600 | macro averaging, weighted averaging, and micro averaging.
00:10:00.520 | Let's start with macro.
00:10:02.000 | This is the most dominant choice in the field.
00:10:04.680 | The reason is that we as NLPers tend to care about
00:10:07.840 | categories no matter how large or small they are.
00:10:10.960 | If anything, we often care more about
00:10:13.800 | the small classes than the large ones because they're interesting or hard.
00:10:17.640 | The macro average is simply going to average across them numerically.
00:10:21.400 | I simply do the average of these three numbers,
00:10:23.880 | so it gives equal weight to all three.
00:10:25.840 | Bounds are 0 and 1,
00:10:27.360 | 0 the worst and 1 the best.
00:10:28.920 | The value encoded is as I said,
00:10:30.840 | same as F scores plus the assumption that all classes are equal,
00:10:34.800 | regardless of size or support.
00:10:37.280 | The weaknesses, a classifier that does well only on small classes,
00:10:42.400 | might not do well in the real world.
00:10:44.400 | That's the dual of caring a lot about small classes.
00:10:47.320 | Suppose you do obsess over positive and negative in this scenario,
00:10:51.080 | and you do really well on them,
00:10:52.520 | but at the cost of neutral.
00:10:54.160 | Well, in the real world,
00:10:55.600 | your system is encountering mostly neutral cases,
00:10:58.760 | and if it's failing on them,
00:11:00.600 | it might look like a really bad system.
00:11:03.560 | A classifier that does well only on large classes might do
00:11:07.040 | poorly on small but vital smaller classes.
00:11:09.640 | This is the case where you might really care about
00:11:13.160 | those small classes even more than you care about the large one,
00:11:16.560 | and that's not reflected in the macro average because it
00:11:19.120 | simply takes them all as equal weight.
00:11:22.600 | Weighted average F scores is a straightforward way to average where you
00:11:27.640 | simply take into account the total support and so it's
00:11:30.800 | a straight up weighted numerical average of the three F1 scores.
00:11:36.720 | The bounds are 0 and 1,
00:11:38.600 | 0 the worst, 1 the best.
00:11:40.360 | The value encoded is the same as the F scores,
00:11:43.640 | plus the assumption that class size does in fact matter in this case.
00:11:46.840 | So this will be more like accuracy.
00:11:48.800 | The weakness, of course,
00:11:50.240 | is that large classes will heavily dominate.
00:11:53.160 | So we're back to that same weakness that we had for accuracy.
00:11:57.440 | The final way of averaging F scores is called micro averaging.
00:12:01.840 | What you do here is take each one of the classes and form
00:12:05.240 | its own binary confusion matrix for that class,
00:12:09.360 | and then you add them together and you get a single binary table.
00:12:14.520 | The properties for this are again bounds 0 and 1,
00:12:17.600 | 0 the worst and 1 the best.
00:12:18.920 | The value encoded is exactly the same as
00:12:22.000 | accuracy if you focus on
00:12:24.240 | the yes category in that final table that you constructed.
00:12:27.840 | It is exactly the same as accuracy.
00:12:30.640 | So the weakness are same as F scores plus a score for yes and a score for no,
00:12:36.280 | which is annoying because what do you do with the no category?
00:12:39.360 | You have to focus on yes,
00:12:40.520 | so there's no single number,
00:12:42.000 | but the yes one was just accuracy after all and that as far as I can tell,
00:12:45.720 | is the only one that everyone pays attention to.
00:12:48.760 | So overall, I feel like at this point you could just ignore micro average F scores.
00:12:53.960 | You still see them in the literature and in results tables from Scikit, I believe.
00:12:58.600 | But overall, it's basically macro average to abstract away from class size,
00:13:03.600 | or weighted average to bring in the overall class size as an element in the metric.
00:13:09.400 | Those are the two that I would suggest going forward and only for
00:13:12.560 | fully balanced problems should you fall back to accuracy,
00:13:16.400 | where class size won't be an interfering factor.
00:13:19.600 | Finally, I wanted to return to that observation I made at the start,
00:13:23.680 | that it's irksome that we need to always impose a decision boundary.
00:13:28.680 | We have to do the same thing with precision and recall.
00:13:31.680 | We could think very differently about this.
00:13:34.040 | We could have, for example,
00:13:35.360 | precision and recall curves that would allow us to explore
00:13:38.160 | the full range of possible ways that our system could make predictions,
00:13:42.040 | given a decision boundary.
00:13:44.200 | This offers much more information about trade-offs between these two pressures,
00:13:48.640 | and could make it much easier for people to align
00:13:52.040 | system choices with the underlying values that they have for their system.
00:13:57.840 | I know it's impractical to ask this because the field is
00:14:00.760 | fairly focused on single summary numbers,
00:14:03.120 | but I think it could be interesting to think about
00:14:05.120 | precision recall curves to get much more information.
00:14:08.360 | Then if we do need to choose a single number,
00:14:11.280 | average precision is a summary of this curve that again,
00:14:14.800 | avoids the decision about how we weight
00:14:17.160 | precision and recall and brings in much more information.
00:14:20.240 | You'll recognize this as analogous to
00:14:22.440 | the average precision calculation that we
00:14:24.840 | did in the context of information retrieval,
00:14:27.360 | where again, it offered a very nuanced lesson about how systems were doing.
00:14:32.680 | [BLANK_AUDIO]