Stanford XCS224U: NLU I NLP Methods and Metrics, Part 2: Classifier Metrics I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.040 | This is part two in our series on methods and metrics.

00:00:09.160 | In part one, I introduced some really high-level

00:00:11.760 | overarching themes around methods and metrics.

00:00:14.760 | We're now going to do a deep technical dive on

00:00:17.280 | classifier metrics because I imagine many of you

00:00:20.160 | will be dealing with classifiers for your final project.

00:00:23.160 | Even though this is a deep technical dive,

00:00:25.240 | I'd like to keep our eyes on those high-level themes.

00:00:28.940 | In fact, let's start there with

00:00:30.720 | an overview of thinking about this.

00:00:32.900 | Carrying over a lesson from the first screencast,

00:00:35.480 | I would emphasize that different evaluation metrics

00:00:38.200 | encode different values.

00:00:40.600 | That means that choosing a metric is

00:00:42.600 | a crucial aspect of your experimental work.

00:00:45.300 | You need to think about your hypotheses,

00:00:47.680 | and your data, and your models,

00:00:49.700 | and how all of those come together to inform

00:00:52.160 | good choices around metrics,

00:00:54.160 | even if you are fitting something

00:00:56.060 | as seemingly simple as a classifier.

00:00:58.680 | You should feel free to motivate

00:01:00.840 | new metrics and specific uses of

00:01:03.200 | existing metrics depending on what your goals are.

00:01:06.180 | That's something that I emphasized in part 1 of

00:01:08.560 | this series that we should really be thinking about

00:01:11.400 | how our metrics align with

00:01:13.160 | the things that we're actually trying to do.

00:01:15.520 | For established tasks, I grant that there will

00:01:18.520 | usually be pressure to use specific metrics,

00:01:21.520 | the ones that are in the leaderboards

00:01:23.280 | or in the prior literature.

00:01:24.840 | But you should feel empowered to push

00:01:27.000 | back if you feel that's the right thing to do.

00:01:29.000 | After all, areas of research

00:01:31.360 | can stagnate due to poor metrics.

00:01:33.600 | We all have to be vigilant and push back if we think

00:01:36.800 | that a metric is just leading us astray.

00:01:40.480 | In that spirit, let's start with

00:01:44.200 | confusion matrices, the basis for

00:01:46.160 | many calculations in this area.

00:01:48.300 | As a running example,

00:01:49.640 | I'm going to use a ternary sentiment problem.

00:01:52.060 | I'll have the gold labels across

00:01:54.240 | the rows and predictions going down the columns.

00:01:57.600 | This confusion matrix is saying, for example,

00:01:59.960 | that there were 15 cases that were

00:02:02.640 | gold positive and that the system predicted as positive.

00:02:06.800 | Whereas there are 100 cases that were

00:02:08.960 | gold positive and the system predicted neutral.

00:02:12.240 | In the spirit of taking nothing for granted,

00:02:14.960 | let me emphasize for you that a threshold was likely

00:02:17.840 | imposed for these categorical predictions

00:02:20.120 | coming from the model,

00:02:21.400 | especially if you have a probabilistic classifier,

00:02:24.000 | what you got out was a probability distribution

00:02:26.560 | over the three classes in this case,

00:02:29.400 | and you applied some decision rule to figure

00:02:32.000 | out which one would count as the actual prediction.

00:02:34.960 | Obviously, different decision rules will give

00:02:37.480 | you very different tables of results.

00:02:40.080 | In the background, you should have in

00:02:41.920 | mind that that is an ingredient here.

00:02:43.720 | In fact, at the end of the slideshow,

00:02:45.720 | I'll suggest a metric that allows you

00:02:47.540 | to pull back from that assumption.

00:02:50.460 | Another thing that's worth keeping track of is

00:02:53.040 | the support that is the number of cases that for

00:02:56.080 | the gold data fall into each one of the categories.

00:02:59.320 | Here you can see that it is highly imbalanced,

00:03:01.920 | and you should have in mind that that will be

00:03:03.760 | an important factor in choosing a good metric.

00:03:07.840 | Let's begin with accuracy.

00:03:10.640 | Accuracy is the correct predictions

00:03:12.800 | divided by the total number of examples.

00:03:15.560 | Given a confusion table like this,

00:03:17.680 | that means that we sum up all the diagonal elements and

00:03:20.440 | divide that by the sum of all the elements in the table.

00:03:23.640 | That's accuracy.

00:03:25.040 | The bounds for accuracy are zero and one,

00:03:27.480 | with zero the worst and one the best.

00:03:30.080 | The value encoded in accuracy is just in

00:03:33.980 | the simplest terms how often is the system correct?

00:03:38.480 | That actually relates to two weaknesses.

00:03:40.960 | First, there is no per class metric.

00:03:43.240 | We have to do this over the entire table.

00:03:45.800 | Second, we have a complete failure to control for class size.

00:03:50.160 | Think about that value encoded.

00:03:51.840 | How often is the system correct?

00:03:53.520 | That is insensitive to the different classes that you

00:03:57.160 | have in your system and

00:03:59.080 | the way it makes predictions for those classes.

00:04:01.080 | It is just looking at the wrong number

00:04:03.040 | of times that you made the right guess.

00:04:05.880 | Actually, our table is a good illustration

00:04:08.280 | of how this can be problematic.

00:04:09.880 | Essentially, all of the true cases are

00:04:12.080 | neutral and essentially all of the predictions are neutral.

00:04:15.720 | As a result, it hardly matters what you do for

00:04:18.400 | the system in terms of positive and negative because

00:04:21.400 | accuracy will be totally dominated

00:04:23.840 | by performance on that neutral category.

00:04:26.800 | That could be good. It's giving us a picture of how

00:04:29.440 | your system performs on the most frequent case,

00:04:32.120 | and it will reflect the value that I've suggested here.

00:04:35.400 | But it might be directly at odds with our goals of

00:04:38.760 | really doing well on even the smallest categories.

00:04:42.560 | Suppose that you do have a goal

00:04:45.680 | of doing well even on the small categories.

00:04:48.280 | I'm going to offer you some metrics for that.

00:04:50.760 | But one thing you should keep in mind again is that if you

00:04:54.200 | are using a cross-entropy loss,

00:04:56.520 | you are implicitly optimizing your model for accuracy because

00:05:00.440 | accuracy is inversely proportional to the negative log loss,

00:05:05.080 | that is the cross-entropy loss.

00:05:07.320 | You might set goals for yourself that are like good macro F1.

00:05:11.320 | That's a metric I'll introduce in a second.

00:05:13.480 | But keep in mind that your system is

00:05:15.440 | actually oriented toward accuracy,

00:05:17.720 | and that will have consequences.

00:05:19.200 | For example, optimization processes tend to favor

00:05:22.800 | the largest classes and this is

00:05:24.960 | a picture of why that happens for classifiers.

00:05:28.600 | One other technical note that I wanted to make,

00:05:31.360 | the cross-entropy loss is actually

00:05:33.240 | a special case of the KL divergence loss.

00:05:36.280 | That's accuracy for soft labels where you have

00:05:39.520 | a full probability distribution over the classes.

00:05:42.880 | The reason we often simplify this away is that typically for

00:05:45.960 | classifiers we have a one-hot vector.

00:05:48.400 | There's exactly one label dimension that is true,

00:05:51.440 | and that means that for all other classes, the false ones,

00:05:55.240 | this ends up being a total of zero,

00:05:57.800 | and that means we can simplify it down to

00:05:59.800 | the negative log of the true class.

00:06:02.640 | That's how you get back to these standard formulations.

00:06:06.680 | But this is the general formulation and you can in principle learn from

00:06:09.960 | distributions over the labels that you have,

00:06:13.280 | and that will be fundamentally the same kind of operation

00:06:16.120 | with the same in-built biases.

00:06:19.200 | But we do want to move away from raw accuracy,

00:06:22.800 | and the first step to doing that is precision.

00:06:25.520 | The precision for a class K is the correct predictions for

00:06:29.240 | K divided by the sum of all guesses for K.

00:06:32.720 | We're going to operate column-wise here.

00:06:35.720 | Here I've shown you the calculation for precision for

00:06:38.840 | the positive class and we could do

00:06:40.400 | similar calculations for the negative and neutral classes.

00:06:44.040 | The bounds of precision are 0 and 1,

00:06:46.440 | with 0 the worst and 1 the best,

00:06:48.400 | with a small caveat that precision is

00:06:50.680 | undefined for cases where you would need to divide by 0.

00:06:54.440 | We just map those to 0 typically,

00:06:56.720 | and sometimes if you're using Scikit,

00:06:58.600 | you see lots of warnings about metrics when you encounter this case.

00:07:03.440 | The value encoded is penalizing incorrect guesses,

00:07:08.240 | and that leads directly to the weakness.

00:07:10.360 | You can achieve high precision for a class K simply by rarely guessing K.

00:07:16.240 | If you just make sure you're very cautious about this class,

00:07:20.160 | you will get high precision in all likelihood,

00:07:22.880 | but that's not necessarily the full set of values we want to encode.

00:07:27.320 | Typically, we balance that with recall.

00:07:31.000 | The recall for class K is the correct predictions for

00:07:34.200 | K divided by the sum of all true members of K.

00:07:37.320 | Here we're going to operate row-wise,

00:07:39.440 | and I've given you the sample calculation for the positive class.

00:07:43.440 | The bounds are 0 and 1,

00:07:45.080 | with 0 the worst and 1 the best.

00:07:47.120 | The value encoded is that we're going to penalize missed true cases.

00:07:52.520 | It is a dual of precision,

00:07:56.000 | and that encodes its weakness as well.

00:07:58.280 | We can achieve high recall for K simply by always guessing K.

00:08:03.040 | If I want to be sure I don't miss any examples,

00:08:05.280 | I'll just guess constantly and increase my chances of not having any misses.

00:08:09.840 | Now you can see very directly that we should balance this against precision,

00:08:13.400 | which is imposing the opposite value.

00:08:15.960 | That's the usual motivation for F scores.

00:08:19.320 | Usually F1 scores,

00:08:21.120 | but we can in principle have this weight beta,

00:08:23.520 | which will control the degree to which we favor precision and recall.

00:08:27.680 | Again, no need to go on autopilot.

00:08:30.320 | There are scenarios where precision is important,

00:08:32.720 | and scenarios where recall is important,

00:08:34.840 | and you could use beta to align

00:08:36.920 | your metrics with those high-level values that you have.

00:08:40.720 | But by default, it's one,

00:08:42.240 | which is an even balance,

00:08:43.400 | and what we're doing is simply the harmonic mean of precision and recall.

00:08:48.000 | This can be a per class notion.

00:08:50.400 | I've given the F1 scores along the rows for each one of those classes.

00:08:55.360 | The bounds are 0 and 1,

00:08:57.160 | with 0 the worst and 1 the best,

00:08:58.760 | and this is always going to be between

00:09:00.960 | precision and recall as the harmonic mean of those two.

00:09:04.360 | The value encoded is something like this.

00:09:07.080 | How much do the predictions for class K align with true instances of K,

00:09:13.040 | with beta controlling the weight placed on precision and recall?

00:09:16.480 | It's like both precision and recall have been

00:09:19.680 | baked into this notion of aligning with the truth.

00:09:23.160 | The weaknesses, there's no normalization for the size of the dataset,

00:09:28.520 | and it ignores all the values off the row and column for K.

00:09:32.560 | If I'm doing the F1 for the positive class,

00:09:35.880 | I don't pay attention to any of these other values,

00:09:38.320 | no matter how many examples there are in those off elements.

00:09:43.280 | That's a structural bias that gets built in here,

00:09:46.280 | a place that these metrics miss when you think per class.

00:09:51.440 | We can average F scores in multiple ways.

00:09:55.360 | I'm going to talk about three;

00:09:56.600 | macro averaging, weighted averaging, and micro averaging.

00:10:00.520 | Let's start with macro.

00:10:02.000 | This is the most dominant choice in the field.

00:10:04.680 | The reason is that we as NLPers tend to care about

00:10:07.840 | categories no matter how large or small they are.

00:10:10.960 | If anything, we often care more about

00:10:13.800 | the small classes than the large ones because they're interesting or hard.

00:10:17.640 | The macro average is simply going to average across them numerically.

00:10:21.400 | I simply do the average of these three numbers,

00:10:23.880 | so it gives equal weight to all three.

00:10:25.840 | Bounds are 0 and 1,

00:10:27.360 | 0 the worst and 1 the best.

00:10:28.920 | The value encoded is as I said,

00:10:30.840 | same as F scores plus the assumption that all classes are equal,

00:10:34.800 | regardless of size or support.

00:10:37.280 | The weaknesses, a classifier that does well only on small classes,

00:10:42.400 | might not do well in the real world.

00:10:44.400 | That's the dual of caring a lot about small classes.

00:10:47.320 | Suppose you do obsess over positive and negative in this scenario,

00:10:51.080 | and you do really well on them,

00:10:52.520 | but at the cost of neutral.

00:10:54.160 | Well, in the real world,

00:10:55.600 | your system is encountering mostly neutral cases,

00:10:58.760 | and if it's failing on them,

00:11:00.600 | it might look like a really bad system.

00:11:03.560 | A classifier that does well only on large classes might do

00:11:07.040 | poorly on small but vital smaller classes.

00:11:09.640 | This is the case where you might really care about

00:11:13.160 | those small classes even more than you care about the large one,

00:11:16.560 | and that's not reflected in the macro average because it

00:11:19.120 | simply takes them all as equal weight.

00:11:22.600 | Weighted average F scores is a straightforward way to average where you

00:11:27.640 | simply take into account the total support and so it's

00:11:30.800 | a straight up weighted numerical average of the three F1 scores.

00:11:36.720 | The bounds are 0 and 1,

00:11:38.600 | 0 the worst, 1 the best.

00:11:40.360 | The value encoded is the same as the F scores,

00:11:43.640 | plus the assumption that class size does in fact matter in this case.

00:11:46.840 | So this will be more like accuracy.

00:11:48.800 | The weakness, of course,

00:11:50.240 | is that large classes will heavily dominate.

00:11:53.160 | So we're back to that same weakness that we had for accuracy.

00:11:57.440 | The final way of averaging F scores is called micro averaging.

00:12:01.840 | What you do here is take each one of the classes and form

00:12:05.240 | its own binary confusion matrix for that class,

00:12:09.360 | and then you add them together and you get a single binary table.

00:12:14.520 | The properties for this are again bounds 0 and 1,

00:12:17.600 | 0 the worst and 1 the best.

00:12:18.920 | The value encoded is exactly the same as

00:12:22.000 | accuracy if you focus on

00:12:24.240 | the yes category in that final table that you constructed.

00:12:27.840 | It is exactly the same as accuracy.

00:12:30.640 | So the weakness are same as F scores plus a score for yes and a score for no,

00:12:36.280 | which is annoying because what do you do with the no category?

00:12:39.360 | You have to focus on yes,

00:12:40.520 | so there's no single number,

00:12:42.000 | but the yes one was just accuracy after all and that as far as I can tell,

00:12:45.720 | is the only one that everyone pays attention to.

00:12:48.760 | So overall, I feel like at this point you could just ignore micro average F scores.

00:12:53.960 | You still see them in the literature and in results tables from Scikit, I believe.

00:12:58.600 | But overall, it's basically macro average to abstract away from class size,

00:13:03.600 | or weighted average to bring in the overall class size as an element in the metric.

00:13:09.400 | Those are the two that I would suggest going forward and only for

00:13:12.560 | fully balanced problems should you fall back to accuracy,

00:13:16.400 | where class size won't be an interfering factor.

00:13:19.600 | Finally, I wanted to return to that observation I made at the start,

00:13:23.680 | that it's irksome that we need to always impose a decision boundary.

00:13:28.680 | We have to do the same thing with precision and recall.

00:13:31.680 | We could think very differently about this.

00:13:34.040 | We could have, for example,

00:13:35.360 | precision and recall curves that would allow us to explore

00:13:38.160 | the full range of possible ways that our system could make predictions,

00:13:42.040 | given a decision boundary.

00:13:44.200 | This offers much more information about trade-offs between these two pressures,

00:13:48.640 | and could make it much easier for people to align

00:13:52.040 | system choices with the underlying values that they have for their system.

00:13:57.840 | I know it's impractical to ask this because the field is

00:14:00.760 | fairly focused on single summary numbers,

00:14:03.120 | but I think it could be interesting to think about

00:14:05.120 | precision recall curves to get much more information.

00:14:08.360 | Then if we do need to choose a single number,

00:14:11.280 | average precision is a summary of this curve that again,

00:14:14.800 | avoids the decision about how we weight

00:14:17.160 | precision and recall and brings in much more information.

00:14:20.240 | You'll recognize this as analogous to

00:14:22.440 | the average precision calculation that we

00:14:24.840 | did in the context of information retrieval,

00:14:27.360 | where again, it offered a very nuanced lesson about how systems were doing.

00:14:32.680 | [BLANK_AUDIO]

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 2: Classifier Metrics I Spring 2023

Chapters