back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 1: Overview I Spring 2023
00:00:05.920 |
This screencast kicks off our series on methods and metrics. 00:00:09.120 |
The overarching goal of this series is to help you work your way through 00:00:13.800 |
your project and make smart choices around experiments and data and metrics and 00:00:18.720 |
everything else and also be a trusted companion and sounding board for you as 00:00:23.560 |
you confront really hard decisions around doing this kind of research. 00:00:27.720 |
Let's start with an overview of the goals and the current issues. 00:00:32.800 |
Fundamentally we're trying to help you with your projects. 00:00:34.880 |
This comes down to things like managing data, establishing baseline systems, 00:00:39.960 |
comparing models, optimizing models, and maybe most importantly, navigating 00:00:46.120 |
We can provide some high-level guidance here and then you should look to your 00:00:49.360 |
mentor from the teaching team to really help you optimize these things and get 00:00:57.520 |
For associated materials, we have a big notebook on evaluation metrics that'll 00:01:02.320 |
allow you to get hands-on with some of the detailed technical things that I give 00:01:08.680 |
There's also a really wonderful set of pages embedded in the Scikit site about 00:01:14.240 |
how to think about model evaluation in AI more generally. 00:01:18.160 |
And then we also have an extensive notebook that allows you to get hands-on 00:01:22.080 |
with the more methodological stuff that we're going to talk about. 00:01:26.040 |
The notebook substantiates with experiments some of the things that I'll 00:01:32.840 |
And then there are a few things that you could check out in terms of reading. 00:01:36.240 |
Although honestly, there is much less reading than you would expect. 00:01:40.400 |
This might be a sign that the field is still maturing. 00:01:42.920 |
For fields in the hard sciences and the behavioral sciences, there would be 00:01:49.920 |
Whereas for us, we seem to have an assumption that people will just pick it 00:01:56.200 |
I'm not sure that's to the credit of the field, but that is the current 00:01:59.400 |
And I guess the thinking behind this unit is that we can help somewhat provide 00:02:03.440 |
some more systematic guidance on all of these really crucial things that serve 00:02:10.320 |
I wanted to start by saying one really important thing about how we think about 00:02:16.560 |
The fundamental thing here is that we will never evaluate your project based on 00:02:20.640 |
how good the results are, where we all know what good means. 00:02:24.480 |
It means being at the top of the leaderboard or something similar. 00:02:28.720 |
I recognize that publication venues in the field do this. 00:02:32.360 |
And the rationale there is that they have additional constraints on space, and 00:02:36.080 |
that leads them to favor positive results for new developments over negative 00:02:42.600 |
As an aside, I think we can all see that this is a distorting effect on the 00:02:46.960 |
scientific literature, and it absolutely is ultimately hindering progress, 00:02:51.560 |
because negative results can be so powerful and useful in terms of 00:02:55.880 |
instructing people about where to invest their energy, and in particular, where 00:03:02.320 |
So this is a kind of sad commentary, but I'm just observing that it's true. 00:03:06.800 |
And then fundamentally, for our course, we are not subject to any constraints, 00:03:14.280 |
So we can do the right and good thing of valuing positive results and negative 00:03:21.400 |
What this means is you could be at the top of the leaderboard according to your 00:03:25.600 |
own experimentation, but if your report is shallow and doesn't really explain how 00:03:30.760 |
that number got there or what the thinking behind it was, you won't do well. 00:03:35.080 |
More importantly, conversely, you could try something really creative and 00:03:40.200 |
It might fail, but that could be an outstanding contribution because of how 00:03:44.240 |
careful you were about the things we actually care about, which are the 00:03:48.560 |
appropriateness of the metrics, the strength of the methods, and the extent 00:03:53.320 |
to which the paper is open and clear-sighted about the limitations of 00:03:58.960 |
This is a wonderful movement in the field that we increasingly emphasize 00:04:02.520 |
finding the outer limits of our ideas and describing those limitations in our 00:04:07.960 |
I would exhort you to do similar things, and I think overall, on average, this 00:04:13.440 |
kind of work is going to lead you to more fruitful systems, more rewarding 00:04:19.800 |
So I feel really good about this, and we can talk in the next unit about how to 00:04:25.400 |
navigate tricky things related to this publication bias. 00:04:29.240 |
I think fundamentally we can do right there as well, and the shift in 00:04:32.840 |
perspective that you need to take is to move away from papers that are a 00:04:36.720 |
competition favoring your chosen system and toward papers that are simply 00:04:41.640 |
openly evaluating scientific hypotheses and mustering as much evidence as 00:04:49.840 |
And once you move into those modes, you're not anymore thinking about how to 00:04:53.280 |
pick a winner, but rather just thinking about strength of evidence and the 00:04:59.160 |
So we can get there, and everyone will be happier as a result, but it does 00:05:02.840 |
require a shift in perspective from the norms that we often hear about, which 00:05:10.520 |
For methods, I've put this under the heading "How Times Have Changed," and 00:05:15.640 |
unfortunately I don't have very many happy lessons to teach here. 00:05:23.800 |
In that era, you could develop your complete system on tiny samples of your 00:05:29.280 |
Once you had it working, you would do regular cross-validation using only the 00:05:32.920 |
trained data, a nice pristine experimental setting. 00:05:36.280 |
You would evaluate only very occasionally on your held-out dev set in an effort 00:05:41.480 |
to avoid hill climbing on that dev set and ultimately overfitting to it, and 00:05:47.800 |
And then in the final stages of your project, you would do a complete round of 00:05:54.120 |
You would select the best model, and you would run your final test evaluation 00:06:03.560 |
We could do something deeply right about this as a scientific picture for our 00:06:09.000 |
In 2023, you could develop your system on tiny samples of your data as before. 00:06:15.680 |
However, for step two, either there's no training data for your task or cross-validation 00:06:21.200 |
on it would cost $20,000 and take six months. 00:06:24.720 |
So already we are off course here in terms of our ideal scientific picture. 00:06:29.000 |
Relatedly, the dev set is frequently and crucially important to optimization. 00:06:33.800 |
So you have to keep fingers crossed that it really is a superb proxy for tests 00:06:38.160 |
because after all, you're going to orient all of your optimization processes 00:06:44.240 |
And then finally, for the final stage, either hyperparameter tuning would cost 00:06:47.800 |
$100,000 and take 10 years or there are no hyperparameters, but test runs cost 00:06:53.320 |
$4,000 because you're calling an open AI model or something like that. 00:07:02.620 |
The core tenets of the previous era remain sound. 00:07:06.600 |
There's something really good about them, but we cannot enforce them. 00:07:11.760 |
If we did, only the richest organizations could follow them and that would restrict 00:07:16.120 |
participation in the field in a way that would be terrible. 00:07:19.080 |
If you're creating a ranked list, you have to put broad participation way above all of 00:07:24.680 |
those very pristine and idealized methods that we used to be able to get away with 00:07:28.920 |
back when every model trained very quickly on a consumer laptop. 00:07:34.360 |
So what you have to do is articulate your methods and the rationale behind them, 00:07:39.600 |
including practical details like resource constraints and heuristics that you had to 00:07:44.880 |
However, two rules should remain absolutely fixed here. 00:07:50.320 |
First, you never do any model selection, even informally, based on test set evaluations. 00:07:56.640 |
I know there are people violating this rule out there in the field, but don't follow that 00:08:02.320 |
It's really important for us, especially when we think about the high stakes scenarios that 00:08:06.400 |
we could be deploying our systems in, to have pristine test evaluations that give us an 00:08:11.960 |
honest look about how our systems will behave on unseen examples. 00:08:16.360 |
And you compromise that entirely the moment you choose a model based on test set numbers. 00:08:22.840 |
Relatedly, as you think about constructing baselines and ablations and comparisons with 00:08:27.600 |
the literature, you have to strive to give all systems you evaluate the best chance of 00:08:33.920 |
You should never, ever stack the deck in favor of a system that you are advocating for. 00:08:40.920 |
All these models have hyperparameters and you could pick really bad settings for models 00:08:44.840 |
you disfavor and you could work really hard to find optimal settings for models that you 00:08:50.600 |
In that way, you would appear to have won some kind of competition, but you would have 00:08:54.720 |
compromised the very foundations of your project. 00:08:58.800 |
What you need to do instead is give every system its best chance. 00:09:02.300 |
Work really hard to make all of them competitive. 00:09:05.120 |
The result will be better science, results you can trust, and ultimately, you will go 00:09:09.240 |
farther in the field if you are rigorous about this rule as well. 00:09:19.800 |
I have put this under the heading, how time should change, and I do feel like they are 00:09:28.320 |
The overarching idea that we could have in mind here is Strathern's law. 00:09:33.440 |
When a measure becomes a target, it ceases to be a good measure. 00:09:39.540 |
We have to be vigilant and make sure we don't fall into this trap. 00:09:43.500 |
In this setting, with this in mind, I'm always reminded of leaderboards. 00:09:48.520 |
Leaderboards are central to the way the field works. 00:09:50.380 |
We all think about them and use them as markers of progress. 00:09:56.820 |
Leaderboards can be an objective basis for comparison and that creates opportunities 00:10:01.260 |
for even wild-seeming ideas to get a hearing. 00:10:04.860 |
In fields without leaderboards, very often these wild ideas are rejected out of hand 00:10:09.580 |
by the community with no evaluation, whereas at least leaderboards give people in our field 00:10:19.100 |
The bad, though, this can get really bad, with leaderboards we have a constant conflation 00:10:23.840 |
of benchmark improvements with actual progress when we know, in fact, that the benchmarks 00:10:29.980 |
Relatedly, we have this conflation of benchmarks with empirical domains. 00:10:34.340 |
People say things like OCR is solved, question answering is solved. 00:10:38.800 |
What they really mean is that certain benchmarks have been solved and we are all aware by now 00:10:43.900 |
that those two claims about the benchmark and the capability are radically different, 00:10:51.340 |
Even in the way we talk, I find we're often guilty of this third thing here, which is 00:10:55.420 |
conflating benchmark performance with a capability. 00:10:58.740 |
We see that a system does well at question answering for squad and we assume it's a good 00:11:03.420 |
question answer, even though we know in our hearts that these two things are very different. 00:11:10.980 |
I think what we should do moving forward is think about how to bring in more of the good, 00:11:15.900 |
more dimensions of good, and remove the dependence on these bad assumptions that we often make. 00:11:23.620 |
The fundamental issue here, I would say, is that the metrics that you choose, including 00:11:28.160 |
the ones that get embedded in leaderboards, are actually tied up with the thing that you're 00:11:33.900 |
Too often in the field, we don't actually make that connection. 00:11:37.980 |
Let me offer you some scenarios and they should get you thinking about how you would approach 00:11:42.280 |
this differently with different metrics in mind. 00:11:46.500 |
Suppose you're in a scenario where missing a safety signal costs lives and human review 00:11:52.380 |
What kind of metric would you favor for a system? 00:11:55.820 |
Conversely, suppose exemplars need to be found in a massive dataset. 00:12:00.260 |
Again, what kind of metrics would you use to evaluate systems in this kind of context? 00:12:05.940 |
I think the metrics would be very different from the first one. 00:12:09.220 |
In the second scenario with the exemplars, we can afford to miss a lot of cases. 00:12:15.860 |
Whereas in the first scenario, every missed case costs lives and we have the opportunity 00:12:23.880 |
So obviously our values are oriented differently. 00:12:27.660 |
Suppose specific mistakes are deal breakers, others hardly matter. 00:12:31.380 |
Now you want a metric that will give credit and give demerits to different kinds of mistakes 00:12:36.720 |
and good predictions in different ways to capture these underlying ideals. 00:12:43.860 |
You're not talking about classification anymore, you're talking about ranking. 00:12:47.380 |
Again, you should have good metrics for ranking. 00:12:51.260 |
Suppose the solution needs to work over an aging cell network. 00:12:54.420 |
Well now your obsession with accuracy should kind of go out the window in favor of systems 00:12:59.780 |
that can run on very constrained hardware, low energy, low power, very fast, all of that 00:13:07.040 |
Suppose the solution cannot provide worse service to specific groups. 00:13:11.620 |
Well standard machine learning models will often favor majority groups. 00:13:16.660 |
And if your ultimate allegiance is to making sure that the system is equitable across groups, 00:13:21.260 |
you will have to change your metrics from the norm and maybe even your underlying practices 00:13:28.220 |
Suppose specific predictions need to be absolutely blocked. 00:13:30.900 |
Well now you're in a totally different territory where some kinds of error cost you infinitely, 00:13:38.300 |
Again, a very different scenario from the norm. 00:13:42.700 |
In the field, tragically, the scientific literature seems to offer one answer to essentially every 00:13:49.460 |
scenario, which is that you use F1 and related accuracy metrics as your measure of system 00:13:57.800 |
You can see if you review this list that F1 is not appropriate for any of these scenarios. 00:14:02.780 |
F1 is just what we as researchers choose when we have no information about the application 00:14:08.740 |
When we have information, we should be tailoring our metrics to those specific scenarios. 00:14:15.340 |
And I worry that the lesson we project out to the world is that you needn't bother. 00:14:20.240 |
We don't do it, and we are purported to be experts, so why would anyone else do it? 00:14:24.500 |
Even though as experts, we can see these different scenarios call for very different metrics. 00:14:30.820 |
Relatedly, if you do do a survey of the scientific literature, you find a kind of overarching 00:14:36.460 |
obsession on performance, accuracy, F1, all of those things. 00:14:40.800 |
This is really nicely supported by this lovely and creative paper. 00:14:44.580 |
The Values Encoded in Machine Learning Research. 00:14:50.340 |
What I've done here is distill their evidence down into a kind of cartoonish picture that 00:14:57.700 |
I've used font size to convey the values that they find encoded in our literature. 00:15:04.300 |
And in the largest font here, unsurprisingly, is performance, dominating every other value 00:15:10.500 |
that we might want reflected in our research. 00:15:13.740 |
Kind of close behind in second place, but actually pretty distant, is efficiency. 00:15:19.180 |
Then you get interpretability, but notice that's interpretability for researchers. 00:15:25.060 |
The interpretability work that we talked about in the previous unit is very focused on technical 00:15:31.820 |
Applicability, robustness, scalability, these are pretty well represented. 00:15:36.980 |
And then I used a different and lighter font to reflect things that are very distant in 00:15:47.320 |
We all recognize that these are crucial aspects of successful NLP systems, but they are hardly 00:15:53.820 |
ever reflected in our practices around hypotheses and system evaluation. 00:15:59.940 |
Really if someone was just consuming our literature, what they would get out of it is again just 00:16:04.980 |
this obsession with accuracy and related notions of performance. 00:16:10.320 |
So we should push back, we should elevate some of those other values in the form of 00:16:18.820 |
I've put this under the heading of multi-dimensional leaderboards. 00:16:22.220 |
I've been involved with one effort, Dynaboard, also DawnBench, and Explainaboard. 00:16:28.620 |
These are all efforts to provide many more dimensions of evaluation for our systems and 00:16:33.660 |
get much richer pictures of what's actually happening. 00:16:36.820 |
In this context, I would like to mention Dynascoring. 00:16:40.060 |
I think this is a really powerful way to bring in multiple metrics and even allow the person 00:16:45.440 |
behind the system to decide which metrics to favor to what degree. 00:16:50.820 |
It's such a powerful metric that I've in fact offered you a notebook that implements Dynascoring 00:16:55.700 |
and offers you some tips on how to use it so that you too could explore using Dynascoring 00:17:01.580 |
to synthesize across multiple things that you measure. 00:17:05.340 |
Let me give you a sense for why this could be so powerful. 00:17:07.740 |
I have here a real leaderboard for question answering systems. 00:17:12.540 |
The DiBerna model is in first place according to my Dynascore. 00:17:15.980 |
That Dynascore was created by giving a lot of weight to performance and then equal weight 00:17:21.980 |
to throughput, memory, fairness, and robustness. 00:17:26.260 |
However, with Dynascoring, I can adjust those weights. 00:17:30.380 |
Suppose I decide that I really want a system that is highly performant but also fair according 00:17:37.100 |
I adjust the Dynascore to put five weight on fairness and I reduce throughput, memory, 00:17:44.260 |
Well now the previously first place system is in second place and ElectraLarge has become 00:17:51.900 |
Of course, different weightings of the different metrics that I have here will adjust the ranking 00:17:58.560 |
That shows you that there is no one true ranking but rather rankings only with respect to different 00:18:04.420 |
priorities and values and measurements that I take. 00:18:07.940 |
That is the essence of Dynascoring to be transparent about those values and also to reflect them 00:18:14.100 |
in good old-fashioned leaderboards as we're doing here. 00:18:19.140 |
In this context, when we talk about evaluation and we talk about different metrics, people 00:18:26.000 |
This is all too technical, too customized, too intricate. 00:18:29.440 |
What we should do is something more like the Turing test. 00:18:33.380 |
After all, that was the ultimate test in some sense. 00:18:35.900 |
The idea here is that a human and a computer are interacting. 00:18:40.900 |
Then the human is trying to figure out that it's a computer and the computer is doing 00:18:47.540 |
In that way, we're supposed to have a good diagnosis for general system quality and intelligence 00:18:59.980 |
The first Turing test was reported in Schieber 1994. 00:19:04.740 |
In that test, Shakespeare expert Cynthia Clay was thrice misclassified as a computer on 00:19:10.820 |
the grounds that no human could know that much about Shakespeare. 00:19:14.680 |
That's an instance of people not really knowing what the human experience is like in its full 00:19:25.020 |
In 2014, an AI, a very simple one called Eugene Guzman, passed the Turing test. 00:19:33.180 |
Well, it did it by adopting the personality of a 13-year-old boy. 00:19:38.820 |
When it was rude or appeared distracted because it was confused about what the human was trying 00:19:43.640 |
to do, people just chalked that up to the fact that 13-year-old boys are often rude 00:19:53.660 |
Google Duplex is a real AI system, a sophisticated one, and that is an AI that routinely runs 00:20:04.020 |
Even though it announces itself, because it does this by law, as an AI, right from the 00:20:09.060 |
start of the conversation, people often lose track of that information and believe that 00:20:15.940 |
Relatedly, now that we've moved into this mode of doing a lot of natural language generation, 00:20:20.940 |
we're all discovering that people are not good at distinguishing human-written texts 00:20:26.020 |
from texts that come from our best large language models. 00:20:29.460 |
In this way, especially with the stories of Duplex and the LLMs, we should reflect on 00:20:34.780 |
the fact that all of us are probably constantly failing Turing tests, in some cases with sophisticated 00:20:40.500 |
AIs, but in some cases with ones that are actually pretty simple. 00:20:44.220 |
There are some cognitive biases about social interaction that make this not such a reliable 00:20:52.140 |
There's another dimension to this that we should think about in the context of evaluation, 00:20:56.260 |
and that is how we estimate human performance. 00:20:58.860 |
My summary here is that we estimate human performance by forcing humans to do machine 00:21:04.180 |
tasks and then saying that that's how humans actually perform. 00:21:08.140 |
Let me give you an example in the context of natural language inference. 00:21:11.940 |
Let's imagine that you're a crowd worker and you've been asked to label premise hypothesis 00:21:16.740 |
pairs for whether or not they're neutral, entailment, or contradiction. 00:21:21.580 |
You get a little training, and after the training, you see, okay, a dog jumping and a dog wearing 00:21:26.260 |
a sweater, those are neutral with respect to each other because we don't know from the 00:21:34.500 |
Then you're given the example turtle and linguist, and you think, "Well, I can imagine turtle 00:21:39.460 |
linguist somewhere in some possible world, but I was told this was a common sense reasoning 00:21:44.100 |
situation, and so I'll say contradiction because no actual turtles are linguists." 00:21:52.420 |
But then you come to a photo of a racehorse and a photo of an athlete, and you're asked 00:21:57.020 |
to assign a label, and you think, "Huh, I haven't really thought about this before. 00:22:08.700 |
You might decide that you have a fixed view on this. 00:22:11.060 |
You say, "Of course, a racehorse could be an athlete," or, "Of course not." 00:22:14.500 |
But the really fundamental thing is that you might be unsure what other people think about 00:22:18.420 |
this, and in turn, you might feel unsure about what label you're supposed to assign. 00:22:22.860 |
The human thing is to discuss and debate to figure out why the question is being asked 00:22:28.540 |
and what people are thinking about related to the issues. 00:22:31.660 |
But what we do instead is block all of that interaction and simply force crowd workers 00:22:35.820 |
to choose a label, and then we penalize them, in effect, to the extent to which they don't 00:22:41.020 |
choose the label that everyone else chose, even though all of us feel uncertainty. 00:22:47.460 |
A chef using a barbecue, a person using a machine. 00:22:52.380 |
I think it probably depends on the situation, the goals, the assumptions, all of that stuff. 00:22:57.300 |
The human thing is to discuss those points of uncertainty and then assign a label, but 00:23:01.980 |
we simply block that when we do crowdsourcing. 00:23:06.260 |
So now, when you hear an estimate of human performance, you should remember that the 00:23:12.260 |
humans were probably not allowed to do most human things, like say, "Let's discuss this." 00:23:17.180 |
And so human performance in these contexts really means average performance of harried 00:23:22.380 |
crowd workers doing a machine task repeatedly. 00:23:26.380 |
We can all do that mental shorthand, but of course, out in the world, people hear human 00:23:30.100 |
performance and they think human performance in the most significant sense. 00:23:34.900 |
We should be aware that that's not true, and we should be pushing back against the assumption 00:23:38.980 |
that this is actually what we mean, when in fact, this is what we did. 00:23:45.980 |
I would say that we're looking for things that are kind of between standard old evaluations. 00:23:51.400 |
Can a system perform more accurately on a friendly test than a human performing that 00:23:56.740 |
That is my kind of cynical paraphrase of standard evaluations. 00:24:01.460 |
But we also don't want to swing to, can a system perform like a human in open-ended 00:24:08.120 |
It's a very particular thing, and it's very thorny. 00:24:11.420 |
In the middle there, there's lots of fruitful stuff. 00:24:14.380 |
In the spirit of our previous units, we could ask, can a system behave systematically, even 00:24:19.420 |
That might be a system that is on its way to being one we can trust, even if it's currently 00:24:26.100 |
Can a system assess its own confidence, know when not to make a prediction? 00:24:30.580 |
Our systems in AI used to fail on every unanticipated input. 00:24:35.220 |
Now they give an answer seemingly with confidence, no matter what you throw at them. 00:24:41.000 |
We need systems to withhold information when they're just not sure it's good information, 00:24:46.720 |
And maybe fundamentally, we should ask, can a system make people happier and more productive? 00:24:52.940 |
This would move us far away from automatic evaluation and toward things that were more 00:25:00.500 |
But ultimately, I feel like this is our goal, and we might as well just design evaluations 00:25:11.240 |
I think that time should change, and they are changing. 00:25:14.820 |
Assessment today, or maybe yesterday, is one-dimensional, accuracy, largely insensitive to context or 00:25:24.300 |
The terms are set by the research community, whether we know it or not. 00:25:28.700 |
The metrics are often opaque, and the assessments are often kind of hard to understand deeply. 00:25:33.740 |
And they are tailored to machine tasks right from the very get-go in the way that they 00:25:39.700 |
I think assessment tomorrow, or maybe today, depending on the work that you all do, could 00:25:46.720 |
It could be highly sensitive to context and use case. 00:25:49.840 |
And the terms could be set by the stakeholders, the system designers out in the world, or 00:25:57.300 |
Ultimately, the judgment should be made by users, and the tasks that we're talking about 00:26:05.480 |
We have entered into an era in which I think we could start to implement all of these visionary 00:26:12.760 |
And so I would encourage you all to think about how you could push forward in these 00:26:16.600 |
directions with the research that you do for this course.