Stanford XCS224U: NLU I NLP Methods and Metrics, Part 6: Model Evaluation & Conclusion I Spring 2023

Welcome back everyone. This is the sixth and final screencast in our series on methods and metrics. We're going to talk about model evaluation. This is a high-level discussion that is directly oriented toward helping you with your final project work. Here's an overview. We're going to talk about baselines. What are they?

Why are they important? We'll talk about the trials and tribulations of hyperparameter optimization and why it's important. We'll think about classifier comparison, a common mode to be in as you're evaluating systems and hypotheses. Then we'll talk about two things that are really particular to the deep learning era. How to assess models that don't converge in any strict sense and also the role of random parameter initialization in the performance of our biggest models.

Let's start with baselines. We take this for granted, but this is actually pretty important conceptually. Here's a fundamental observation about baselines. Evaluation numbers in our field can never be understood properly in isolation. Suppose your system gets 0.95 F1, you feel overjoyed, but the first question reviewers will ask you is, is the task too easy?

How do simple baselines do on the problem? Conversely, suppose your system gets 0.6 and you feel in despair because you feel like you haven't had a success here, but the next question should be, how do humans do? They're presumably a upper bound. If it's a hard task or a noisy task, human performance might be close to 0.61 and you might really have achieved something meaningful there.

It's baseline models and in that case, Oracle models that are helping us to understand. Baselines are also crucial for strong experimental design. Defining your baseline should not be some afterthought, but rather central to how you define your overall hypotheses. Think about simple systems, think about ablations of your target system and incorporate those into your thinking about the comparisons that you'll make.

Baselines are really just one aspect of the comparisons we want to offer. Baselines are essential for building a persuasive case. We saw that in my two examples there. To really understand and calibrate on what you achieved, we need some baselines to calibrate all of that stuff. They can also be used to illuminate specific aspects of the problem and specific virtues of your proposed system.

That often falls under the heading of ablations of your system. Those are baselines that remove crucial features or components and test the model with the same protocol. Then the distance between your chosen model and the ablated model is a estimate of the importance of the ablated component to the overall system performance.

Crucial aspect of arguing and supporting hypotheses and everything else. Random baselines are really useful for many purposes. First, they can provide a really true lower bound on how systems can do on your problem. Sometimes they are surprisingly robust, and so it's worth running these early. I think also they can help you fully debug your system.

These are probably lightweight models that do relatively little processing and can make sure that everything is functioning and makes sense and all that other stuff. Scikit-learn again has you covered. They have dummy classifier and dummy regressor. They have different ways of acting as random models, and I think this is really useful to set up early in your process.

You could also think about task specific baselines. This might require real thought and real study in the literature. Does your problem suggest a baseline that will reveal something about the problem or the way it's modeled? If so, you should have one of these task specific baselines. Here are two recent examples from NLU.

The first one is natural language inference. People discovered that so-called hypothesis only baselines can be very strong. The reason this happens is that in the underlying crowdsourcing effort, crowd workers were given premise sentences and asked to construct three hypotheses, one for neutral, one for contradiction, and one for entailment.

In that process of construction, they did some systematic things that convey information about the label inadvertently through the hypothesis. For example, many contradictions involve negation and many entailment pairs involve very general terms as part of the hypothesis. What that means is that the hypothesis actually carries information about the label and a hypothesis only baseline quantifies that.

You simply fit a model without any premise information and see how you do. The finding of the literature is that very often, for our benchmarks, the hypothesis only baseline is way above chance. What that shows you is that the random baseline is not so informative anymore. There's a similar story for the story closed task.

This is to distinguish between a coherent and incoherent ending for a story. Again, systems that look only at the ending often do really well. I think for the same reason, the coherent versus incoherent thing is often actually inferable just from the ending, neglecting the story. It's not that the task is broken here necessarily, but rather again, that you should think about this as a baseline to compare against and progress.

It's really progress from this very specialized baseline. Next topic, hyperparameter optimization. This is discussed extensively in one of our background units on sentiment analysis. You might go there for a refresher. Here, I'll just briefly review the rationale. You want maybe to obtain the best version of your model, and that might mean exploring over different hyperparameters to find a optimal setting for it.

Another motivation is about comparison between models. Suppose you do have a results table full of different systems you're comparing. It makes no sense to compare them against randomly chosen parameter settings because you really want to give every model the best chance to shine. Otherwise, there's an arbitrariness to the evaluation that might not translate into robust results.

What you really do is give every system a chance by exploring a wide range of hyperparameters and reporting the optimal results according to that exploration. That's a fair comparison and it implies a lot of search over hyperparameters. You might want to understand the stability of your architecture. This is interestingly different.

This is where you're not interested in the best parameters, but rather how stable system performance is under various choices people might make in order to get a sense for how robustly it will perform if people are say not attentive to these hyperparameters or set them incorrectly in a inadvertent accident or an adversarial setting.

Crucial to all of this no matter what your goals, hyperparameter tuning must be done only on train and development data. You never do model selection of any kind based on the test data. This is a special case of the rule that I've been repeating throughout the course. This is really fundamental to how we think about testing and generalization and it applies with real force in the context of the kind of model selection we're doing here.

Now, hyperparameter optimization has gotten really challenging in the era of long-running expensive training regimes. Let me give you a sense for what the problem is by way of an example. For each hyperparameter, you identify a large set of values for it in some range. Then you create a list of all the combinations of all the hyperparameters.

This is the cross product of all the values for all the features you identified in step 1. What you can hear in that description is an exponential growth in the number of settings. For each setting, you cross-validate it on the available training data, which might imply 5, or 10, or 20 experiments.

Then you choose the settings that did best in step 3, and you train on all the trained data using that setting, and then you evaluate that model on the test set. That is a pristine version of the protocol that we might be implementing. But here's the problem. Suppose parameter h1 has five values and parameter 2 has 10, then the total number of settings is now 50.

Suppose we add 3, then it goes to 100. Now, suppose we're going to do five-fold cross-validation to select optimal parameters. Now we are at 500 runs. Very quickly, the number of experiments exploded. If each one of these runs takes a day, you're pretty much out of contention in terms of actually implementing this protocol completely.

Something has to change. The above is untenable as a set of laws for the scientific community. We cannot insist on this level of hyperparameter optimization. If we adopted it, complex models trained on large datasets would end up disfavored and only the very wealthy would be able to participate. To give you a glimpse of this, here's a quote from a paper from a team at Google doing NLP for healthcare.

The performance of all above neural networks were tuned automatically using Google Vizier with a total of over 200,000 GPU hours. That is a lot of money spent on a lot of compute. Obviously, we cannot insist on a similar level of investment for experiments say for this course, but frankly, for any contribution in the field, we have to have compromises.

Here are some reasonable compromises. These are pragmatic steps you can take to alleviate this resource problem. I've given them in descending order of attractiveness and I find that as the days go by, we need to go lower and lower on this list. You could do random sampling and guided sampling to explore a large space on a fixed budget.

This is nice because you have the cross product of all of the settings. That's too large. You simply randomly sample in the space, maybe with some guidance from a model, and you can then on a fixed budget of say, five or 10 or 100 runs, do a version of the full grid search.

You could also search based on a few epochs of training. The expense comes from multiple epochs, maybe you do one or two, and then you pick the hyperparameters that were best at that point. If the learning curves are familiar and consistent, then this will be a pretty strong approach here.

You could also search based on subsets of the data. This is fine, but it could be risky because we know some parameters are very dependent on dataset size. You're selecting based on small data and applying it to large data, even though you know that's probably a risky assumption. You could do heuristic search and define which hyperparameters matter less and then set them by hand and justify that in the paper.

That's increasingly common. People describe things like we determined in our initial experiments that these hyperparameters had this optimal value or didn't matter that much, and so we chose these reasonable values. Then the actual search happens only over the ones that you can tell are important. We have to take your word for it that you've done the heuristic search responsibly, but this is obviously a really good way to balance exploration with constrained resources.

You could find the optimal hyperparameters via a single split and use them for all subsequent splits and then justify that based on the fact that the splits are similar. That would automatically cut down substantially on the number of runs you need to do because you don't need to do so much cross-validation in this mode.

Then finally, you could adopt others' choices. The skeptics will complain that these findings don't translate to new datasets, but it could be the only option. As I say, a few years ago, this was frowned upon, but now in the modern era with these massive models, it's basically the only option.

I think increasingly, people are simply carrying forward other hyperparameters. It means less exploration. We might not be seeing the best versions of these models, but it might then again be the only option. In terms of tools for hyperparameter search, as always, Scikit is really rich with these things. They've got a lot of these toolings.

In addition, Scikit Optimize is one level up in terms of sophistication. That's where you could do model-guided search through a hyperparameter grid in order to intelligently select good settings to lead to a good model on a fixed budget. Next topic, classifier comparison. This is a short one, but this can be important.

Suppose you've assessed two classifier models. Their performance is probably different in some way. What can you do to establish whether these models are different in any meaningful sense? I think there are a few options. The first would be practical differences. If they obviously make a large number of different predictions, you might be able to quantify that difference in terms of some actual external outcome.

That's a really good scenario to be in. You could also think about confidence intervals to further bolster the argument that you're making. This will give us a picture of how consistently different the two systems are. If they are consistently different, then you have a very clear argument in favor of one over the other.

The Wilcoxon sign-rank test is a accepted method in the field for assessing classifiers using methodologies that are similar to standard t-tests. I guess the consensus is just that the assumptions behind the Wilcoxon test are somewhat more aligned with classifier comparison. To do that, as well as confidence intervals, you will have had to run your model on lots of different settings to get a long vector of 10-20 scores to use as the basis for the stats testing.

If that is too expensive, you could opt for a McNemars test. This is a comparison that you do over two single trained classifiers based on their confusion matrices. You only need one run. It will be unstable if the models are unstable, but it is a way of doing a stats test in the mode of the chi-squared test to give you some information about how two fixed artifacts compare to each other.

Not as strong as the previous methods, but nonetheless useful and arguably better than nothing. A special topic for deep learning. How do you assess models without convergence? This never used to arise. Back in the days of all linear models, all these models would converge more or less instantly to epsilon loss, and then you could feel like that was how you'd move forward with assessing them.

But now with neural networks, convergence has really taken center stage, and it's in a complicated way that it takes center stage. First, these models rarely converge to epsilon loss, and therefore it's non-issue whether or not that would be your stopping criterion. In addition, they might converge at different rates between runs, and their performance on the test set might not even be especially related to how small the loss got.

You need to be thoughtful about exactly what your stopping criteria will be. Yes, sometimes a model with low final error turns out to be great, and sometimes it's worse than the one that finished with a higher error. This might have something to do with overfitting and regularization, but the bottom line here is, we don't know a priori what's going on.

This is very experiment-driven. One thing to think about for stopping criteria in general is what we call incremental dev set testing. To address the uncertainty that I just reviewed, you regularly collect information about dev set performance as part of the training that you're doing. For example, at every 100th iteration, you could make predictions on the dev set and store the resulting vector of predictions.

All the PyTorch models for this course have an early stopping parameter and a bunch of related parameters that will help you set this up in a way that will allow you to do this incremental testing, and with luck, get the best model. If you need a little bit of motivation for this, here are some plots from an actual model.

You can see the loss going down very steadily across different iterations of training. But the performance on the dev set tells a very different story. You can see based on this performance that at a certain very early point in this process around iteration 10, our performance was actually getting worse even though the loss was going down.

That just shows you that sometimes the steady loss curve is a picture of overfitting and not of your model actually getting better at the thing that you care about. Think carefully about your stopping criteria. In general though, I think we might want to take a more expansive view of how we do evaluation in this mode.

Here what I'm pitching is that we look at the entire performance curve maybe with confidence intervals so we can make some confident distinctions. All these plots for different conditions across models we were comparing have epochs along the x-axis and F1 along the y-axis. If you step back, what I think you see is that our mittens model, the one that we were advocating for, is the best model on average but largely in early parts of training.

If you train for long enough, a lot of the distinctions disappear. If you have a fixed budget of epochs, mittens is a good choice. If you don't care about the resources, it might not be so clear which one you should choose, maybe it doesn't matter. That's a nuanced lesson that I think is really powerful to teach and we can't really teach it if all we do is offer point estimates of model performance.

We really need to see the full curve to see that level of nuance. I know that NLPers love their results tables, you should have results tables, but maybe you could supplement them with some figures that would give us a fuller picture of what was going on. Final topic, the role of random parameter initialization.

Most deep learning models have parameters that are random at the start. Even if they're pre-trained, there are usually some random parameters in the mix there. This is clearly meaningful for the non-convex problems that we're posing. Simpler models can be impacted as well, but it's especially pressing in the deep learning era.

Here is a relatively recent paper showing actually that different initializations for neural sequence models led to statistically significant differences in performance. A number of recent systems were actually indistinguishable in terms of their raw performance once we took this source of variation into account. There's a related issue here of catastrophic failure from unlucky initializations.

Sometimes that happens, sometimes you see it, sometimes it's hard to notice. There's a question of how to report that as part of overall system performance. We need to be reflective about this. Maybe the bottom line here is just for the associated notebook for this unit, evaluation methods. I just showed you with the classic XOR problem, which has always been used to motivate the powerful models that we work with now, that you don't actually get success for a simple feed-forward network for that problem.

Eight out of 10 times it succeeds, and two out of the 10 times it's a colossal failure. That is a glimpse of just how important initialization can be. Since we don't analytically understand why we're seeing this variation, the best response if you can afford it is a bunch more experiments.

Let's wrap up. A lot of this in the back of my mind is oriented toward helping you with the protocols, which is a document associated with your final project, where you give us the nuts and bolts of the project, and try to identify any obstacles to success. All the lessons we've been teaching throughout this series are oriented toward helping you think critically about this protocol and ultimately set up a firm foundation for your project.

With that out of the way, I thought I would look ahead to this moment and the future. I think this is an ideal moment for innovation in surprising new places. Architecture innovation, way overrated at this point. I mean, it's still important, but it is overrated relative to the amount of things people do.

Metric innovation, way underrated. It's been a theme of these lectures that we need to think very carefully about our metrics because they are our guideposts toward whether we're succeeding or not. Relatedly, evaluations in general. We need innovation there. This is way underrated by the community at this point. Task innovation, underrated.

We are seeing some things, so it's not as bad as two and three, but still we should all be participating in this area. Then finally, exhaustive hyperparameter search. You need to weigh this against other factors. There is more at play here than just that pristine scientific paradigm. We need to think about costs in every sense and how it relates to the innovations that we're likely to see.

I'm pitching a pragmatic approach, but I'm also exhorting you to think expansively about how you might participate in pushing the field forward.

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 6: Model Evaluation & Conclusion I Spring 2023

Transcript