back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 6: Model Evaluation & Conclusion I Spring 2023
00:00:12.900 |
This is a high-level discussion that is directly 00:00:15.300 |
oriented toward helping you with your final project work. 00:00:18.900 |
Here's an overview. We're going to talk about baselines. 00:00:55.380 |
but this is actually pretty important conceptually. 00:00:58.500 |
Here's a fundamental observation about baselines. 00:01:11.520 |
but the first question reviewers will ask you is, 00:01:19.600 |
Conversely, suppose your system gets 0.6 and you feel 00:01:38.540 |
you might really have achieved something meaningful there. 00:01:43.520 |
Oracle models that are helping us to understand. 00:01:46.660 |
Baselines are also crucial for strong experimental design. 00:01:51.180 |
Defining your baseline should not be some afterthought, 00:02:01.160 |
think about ablations of your target system and 00:02:12.820 |
Baselines are essential for building a persuasive case. 00:02:18.660 |
To really understand and calibrate on what you achieved, 00:02:22.020 |
we need some baselines to calibrate all of that stuff. 00:02:30.860 |
and specific virtues of your proposed system. 00:02:45.180 |
your chosen model and the ablated model is a estimate of 00:02:50.920 |
the ablated component to the overall system performance. 00:03:00.140 |
Random baselines are really useful for many purposes. 00:03:14.780 |
I think also they can help you fully debug your system. 00:03:18.340 |
These are probably lightweight models that do 00:03:28.820 |
They have dummy classifier and dummy regressor. 00:03:31.620 |
They have different ways of acting as random models, 00:03:39.860 |
You could also think about task specific baselines. 00:03:47.420 |
Does your problem suggest a baseline that will reveal 00:03:50.500 |
something about the problem or the way it's modeled? 00:03:53.500 |
If so, you should have one of these task specific baselines. 00:04:04.500 |
hypothesis only baselines can be very strong. 00:04:36.420 |
very general terms as part of the hypothesis. 00:04:43.180 |
the label and a hypothesis only baseline quantifies that. 00:04:51.540 |
The finding of the literature is that very often, 00:05:00.340 |
the random baseline is not so informative anymore. 00:05:04.460 |
There's a similar story for the story closed task. 00:05:09.700 |
a coherent and incoherent ending for a story. 00:05:17.660 |
the coherent versus incoherent thing is often 00:05:25.460 |
It's not that the task is broken here necessarily, 00:05:28.580 |
but rather again, that you should think about 00:05:30.460 |
this as a baseline to compare against and progress. 00:05:33.860 |
It's really progress from this very specialized baseline. 00:05:44.660 |
one of our background units on sentiment analysis. 00:05:49.580 |
Here, I'll just briefly review the rationale. 00:05:52.300 |
You want maybe to obtain the best version of your model, 00:06:01.660 |
Another motivation is about comparison between models. 00:06:24.980 |
What you really do is give every system a chance by 00:06:27.780 |
exploring a wide range of hyperparameters and 00:06:47.500 |
but rather how stable system performance is under 00:06:50.540 |
various choices people might make in order to get 00:06:53.820 |
a sense for how robustly it will perform if people are 00:06:57.020 |
say not attentive to these hyperparameters or set them 00:07:06.100 |
Crucial to all of this no matter what your goals, 00:07:20.700 |
the rule that I've been repeating throughout the course. 00:07:22.860 |
This is really fundamental to how we think about 00:07:25.020 |
testing and generalization and it applies with 00:07:29.900 |
the kind of model selection we're doing here. 00:07:33.020 |
Now, hyperparameter optimization has gotten really 00:07:42.100 |
Let me give you a sense for what the problem is 00:07:51.860 |
Then you create a list of all the combinations 00:07:59.180 |
for all the features you identified in step 1. 00:08:03.860 |
an exponential growth in the number of settings. 00:08:11.180 |
which might imply 5, or 10, or 20 experiments. 00:08:15.540 |
Then you choose the settings that did best in step 3, 00:08:18.780 |
and you train on all the trained data using that setting, 00:08:21.820 |
and then you evaluate that model on the test set. 00:08:26.580 |
of the protocol that we might be implementing. 00:08:44.620 |
five-fold cross-validation to select optimal parameters. 00:08:49.940 |
Very quickly, the number of experiments exploded. 00:08:56.300 |
you're pretty much out of contention in terms of 00:08:58.620 |
actually implementing this protocol completely. 00:09:08.500 |
We cannot insist on this level of hyperparameter optimization. 00:09:14.180 |
complex models trained on large datasets would end up 00:09:27.060 |
The performance of all above neural networks were 00:09:36.700 |
That is a lot of money spent on a lot of compute. 00:09:40.380 |
Obviously, we cannot insist on a similar level of 00:09:43.740 |
investment for experiments say for this course, 00:09:46.700 |
but frankly, for any contribution in the field, 00:10:00.700 |
attractiveness and I find that as the days go by, 00:10:10.500 |
guided sampling to explore a large space on a fixed budget. 00:10:30.980 |
You could also search based on a few epochs of 00:10:33.980 |
training. The expense comes from multiple epochs, 00:10:42.340 |
If the learning curves are familiar and consistent, 00:10:47.220 |
then this will be a pretty strong approach here. 00:10:50.540 |
You could also search based on subsets of the data. 00:10:53.780 |
This is fine, but it could be risky because we 00:10:56.540 |
know some parameters are very dependent on dataset size. 00:11:03.700 |
even though you know that's probably a risky assumption. 00:11:10.500 |
which hyperparameters matter less and then set 00:11:23.180 |
this optimal value or didn't matter that much, 00:11:29.020 |
Then the actual search happens only over the ones that you 00:11:34.580 |
We have to take your word for it that you've done 00:11:40.780 |
balance exploration with constrained resources. 00:11:44.820 |
You could find the optimal hyperparameters via 00:11:51.500 |
that based on the fact that the splits are similar. 00:11:55.900 |
substantially on the number of runs you need to do because you 00:11:58.780 |
don't need to do so much cross-validation in this mode. 00:12:03.020 |
Then finally, you could adopt others' choices. 00:12:05.900 |
The skeptics will complain that these findings 00:12:15.840 |
but now in the modern era with these massive models, 00:12:26.980 |
We might not be seeing the best versions of these models, 00:12:36.300 |
as always, Scikit is really rich with these things. 00:12:46.140 |
That's where you could do model-guided search through 00:12:48.980 |
a hyperparameter grid in order to intelligently 00:13:00.940 |
This is a short one, but this can be important. 00:13:02.780 |
Suppose you've assessed two classifier models. 00:13:05.580 |
Their performance is probably different in some way. 00:13:11.180 |
these models are different in any meaningful sense? 00:13:21.080 |
you might be able to quantify that difference in terms of 00:13:29.320 |
You could also think about confidence intervals 00:13:32.100 |
to further bolster the argument that you're making. 00:13:45.540 |
The Wilcoxon sign-rank test is a accepted method in 00:13:51.900 |
using methodologies that are similar to standard t-tests. 00:13:59.620 |
are somewhat more aligned with classifier comparison. 00:14:07.260 |
lots of different settings to get a long vector of 00:14:10.860 |
10-20 scores to use as the basis for the stats testing. 00:14:29.540 |
It will be unstable if the models are unstable, 00:14:32.440 |
but it is a way of doing a stats test in the mode of 00:14:44.860 |
but nonetheless useful and arguably better than nothing. 00:14:54.180 |
How do you assess models without convergence? 00:15:14.460 |
and it's in a complicated way that it takes center stage. 00:15:17.140 |
First, these models rarely converge to epsilon loss, 00:15:29.500 |
and their performance on the test set might not 00:15:32.540 |
even be especially related to how small the loss got. 00:15:59.100 |
One thing to think about for stopping criteria in 00:16:03.060 |
general is what we call incremental dev set testing. 00:16:06.700 |
To address the uncertainty that I just reviewed, 00:16:12.140 |
dev set performance as part of the training that you're doing. 00:16:18.260 |
you could make predictions on the dev set and 00:16:28.960 |
related parameters that will help you set this up in 00:16:31.900 |
a way that will allow you to do this incremental testing, 00:16:37.580 |
If you need a little bit of motivation for this, 00:16:44.460 |
steadily across different iterations of training. 00:16:47.700 |
But the performance on the dev set tells a very different story. 00:16:51.780 |
You can see based on this performance that at 00:16:54.160 |
a certain very early point in this process around iteration 10, 00:17:02.620 |
That just shows you that sometimes the steady loss curve is 00:17:09.220 |
your model actually getting better at the thing that you care about. 00:17:13.140 |
Think carefully about your stopping criteria. 00:17:16.580 |
In general though, I think we might want to take 00:17:20.100 |
a more expansive view of how we do evaluation in this mode. 00:17:29.740 |
confidence intervals so we can make some confident distinctions. 00:17:33.180 |
All these plots for different conditions across models we were 00:17:36.420 |
comparing have epochs along the x-axis and F1 along the y-axis. 00:17:42.060 |
If you step back, what I think you see is that our mittens model, 00:17:48.400 |
is the best model on average but largely in early parts of training. 00:18:03.780 |
it might not be so clear which one you should choose, 00:18:07.560 |
That's a nuanced lesson that I think is really powerful to 00:18:13.740 |
do is offer point estimates of model performance. 00:18:16.380 |
We really need to see the full curve to see that level of nuance. 00:18:20.780 |
I know that NLPers love their results tables, 00:18:25.660 |
but maybe you could supplement them with some figures that would 00:18:28.700 |
give us a fuller picture of what was going on. 00:18:32.620 |
Final topic, the role of random parameter initialization. 00:18:37.180 |
Most deep learning models have parameters that are random at the start. 00:18:42.100 |
there are usually some random parameters in the mix there. 00:18:52.560 |
but it's especially pressing in the deep learning era. 00:18:55.100 |
Here is a relatively recent paper showing actually that 00:18:57.980 |
different initializations for neural sequence models 00:19:00.820 |
led to statistically significant differences in performance. 00:19:05.280 |
A number of recent systems were actually indistinguishable in terms of 00:19:09.340 |
their raw performance once we took this source of variation into account. 00:19:15.780 |
catastrophic failure from unlucky initializations. 00:19:19.100 |
Sometimes that happens, sometimes you see it, 00:19:23.880 |
There's a question of how to report that as part of overall system performance. 00:19:32.940 |
the associated notebook for this unit, evaluation methods. 00:19:36.160 |
I just showed you with the classic XOR problem, 00:19:41.220 |
motivate the powerful models that we work with now, 00:19:46.580 |
a simple feed-forward network for that problem. 00:19:51.620 |
and two out of the 10 times it's a colossal failure. 00:19:54.680 |
That is a glimpse of just how important initialization can be. 00:19:59.060 |
Since we don't analytically understand why we're seeing this variation, 00:20:03.120 |
the best response if you can afford it is a bunch more experiments. 00:20:08.620 |
Let's wrap up. A lot of this in the back of my mind is 00:20:13.460 |
oriented toward helping you with the protocols, 00:20:15.780 |
which is a document associated with your final project, 00:20:19.140 |
where you give us the nuts and bolts of the project, 00:20:21.800 |
and try to identify any obstacles to success. 00:20:25.180 |
All the lessons we've been teaching throughout this series are 00:20:28.460 |
oriented toward helping you think critically about 00:20:30.840 |
this protocol and ultimately set up a firm foundation for your project. 00:20:36.020 |
With that out of the way, I thought I would look 00:20:41.340 |
I think this is an ideal moment for innovation in surprising new places. 00:20:46.260 |
Architecture innovation, way overrated at this point. 00:20:51.900 |
but it is overrated relative to the amount of things people do. 00:20:58.760 |
It's been a theme of these lectures that we need to think very carefully about 00:21:11.620 |
This is way underrated by the community at this point. 00:21:19.840 |
but still we should all be participating in this area. 00:21:23.280 |
Then finally, exhaustive hyperparameter search. 00:21:26.520 |
You need to weigh this against other factors. 00:21:29.120 |
There is more at play here than just that pristine scientific paradigm. 00:21:33.440 |
We need to think about costs in every sense and 00:21:36.600 |
how it relates to the innovations that we're likely to see. 00:21:44.360 |
expansively about how you might participate in pushing the field forward.