Back to Index

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 5: Data Organization I Spring 2023


Chapters

0:0
0:15 Train/Dev/Test
1:27 No fixed splits
3:19 Cross-validation: Random splits
5:17 Cross-validation: K-folds

Transcript

Welcome back everyone. This is part five in our series on methods and metrics. This will be a short, focused, technical screencast about data organization. Within the field of NLP and indeed all of AI, we're all accustomed to having datasets that have train, dev, and test portions. This is common in our largest publicly available datasets.

It does presuppose a fairly large dataset, and that's in virtue of the fact that we hardly ever get to use the test set. As I've said repeatedly in the field, we're all on the honor system to do test set runs only when all of system development is complete. That test set is under lock and key most of the time, and that does mean that it goes hardly ever used during the course of scientific inquiry.

Having these fixed test sets is good because it ensures consistent evaluations. It's much easier to compare two models if they were evaluated according to exactly the same protocol. But it does have a downside that because we always use the same test set, we get a community-wide hill climbing on that test set as later papers learn indirect lessons about the test set from earlier papers in the literature, and that ends up inflating performance.

But on balance, I think train-dev-test has been good for the field of NLP. However, if you're doing work outside of NLP, you might encounter datasets that don't have predefined splits. That could be because they're small or because they're from a different field. For example, in psychology, you hardly ever get this train-dev-test methodology, and so datasets from that field, which you might want to make use of, are unlikely to have the predefined splits.

This poses a challenge for assessment, because as I said, for robust comparisons, we really want to have all our models run using the same assessment regime and that means using the same splits for all of your experimental runs. Now, for large datasets, you could just impose the splits yourself and then use them for the entire project.

That will simplify your experimental design, and it will also reduce the amount of hyperparameter optimization that you need to do. If you can get away with it, just impose the splits and maybe bake that into how people think about the dataset in NLP now. But for small datasets, imposing these splits might simply leave you with too little data and that could lead to very highly variable system assessments.

Either you're training on too few examples to have a lot of examples for assessment and that causes some noise, or you're leaving too few examples to assess on, and then the resulting assessments are very noisy and highly variable. It's hard to get that right. In these situations, I think what you should do is think about cross-validation.

In cross-validation, we take a set of examples and partition them into two or more train test splits. We run a bunch of system evaluations and then we aggregate over those scores in some way usually by taking an average and we report that as a measure of system performance. There are two broad methods that you can use for this kind of cross-validation.

The first is very simple. I've called it random splits here. The idea is for k splits, that is k times, you shuffle your dataset and then you split it into T percent train and usually one minus T percent test to use all the data, and then you conduct an evaluation.

You repeat that k times and you get a vector of scores and then you aggregate those scores in some way. Usually, you would take an average, but you could also think about an average plus a confidence interval or some kind of stats test that would tell you about how two systems differ according to this regime.

Usually, but not always, you want these splits to be stratified in the sense that the train and test splits have approximately the same distribution over the classes or output values to give you consistent evaluations. Trade-offs. Well, the good part of this is that you can create as many experiments as you want without having this impact the ratio of training to testing examples.

The value of k here is separate from the value of T and one minus T. What that means is that you can run lots of experiments and independently set the number of train examples or the number of assessment examples. That's certainly to the good. The bad here is that you don't get a guarantee that every example will be used the same number of times for training and testing because of the shuffle stuff that you do here, introducing a lot of randomness.

Frankly, for reasonably sized datasets, this bad here is very minimal indeed. I really like random splits and I would worry about the bad only in situations in which you have a very small dataset. Finally, Scikit-learn has lots of utilities for doing this random split stuff. I would encourage you to use them.

They've worked them out, nice reliable code that will help you with these protocols. Now, in some situations, you might instead want to do what's called k-fold cross-validation, and this is somewhat different. Let's imagine we have a dataset and we have divided it ahead of time into three folds that is three disjoint parts.

Then we have experiment 1 where we have our test fold is fold 1 and we train on folds 2 and 3 together. Experiment 2, we test on fold 2 and train on 1 and 3. For experiment 3, we test on fold 3 and train on 1 and 2. We've covered all of the combinations.

Our three folds give us three separate experiments, and then we aggregate results across all three of the experiments. Let's think about our trade-offs again. The good part is that every example appears in a train set exactly k minus 1 times and in a test set exactly 1. We get a nice pristine experimental setting in that regard.

The bad though is really bad to my mind. The size of k determines the size of the train set. If I do three folds cross-validation, I get to train on 67 percent of the data and test on 33. But if I want to do 10 folds cross-validation, now I have to train on 90 percent and test on 10.

It feels like the number of experiments has gotten problematically entwined with the percentage of train and test that I want to have, and that's really problematic. You might want to have a lot of folds, that is a lot of experiments, but nonetheless train on only 80 percent of the data in each case.

That leads me to prefer the random splits approach in almost all settings, because the bad there was relatively small relative to the confound that this introduces for k-folds cross-validation. Finally, I'll just note that Scikit again has you covered. They have lots of great utilities for doing this k-folds cross-validation in various ways.

Do make use of them to make sure that your protocols are the ones that you wanted.