back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 5: Data Organization I Spring 2023

Chapters
0:0 
0:15 Train/Dev/Test
1:27 No fixed splits
3:19 Cross-validation: Random splits
5:17 Cross-validation: K-folds
00:00:06.000 | 
This is part five in our series on methods and metrics. 00:00:11.180 | 
technical screencast about data organization. 00:00:15.080 | 
Within the field of NLP and indeed all of AI, 00:00:18.140 | 
we're all accustomed to having datasets that have 00:00:23.440 | 
This is common in our largest publicly available datasets. 00:00:37.080 | 
we're all on the honor system to do test set runs 00:00:40.020 | 
only when all of system development is complete. 00:00:43.400 | 
That test set is under lock and key most of the time, 00:00:49.320 | 
used during the course of scientific inquiry. 00:00:53.320 | 
Having these fixed test sets is good because it ensures 00:00:58.480 | 
It's much easier to compare two models if they were 00:01:01.200 | 
evaluated according to exactly the same protocol. 00:01:08.520 | 
we get a community-wide hill climbing on that test set as 00:01:16.000 | 
the test set from earlier papers in the literature, 00:01:23.040 | 
train-dev-test has been good for the field of NLP. 00:01:26.920 | 
However, if you're doing work outside of NLP, 00:01:40.000 | 
you hardly ever get this train-dev-test methodology, 00:01:55.900 | 
we really want to have all our models run using 00:02:01.200 | 
using the same splits for all of your experimental runs. 00:02:15.960 | 
hyperparameter optimization that you need to do. 00:02:19.880 | 
just impose the splits and maybe bake that into 00:02:22.680 | 
how people think about the dataset in NLP now. 00:02:28.160 | 
imposing these splits might simply leave you with 00:02:35.880 | 
Either you're training on too few examples to have a lot of 00:02:39.640 | 
examples for assessment and that causes some noise, 00:02:42.560 | 
or you're leaving too few examples to assess on, 00:02:53.000 | 
I think what you should do is think about cross-validation. 00:02:56.940 | 
In cross-validation, we take a set of examples 00:03:00.240 | 
and partition them into two or more train test splits. 00:03:04.260 | 
We run a bunch of system evaluations and then we aggregate over 00:03:08.560 | 
those scores in some way usually by taking an average and we 00:03:11.860 | 
report that as a measure of system performance. 00:03:28.240 | 
you shuffle your dataset and then you split it into 00:03:38.800 | 
You repeat that k times and you get a vector of 00:03:41.320 | 
scores and then you aggregate those scores in some way. 00:03:46.680 | 
but you could also think about an average plus 00:03:48.760 | 
a confidence interval or some kind of stats test that would 00:03:51.520 | 
tell you about how two systems differ according to this regime. 00:03:58.380 | 
you want these splits to be stratified in the sense that 00:04:06.280 | 
output values to give you consistent evaluations. 00:04:10.720 | 
Trade-offs. Well, the good part of this is that you can create 00:04:14.680 | 
as many experiments as you want without having 00:04:17.880 | 
this impact the ratio of training to testing examples. 00:04:21.320 | 
The value of k here is separate from the value of T and one minus T. 00:04:26.600 | 
What that means is that you can run lots of experiments and 00:04:32.200 | 
train examples or the number of assessment examples. 00:04:37.240 | 
The bad here is that you don't get a guarantee that 00:04:40.080 | 
every example will be used the same number of 00:04:53.960 | 
I really like random splits and I would worry about 00:04:56.960 | 
the bad only in situations in which you have a very small dataset. 00:05:10.560 | 
nice reliable code that will help you with these protocols. 00:05:17.440 | 
you might instead want to do what's called k-fold cross-validation, 00:05:22.640 | 
Let's imagine we have a dataset and we have divided it ahead of time into 00:05:30.640 | 
Then we have experiment 1 where we have our test fold is 00:05:34.560 | 
fold 1 and we train on folds 2 and 3 together. 00:05:38.280 | 
Experiment 2, we test on fold 2 and train on 1 and 3. 00:05:45.880 | 
For experiment 3, we test on fold 3 and train on 1 and 2. 00:05:51.880 | 
Our three folds give us three separate experiments, 00:05:54.920 | 
and then we aggregate results across all three of the experiments. 00:06:01.480 | 
The good part is that every example appears in a train set 00:06:05.320 | 
exactly k minus 1 times and in a test set exactly 1. 00:06:09.440 | 
We get a nice pristine experimental setting in that regard. 00:06:17.700 | 
The size of k determines the size of the train set. 00:06:24.360 | 
I get to train on 67 percent of the data and test on 33. 00:06:28.920 | 
But if I want to do 10 folds cross-validation, 00:06:31.640 | 
now I have to train on 90 percent and test on 10. 00:06:34.960 | 
It feels like the number of experiments has gotten 00:06:40.560 | 
the percentage of train and test that I want to have, 00:06:48.340 | 
but nonetheless train on only 80 percent of the data in each case. 00:06:54.880 | 
the random splits approach in almost all settings, 00:06:57.800 | 
because the bad there was relatively small relative to 00:07:06.080 | 
Finally, I'll just note that Scikit again has you covered. 00:07:11.560 | 
this k-folds cross-validation in various ways.