back to indexStanford XCS224U: NLU I NLP Methods and Metrics, Part 5: Data Organization I Spring 2023
Chapters
0:0
0:15 Train/Dev/Test
1:27 No fixed splits
3:19 Cross-validation: Random splits
5:17 Cross-validation: K-folds
00:00:06.000 |
This is part five in our series on methods and metrics. 00:00:11.180 |
technical screencast about data organization. 00:00:15.080 |
Within the field of NLP and indeed all of AI, 00:00:18.140 |
we're all accustomed to having datasets that have 00:00:23.440 |
This is common in our largest publicly available datasets. 00:00:37.080 |
we're all on the honor system to do test set runs 00:00:40.020 |
only when all of system development is complete. 00:00:43.400 |
That test set is under lock and key most of the time, 00:00:49.320 |
used during the course of scientific inquiry. 00:00:53.320 |
Having these fixed test sets is good because it ensures 00:00:58.480 |
It's much easier to compare two models if they were 00:01:01.200 |
evaluated according to exactly the same protocol. 00:01:08.520 |
we get a community-wide hill climbing on that test set as 00:01:16.000 |
the test set from earlier papers in the literature, 00:01:23.040 |
train-dev-test has been good for the field of NLP. 00:01:26.920 |
However, if you're doing work outside of NLP, 00:01:40.000 |
you hardly ever get this train-dev-test methodology, 00:01:55.900 |
we really want to have all our models run using 00:02:01.200 |
using the same splits for all of your experimental runs. 00:02:15.960 |
hyperparameter optimization that you need to do. 00:02:19.880 |
just impose the splits and maybe bake that into 00:02:22.680 |
how people think about the dataset in NLP now. 00:02:28.160 |
imposing these splits might simply leave you with 00:02:35.880 |
Either you're training on too few examples to have a lot of 00:02:39.640 |
examples for assessment and that causes some noise, 00:02:42.560 |
or you're leaving too few examples to assess on, 00:02:53.000 |
I think what you should do is think about cross-validation. 00:02:56.940 |
In cross-validation, we take a set of examples 00:03:00.240 |
and partition them into two or more train test splits. 00:03:04.260 |
We run a bunch of system evaluations and then we aggregate over 00:03:08.560 |
those scores in some way usually by taking an average and we 00:03:11.860 |
report that as a measure of system performance. 00:03:28.240 |
you shuffle your dataset and then you split it into 00:03:38.800 |
You repeat that k times and you get a vector of 00:03:41.320 |
scores and then you aggregate those scores in some way. 00:03:46.680 |
but you could also think about an average plus 00:03:48.760 |
a confidence interval or some kind of stats test that would 00:03:51.520 |
tell you about how two systems differ according to this regime. 00:03:58.380 |
you want these splits to be stratified in the sense that 00:04:06.280 |
output values to give you consistent evaluations. 00:04:10.720 |
Trade-offs. Well, the good part of this is that you can create 00:04:14.680 |
as many experiments as you want without having 00:04:17.880 |
this impact the ratio of training to testing examples. 00:04:21.320 |
The value of k here is separate from the value of T and one minus T. 00:04:26.600 |
What that means is that you can run lots of experiments and 00:04:32.200 |
train examples or the number of assessment examples. 00:04:37.240 |
The bad here is that you don't get a guarantee that 00:04:40.080 |
every example will be used the same number of 00:04:53.960 |
I really like random splits and I would worry about 00:04:56.960 |
the bad only in situations in which you have a very small dataset. 00:05:10.560 |
nice reliable code that will help you with these protocols. 00:05:17.440 |
you might instead want to do what's called k-fold cross-validation, 00:05:22.640 |
Let's imagine we have a dataset and we have divided it ahead of time into 00:05:30.640 |
Then we have experiment 1 where we have our test fold is 00:05:34.560 |
fold 1 and we train on folds 2 and 3 together. 00:05:38.280 |
Experiment 2, we test on fold 2 and train on 1 and 3. 00:05:45.880 |
For experiment 3, we test on fold 3 and train on 1 and 2. 00:05:51.880 |
Our three folds give us three separate experiments, 00:05:54.920 |
and then we aggregate results across all three of the experiments. 00:06:01.480 |
The good part is that every example appears in a train set 00:06:05.320 |
exactly k minus 1 times and in a test set exactly 1. 00:06:09.440 |
We get a nice pristine experimental setting in that regard. 00:06:17.700 |
The size of k determines the size of the train set. 00:06:24.360 |
I get to train on 67 percent of the data and test on 33. 00:06:28.920 |
But if I want to do 10 folds cross-validation, 00:06:31.640 |
now I have to train on 90 percent and test on 10. 00:06:34.960 |
It feels like the number of experiments has gotten 00:06:40.560 |
the percentage of train and test that I want to have, 00:06:48.340 |
but nonetheless train on only 80 percent of the data in each case. 00:06:54.880 |
the random splits approach in almost all settings, 00:06:57.800 |
because the bad there was relatively small relative to 00:07:06.080 |
Finally, I'll just note that Scikit again has you covered. 00:07:11.560 |
this k-folds cross-validation in various ways.