back to index

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 5: Data Organization I Spring 2023


Chapters

0:0
0:15 Train/Dev/Test
1:27 No fixed splits
3:19 Cross-validation: Random splits
5:17 Cross-validation: K-folds

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.000 | This is part five in our series on methods and metrics.
00:00:09.160 | This will be a short, focused,
00:00:11.180 | technical screencast about data organization.
00:00:15.080 | Within the field of NLP and indeed all of AI,
00:00:18.140 | we're all accustomed to having datasets that have
00:00:20.440 | train, dev, and test portions.
00:00:23.440 | This is common in our largest publicly available datasets.
00:00:27.480 | It does presuppose a fairly large dataset,
00:00:30.760 | and that's in virtue of the fact that we
00:00:32.440 | hardly ever get to use the test set.
00:00:34.480 | As I've said repeatedly in the field,
00:00:37.080 | we're all on the honor system to do test set runs
00:00:40.020 | only when all of system development is complete.
00:00:43.400 | That test set is under lock and key most of the time,
00:00:46.480 | and that does mean that it goes hardly ever
00:00:49.320 | used during the course of scientific inquiry.
00:00:53.320 | Having these fixed test sets is good because it ensures
00:00:57.080 | consistent evaluations.
00:00:58.480 | It's much easier to compare two models if they were
00:01:01.200 | evaluated according to exactly the same protocol.
00:01:04.240 | But it does have a downside that
00:01:06.120 | because we always use the same test set,
00:01:08.520 | we get a community-wide hill climbing on that test set as
00:01:12.660 | later papers learn indirect lessons about
00:01:16.000 | the test set from earlier papers in the literature,
00:01:18.840 | and that ends up inflating performance.
00:01:21.320 | But on balance, I think
00:01:23.040 | train-dev-test has been good for the field of NLP.
00:01:26.920 | However, if you're doing work outside of NLP,
00:01:30.360 | you might encounter datasets that
00:01:32.040 | don't have predefined splits.
00:01:34.480 | That could be because they're small or
00:01:36.340 | because they're from a different field.
00:01:38.280 | For example, in psychology,
00:01:40.000 | you hardly ever get this train-dev-test methodology,
00:01:43.340 | and so datasets from that field,
00:01:45.200 | which you might want to make use of,
00:01:46.760 | are unlikely to have the predefined splits.
00:01:50.280 | This poses a challenge for assessment,
00:01:53.240 | because as I said, for robust comparisons,
00:01:55.900 | we really want to have all our models run using
00:01:58.480 | the same assessment regime and that means
00:02:01.200 | using the same splits for all of your experimental runs.
00:02:04.960 | Now, for large datasets,
00:02:06.760 | you could just impose the splits yourself
00:02:08.860 | and then use them for the entire project.
00:02:11.160 | That will simplify your experimental design,
00:02:14.200 | and it will also reduce the amount of
00:02:15.960 | hyperparameter optimization that you need to do.
00:02:18.280 | If you can get away with it,
00:02:19.880 | just impose the splits and maybe bake that into
00:02:22.680 | how people think about the dataset in NLP now.
00:02:26.060 | But for small datasets,
00:02:28.160 | imposing these splits might simply leave you with
00:02:30.720 | too little data and that could lead to
00:02:32.380 | very highly variable system assessments.
00:02:35.880 | Either you're training on too few examples to have a lot of
00:02:39.640 | examples for assessment and that causes some noise,
00:02:42.560 | or you're leaving too few examples to assess on,
00:02:46.020 | and then the resulting assessments
00:02:47.900 | are very noisy and highly variable.
00:02:50.240 | It's hard to get that right.
00:02:51.600 | In these situations,
00:02:53.000 | I think what you should do is think about cross-validation.
00:02:56.940 | In cross-validation, we take a set of examples
00:03:00.240 | and partition them into two or more train test splits.
00:03:04.260 | We run a bunch of system evaluations and then we aggregate over
00:03:08.560 | those scores in some way usually by taking an average and we
00:03:11.860 | report that as a measure of system performance.
00:03:15.880 | There are two broad methods that you
00:03:18.740 | can use for this kind of cross-validation.
00:03:20.960 | The first is very simple.
00:03:22.560 | I've called it random splits here.
00:03:24.920 | The idea is for k splits,
00:03:27.160 | that is k times,
00:03:28.240 | you shuffle your dataset and then you split it into
00:03:31.720 | T percent train and usually one minus
00:03:34.160 | T percent test to use all the data,
00:03:36.560 | and then you conduct an evaluation.
00:03:38.800 | You repeat that k times and you get a vector of
00:03:41.320 | scores and then you aggregate those scores in some way.
00:03:44.640 | Usually, you would take an average,
00:03:46.680 | but you could also think about an average plus
00:03:48.760 | a confidence interval or some kind of stats test that would
00:03:51.520 | tell you about how two systems differ according to this regime.
00:03:56.020 | Usually, but not always,
00:03:58.380 | you want these splits to be stratified in the sense that
00:04:01.160 | the train and test splits have approximately
00:04:03.840 | the same distribution over the classes or
00:04:06.280 | output values to give you consistent evaluations.
00:04:10.720 | Trade-offs. Well, the good part of this is that you can create
00:04:14.680 | as many experiments as you want without having
00:04:17.880 | this impact the ratio of training to testing examples.
00:04:21.320 | The value of k here is separate from the value of T and one minus T.
00:04:26.600 | What that means is that you can run lots of experiments and
00:04:30.600 | independently set the number of
00:04:32.200 | train examples or the number of assessment examples.
00:04:35.320 | That's certainly to the good.
00:04:37.240 | The bad here is that you don't get a guarantee that
00:04:40.080 | every example will be used the same number of
00:04:42.320 | times for training and testing because of
00:04:44.480 | the shuffle stuff that you do here,
00:04:46.560 | introducing a lot of randomness.
00:04:48.720 | Frankly, for reasonably sized datasets,
00:04:51.440 | this bad here is very minimal indeed.
00:04:53.960 | I really like random splits and I would worry about
00:04:56.960 | the bad only in situations in which you have a very small dataset.
00:05:01.960 | Finally, Scikit-learn has lots of
00:05:04.640 | utilities for doing this random split stuff.
00:05:07.640 | I would encourage you to use them.
00:05:09.320 | They've worked them out,
00:05:10.560 | nice reliable code that will help you with these protocols.
00:05:14.400 | Now, in some situations,
00:05:17.440 | you might instead want to do what's called k-fold cross-validation,
00:05:21.200 | and this is somewhat different.
00:05:22.640 | Let's imagine we have a dataset and we have divided it ahead of time into
00:05:26.600 | three folds that is three disjoint parts.
00:05:30.640 | Then we have experiment 1 where we have our test fold is
00:05:34.560 | fold 1 and we train on folds 2 and 3 together.
00:05:38.280 | Experiment 2, we test on fold 2 and train on 1 and 3.
00:05:45.880 | For experiment 3, we test on fold 3 and train on 1 and 2.
00:05:49.840 | We've covered all of the combinations.
00:05:51.880 | Our three folds give us three separate experiments,
00:05:54.920 | and then we aggregate results across all three of the experiments.
00:05:59.520 | Let's think about our trade-offs again.
00:06:01.480 | The good part is that every example appears in a train set
00:06:05.320 | exactly k minus 1 times and in a test set exactly 1.
00:06:09.440 | We get a nice pristine experimental setting in that regard.
00:06:13.840 | The bad though is really bad to my mind.
00:06:17.700 | The size of k determines the size of the train set.
00:06:22.160 | If I do three folds cross-validation,
00:06:24.360 | I get to train on 67 percent of the data and test on 33.
00:06:28.920 | But if I want to do 10 folds cross-validation,
00:06:31.640 | now I have to train on 90 percent and test on 10.
00:06:34.960 | It feels like the number of experiments has gotten
00:06:37.880 | problematically entwined with
00:06:40.560 | the percentage of train and test that I want to have,
00:06:43.080 | and that's really problematic.
00:06:44.320 | You might want to have a lot of folds,
00:06:46.680 | that is a lot of experiments,
00:06:48.340 | but nonetheless train on only 80 percent of the data in each case.
00:06:52.680 | That leads me to prefer
00:06:54.880 | the random splits approach in almost all settings,
00:06:57.800 | because the bad there was relatively small relative to
00:07:01.400 | the confound that this introduces for
00:07:03.820 | k-folds cross-validation.
00:07:06.080 | Finally, I'll just note that Scikit again has you covered.
00:07:09.160 | They have lots of great utilities for doing
00:07:11.560 | this k-folds cross-validation in various ways.
00:07:14.720 | Do make use of them to make sure that
00:07:17.000 | your protocols are the ones that you wanted.
00:07:20.080 | [BLANK_AUDIO]