Stanford XCS224U: NLU I NLP Methods and Metrics, Part 5: Data Organization I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.000 | This is part five in our series on methods and metrics.

00:00:09.160 | This will be a short, focused,

00:00:11.180 | technical screencast about data organization.

00:00:15.080 | Within the field of NLP and indeed all of AI,

00:00:18.140 | we're all accustomed to having datasets that have

00:00:20.440 | train, dev, and test portions.

00:00:23.440 | This is common in our largest publicly available datasets.

00:00:27.480 | It does presuppose a fairly large dataset,

00:00:30.760 | and that's in virtue of the fact that we

00:00:32.440 | hardly ever get to use the test set.

00:00:34.480 | As I've said repeatedly in the field,

00:00:37.080 | we're all on the honor system to do test set runs

00:00:40.020 | only when all of system development is complete.

00:00:43.400 | That test set is under lock and key most of the time,

00:00:46.480 | and that does mean that it goes hardly ever

00:00:49.320 | used during the course of scientific inquiry.

00:00:53.320 | Having these fixed test sets is good because it ensures

00:00:57.080 | consistent evaluations.

00:00:58.480 | It's much easier to compare two models if they were

00:01:01.200 | evaluated according to exactly the same protocol.

00:01:04.240 | But it does have a downside that

00:01:06.120 | because we always use the same test set,

00:01:08.520 | we get a community-wide hill climbing on that test set as

00:01:12.660 | later papers learn indirect lessons about

00:01:16.000 | the test set from earlier papers in the literature,

00:01:18.840 | and that ends up inflating performance.

00:01:21.320 | But on balance, I think

00:01:23.040 | train-dev-test has been good for the field of NLP.

00:01:26.920 | However, if you're doing work outside of NLP,

00:01:30.360 | you might encounter datasets that

00:01:32.040 | don't have predefined splits.

00:01:34.480 | That could be because they're small or

00:01:36.340 | because they're from a different field.

00:01:38.280 | For example, in psychology,

00:01:40.000 | you hardly ever get this train-dev-test methodology,

00:01:43.340 | and so datasets from that field,

00:01:45.200 | which you might want to make use of,

00:01:46.760 | are unlikely to have the predefined splits.

00:01:50.280 | This poses a challenge for assessment,

00:01:53.240 | because as I said, for robust comparisons,

00:01:55.900 | we really want to have all our models run using

00:01:58.480 | the same assessment regime and that means

00:02:01.200 | using the same splits for all of your experimental runs.

00:02:04.960 | Now, for large datasets,

00:02:06.760 | you could just impose the splits yourself

00:02:08.860 | and then use them for the entire project.

00:02:11.160 | That will simplify your experimental design,

00:02:14.200 | and it will also reduce the amount of

00:02:15.960 | hyperparameter optimization that you need to do.

00:02:18.280 | If you can get away with it,

00:02:19.880 | just impose the splits and maybe bake that into

00:02:22.680 | how people think about the dataset in NLP now.

00:02:26.060 | But for small datasets,

00:02:28.160 | imposing these splits might simply leave you with

00:02:30.720 | too little data and that could lead to

00:02:32.380 | very highly variable system assessments.

00:02:35.880 | Either you're training on too few examples to have a lot of

00:02:39.640 | examples for assessment and that causes some noise,

00:02:42.560 | or you're leaving too few examples to assess on,

00:02:46.020 | and then the resulting assessments

00:02:47.900 | are very noisy and highly variable.

00:02:50.240 | It's hard to get that right.

00:02:51.600 | In these situations,

00:02:53.000 | I think what you should do is think about cross-validation.

00:02:56.940 | In cross-validation, we take a set of examples

00:03:00.240 | and partition them into two or more train test splits.

00:03:04.260 | We run a bunch of system evaluations and then we aggregate over

00:03:08.560 | those scores in some way usually by taking an average and we

00:03:11.860 | report that as a measure of system performance.

00:03:15.880 | There are two broad methods that you

00:03:18.740 | can use for this kind of cross-validation.

00:03:20.960 | The first is very simple.

00:03:22.560 | I've called it random splits here.

00:03:24.920 | The idea is for k splits,

00:03:27.160 | that is k times,

00:03:28.240 | you shuffle your dataset and then you split it into

00:03:31.720 | T percent train and usually one minus

00:03:34.160 | T percent test to use all the data,

00:03:36.560 | and then you conduct an evaluation.

00:03:38.800 | You repeat that k times and you get a vector of

00:03:41.320 | scores and then you aggregate those scores in some way.

00:03:44.640 | Usually, you would take an average,

00:03:46.680 | but you could also think about an average plus

00:03:48.760 | a confidence interval or some kind of stats test that would

00:03:51.520 | tell you about how two systems differ according to this regime.

00:03:56.020 | Usually, but not always,

00:03:58.380 | you want these splits to be stratified in the sense that

00:04:01.160 | the train and test splits have approximately

00:04:03.840 | the same distribution over the classes or

00:04:06.280 | output values to give you consistent evaluations.

00:04:10.720 | Trade-offs. Well, the good part of this is that you can create

00:04:14.680 | as many experiments as you want without having

00:04:17.880 | this impact the ratio of training to testing examples.

00:04:21.320 | The value of k here is separate from the value of T and one minus T.

00:04:26.600 | What that means is that you can run lots of experiments and

00:04:30.600 | independently set the number of

00:04:32.200 | train examples or the number of assessment examples.

00:04:35.320 | That's certainly to the good.

00:04:37.240 | The bad here is that you don't get a guarantee that

00:04:40.080 | every example will be used the same number of

00:04:42.320 | times for training and testing because of

00:04:44.480 | the shuffle stuff that you do here,

00:04:46.560 | introducing a lot of randomness.

00:04:48.720 | Frankly, for reasonably sized datasets,

00:04:51.440 | this bad here is very minimal indeed.

00:04:53.960 | I really like random splits and I would worry about

00:04:56.960 | the bad only in situations in which you have a very small dataset.

00:05:01.960 | Finally, Scikit-learn has lots of

00:05:04.640 | utilities for doing this random split stuff.

00:05:07.640 | I would encourage you to use them.

00:05:09.320 | They've worked them out,

00:05:10.560 | nice reliable code that will help you with these protocols.

00:05:14.400 | Now, in some situations,

00:05:17.440 | you might instead want to do what's called k-fold cross-validation,

00:05:21.200 | and this is somewhat different.

00:05:22.640 | Let's imagine we have a dataset and we have divided it ahead of time into

00:05:26.600 | three folds that is three disjoint parts.

00:05:30.640 | Then we have experiment 1 where we have our test fold is

00:05:34.560 | fold 1 and we train on folds 2 and 3 together.

00:05:38.280 | Experiment 2, we test on fold 2 and train on 1 and 3.

00:05:45.880 | For experiment 3, we test on fold 3 and train on 1 and 2.

00:05:49.840 | We've covered all of the combinations.

00:05:51.880 | Our three folds give us three separate experiments,

00:05:54.920 | and then we aggregate results across all three of the experiments.

00:05:59.520 | Let's think about our trade-offs again.

00:06:01.480 | The good part is that every example appears in a train set

00:06:05.320 | exactly k minus 1 times and in a test set exactly 1.

00:06:09.440 | We get a nice pristine experimental setting in that regard.

00:06:13.840 | The bad though is really bad to my mind.

00:06:17.700 | The size of k determines the size of the train set.

00:06:22.160 | If I do three folds cross-validation,

00:06:24.360 | I get to train on 67 percent of the data and test on 33.

00:06:28.920 | But if I want to do 10 folds cross-validation,

00:06:31.640 | now I have to train on 90 percent and test on 10.

00:06:34.960 | It feels like the number of experiments has gotten

00:06:37.880 | problematically entwined with

00:06:40.560 | the percentage of train and test that I want to have,

00:06:43.080 | and that's really problematic.

00:06:44.320 | You might want to have a lot of folds,

00:06:46.680 | that is a lot of experiments,

00:06:48.340 | but nonetheless train on only 80 percent of the data in each case.

00:06:52.680 | That leads me to prefer

00:06:54.880 | the random splits approach in almost all settings,

00:06:57.800 | because the bad there was relatively small relative to

00:07:01.400 | the confound that this introduces for

00:07:03.820 | k-folds cross-validation.

00:07:06.080 | Finally, I'll just note that Scikit again has you covered.

00:07:09.160 | They have lots of great utilities for doing

00:07:11.560 | this k-folds cross-validation in various ways.

00:07:14.720 | Do make use of them to make sure that

00:07:17.000 | your protocols are the ones that you wanted.

00:07:20.080 | [BLANK_AUDIO]

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 5: Data Organization I Spring 2023

Chapters