Stanford XCS224U: NLU I NLP Methods and Metrics, Part 6: Model Evaluation & Conclusion I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.280 | This is the sixth and final screencast

00:00:08.820 | in our series on methods and metrics.

00:00:10.680 | We're going to talk about model evaluation.

00:00:12.900 | This is a high-level discussion that is directly

00:00:15.300 | oriented toward helping you with your final project work.

00:00:18.900 | Here's an overview. We're going to talk about baselines.

00:00:21.940 | What are they? Why are they important?

00:00:24.260 | We'll talk about the trials and

00:00:26.300 | tribulations of hyperparameter optimization

00:00:28.620 | and why it's important.

00:00:30.120 | We'll think about classifier comparison,

00:00:32.720 | a common mode to be in as

00:00:34.100 | you're evaluating systems and hypotheses.

00:00:36.620 | Then we'll talk about two things that are

00:00:38.520 | really particular to the deep learning era.

00:00:40.960 | How to assess models that don't converge in

00:00:43.460 | any strict sense and also the role of

00:00:46.460 | random parameter initialization in

00:00:48.660 | the performance of our biggest models.

00:00:51.800 | Let's start with baselines.

00:00:54.180 | We take this for granted,

00:00:55.380 | but this is actually pretty important conceptually.

00:00:58.500 | Here's a fundamental observation about baselines.

00:01:01.520 | Evaluation numbers in our field can

00:01:03.980 | never be understood properly in isolation.

00:01:06.820 | Suppose your system gets 0.95 F1,

00:01:09.740 | you feel overjoyed,

00:01:11.520 | but the first question reviewers will ask you is,

00:01:14.700 | is the task too easy?

00:01:16.280 | How do simple baselines do on the problem?

00:01:19.600 | Conversely, suppose your system gets 0.6 and you feel

00:01:24.260 | in despair because you

00:01:26.500 | feel like you haven't had a success here,

00:01:28.220 | but the next question should be,

00:01:30.020 | how do humans do?

00:01:31.180 | They're presumably a upper bound.

00:01:33.780 | If it's a hard task or a noisy task,

00:01:36.180 | human performance might be close to 0.61 and

00:01:38.540 | you might really have achieved something meaningful there.

00:01:41.380 | It's baseline models and in that case,

00:01:43.520 | Oracle models that are helping us to understand.

00:01:46.660 | Baselines are also crucial for strong experimental design.

00:01:51.180 | Defining your baseline should not be some afterthought,

00:01:54.940 | but rather central to how you

00:01:56.640 | define your overall hypotheses.

00:01:58.800 | Think about simple systems,

00:02:01.160 | think about ablations of your target system and

00:02:03.860 | incorporate those into your thinking

00:02:05.780 | about the comparisons that you'll make.

00:02:07.580 | Baselines are really just one aspect

00:02:10.060 | of the comparisons we want to offer.

00:02:12.820 | Baselines are essential for building a persuasive case.

00:02:16.640 | We saw that in my two examples there.

00:02:18.660 | To really understand and calibrate on what you achieved,

00:02:22.020 | we need some baselines to calibrate all of that stuff.

00:02:26.100 | They can also be used to illuminate

00:02:28.640 | specific aspects of the problem

00:02:30.860 | and specific virtues of your proposed system.

00:02:33.720 | That often falls under the heading

00:02:35.180 | of ablations of your system.

00:02:36.740 | Those are baselines that remove

00:02:38.920 | crucial features or components

00:02:41.220 | and test the model with the same protocol.

00:02:43.420 | Then the distance between

00:02:45.180 | your chosen model and the ablated model is a estimate of

00:02:49.420 | the importance of

00:02:50.920 | the ablated component to the overall system performance.

00:02:54.580 | Crucial aspect of arguing and

00:02:56.940 | supporting hypotheses and everything else.

00:03:00.140 | Random baselines are really useful for many purposes.

00:03:05.340 | First, they can provide a really true lower

00:03:07.700 | bound on how systems can do on your problem.

00:03:10.460 | Sometimes they are surprisingly robust,

00:03:12.780 | and so it's worth running these early.

00:03:14.780 | I think also they can help you fully debug your system.

00:03:18.340 | These are probably lightweight models that do

00:03:20.300 | relatively little processing and can make

00:03:22.740 | sure that everything is functioning

00:03:24.660 | and makes sense and all that other stuff.

00:03:26.700 | Scikit-learn again has you covered.

00:03:28.820 | They have dummy classifier and dummy regressor.

00:03:31.620 | They have different ways of acting as random models,

00:03:35.100 | and I think this is really useful to

00:03:36.740 | set up early in your process.

00:03:39.860 | You could also think about task specific baselines.

00:03:43.620 | This might require real thought

00:03:45.220 | and real study in the literature.

00:03:47.420 | Does your problem suggest a baseline that will reveal

00:03:50.500 | something about the problem or the way it's modeled?

00:03:53.500 | If so, you should have one of these task specific baselines.

00:03:57.100 | Here are two recent examples from NLU.

00:03:59.860 | The first one is natural language inference.

00:04:02.220 | People discovered that so-called

00:04:04.500 | hypothesis only baselines can be very strong.

00:04:08.100 | The reason this happens is that

00:04:09.900 | in the underlying crowdsourcing effort,

00:04:12.100 | crowd workers were given premise sentences

00:04:14.540 | and asked to construct three hypotheses,

00:04:16.700 | one for neutral, one for

00:04:18.420 | contradiction, and one for entailment.

00:04:20.780 | In that process of construction,

00:04:22.860 | they did some systematic things that convey

00:04:25.940 | information about the label

00:04:27.940 | inadvertently through the hypothesis.

00:04:30.620 | For example, many contradictions involve

00:04:33.540 | negation and many entailment pairs involve

00:04:36.420 | very general terms as part of the hypothesis.

00:04:39.780 | What that means is that the hypothesis

00:04:41.620 | actually carries information about

00:04:43.180 | the label and a hypothesis only baseline quantifies that.

00:04:46.700 | You simply fit a model without

00:04:48.580 | any premise information and see how you do.

00:04:51.540 | The finding of the literature is that very often,

00:04:54.540 | for our benchmarks, the hypothesis

00:04:56.700 | only baseline is way above chance.

00:04:59.140 | What that shows you is that

00:05:00.340 | the random baseline is not so informative anymore.

00:05:04.460 | There's a similar story for the story closed task.

00:05:07.860 | This is to distinguish between

00:05:09.700 | a coherent and incoherent ending for a story.

00:05:12.500 | Again, systems that look only at

00:05:14.340 | the ending often do really well.

00:05:16.340 | I think for the same reason,

00:05:17.660 | the coherent versus incoherent thing is often

00:05:20.580 | actually inferable just from the ending,

00:05:23.900 | neglecting the story.

00:05:25.460 | It's not that the task is broken here necessarily,

00:05:28.580 | but rather again, that you should think about

00:05:30.460 | this as a baseline to compare against and progress.

00:05:33.860 | It's really progress from this very specialized baseline.

00:05:38.780 | Next topic, hyperparameter optimization.

00:05:42.860 | This is discussed extensively in

00:05:44.660 | one of our background units on sentiment analysis.

00:05:47.460 | You might go there for a refresher.

00:05:49.580 | Here, I'll just briefly review the rationale.

00:05:52.300 | You want maybe to obtain the best version of your model,

00:05:55.860 | and that might mean exploring over

00:05:57.700 | different hyperparameters to find

00:05:59.460 | a optimal setting for it.

00:06:01.660 | Another motivation is about comparison between models.

00:06:05.340 | Suppose you do have a results table

00:06:07.220 | full of different systems you're comparing.

00:06:09.580 | It makes no sense to compare them against

00:06:12.420 | randomly chosen parameter settings

00:06:14.500 | because you really want to give

00:06:16.260 | every model the best chance to shine.

00:06:18.780 | Otherwise, there's an arbitrariness to

00:06:20.820 | the evaluation that might

00:06:22.580 | not translate into robust results.

00:06:24.980 | What you really do is give every system a chance by

00:06:27.780 | exploring a wide range of hyperparameters and

00:06:31.060 | reporting the optimal results

00:06:33.100 | according to that exploration.

00:06:34.940 | That's a fair comparison and it implies

00:06:37.300 | a lot of search over hyperparameters.

00:06:40.300 | You might want to understand

00:06:41.620 | the stability of your architecture.

00:06:43.100 | This is interestingly different.

00:06:44.420 | This is where you're not

00:06:45.580 | interested in the best parameters,

00:06:47.500 | but rather how stable system performance is under

00:06:50.540 | various choices people might make in order to get

00:06:53.820 | a sense for how robustly it will perform if people are

00:06:57.020 | say not attentive to these hyperparameters or set them

00:07:00.340 | incorrectly in a inadvertent accident

00:07:03.500 | or an adversarial setting.

00:07:06.100 | Crucial to all of this no matter what your goals,

00:07:09.980 | hyperparameter tuning must be done

00:07:12.100 | only on train and development data.

00:07:14.340 | You never do model selection of

00:07:16.700 | any kind based on the test data.

00:07:19.100 | This is a special case of

00:07:20.700 | the rule that I've been repeating throughout the course.

00:07:22.860 | This is really fundamental to how we think about

00:07:25.020 | testing and generalization and it applies with

00:07:27.740 | real force in the context of

00:07:29.900 | the kind of model selection we're doing here.

00:07:33.020 | Now, hyperparameter optimization has gotten really

00:07:37.780 | challenging in the era of

00:07:39.460 | long-running expensive training regimes.

00:07:42.100 | Let me give you a sense for what the problem is

00:07:44.260 | by way of an example.

00:07:46.020 | For each hyperparameter, you identify

00:07:48.620 | a large set of values for it in some range.

00:07:51.860 | Then you create a list of all the combinations

00:07:54.740 | of all the hyperparameters.

00:07:56.580 | This is the cross product of all the values

00:07:59.180 | for all the features you identified in step 1.

00:08:01.860 | What you can hear in that description is

00:08:03.860 | an exponential growth in the number of settings.

00:08:07.180 | For each setting, you cross-validate it

00:08:09.740 | on the available training data,

00:08:11.180 | which might imply 5, or 10, or 20 experiments.

00:08:15.540 | Then you choose the settings that did best in step 3,

00:08:18.780 | and you train on all the trained data using that setting,

00:08:21.820 | and then you evaluate that model on the test set.

00:08:24.220 | That is a pristine version

00:08:26.580 | of the protocol that we might be implementing.

00:08:29.580 | But here's the problem.

00:08:31.420 | Suppose parameter h1 has

00:08:33.180 | five values and parameter 2 has 10,

00:08:36.060 | then the total number of settings is now 50.

00:08:39.100 | Suppose we add 3, then it goes to 100.

00:08:43.180 | Now, suppose we're going to do

00:08:44.620 | five-fold cross-validation to select optimal parameters.

00:08:47.620 | Now we are at 500 runs.

00:08:49.940 | Very quickly, the number of experiments exploded.

00:08:54.060 | If each one of these runs takes a day,

00:08:56.300 | you're pretty much out of contention in terms of

00:08:58.620 | actually implementing this protocol completely.

00:09:01.660 | Something has to change.

00:09:03.820 | The above is untenable as

00:09:06.540 | a set of laws for the scientific community.

00:09:08.500 | We cannot insist on this level of hyperparameter optimization.

00:09:12.660 | If we adopted it,

00:09:14.180 | complex models trained on large datasets would end up

00:09:17.020 | disfavored and only the very

00:09:19.140 | wealthy would be able to participate.

00:09:20.980 | To give you a glimpse of this,

00:09:22.180 | here's a quote from a paper from a team at

00:09:24.300 | Google doing NLP for healthcare.

00:09:27.060 | The performance of all above neural networks were

00:09:29.700 | tuned automatically using Google Vizier with

00:09:32.700 | a total of over 200,000 GPU hours.

00:09:36.700 | That is a lot of money spent on a lot of compute.

00:09:40.380 | Obviously, we cannot insist on a similar level of

00:09:43.740 | investment for experiments say for this course,

00:09:46.700 | but frankly, for any contribution in the field,

00:09:49.420 | we have to have compromises.

00:09:51.420 | Here are some reasonable compromises.

00:09:53.620 | These are pragmatic steps you can take to

00:09:55.740 | alleviate this resource problem.

00:09:58.460 | I've given them in descending order of

00:10:00.700 | attractiveness and I find that as the days go by,

00:10:04.260 | we need to go lower and lower on this list.

00:10:08.180 | You could do random sampling and

00:10:10.500 | guided sampling to explore a large space on a fixed budget.

00:10:13.900 | This is nice because you have

00:10:15.460 | the cross product of all of the settings.

00:10:17.820 | That's too large.

00:10:19.060 | You simply randomly sample in the space,

00:10:21.480 | maybe with some guidance from a model,

00:10:23.700 | and you can then on a fixed budget of say,

00:10:25.980 | five or 10 or 100 runs,

00:10:27.860 | do a version of the full grid search.

00:10:30.980 | You could also search based on a few epochs of

00:10:33.980 | training. The expense comes from multiple epochs,

00:10:36.940 | maybe you do one or two,

00:10:38.500 | and then you pick the hyperparameters

00:10:40.280 | that were best at that point.

00:10:42.340 | If the learning curves are familiar and consistent,

00:10:47.220 | then this will be a pretty strong approach here.

00:10:50.540 | You could also search based on subsets of the data.

00:10:53.780 | This is fine, but it could be risky because we

00:10:56.540 | know some parameters are very dependent on dataset size.

00:11:00.180 | You're selecting based on small data

00:11:02.300 | and applying it to large data,

00:11:03.700 | even though you know that's probably a risky assumption.

00:11:07.740 | You could do heuristic search and define

00:11:10.500 | which hyperparameters matter less and then set

00:11:12.980 | them by hand and justify that in the paper.

00:11:15.940 | That's increasingly common.

00:11:17.540 | People describe things like we

00:11:18.980 | determined in our initial experiments

00:11:21.420 | that these hyperparameters had

00:11:23.180 | this optimal value or didn't matter that much,

00:11:25.940 | and so we chose these reasonable values.

00:11:29.020 | Then the actual search happens only over the ones that you

00:11:32.260 | can tell are important.

00:11:34.580 | We have to take your word for it that you've done

00:11:36.620 | the heuristic search responsibly,

00:11:38.260 | but this is obviously a really good way to

00:11:40.780 | balance exploration with constrained resources.

00:11:44.820 | You could find the optimal hyperparameters via

00:11:47.620 | a single split and use them for

00:11:49.500 | all subsequent splits and then justify

00:11:51.500 | that based on the fact that the splits are similar.

00:11:54.040 | That would automatically cut down

00:11:55.900 | substantially on the number of runs you need to do because you

00:11:58.780 | don't need to do so much cross-validation in this mode.

00:12:03.020 | Then finally, you could adopt others' choices.

00:12:05.900 | The skeptics will complain that these findings

00:12:08.060 | don't translate to new datasets,

00:12:09.980 | but it could be the only option.

00:12:11.780 | As I say, a few years ago,

00:12:14.180 | this was frowned upon,

00:12:15.840 | but now in the modern era with these massive models,

00:12:18.720 | it's basically the only option.

00:12:20.460 | I think increasingly, people are simply

00:12:22.840 | carrying forward other hyperparameters.

00:12:25.260 | It means less exploration.

00:12:26.980 | We might not be seeing the best versions of these models,

00:12:29.860 | but it might then again be the only option.

00:12:33.420 | In terms of tools for hyperparameter search,

00:12:36.300 | as always, Scikit is really rich with these things.

00:12:38.660 | They've got a lot of these toolings.

00:12:40.980 | In addition, Scikit Optimize is

00:12:43.460 | one level up in terms of sophistication.

00:12:46.140 | That's where you could do model-guided search through

00:12:48.980 | a hyperparameter grid in order to intelligently

00:12:52.380 | select good settings to

00:12:54.460 | lead to a good model on a fixed budget.

00:12:57.700 | Next topic, classifier comparison.

00:13:00.940 | This is a short one, but this can be important.

00:13:02.780 | Suppose you've assessed two classifier models.

00:13:05.580 | Their performance is probably different in some way.

00:13:09.000 | What can you do to establish whether

00:13:11.180 | these models are different in any meaningful sense?

00:13:13.900 | I think there are a few options.

00:13:15.220 | The first would be practical differences.

00:13:17.660 | If they obviously make

00:13:19.160 | a large number of different predictions,

00:13:21.080 | you might be able to quantify that difference in terms of

00:13:24.140 | some actual external outcome.

00:13:26.420 | That's a really good scenario to be in.

00:13:29.320 | You could also think about confidence intervals

00:13:32.100 | to further bolster the argument that you're making.

00:13:34.940 | This will give us a picture of how

00:13:36.580 | consistently different the two systems are.

00:13:39.860 | If they are consistently different,

00:13:41.680 | then you have a very clear argument

00:13:43.360 | in favor of one over the other.

00:13:45.540 | The Wilcoxon sign-rank test is a accepted method in

00:13:49.460 | the field for assessing classifiers

00:13:51.900 | using methodologies that are similar to standard t-tests.

00:13:55.380 | I guess the consensus is just that

00:13:57.180 | the assumptions behind the Wilcoxon test

00:13:59.620 | are somewhat more aligned with classifier comparison.

00:14:02.940 | To do that, as well as confidence intervals,

00:14:05.380 | you will have had to run your model on

00:14:07.260 | lots of different settings to get a long vector of

00:14:10.860 | 10-20 scores to use as the basis for the stats testing.

00:14:15.580 | If that is too expensive,

00:14:17.260 | you could opt for a McNemars test.

00:14:19.820 | This is a comparison that you do over

00:14:22.300 | two single trained classifiers

00:14:25.900 | based on their confusion matrices.

00:14:27.980 | You only need one run.

00:14:29.540 | It will be unstable if the models are unstable,

00:14:32.440 | but it is a way of doing a stats test in the mode of

00:14:36.460 | the chi-squared test to give you

00:14:38.460 | some information about how

00:14:39.860 | two fixed artifacts compare to each other.

00:14:42.420 | Not as strong as the previous methods,

00:14:44.860 | but nonetheless useful and arguably better than nothing.

00:14:50.340 | A special topic for deep learning.

00:14:54.180 | How do you assess models without convergence?

00:14:56.700 | This never used to arise.

00:14:58.300 | Back in the days of all linear models,

00:15:00.900 | all these models would converge more or less

00:15:02.740 | instantly to epsilon loss,

00:15:05.180 | and then you could feel like that was how

00:15:07.180 | you'd move forward with assessing them.

00:15:09.660 | But now with neural networks,

00:15:11.980 | convergence has really taken center stage,

00:15:14.460 | and it's in a complicated way that it takes center stage.

00:15:17.140 | First, these models rarely converge to epsilon loss,

00:15:21.040 | and therefore it's non-issue whether or

00:15:23.740 | not that would be your stopping criterion.

00:15:25.900 | In addition, they might converge at

00:15:27.700 | different rates between runs,

00:15:29.500 | and their performance on the test set might not

00:15:32.540 | even be especially related to how small the loss got.

00:15:36.620 | You need to be thoughtful about

00:15:39.220 | exactly what your stopping criteria will be.

00:15:41.820 | Yes, sometimes a model with

00:15:43.340 | low final error turns out to be great,

00:15:45.540 | and sometimes it's worse than the one

00:15:47.220 | that finished with a higher error.

00:15:48.700 | This might have something to do with

00:15:50.060 | overfitting and regularization,

00:15:52.380 | but the bottom line here is,

00:15:53.920 | we don't know a priori what's going on.

00:15:56.120 | This is very experiment-driven.

00:15:59.100 | One thing to think about for stopping criteria in

00:16:03.060 | general is what we call incremental dev set testing.

00:16:06.700 | To address the uncertainty that I just reviewed,

00:16:09.640 | you regularly collect information about

00:16:12.140 | dev set performance as part of the training that you're doing.

00:16:15.760 | For example, at every 100th iteration,

00:16:18.260 | you could make predictions on the dev set and

00:16:20.860 | store the resulting vector of predictions.

00:16:24.140 | All the PyTorch models for this course have

00:16:26.780 | an early stopping parameter and a bunch of

00:16:28.960 | related parameters that will help you set this up in

00:16:31.900 | a way that will allow you to do this incremental testing,

00:16:34.960 | and with luck, get the best model.

00:16:37.580 | If you need a little bit of motivation for this,

00:16:39.980 | here are some plots from an actual model.

00:16:42.320 | You can see the loss going down very

00:16:44.460 | steadily across different iterations of training.

00:16:47.700 | But the performance on the dev set tells a very different story.

00:16:51.780 | You can see based on this performance that at

00:16:54.160 | a certain very early point in this process around iteration 10,

00:16:58.420 | our performance was actually getting worse

00:17:00.560 | even though the loss was going down.

00:17:02.620 | That just shows you that sometimes the steady loss curve is

00:17:06.340 | a picture of overfitting and not of

00:17:09.220 | your model actually getting better at the thing that you care about.

00:17:13.140 | Think carefully about your stopping criteria.

00:17:16.580 | In general though, I think we might want to take

00:17:20.100 | a more expansive view of how we do evaluation in this mode.

00:17:24.380 | Here what I'm pitching is that we look at

00:17:26.900 | the entire performance curve maybe with

00:17:29.740 | confidence intervals so we can make some confident distinctions.

00:17:33.180 | All these plots for different conditions across models we were

00:17:36.420 | comparing have epochs along the x-axis and F1 along the y-axis.

00:17:42.060 | If you step back, what I think you see is that our mittens model,

00:17:46.020 | the one that we were advocating for,

00:17:48.400 | is the best model on average but largely in early parts of training.

00:17:53.240 | If you train for long enough,

00:17:54.740 | a lot of the distinctions disappear.

00:17:57.820 | If you have a fixed budget of epochs,

00:18:00.500 | mittens is a good choice.

00:18:01.760 | If you don't care about the resources,

00:18:03.780 | it might not be so clear which one you should choose,

00:18:06.300 | maybe it doesn't matter.

00:18:07.560 | That's a nuanced lesson that I think is really powerful to

00:18:10.960 | teach and we can't really teach it if all we

00:18:13.740 | do is offer point estimates of model performance.

00:18:16.380 | We really need to see the full curve to see that level of nuance.

00:18:20.780 | I know that NLPers love their results tables,

00:18:23.980 | you should have results tables,

00:18:25.660 | but maybe you could supplement them with some figures that would

00:18:28.700 | give us a fuller picture of what was going on.

00:18:32.620 | Final topic, the role of random parameter initialization.

00:18:37.180 | Most deep learning models have parameters that are random at the start.

00:18:40.880 | Even if they're pre-trained,

00:18:42.100 | there are usually some random parameters in the mix there.

00:18:45.500 | This is clearly meaningful for

00:18:47.700 | the non-convex problems that we're posing.

00:18:50.420 | Simpler models can be impacted as well,

00:18:52.560 | but it's especially pressing in the deep learning era.

00:18:55.100 | Here is a relatively recent paper showing actually that

00:18:57.980 | different initializations for neural sequence models

00:19:00.820 | led to statistically significant differences in performance.

00:19:05.280 | A number of recent systems were actually indistinguishable in terms of

00:19:09.340 | their raw performance once we took this source of variation into account.

00:19:13.920 | There's a related issue here of

00:19:15.780 | catastrophic failure from unlucky initializations.

00:19:19.100 | Sometimes that happens, sometimes you see it,

00:19:21.920 | sometimes it's hard to notice.

00:19:23.880 | There's a question of how to report that as part of overall system performance.

00:19:28.300 | We need to be reflective about this.

00:19:30.540 | Maybe the bottom line here is just for

00:19:32.940 | the associated notebook for this unit, evaluation methods.

00:19:36.160 | I just showed you with the classic XOR problem,

00:19:39.880 | which has always been used to

00:19:41.220 | motivate the powerful models that we work with now,

00:19:44.200 | that you don't actually get success for

00:19:46.580 | a simple feed-forward network for that problem.

00:19:49.320 | Eight out of 10 times it succeeds,

00:19:51.620 | and two out of the 10 times it's a colossal failure.

00:19:54.680 | That is a glimpse of just how important initialization can be.

00:19:59.060 | Since we don't analytically understand why we're seeing this variation,

00:20:03.120 | the best response if you can afford it is a bunch more experiments.

00:20:08.620 | Let's wrap up. A lot of this in the back of my mind is

00:20:13.460 | oriented toward helping you with the protocols,

00:20:15.780 | which is a document associated with your final project,

00:20:19.140 | where you give us the nuts and bolts of the project,

00:20:21.800 | and try to identify any obstacles to success.

00:20:25.180 | All the lessons we've been teaching throughout this series are

00:20:28.460 | oriented toward helping you think critically about

00:20:30.840 | this protocol and ultimately set up a firm foundation for your project.

00:20:36.020 | With that out of the way, I thought I would look

00:20:38.060 | ahead to this moment and the future.

00:20:41.340 | I think this is an ideal moment for innovation in surprising new places.

00:20:46.260 | Architecture innovation, way overrated at this point.

00:20:50.500 | I mean, it's still important,

00:20:51.900 | but it is overrated relative to the amount of things people do.

00:20:55.440 | Metric innovation, way underrated.

00:20:58.760 | It's been a theme of these lectures that we need to think very carefully about

00:21:02.880 | our metrics because they are our guideposts

00:21:05.360 | toward whether we're succeeding or not.

00:21:08.040 | Relatedly, evaluations in general.

00:21:10.420 | We need innovation there.

00:21:11.620 | This is way underrated by the community at this point.

00:21:14.840 | Task innovation, underrated.

00:21:16.900 | We are seeing some things,

00:21:18.080 | so it's not as bad as two and three,

00:21:19.840 | but still we should all be participating in this area.

00:21:23.280 | Then finally, exhaustive hyperparameter search.

00:21:26.520 | You need to weigh this against other factors.

00:21:29.120 | There is more at play here than just that pristine scientific paradigm.

00:21:33.440 | We need to think about costs in every sense and

00:21:36.600 | how it relates to the innovations that we're likely to see.

00:21:40.000 | I'm pitching a pragmatic approach,

00:21:41.980 | but I'm also exhorting you to think

00:21:44.360 | expansively about how you might participate in pushing the field forward.

00:21:49.760 | [BLANK_AUDIO]