back to index

Stanford XCS224U: NLU I NLP Methods and Metrics, Part 6: Model Evaluation & Conclusion I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.280 | This is the sixth and final screencast
00:00:08.820 | in our series on methods and metrics.
00:00:10.680 | We're going to talk about model evaluation.
00:00:12.900 | This is a high-level discussion that is directly
00:00:15.300 | oriented toward helping you with your final project work.
00:00:18.900 | Here's an overview. We're going to talk about baselines.
00:00:21.940 | What are they? Why are they important?
00:00:24.260 | We'll talk about the trials and
00:00:26.300 | tribulations of hyperparameter optimization
00:00:28.620 | and why it's important.
00:00:30.120 | We'll think about classifier comparison,
00:00:32.720 | a common mode to be in as
00:00:34.100 | you're evaluating systems and hypotheses.
00:00:36.620 | Then we'll talk about two things that are
00:00:38.520 | really particular to the deep learning era.
00:00:40.960 | How to assess models that don't converge in
00:00:43.460 | any strict sense and also the role of
00:00:46.460 | random parameter initialization in
00:00:48.660 | the performance of our biggest models.
00:00:51.800 | Let's start with baselines.
00:00:54.180 | We take this for granted,
00:00:55.380 | but this is actually pretty important conceptually.
00:00:58.500 | Here's a fundamental observation about baselines.
00:01:01.520 | Evaluation numbers in our field can
00:01:03.980 | never be understood properly in isolation.
00:01:06.820 | Suppose your system gets 0.95 F1,
00:01:09.740 | you feel overjoyed,
00:01:11.520 | but the first question reviewers will ask you is,
00:01:14.700 | is the task too easy?
00:01:16.280 | How do simple baselines do on the problem?
00:01:19.600 | Conversely, suppose your system gets 0.6 and you feel
00:01:24.260 | in despair because you
00:01:26.500 | feel like you haven't had a success here,
00:01:28.220 | but the next question should be,
00:01:30.020 | how do humans do?
00:01:31.180 | They're presumably a upper bound.
00:01:33.780 | If it's a hard task or a noisy task,
00:01:36.180 | human performance might be close to 0.61 and
00:01:38.540 | you might really have achieved something meaningful there.
00:01:41.380 | It's baseline models and in that case,
00:01:43.520 | Oracle models that are helping us to understand.
00:01:46.660 | Baselines are also crucial for strong experimental design.
00:01:51.180 | Defining your baseline should not be some afterthought,
00:01:54.940 | but rather central to how you
00:01:56.640 | define your overall hypotheses.
00:01:58.800 | Think about simple systems,
00:02:01.160 | think about ablations of your target system and
00:02:03.860 | incorporate those into your thinking
00:02:05.780 | about the comparisons that you'll make.
00:02:07.580 | Baselines are really just one aspect
00:02:10.060 | of the comparisons we want to offer.
00:02:12.820 | Baselines are essential for building a persuasive case.
00:02:16.640 | We saw that in my two examples there.
00:02:18.660 | To really understand and calibrate on what you achieved,
00:02:22.020 | we need some baselines to calibrate all of that stuff.
00:02:26.100 | They can also be used to illuminate
00:02:28.640 | specific aspects of the problem
00:02:30.860 | and specific virtues of your proposed system.
00:02:33.720 | That often falls under the heading
00:02:35.180 | of ablations of your system.
00:02:36.740 | Those are baselines that remove
00:02:38.920 | crucial features or components
00:02:41.220 | and test the model with the same protocol.
00:02:43.420 | Then the distance between
00:02:45.180 | your chosen model and the ablated model is a estimate of
00:02:49.420 | the importance of
00:02:50.920 | the ablated component to the overall system performance.
00:02:54.580 | Crucial aspect of arguing and
00:02:56.940 | supporting hypotheses and everything else.
00:03:00.140 | Random baselines are really useful for many purposes.
00:03:05.340 | First, they can provide a really true lower
00:03:07.700 | bound on how systems can do on your problem.
00:03:10.460 | Sometimes they are surprisingly robust,
00:03:12.780 | and so it's worth running these early.
00:03:14.780 | I think also they can help you fully debug your system.
00:03:18.340 | These are probably lightweight models that do
00:03:20.300 | relatively little processing and can make
00:03:22.740 | sure that everything is functioning
00:03:24.660 | and makes sense and all that other stuff.
00:03:26.700 | Scikit-learn again has you covered.
00:03:28.820 | They have dummy classifier and dummy regressor.
00:03:31.620 | They have different ways of acting as random models,
00:03:35.100 | and I think this is really useful to
00:03:36.740 | set up early in your process.
00:03:39.860 | You could also think about task specific baselines.
00:03:43.620 | This might require real thought
00:03:45.220 | and real study in the literature.
00:03:47.420 | Does your problem suggest a baseline that will reveal
00:03:50.500 | something about the problem or the way it's modeled?
00:03:53.500 | If so, you should have one of these task specific baselines.
00:03:57.100 | Here are two recent examples from NLU.
00:03:59.860 | The first one is natural language inference.
00:04:02.220 | People discovered that so-called
00:04:04.500 | hypothesis only baselines can be very strong.
00:04:08.100 | The reason this happens is that
00:04:09.900 | in the underlying crowdsourcing effort,
00:04:12.100 | crowd workers were given premise sentences
00:04:14.540 | and asked to construct three hypotheses,
00:04:16.700 | one for neutral, one for
00:04:18.420 | contradiction, and one for entailment.
00:04:20.780 | In that process of construction,
00:04:22.860 | they did some systematic things that convey
00:04:25.940 | information about the label
00:04:27.940 | inadvertently through the hypothesis.
00:04:30.620 | For example, many contradictions involve
00:04:33.540 | negation and many entailment pairs involve
00:04:36.420 | very general terms as part of the hypothesis.
00:04:39.780 | What that means is that the hypothesis
00:04:41.620 | actually carries information about
00:04:43.180 | the label and a hypothesis only baseline quantifies that.
00:04:46.700 | You simply fit a model without
00:04:48.580 | any premise information and see how you do.
00:04:51.540 | The finding of the literature is that very often,
00:04:54.540 | for our benchmarks, the hypothesis
00:04:56.700 | only baseline is way above chance.
00:04:59.140 | What that shows you is that
00:05:00.340 | the random baseline is not so informative anymore.
00:05:04.460 | There's a similar story for the story closed task.
00:05:07.860 | This is to distinguish between
00:05:09.700 | a coherent and incoherent ending for a story.
00:05:12.500 | Again, systems that look only at
00:05:14.340 | the ending often do really well.
00:05:16.340 | I think for the same reason,
00:05:17.660 | the coherent versus incoherent thing is often
00:05:20.580 | actually inferable just from the ending,
00:05:23.900 | neglecting the story.
00:05:25.460 | It's not that the task is broken here necessarily,
00:05:28.580 | but rather again, that you should think about
00:05:30.460 | this as a baseline to compare against and progress.
00:05:33.860 | It's really progress from this very specialized baseline.
00:05:38.780 | Next topic, hyperparameter optimization.
00:05:42.860 | This is discussed extensively in
00:05:44.660 | one of our background units on sentiment analysis.
00:05:47.460 | You might go there for a refresher.
00:05:49.580 | Here, I'll just briefly review the rationale.
00:05:52.300 | You want maybe to obtain the best version of your model,
00:05:55.860 | and that might mean exploring over
00:05:57.700 | different hyperparameters to find
00:05:59.460 | a optimal setting for it.
00:06:01.660 | Another motivation is about comparison between models.
00:06:05.340 | Suppose you do have a results table
00:06:07.220 | full of different systems you're comparing.
00:06:09.580 | It makes no sense to compare them against
00:06:12.420 | randomly chosen parameter settings
00:06:14.500 | because you really want to give
00:06:16.260 | every model the best chance to shine.
00:06:18.780 | Otherwise, there's an arbitrariness to
00:06:20.820 | the evaluation that might
00:06:22.580 | not translate into robust results.
00:06:24.980 | What you really do is give every system a chance by
00:06:27.780 | exploring a wide range of hyperparameters and
00:06:31.060 | reporting the optimal results
00:06:33.100 | according to that exploration.
00:06:34.940 | That's a fair comparison and it implies
00:06:37.300 | a lot of search over hyperparameters.
00:06:40.300 | You might want to understand
00:06:41.620 | the stability of your architecture.
00:06:43.100 | This is interestingly different.
00:06:44.420 | This is where you're not
00:06:45.580 | interested in the best parameters,
00:06:47.500 | but rather how stable system performance is under
00:06:50.540 | various choices people might make in order to get
00:06:53.820 | a sense for how robustly it will perform if people are
00:06:57.020 | say not attentive to these hyperparameters or set them
00:07:00.340 | incorrectly in a inadvertent accident
00:07:03.500 | or an adversarial setting.
00:07:06.100 | Crucial to all of this no matter what your goals,
00:07:09.980 | hyperparameter tuning must be done
00:07:12.100 | only on train and development data.
00:07:14.340 | You never do model selection of
00:07:16.700 | any kind based on the test data.
00:07:19.100 | This is a special case of
00:07:20.700 | the rule that I've been repeating throughout the course.
00:07:22.860 | This is really fundamental to how we think about
00:07:25.020 | testing and generalization and it applies with
00:07:27.740 | real force in the context of
00:07:29.900 | the kind of model selection we're doing here.
00:07:33.020 | Now, hyperparameter optimization has gotten really
00:07:37.780 | challenging in the era of
00:07:39.460 | long-running expensive training regimes.
00:07:42.100 | Let me give you a sense for what the problem is
00:07:44.260 | by way of an example.
00:07:46.020 | For each hyperparameter, you identify
00:07:48.620 | a large set of values for it in some range.
00:07:51.860 | Then you create a list of all the combinations
00:07:54.740 | of all the hyperparameters.
00:07:56.580 | This is the cross product of all the values
00:07:59.180 | for all the features you identified in step 1.
00:08:01.860 | What you can hear in that description is
00:08:03.860 | an exponential growth in the number of settings.
00:08:07.180 | For each setting, you cross-validate it
00:08:09.740 | on the available training data,
00:08:11.180 | which might imply 5, or 10, or 20 experiments.
00:08:15.540 | Then you choose the settings that did best in step 3,
00:08:18.780 | and you train on all the trained data using that setting,
00:08:21.820 | and then you evaluate that model on the test set.
00:08:24.220 | That is a pristine version
00:08:26.580 | of the protocol that we might be implementing.
00:08:29.580 | But here's the problem.
00:08:31.420 | Suppose parameter h1 has
00:08:33.180 | five values and parameter 2 has 10,
00:08:36.060 | then the total number of settings is now 50.
00:08:39.100 | Suppose we add 3, then it goes to 100.
00:08:43.180 | Now, suppose we're going to do
00:08:44.620 | five-fold cross-validation to select optimal parameters.
00:08:47.620 | Now we are at 500 runs.
00:08:49.940 | Very quickly, the number of experiments exploded.
00:08:54.060 | If each one of these runs takes a day,
00:08:56.300 | you're pretty much out of contention in terms of
00:08:58.620 | actually implementing this protocol completely.
00:09:01.660 | Something has to change.
00:09:03.820 | The above is untenable as
00:09:06.540 | a set of laws for the scientific community.
00:09:08.500 | We cannot insist on this level of hyperparameter optimization.
00:09:12.660 | If we adopted it,
00:09:14.180 | complex models trained on large datasets would end up
00:09:17.020 | disfavored and only the very
00:09:19.140 | wealthy would be able to participate.
00:09:20.980 | To give you a glimpse of this,
00:09:22.180 | here's a quote from a paper from a team at
00:09:24.300 | Google doing NLP for healthcare.
00:09:27.060 | The performance of all above neural networks were
00:09:29.700 | tuned automatically using Google Vizier with
00:09:32.700 | a total of over 200,000 GPU hours.
00:09:36.700 | That is a lot of money spent on a lot of compute.
00:09:40.380 | Obviously, we cannot insist on a similar level of
00:09:43.740 | investment for experiments say for this course,
00:09:46.700 | but frankly, for any contribution in the field,
00:09:49.420 | we have to have compromises.
00:09:51.420 | Here are some reasonable compromises.
00:09:53.620 | These are pragmatic steps you can take to
00:09:55.740 | alleviate this resource problem.
00:09:58.460 | I've given them in descending order of
00:10:00.700 | attractiveness and I find that as the days go by,
00:10:04.260 | we need to go lower and lower on this list.
00:10:08.180 | You could do random sampling and
00:10:10.500 | guided sampling to explore a large space on a fixed budget.
00:10:13.900 | This is nice because you have
00:10:15.460 | the cross product of all of the settings.
00:10:17.820 | That's too large.
00:10:19.060 | You simply randomly sample in the space,
00:10:21.480 | maybe with some guidance from a model,
00:10:23.700 | and you can then on a fixed budget of say,
00:10:25.980 | five or 10 or 100 runs,
00:10:27.860 | do a version of the full grid search.
00:10:30.980 | You could also search based on a few epochs of
00:10:33.980 | training. The expense comes from multiple epochs,
00:10:36.940 | maybe you do one or two,
00:10:38.500 | and then you pick the hyperparameters
00:10:40.280 | that were best at that point.
00:10:42.340 | If the learning curves are familiar and consistent,
00:10:47.220 | then this will be a pretty strong approach here.
00:10:50.540 | You could also search based on subsets of the data.
00:10:53.780 | This is fine, but it could be risky because we
00:10:56.540 | know some parameters are very dependent on dataset size.
00:11:00.180 | You're selecting based on small data
00:11:02.300 | and applying it to large data,
00:11:03.700 | even though you know that's probably a risky assumption.
00:11:07.740 | You could do heuristic search and define
00:11:10.500 | which hyperparameters matter less and then set
00:11:12.980 | them by hand and justify that in the paper.
00:11:15.940 | That's increasingly common.
00:11:17.540 | People describe things like we
00:11:18.980 | determined in our initial experiments
00:11:21.420 | that these hyperparameters had
00:11:23.180 | this optimal value or didn't matter that much,
00:11:25.940 | and so we chose these reasonable values.
00:11:29.020 | Then the actual search happens only over the ones that you
00:11:32.260 | can tell are important.
00:11:34.580 | We have to take your word for it that you've done
00:11:36.620 | the heuristic search responsibly,
00:11:38.260 | but this is obviously a really good way to
00:11:40.780 | balance exploration with constrained resources.
00:11:44.820 | You could find the optimal hyperparameters via
00:11:47.620 | a single split and use them for
00:11:49.500 | all subsequent splits and then justify
00:11:51.500 | that based on the fact that the splits are similar.
00:11:54.040 | That would automatically cut down
00:11:55.900 | substantially on the number of runs you need to do because you
00:11:58.780 | don't need to do so much cross-validation in this mode.
00:12:03.020 | Then finally, you could adopt others' choices.
00:12:05.900 | The skeptics will complain that these findings
00:12:08.060 | don't translate to new datasets,
00:12:09.980 | but it could be the only option.
00:12:11.780 | As I say, a few years ago,
00:12:14.180 | this was frowned upon,
00:12:15.840 | but now in the modern era with these massive models,
00:12:18.720 | it's basically the only option.
00:12:20.460 | I think increasingly, people are simply
00:12:22.840 | carrying forward other hyperparameters.
00:12:25.260 | It means less exploration.
00:12:26.980 | We might not be seeing the best versions of these models,
00:12:29.860 | but it might then again be the only option.
00:12:33.420 | In terms of tools for hyperparameter search,
00:12:36.300 | as always, Scikit is really rich with these things.
00:12:38.660 | They've got a lot of these toolings.
00:12:40.980 | In addition, Scikit Optimize is
00:12:43.460 | one level up in terms of sophistication.
00:12:46.140 | That's where you could do model-guided search through
00:12:48.980 | a hyperparameter grid in order to intelligently
00:12:52.380 | select good settings to
00:12:54.460 | lead to a good model on a fixed budget.
00:12:57.700 | Next topic, classifier comparison.
00:13:00.940 | This is a short one, but this can be important.
00:13:02.780 | Suppose you've assessed two classifier models.
00:13:05.580 | Their performance is probably different in some way.
00:13:09.000 | What can you do to establish whether
00:13:11.180 | these models are different in any meaningful sense?
00:13:13.900 | I think there are a few options.
00:13:15.220 | The first would be practical differences.
00:13:17.660 | If they obviously make
00:13:19.160 | a large number of different predictions,
00:13:21.080 | you might be able to quantify that difference in terms of
00:13:24.140 | some actual external outcome.
00:13:26.420 | That's a really good scenario to be in.
00:13:29.320 | You could also think about confidence intervals
00:13:32.100 | to further bolster the argument that you're making.
00:13:34.940 | This will give us a picture of how
00:13:36.580 | consistently different the two systems are.
00:13:39.860 | If they are consistently different,
00:13:41.680 | then you have a very clear argument
00:13:43.360 | in favor of one over the other.
00:13:45.540 | The Wilcoxon sign-rank test is a accepted method in
00:13:49.460 | the field for assessing classifiers
00:13:51.900 | using methodologies that are similar to standard t-tests.
00:13:55.380 | I guess the consensus is just that
00:13:57.180 | the assumptions behind the Wilcoxon test
00:13:59.620 | are somewhat more aligned with classifier comparison.
00:14:02.940 | To do that, as well as confidence intervals,
00:14:05.380 | you will have had to run your model on
00:14:07.260 | lots of different settings to get a long vector of
00:14:10.860 | 10-20 scores to use as the basis for the stats testing.
00:14:15.580 | If that is too expensive,
00:14:17.260 | you could opt for a McNemars test.
00:14:19.820 | This is a comparison that you do over
00:14:22.300 | two single trained classifiers
00:14:25.900 | based on their confusion matrices.
00:14:27.980 | You only need one run.
00:14:29.540 | It will be unstable if the models are unstable,
00:14:32.440 | but it is a way of doing a stats test in the mode of
00:14:36.460 | the chi-squared test to give you
00:14:38.460 | some information about how
00:14:39.860 | two fixed artifacts compare to each other.
00:14:42.420 | Not as strong as the previous methods,
00:14:44.860 | but nonetheless useful and arguably better than nothing.
00:14:50.340 | A special topic for deep learning.
00:14:54.180 | How do you assess models without convergence?
00:14:56.700 | This never used to arise.
00:14:58.300 | Back in the days of all linear models,
00:15:00.900 | all these models would converge more or less
00:15:02.740 | instantly to epsilon loss,
00:15:05.180 | and then you could feel like that was how
00:15:07.180 | you'd move forward with assessing them.
00:15:09.660 | But now with neural networks,
00:15:11.980 | convergence has really taken center stage,
00:15:14.460 | and it's in a complicated way that it takes center stage.
00:15:17.140 | First, these models rarely converge to epsilon loss,
00:15:21.040 | and therefore it's non-issue whether or
00:15:23.740 | not that would be your stopping criterion.
00:15:25.900 | In addition, they might converge at
00:15:27.700 | different rates between runs,
00:15:29.500 | and their performance on the test set might not
00:15:32.540 | even be especially related to how small the loss got.
00:15:36.620 | You need to be thoughtful about
00:15:39.220 | exactly what your stopping criteria will be.
00:15:41.820 | Yes, sometimes a model with
00:15:43.340 | low final error turns out to be great,
00:15:45.540 | and sometimes it's worse than the one
00:15:47.220 | that finished with a higher error.
00:15:48.700 | This might have something to do with
00:15:50.060 | overfitting and regularization,
00:15:52.380 | but the bottom line here is,
00:15:53.920 | we don't know a priori what's going on.
00:15:56.120 | This is very experiment-driven.
00:15:59.100 | One thing to think about for stopping criteria in
00:16:03.060 | general is what we call incremental dev set testing.
00:16:06.700 | To address the uncertainty that I just reviewed,
00:16:09.640 | you regularly collect information about
00:16:12.140 | dev set performance as part of the training that you're doing.
00:16:15.760 | For example, at every 100th iteration,
00:16:18.260 | you could make predictions on the dev set and
00:16:20.860 | store the resulting vector of predictions.
00:16:24.140 | All the PyTorch models for this course have
00:16:26.780 | an early stopping parameter and a bunch of
00:16:28.960 | related parameters that will help you set this up in
00:16:31.900 | a way that will allow you to do this incremental testing,
00:16:34.960 | and with luck, get the best model.
00:16:37.580 | If you need a little bit of motivation for this,
00:16:39.980 | here are some plots from an actual model.
00:16:42.320 | You can see the loss going down very
00:16:44.460 | steadily across different iterations of training.
00:16:47.700 | But the performance on the dev set tells a very different story.
00:16:51.780 | You can see based on this performance that at
00:16:54.160 | a certain very early point in this process around iteration 10,
00:16:58.420 | our performance was actually getting worse
00:17:00.560 | even though the loss was going down.
00:17:02.620 | That just shows you that sometimes the steady loss curve is
00:17:06.340 | a picture of overfitting and not of
00:17:09.220 | your model actually getting better at the thing that you care about.
00:17:13.140 | Think carefully about your stopping criteria.
00:17:16.580 | In general though, I think we might want to take
00:17:20.100 | a more expansive view of how we do evaluation in this mode.
00:17:24.380 | Here what I'm pitching is that we look at
00:17:26.900 | the entire performance curve maybe with
00:17:29.740 | confidence intervals so we can make some confident distinctions.
00:17:33.180 | All these plots for different conditions across models we were
00:17:36.420 | comparing have epochs along the x-axis and F1 along the y-axis.
00:17:42.060 | If you step back, what I think you see is that our mittens model,
00:17:46.020 | the one that we were advocating for,
00:17:48.400 | is the best model on average but largely in early parts of training.
00:17:53.240 | If you train for long enough,
00:17:54.740 | a lot of the distinctions disappear.
00:17:57.820 | If you have a fixed budget of epochs,
00:18:00.500 | mittens is a good choice.
00:18:01.760 | If you don't care about the resources,
00:18:03.780 | it might not be so clear which one you should choose,
00:18:06.300 | maybe it doesn't matter.
00:18:07.560 | That's a nuanced lesson that I think is really powerful to
00:18:10.960 | teach and we can't really teach it if all we
00:18:13.740 | do is offer point estimates of model performance.
00:18:16.380 | We really need to see the full curve to see that level of nuance.
00:18:20.780 | I know that NLPers love their results tables,
00:18:23.980 | you should have results tables,
00:18:25.660 | but maybe you could supplement them with some figures that would
00:18:28.700 | give us a fuller picture of what was going on.
00:18:32.620 | Final topic, the role of random parameter initialization.
00:18:37.180 | Most deep learning models have parameters that are random at the start.
00:18:40.880 | Even if they're pre-trained,
00:18:42.100 | there are usually some random parameters in the mix there.
00:18:45.500 | This is clearly meaningful for
00:18:47.700 | the non-convex problems that we're posing.
00:18:50.420 | Simpler models can be impacted as well,
00:18:52.560 | but it's especially pressing in the deep learning era.
00:18:55.100 | Here is a relatively recent paper showing actually that
00:18:57.980 | different initializations for neural sequence models
00:19:00.820 | led to statistically significant differences in performance.
00:19:05.280 | A number of recent systems were actually indistinguishable in terms of
00:19:09.340 | their raw performance once we took this source of variation into account.
00:19:13.920 | There's a related issue here of
00:19:15.780 | catastrophic failure from unlucky initializations.
00:19:19.100 | Sometimes that happens, sometimes you see it,
00:19:21.920 | sometimes it's hard to notice.
00:19:23.880 | There's a question of how to report that as part of overall system performance.
00:19:28.300 | We need to be reflective about this.
00:19:30.540 | Maybe the bottom line here is just for
00:19:32.940 | the associated notebook for this unit, evaluation methods.
00:19:36.160 | I just showed you with the classic XOR problem,
00:19:39.880 | which has always been used to
00:19:41.220 | motivate the powerful models that we work with now,
00:19:44.200 | that you don't actually get success for
00:19:46.580 | a simple feed-forward network for that problem.
00:19:49.320 | Eight out of 10 times it succeeds,
00:19:51.620 | and two out of the 10 times it's a colossal failure.
00:19:54.680 | That is a glimpse of just how important initialization can be.
00:19:59.060 | Since we don't analytically understand why we're seeing this variation,
00:20:03.120 | the best response if you can afford it is a bunch more experiments.
00:20:08.620 | Let's wrap up. A lot of this in the back of my mind is
00:20:13.460 | oriented toward helping you with the protocols,
00:20:15.780 | which is a document associated with your final project,
00:20:19.140 | where you give us the nuts and bolts of the project,
00:20:21.800 | and try to identify any obstacles to success.
00:20:25.180 | All the lessons we've been teaching throughout this series are
00:20:28.460 | oriented toward helping you think critically about
00:20:30.840 | this protocol and ultimately set up a firm foundation for your project.
00:20:36.020 | With that out of the way, I thought I would look
00:20:38.060 | ahead to this moment and the future.
00:20:41.340 | I think this is an ideal moment for innovation in surprising new places.
00:20:46.260 | Architecture innovation, way overrated at this point.
00:20:50.500 | I mean, it's still important,
00:20:51.900 | but it is overrated relative to the amount of things people do.
00:20:55.440 | Metric innovation, way underrated.
00:20:58.760 | It's been a theme of these lectures that we need to think very carefully about
00:21:02.880 | our metrics because they are our guideposts
00:21:05.360 | toward whether we're succeeding or not.
00:21:08.040 | Relatedly, evaluations in general.
00:21:10.420 | We need innovation there.
00:21:11.620 | This is way underrated by the community at this point.
00:21:14.840 | Task innovation, underrated.
00:21:16.900 | We are seeing some things,
00:21:18.080 | so it's not as bad as two and three,
00:21:19.840 | but still we should all be participating in this area.
00:21:23.280 | Then finally, exhaustive hyperparameter search.
00:21:26.520 | You need to weigh this against other factors.
00:21:29.120 | There is more at play here than just that pristine scientific paradigm.
00:21:33.440 | We need to think about costs in every sense and
00:21:36.600 | how it relates to the innovations that we're likely to see.
00:21:40.000 | I'm pitching a pragmatic approach,
00:21:41.980 | but I'm also exhorting you to think
00:21:44.360 | expansively about how you might participate in pushing the field forward.
00:21:49.760 | [BLANK_AUDIO]