Stanford XCS224U: NLU I NLP Methods and Metrics, Part 1: Overview I Spring 2023

00:00:00.000 | Welcome everyone.

00:00:05.920 | This screencast kicks off our series on methods and metrics.

00:00:09.120 | The overarching goal of this series is to help you work your way through

00:00:13.800 | your project and make smart choices around experiments and data and metrics and

00:00:18.720 | everything else and also be a trusted companion and sounding board for you as

00:00:23.560 | you confront really hard decisions around doing this kind of research.

00:00:27.720 | Let's start with an overview of the goals and the current issues.

00:00:32.800 | Fundamentally we're trying to help you with your projects.

00:00:34.880 | This comes down to things like managing data, establishing baseline systems,

00:00:39.960 | comparing models, optimizing models, and maybe most importantly, navigating

00:00:44.520 | tricky situations.

00:00:46.120 | We can provide some high-level guidance here and then you should look to your

00:00:49.360 | mentor from the teaching team to really help you optimize these things and get

00:00:54.280 | through the really hard parts.

00:00:57.520 | For associated materials, we have a big notebook on evaluation metrics that'll

00:01:02.320 | allow you to get hands-on with some of the detailed technical things that I give

00:01:05.960 | an overview of in these screencasts.

00:01:08.680 | There's also a really wonderful set of pages embedded in the Scikit site about

00:01:14.240 | how to think about model evaluation in AI more generally.

00:01:18.160 | And then we also have an extensive notebook that allows you to get hands-on

00:01:22.080 | with the more methodological stuff that we're going to talk about.

00:01:26.040 | The notebook substantiates with experiments some of the things that I'll

00:01:29.480 | just offer you as general lessons here.

00:01:32.840 | And then there are a few things that you could check out in terms of reading.

00:01:36.240 | Although honestly, there is much less reading than you would expect.

00:01:40.400 | This might be a sign that the field is still maturing.

00:01:42.920 | For fields in the hard sciences and the behavioral sciences, there would be

00:01:46.920 | entire textbooks about methods and metrics.

00:01:49.920 | Whereas for us, we seem to have an assumption that people will just pick it

00:01:54.040 | up as they go.

00:01:56.200 | I'm not sure that's to the credit of the field, but that is the current

00:01:58.760 | situation.

00:01:59.400 | And I guess the thinking behind this unit is that we can help somewhat provide

00:02:03.440 | some more systematic guidance on all of these really crucial things that serve

00:02:07.600 | the foundations of our field.

00:02:10.320 | I wanted to start by saying one really important thing about how we think about

00:02:14.840 | your projects.

00:02:16.560 | The fundamental thing here is that we will never evaluate your project based on

00:02:20.640 | how good the results are, where we all know what good means.

00:02:24.480 | It means being at the top of the leaderboard or something similar.

00:02:28.720 | I recognize that publication venues in the field do this.

00:02:32.360 | And the rationale there is that they have additional constraints on space, and

00:02:36.080 | that leads them to favor positive results for new developments over negative

00:02:41.120 | results.

00:02:42.600 | As an aside, I think we can all see that this is a distorting effect on the

00:02:46.960 | scientific literature, and it absolutely is ultimately hindering progress,

00:02:51.560 | because negative results can be so powerful and useful in terms of

00:02:55.880 | instructing people about where to invest their energy, and in particular, where

00:03:00.400 | not to invest their energy.

00:03:02.320 | So this is a kind of sad commentary, but I'm just observing that it's true.

00:03:06.800 | And then fundamentally, for our course, we are not subject to any constraints,

00:03:11.760 | real or imagined, on what we can publish.

00:03:14.280 | So we can do the right and good thing of valuing positive results and negative

00:03:18.600 | results and everything in between.

00:03:21.400 | What this means is you could be at the top of the leaderboard according to your

00:03:25.600 | own experimentation, but if your report is shallow and doesn't really explain how

00:03:30.760 | that number got there or what the thinking behind it was, you won't do well.

00:03:35.080 | More importantly, conversely, you could try something really creative and

00:03:39.240 | ambitious.

00:03:40.200 | It might fail, but that could be an outstanding contribution because of how

00:03:44.240 | careful you were about the things we actually care about, which are the

00:03:48.560 | appropriateness of the metrics, the strength of the methods, and the extent

00:03:53.320 | to which the paper is open and clear-sighted about the limitations of

00:03:57.960 | its findings.

00:03:58.960 | This is a wonderful movement in the field that we increasingly emphasize

00:04:02.520 | finding the outer limits of our ideas and describing those limitations in our

00:04:07.120 | papers.

00:04:07.960 | I would exhort you to do similar things, and I think overall, on average, this

00:04:13.440 | kind of work is going to lead you to more fruitful systems, more rewarding

00:04:17.240 | results, and higher-quality papers.

00:04:19.800 | So I feel really good about this, and we can talk in the next unit about how to

00:04:25.400 | navigate tricky things related to this publication bias.

00:04:29.240 | I think fundamentally we can do right there as well, and the shift in

00:04:32.840 | perspective that you need to take is to move away from papers that are a

00:04:36.720 | competition favoring your chosen system and toward papers that are simply

00:04:41.640 | openly evaluating scientific hypotheses and mustering as much evidence as

00:04:46.880 | possible to inform those hypotheses.

00:04:49.840 | And once you move into those modes, you're not anymore thinking about how to

00:04:53.280 | pick a winner, but rather just thinking about strength of evidence and the

00:04:57.240 | importance of the hypotheses.

00:04:59.160 | So we can get there, and everyone will be happier as a result, but it does

00:05:02.840 | require a shift in perspective from the norms that we often hear about, which

00:05:06.440 | are competition-oriented.

00:05:10.520 | For methods, I've put this under the heading "How Times Have Changed," and

00:05:15.640 | unfortunately I don't have very many happy lessons to teach here.

00:05:20.560 | Let's rewind the clock to around 2010.

00:05:23.800 | In that era, you could develop your complete system on tiny samples of your

00:05:27.680 | trained data.

00:05:29.280 | Once you had it working, you would do regular cross-validation using only the

00:05:32.920 | trained data, a nice pristine experimental setting.

00:05:36.280 | You would evaluate only very occasionally on your held-out dev set in an effort

00:05:41.480 | to avoid hill climbing on that dev set and ultimately overfitting to it, and

00:05:45.840 | that's why you would be cautious there.

00:05:47.800 | And then in the final stages of your project, you would do a complete round of

00:05:50.920 | hyperparameter tuning using your dev data.

00:05:54.120 | You would select the best model, and you would run your final test evaluation

00:05:58.680 | and report that number more or less as is.

00:06:02.560 | That was 2010.

00:06:03.560 | We could do something deeply right about this as a scientific picture for our

00:06:06.400 | field, but unfortunately times have changed.

00:06:09.000 | In 2023, you could develop your system on tiny samples of your data as before.

00:06:14.680 | Good.

00:06:15.680 | However, for step two, either there's no training data for your task or cross-validation

00:06:21.200 | on it would cost $20,000 and take six months.

00:06:24.720 | So already we are off course here in terms of our ideal scientific picture.

00:06:29.000 | Relatedly, the dev set is frequently and crucially important to optimization.

00:06:33.800 | So you have to keep fingers crossed that it really is a superb proxy for tests

00:06:38.160 | because after all, you're going to orient all of your optimization processes

00:06:42.080 | toward this dev set.

00:06:44.240 | And then finally, for the final stage, either hyperparameter tuning would cost

00:06:47.800 | $100,000 and take 10 years or there are no hyperparameters, but test runs cost

00:06:53.320 | $4,000 because you're calling an open AI model or something like that.

00:06:58.080 | Boy, this looks untenable, right?

00:07:01.480 | What do we do?

00:07:02.620 | The core tenets of the previous era remain sound.

00:07:05.600 | As I said, I like them.

00:07:06.600 | There's something really good about them, but we cannot enforce them.

00:07:09.400 | Enforcing them has become impossible.

00:07:11.760 | If we did, only the richest organizations could follow them and that would restrict

00:07:16.120 | participation in the field in a way that would be terrible.

00:07:19.080 | If you're creating a ranked list, you have to put broad participation way above all of

00:07:24.680 | those very pristine and idealized methods that we used to be able to get away with

00:07:28.920 | back when every model trained very quickly on a consumer laptop.

00:07:34.360 | So what you have to do is articulate your methods and the rationale behind them,

00:07:39.600 | including practical details like resource constraints and heuristics that you had to

00:07:43.880 | invoke.

00:07:44.880 | However, two rules should remain absolutely fixed here.

00:07:48.840 | I'm adamant about this.

00:07:50.320 | First, you never do any model selection, even informally, based on test set evaluations.

00:07:56.640 | I know there are people violating this rule out there in the field, but don't follow that

00:08:01.320 | path.

00:08:02.320 | It's really important for us, especially when we think about the high stakes scenarios that

00:08:06.400 | we could be deploying our systems in, to have pristine test evaluations that give us an

00:08:11.960 | honest look about how our systems will behave on unseen examples.

00:08:16.360 | And you compromise that entirely the moment you choose a model based on test set numbers.

00:08:22.840 | Relatedly, as you think about constructing baselines and ablations and comparisons with

00:08:27.600 | the literature, you have to strive to give all systems you evaluate the best chance of

00:08:32.800 | success.

00:08:33.920 | You should never, ever stack the deck in favor of a system that you are advocating for.

00:08:39.200 | We all know it can be done.

00:08:40.920 | All these models have hyperparameters and you could pick really bad settings for models

00:08:44.840 | you disfavor and you could work really hard to find optimal settings for models that you

00:08:49.480 | like about.

00:08:50.600 | In that way, you would appear to have won some kind of competition, but you would have

00:08:54.720 | compromised the very foundations of your project.

00:08:58.800 | What you need to do instead is give every system its best chance.

00:09:02.300 | Work really hard to make all of them competitive.

00:09:05.120 | The result will be better science, results you can trust, and ultimately, you will go

00:09:09.240 | farther in the field if you are rigorous about this rule as well.

00:09:16.160 | That was it for methods.

00:09:17.240 | For metrics, I can be more hopeful.

00:09:19.800 | I have put this under the heading, how time should change, and I do feel like they are

00:09:25.040 | changing very rapidly in a happy way.

00:09:28.320 | The overarching idea that we could have in mind here is Strathern's law.

00:09:33.440 | When a measure becomes a target, it ceases to be a good measure.

00:09:37.180 | We have to beware Strathern's law.

00:09:39.540 | We have to be vigilant and make sure we don't fall into this trap.

00:09:43.500 | In this setting, with this in mind, I'm always reminded of leaderboards.

00:09:48.520 | Leaderboards are central to the way the field works.

00:09:50.380 | We all think about them and use them as markers of progress.

00:09:54.100 | They do have their good aspects.

00:09:56.820 | Leaderboards can be an objective basis for comparison and that creates opportunities

00:10:01.260 | for even wild-seeming ideas to get a hearing.

00:10:04.860 | In fields without leaderboards, very often these wild ideas are rejected out of hand

00:10:09.580 | by the community with no evaluation, whereas at least leaderboards give people in our field

00:10:16.060 | a chance to participate.

00:10:18.100 | That's the good.

00:10:19.100 | The bad, though, this can get really bad, with leaderboards we have a constant conflation

00:10:23.840 | of benchmark improvements with actual progress when we know, in fact, that the benchmarks

00:10:28.680 | might be fallible.

00:10:29.980 | Relatedly, we have this conflation of benchmarks with empirical domains.

00:10:34.340 | People say things like OCR is solved, question answering is solved.

00:10:38.800 | What they really mean is that certain benchmarks have been solved and we are all aware by now

00:10:43.900 | that those two claims about the benchmark and the capability are radically different,

00:10:48.340 | but nonetheless people conflate them.

00:10:51.340 | Even in the way we talk, I find we're often guilty of this third thing here, which is

00:10:55.420 | conflating benchmark performance with a capability.

00:10:58.740 | We see that a system does well at question answering for squad and we assume it's a good

00:11:03.420 | question answer, even though we know in our hearts that these two things are very different.

00:11:09.820 | That's the bad of leaderboards.

00:11:10.980 | I think what we should do moving forward is think about how to bring in more of the good,

00:11:15.900 | more dimensions of good, and remove the dependence on these bad assumptions that we often make.

00:11:23.620 | The fundamental issue here, I would say, is that the metrics that you choose, including

00:11:28.160 | the ones that get embedded in leaderboards, are actually tied up with the thing that you're

00:11:32.100 | trying to solve.

00:11:33.900 | Too often in the field, we don't actually make that connection.

00:11:37.980 | Let me offer you some scenarios and they should get you thinking about how you would approach

00:11:42.280 | this differently with different metrics in mind.

00:11:46.500 | Suppose you're in a scenario where missing a safety signal costs lives and human review

00:11:50.800 | is feasible.

00:11:52.380 | What kind of metric would you favor for a system?

00:11:55.820 | Conversely, suppose exemplars need to be found in a massive dataset.

00:12:00.260 | Again, what kind of metrics would you use to evaluate systems in this kind of context?

00:12:05.940 | I think the metrics would be very different from the first one.

00:12:09.220 | In the second scenario with the exemplars, we can afford to miss a lot of cases.

00:12:14.160 | We just need a few really good ones.

00:12:15.860 | Whereas in the first scenario, every missed case costs lives and we have the opportunity

00:12:21.980 | to do human review.

00:12:23.880 | So obviously our values are oriented differently.

00:12:27.660 | Suppose specific mistakes are deal breakers, others hardly matter.

00:12:31.380 | Now you want a metric that will give credit and give demerits to different kinds of mistakes

00:12:36.720 | and good predictions in different ways to capture these underlying ideals.

00:12:42.080 | Suppose cases need to be prioritized.

00:12:43.860 | You're not talking about classification anymore, you're talking about ranking.

00:12:47.380 | Again, you should have good metrics for ranking.

00:12:51.260 | Suppose the solution needs to work over an aging cell network.

00:12:54.420 | Well now your obsession with accuracy should kind of go out the window in favor of systems

00:12:59.780 | that can run on very constrained hardware, low energy, low power, very fast, all of that

00:13:05.460 | stuff.

00:13:07.040 | Suppose the solution cannot provide worse service to specific groups.

00:13:11.620 | Well standard machine learning models will often favor majority groups.

00:13:15.340 | We know this.

00:13:16.660 | And if your ultimate allegiance is to making sure that the system is equitable across groups,

00:13:21.260 | you will have to change your metrics from the norm and maybe even your underlying practices

00:13:25.900 | around optimization.

00:13:28.220 | Suppose specific predictions need to be absolutely blocked.

00:13:30.900 | Well now you're in a totally different territory where some kinds of error cost you infinitely,

00:13:36.540 | whereas others matter hardly at all.

00:13:38.300 | Again, a very different scenario from the norm.

00:13:42.700 | In the field, tragically, the scientific literature seems to offer one answer to essentially every

00:13:49.460 | scenario, which is that you use F1 and related accuracy metrics as your measure of system

00:13:55.820 | performance.

00:13:57.800 | You can see if you review this list that F1 is not appropriate for any of these scenarios.

00:14:02.780 | F1 is just what we as researchers choose when we have no information about the application

00:14:07.740 | area.

00:14:08.740 | When we have information, we should be tailoring our metrics to those specific scenarios.

00:14:13.540 | It's just hardly ever done.

00:14:15.340 | And I worry that the lesson we project out to the world is that you needn't bother.

00:14:20.240 | We don't do it, and we are purported to be experts, so why would anyone else do it?

00:14:24.500 | Even though as experts, we can see these different scenarios call for very different metrics.

00:14:30.820 | Relatedly, if you do do a survey of the scientific literature, you find a kind of overarching

00:14:36.460 | obsession on performance, accuracy, F1, all of those things.

00:14:40.800 | This is really nicely supported by this lovely and creative paper.

00:14:44.580 | The Values Encoded in Machine Learning Research.

00:14:47.120 | This is Birhani et al., 2021.

00:14:50.340 | What I've done here is distill their evidence down into a kind of cartoonish picture that

00:14:55.520 | does I think capture the essence of this.

00:14:57.700 | I've used font size to convey the values that they find encoded in our literature.

00:15:04.300 | And in the largest font here, unsurprisingly, is performance, dominating every other value

00:15:10.500 | that we might want reflected in our research.

00:15:13.740 | Kind of close behind in second place, but actually pretty distant, is efficiency.

00:15:19.180 | Then you get interpretability, but notice that's interpretability for researchers.

00:15:23.220 | We're probably guilty of that in this class.

00:15:25.060 | The interpretability work that we talked about in the previous unit is very focused on technical

00:15:30.460 | consumers.

00:15:31.820 | Applicability, robustness, scalability, these are pretty well represented.

00:15:36.980 | And then I used a different and lighter font to reflect things that are very distant in

00:15:41.060 | this ranking.

00:15:42.540 | Accuracy, but for users now.

00:15:44.500 | Beneficence, privacy, fairness, justice.

00:15:47.320 | We all recognize that these are crucial aspects of successful NLP systems, but they are hardly

00:15:53.820 | ever reflected in our practices around hypotheses and system evaluation.

00:15:59.940 | Really if someone was just consuming our literature, what they would get out of it is again just

00:16:04.980 | this obsession with accuracy and related notions of performance.

00:16:10.320 | So we should push back, we should elevate some of those other values in the form of

00:16:15.100 | metrics that we use.

00:16:16.700 | Luckily, there are efforts to do this.

00:16:18.820 | I've put this under the heading of multi-dimensional leaderboards.

00:16:22.220 | I've been involved with one effort, Dynaboard, also DawnBench, and Explainaboard.

00:16:28.620 | These are all efforts to provide many more dimensions of evaluation for our systems and

00:16:33.660 | get much richer pictures of what's actually happening.

00:16:36.820 | In this context, I would like to mention Dynascoring.

00:16:40.060 | I think this is a really powerful way to bring in multiple metrics and even allow the person

00:16:45.440 | behind the system to decide which metrics to favor to what degree.

00:16:50.820 | It's such a powerful metric that I've in fact offered you a notebook that implements Dynascoring

00:16:55.700 | and offers you some tips on how to use it so that you too could explore using Dynascoring

00:17:01.580 | to synthesize across multiple things that you measure.

00:17:05.340 | Let me give you a sense for why this could be so powerful.

00:17:07.740 | I have here a real leaderboard for question answering systems.

00:17:12.540 | The DiBerna model is in first place according to my Dynascore.

00:17:15.980 | That Dynascore was created by giving a lot of weight to performance and then equal weight

00:17:21.980 | to throughput, memory, fairness, and robustness.

00:17:26.260 | However, with Dynascoring, I can adjust those weights.

00:17:30.380 | Suppose I decide that I really want a system that is highly performant but also fair according

00:17:35.220 | to my fairness metric.

00:17:37.100 | I adjust the Dynascore to put five weight on fairness and I reduce throughput, memory,

00:17:42.220 | and robustness accordingly.

00:17:44.260 | Well now the previously first place system is in second place and ElectraLarge has become

00:17:49.840 | the first place system.

00:17:51.900 | Of course, different weightings of the different metrics that I have here will adjust the ranking

00:17:56.860 | in other ways.

00:17:58.560 | That shows you that there is no one true ranking but rather rankings only with respect to different

00:18:04.420 | priorities and values and measurements that I take.

00:18:07.940 | That is the essence of Dynascoring to be transparent about those values and also to reflect them

00:18:14.100 | in good old-fashioned leaderboards as we're doing here.

00:18:19.140 | In this context, when we talk about evaluation and we talk about different metrics, people

00:18:24.220 | often say, "Wait a second.

00:18:26.000 | This is all too technical, too customized, too intricate.

00:18:29.440 | What we should do is something more like the Turing test.

00:18:33.380 | After all, that was the ultimate test in some sense.

00:18:35.900 | The idea here is that a human and a computer are interacting.

00:18:40.900 | Then the human is trying to figure out that it's a computer and the computer is doing

00:18:44.860 | its level best to fool that human.

00:18:47.540 | In that way, we're supposed to have a good diagnosis for general system quality and intelligence

00:18:53.340 | and all those other things.

00:18:55.300 | I just want to issue a cautionary note here.

00:18:59.980 | The first Turing test was reported in Schieber 1994.

00:19:04.740 | In that test, Shakespeare expert Cynthia Clay was thrice misclassified as a computer on

00:19:10.820 | the grounds that no human could know that much about Shakespeare.

00:19:14.680 | That's an instance of people not really knowing what the human experience is like in its full

00:19:20.140 | range and generality.

00:19:22.100 | Conversely, this is another comical story.

00:19:25.020 | In 2014, an AI, a very simple one called Eugene Guzman, passed the Turing test.

00:19:32.020 | How did it do it?

00:19:33.180 | Well, it did it by adopting the personality of a 13-year-old boy.

00:19:38.820 | When it was rude or appeared distracted because it was confused about what the human was trying

00:19:43.640 | to do, people just chalked that up to the fact that 13-year-old boys are often rude

00:19:48.820 | and distracted.

00:19:49.820 | In that way, it got a huge pass.

00:19:53.660 | Google Duplex is a real AI system, a sophisticated one, and that is an AI that routinely runs

00:19:59.900 | and wins Turing tests with service workers.

00:20:02.380 | It makes phone calls.

00:20:04.020 | Even though it announces itself, because it does this by law, as an AI, right from the

00:20:09.060 | start of the conversation, people often lose track of that information and believe that

00:20:13.580 | they are talking with a computer.

00:20:15.940 | Relatedly, now that we've moved into this mode of doing a lot of natural language generation,

00:20:20.940 | we're all discovering that people are not good at distinguishing human-written texts

00:20:26.020 | from texts that come from our best large language models.

00:20:29.460 | In this way, especially with the stories of Duplex and the LLMs, we should reflect on

00:20:34.780 | the fact that all of us are probably constantly failing Turing tests, in some cases with sophisticated

00:20:40.500 | AIs, but in some cases with ones that are actually pretty simple.

00:20:44.220 | There are some cognitive biases about social interaction that make this not such a reliable

00:20:50.380 | test.

00:20:52.140 | There's another dimension to this that we should think about in the context of evaluation,

00:20:56.260 | and that is how we estimate human performance.

00:20:58.860 | My summary here is that we estimate human performance by forcing humans to do machine

00:21:04.180 | tasks and then saying that that's how humans actually perform.

00:21:08.140 | Let me give you an example in the context of natural language inference.

00:21:11.940 | Let's imagine that you're a crowd worker and you've been asked to label premise hypothesis

00:21:16.740 | pairs for whether or not they're neutral, entailment, or contradiction.

00:21:21.580 | You get a little training, and after the training, you see, okay, a dog jumping and a dog wearing

00:21:26.260 | a sweater, those are neutral with respect to each other because we don't know from the

00:21:30.380 | jumping whether it's wearing a sweater.

00:21:32.260 | There's no relationship.

00:21:34.500 | Then you're given the example turtle and linguist, and you think, "Well, I can imagine turtle

00:21:39.460 | linguist somewhere in some possible world, but I was told this was a common sense reasoning

00:21:44.100 | situation, and so I'll say contradiction because no actual turtles are linguists."

00:21:50.340 | Seems like a safe assumption.

00:21:52.420 | But then you come to a photo of a racehorse and a photo of an athlete, and you're asked

00:21:57.020 | to assign a label, and you think, "Huh, I haven't really thought about this before.

00:22:02.580 | Can a racehorse be an athlete?

00:22:05.440 | In general, can animals be athletes?"

00:22:08.700 | You might decide that you have a fixed view on this.

00:22:11.060 | You say, "Of course, a racehorse could be an athlete," or, "Of course not."

00:22:14.500 | But the really fundamental thing is that you might be unsure what other people think about

00:22:18.420 | this, and in turn, you might feel unsure about what label you're supposed to assign.

00:22:22.860 | The human thing is to discuss and debate to figure out why the question is being asked

00:22:28.540 | and what people are thinking about related to the issues.

00:22:31.660 | But what we do instead is block all of that interaction and simply force crowd workers

00:22:35.820 | to choose a label, and then we penalize them, in effect, to the extent to which they don't

00:22:41.020 | choose the label that everyone else chose, even though all of us feel uncertainty.

00:22:45.820 | Here's another example.

00:22:47.460 | A chef using a barbecue, a person using a machine.

00:22:50.620 | Is a barbecue a machine?

00:22:52.380 | I think it probably depends on the situation, the goals, the assumptions, all of that stuff.

00:22:57.300 | The human thing is to discuss those points of uncertainty and then assign a label, but

00:23:01.980 | we simply block that when we do crowdsourcing.

00:23:06.260 | So now, when you hear an estimate of human performance, you should remember that the

00:23:12.260 | humans were probably not allowed to do most human things, like say, "Let's discuss this."

00:23:17.180 | And so human performance in these contexts really means average performance of harried

00:23:22.380 | crowd workers doing a machine task repeatedly.

00:23:26.380 | We can all do that mental shorthand, but of course, out in the world, people hear human

00:23:30.100 | performance and they think human performance in the most significant sense.

00:23:34.900 | We should be aware that that's not true, and we should be pushing back against the assumption

00:23:38.980 | that this is actually what we mean, when in fact, this is what we did.

00:23:44.100 | So what are we looking for with metrics?

00:23:45.980 | I would say that we're looking for things that are kind of between standard old evaluations.

00:23:51.400 | Can a system perform more accurately on a friendly test than a human performing that

00:23:55.420 | same machine task?

00:23:56.740 | That is my kind of cynical paraphrase of standard evaluations.

00:24:01.460 | But we also don't want to swing to, can a system perform like a human in open-ended

00:24:05.520 | adversarial communication?

00:24:07.120 | That's the Turing test.

00:24:08.120 | It's a very particular thing, and it's very thorny.

00:24:11.420 | In the middle there, there's lots of fruitful stuff.

00:24:14.380 | In the spirit of our previous units, we could ask, can a system behave systematically, even

00:24:18.340 | if it's not accurate?

00:24:19.420 | That might be a system that is on its way to being one we can trust, even if it's currently

00:24:23.540 | kind of not doing so well.

00:24:26.100 | Can a system assess its own confidence, know when not to make a prediction?

00:24:30.580 | Our systems in AI used to fail on every unanticipated input.

00:24:35.220 | Now they give an answer seemingly with confidence, no matter what you throw at them.

00:24:39.800 | We need to change that.

00:24:41.000 | We need systems to withhold information when they're just not sure it's good information,

00:24:45.720 | as an example.

00:24:46.720 | And maybe fundamentally, we should ask, can a system make people happier and more productive?

00:24:52.940 | This would move us far away from automatic evaluation and toward things that were more

00:24:57.440 | like human-computer interaction evaluations.

00:25:00.500 | But ultimately, I feel like this is our goal, and we might as well just design evaluations

00:25:06.460 | that are oriented to it.

00:25:09.520 | As I said, I'm hopeful about all of this.

00:25:11.240 | I think that time should change, and they are changing.

00:25:14.820 | Assessment today, or maybe yesterday, is one-dimensional, accuracy, largely insensitive to context or

00:25:21.100 | use case, again, F1 maybe.

00:25:24.300 | The terms are set by the research community, whether we know it or not.

00:25:28.700 | The metrics are often opaque, and the assessments are often kind of hard to understand deeply.

00:25:33.740 | And they are tailored to machine tasks right from the very get-go in the way that they

00:25:37.900 | are structured.

00:25:39.700 | I think assessment tomorrow, or maybe today, depending on the work that you all do, could

00:25:44.660 | be high-dimensional and fluid.

00:25:46.720 | It could be highly sensitive to context and use case.

00:25:49.840 | And the terms could be set by the stakeholders, the system designers out in the world, or

00:25:54.260 | better, the people who are using the system.

00:25:57.300 | Ultimately, the judgment should be made by users, and the tasks that we're talking about

00:26:02.200 | should be fundamentally human tasks.

00:26:05.480 | We have entered into an era in which I think we could start to implement all of these visionary

00:26:10.240 | items about how assessment should work.

00:26:12.760 | And so I would encourage you all to think about how you could push forward in these

00:26:16.600 | directions with the research that you do for this course.

00:26:19.960 | Thank you.

00:26:20.960 | [end of transcript]

00:26:20.980 | [BLANK_AUDIO]