back to index

Stanford XCS224U I Behavioral Eval of NLU Models, Pt 2: Analytical Considerations I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.440 | This is part 2 in our series on
00:00:08.360 | advanced behavioral methods for NLU.
00:00:10.640 | We're going to talk about
00:00:11.800 | some analytic considerations that
00:00:13.580 | surround this kind of assessment and analysis.
00:00:16.520 | The key questions for us are,
00:00:18.720 | what can behavioral testing tell us?
00:00:21.520 | Just as crucially, what can't it
00:00:23.680 | tell us about the nature of our systems?
00:00:27.040 | I said this in the first part of
00:00:29.080 | these screencasts it bears repeating.
00:00:31.480 | We talk about adversarial testing as a shorthand,
00:00:34.360 | but there is no intrinsic need to
00:00:36.160 | be adversarial in all cases.
00:00:38.140 | We could just be trying to
00:00:39.600 | explore what our systems are capable of.
00:00:42.280 | Here are some example questions
00:00:44.280 | that you might be thinking of in this mode.
00:00:46.160 | They start off purely exploratory
00:00:48.520 | and they end up being quite adversarial.
00:00:50.920 | For example, has my system
00:00:52.960 | learned anything about numerical terms?
00:00:54.920 | You could ask this in an open-ended way and
00:00:57.440 | construct a test dataset that
00:00:59.120 | would help you address this question.
00:01:01.380 | Does my system understand how negation works?
00:01:03.760 | Same thing, maybe you did an audit of
00:01:05.440 | the training data and you found that
00:01:07.280 | negation was sparsely represented and you want to ask,
00:01:10.560 | given what it did see,
00:01:12.040 | was it able to induce
00:01:13.520 | a reasonable theory of how negation works?
00:01:15.960 | Behavioral testing could help you address that question.
00:01:19.160 | Does my system work with a new style or genre?
00:01:22.600 | That's an important subtle domain shift.
00:01:25.320 | Maybe it was trained on text from newspapers and you would
00:01:28.200 | just want to find out whether
00:01:30.120 | its behavior is accurate on Twitter data.
00:01:33.480 | This system is supposed to know about numerical terms,
00:01:36.840 | but here are some test cases that are
00:01:38.880 | outside of its training experiences for such terms.
00:01:42.280 | What will happen? We are now moving into
00:01:44.760 | a mode of being more thoroughly adversarial.
00:01:47.280 | We might have discovered about
00:01:49.280 | the system that it's not good at numerical terms,
00:01:51.980 | and now we're trying to expose
00:01:53.880 | that gap in its abilities.
00:01:56.040 | When applied to invented genres,
00:01:58.380 | that is very unfamiliar kinds of inputs,
00:02:00.640 | does the system produce socially problematic,
00:02:03.120 | say stereotyped outputs?
00:02:04.800 | At this point, you're actively trying to construct examples
00:02:08.560 | that you know will be outside of
00:02:10.920 | standard experiences for the model in an effort to
00:02:13.760 | discover what its behaviors are like in
00:02:15.760 | those unusual tail situations.
00:02:18.400 | That is probably more thoroughly adversarial.
00:02:21.560 | Maybe the most adversarial of all would be exploring
00:02:25.440 | random inputs that lead
00:02:27.160 | the system to produce problematic outputs.
00:02:29.280 | This is the mode where you take
00:02:30.980 | normal examples and append a certain sequence of
00:02:33.360 | Unicode characters onto the end and see
00:02:35.840 | something very surprising happen as a result,
00:02:38.680 | as a way of auditing the system
00:02:41.260 | for gaps in its security in some general sense.
00:02:45.520 | All these things are on the table for us.
00:02:47.360 | These are all interesting behavioral tests to run.
00:02:51.160 | But behavioral testing has its limits.
00:02:55.040 | I think we all know this in our hearts,
00:02:57.200 | but it's worth dwelling on it.
00:02:59.220 | No amount of behavioral testing can truly offer
00:03:02.600 | you a guarantee about what our systems will be like.
00:03:06.400 | You are at the mercy of
00:03:08.500 | the set of examples that you decided to construct.
00:03:11.600 | I think this is pretty clear.
00:03:13.160 | This often goes under the heading of
00:03:14.760 | the limits of scientific induction.
00:03:17.880 | But let's linger over an example to see
00:03:20.200 | just how this can become pressing.
00:03:22.820 | For an illustration, I've got
00:03:24.320 | an even-odd model in the middle here,
00:03:26.120 | and I've drawn it as a big opaque rectangle.
00:03:28.520 | We don't know how this model works.
00:03:30.520 | But the promise of this model is that it takes in
00:03:33.360 | strings like four and predicts whether
00:03:36.120 | those strings refer to even or odd numbers.
00:03:39.120 | Here four has come in and it has predicted even.
00:03:42.200 | So far so good. 21 comes in and it predicted odd.
00:03:46.020 | Also good. 32, prediction of even.
00:03:50.000 | 36, prediction of even.
00:03:52.620 | 63, prediction of odd.
00:03:55.440 | This behavioral testing is going great and suggests that we
00:03:58.880 | have a very solid model of even-odd detection.
00:04:02.480 | But suppose I now expose for you how this model works.
00:04:06.720 | I show you the insides and what that reveals for
00:04:09.440 | you is that this model is just a simple lookup,
00:04:12.560 | and we got lucky on those five inputs because those are
00:04:15.400 | exactly the five inputs that the model was prepared for.
00:04:18.640 | It has this else clause that says odd.
00:04:21.680 | Now you know exactly how to expose a weakness of the system.
00:04:25.280 | When you input 22,
00:04:26.980 | that is not in the list and it defaults to
00:04:29.560 | its elsewhere case and says odd,
00:04:31.760 | and that is an incorrect prediction.
00:04:34.100 | Notice that it was not the behavioral test that gave us this insight,
00:04:38.160 | but rather being able to peek under the hood.
00:04:41.240 | Now we move to even-odd model 2.
00:04:43.720 | It gets 22 right, it says even,
00:04:45.840 | and let's assume it gets all those green cases correct as well.
00:04:49.280 | Five comes in and it says odd,
00:04:51.720 | 89 comes in and it says odd,
00:04:54.280 | 56 comes in and it says even,
00:04:56.640 | and you think now we really have an excellent model of even-odd detection.
00:05:02.100 | But once again, now I let you look inside.
00:05:06.200 | What's revealed when you look inside is that this is
00:05:08.600 | a more sophisticated version of the same lookup.
00:05:11.960 | Now what this model does is look at
00:05:13.980 | the final token by splitting on white space and
00:05:16.760 | use that as the basis for classification decisions.
00:05:20.560 | If that final word is not in its lookup table,
00:05:23.740 | it defaults to predicting odd.
00:05:26.180 | Having seen that, we can now be adversarial.
00:05:29.020 | We feed in 16,
00:05:30.520 | it predicts odd by following that elsewhere case,
00:05:33.540 | and we have shown that the model has a flaw.
00:05:36.600 | But again, we saw the flaw not from our behavioral test,
00:05:40.500 | but rather from looking inside the model.
00:05:42.900 | Now we move to model 3,
00:05:44.540 | it gets 16 right.
00:05:45.840 | Is this the one true model of even-odd?
00:05:48.380 | Well, we can keep our behavioral testing up,
00:05:50.900 | but we will always have doubts in the back of our minds that we have missed
00:05:54.260 | some important cases in our test that are
00:05:56.740 | hiding significant problems for the system.
00:05:59.720 | That is a simple illustration of
00:06:02.500 | the actual situation that you are in if you deploy a model out into the world.
00:06:07.180 | You have done limited testing,
00:06:08.980 | and now you have to see what happens for the unfamiliar cases.
00:06:13.340 | Another more incidental limitation to keep in mind of behavioral testing,
00:06:19.600 | as we're going to discuss it,
00:06:20.600 | is that by and large,
00:06:22.200 | we set aside the question of metrics.
00:06:24.460 | When you look through the literature on challenge and adversarial tests,
00:06:27.820 | you find that mostly people are adopting the metrics that are
00:06:31.260 | familiar from the underlying tasks
00:06:34.060 | and simply probing the models within those guardrails.
00:06:37.420 | I think that's fine, but in the fullness of adversarial testing,
00:06:41.260 | we should feel free to break out of the confines of these tasks and assess
00:06:45.260 | models in new ways to expose new limitations and so forth.
00:06:50.620 | I'm going to play by the rules by and large in this lecture,
00:06:53.580 | but have in mind that one way to be adversarial would be to put
00:06:57.660 | models in entirely unfamiliar situations and ask new things of them.
00:07:02.740 | Here's another really crucial analytical point
00:07:07.020 | that we need to think about when we do behavioral testing.
00:07:09.620 | When we see a failure,
00:07:11.420 | is this a failure of the model or is it a failure of the underlying dataset?
00:07:17.260 | Lovely paper that provides a framework for thinking about this is Liu et al,
00:07:22.020 | 2019, which has the heading,
00:07:24.140 | inoculation by fine-tuning.
00:07:26.500 | We're going to talk about that idea in a second,
00:07:28.700 | but the guiding idea behind the paper is embodied in this quote.
00:07:33.540 | What should we conclude when a system fails on a challenge dataset?
00:07:37.980 | In some cases, a challenge might exploit
00:07:40.740 | blind spots in the design of the original dataset,
00:07:43.620 | call that a dataset weakness.
00:07:45.820 | In others, the challenge might expose an inherent inability of
00:07:49.180 | a particular model family to handle certain kinds of natural language phenomena.
00:07:53.860 | That's the model weakness.
00:07:55.660 | These are, of course, not mutually exclusive.
00:07:58.500 | Dataset weakness and model weakness.
00:08:01.220 | The thing to watch out for is that people
00:08:03.740 | want to claim they have found model weaknesses.
00:08:06.620 | That's where the action is.
00:08:07.860 | If you can show that the transformer architecture is
00:08:10.780 | fundamentally incapable of capturing some phenomenon,
00:08:14.860 | then you have a real headline result.
00:08:17.020 | That is important. It might mean that the transformer is
00:08:19.860 | a non-starter when it comes to modeling language.
00:08:23.420 | But frankly, it's more likely that you have found a dataset weakness.
00:08:28.860 | There is something about the available training data
00:08:31.500 | that means the model has not hit your learning targets.
00:08:34.780 | That is a much less interesting result because it often
00:08:37.580 | means that we just need to supplement with more data.
00:08:40.820 | We need to be careful about this because we don't want to
00:08:43.500 | mistake dataset weaknesses for model weaknesses.
00:08:48.260 | We made a similar point in a paper that we did
00:08:51.580 | about posing fair but challenging evaluation tasks.
00:08:55.340 | We write, however, for any evaluation method,
00:08:58.660 | we should ask whether it is fair.
00:09:01.140 | Fair in the following sense,
00:09:02.860 | has the model been shown data sufficient to
00:09:05.260 | support the generalization we are asking of it?
00:09:08.700 | Unless we can say yes with complete certainty,
00:09:12.020 | we can't be sure whether a failed evaluation traces to
00:09:15.060 | a model limitation or a data limitation that no model could overcome.
00:09:19.940 | This is an important point.
00:09:21.860 | When we say fair to our models,
00:09:24.260 | we don't mean that we're particularly worried about
00:09:26.380 | them that they might be mistreated or something.
00:09:28.740 | Rather, we are worried about an analytic mistake where we blame a model for
00:09:33.780 | a failing when in fact the failing is on us because something about
00:09:37.580 | the specification that we gave didn't
00:09:39.900 | fully disambiguate the learning targets that we had in mind.
00:09:43.340 | This can easily happen and it can lead to misdiagnosis of problems.
00:09:48.220 | Here's an example that's just at
00:09:49.980 | a very human level that can show that any agent
00:09:52.540 | could feel stumped by a misspecified problem.
00:09:55.620 | Suppose I begin the numerical sequence 3, 5, 7,
00:09:59.180 | and I ask you to guess what the next number is.
00:10:02.300 | Well, even within human expectations here,
00:10:07.180 | it seems reasonable to assume that I was listing out odd numbers,
00:10:10.860 | in which case you should say 9,
00:10:12.740 | or prime numbers, in which case you should say 11.
00:10:16.820 | It's absolutely unfair if I was imagining
00:10:19.380 | the prime case for me to scold you for saying 9 in this context.
00:10:23.880 | But that is exactly the mistake that we are at risk of
00:10:26.940 | making when we pose challenged problems to our systems.
00:10:30.740 | Here's another case in which this could happen that's more
00:10:33.540 | oriented toward natural language understanding.
00:10:36.020 | Suppose I want to probe systems to see whether they can
00:10:38.420 | learn basic aspects of Boolean logic.
00:10:42.060 | What I do is show the system cases of combinations of p and q,
00:10:46.300 | where they're both true and where p is false and q is true.
00:10:50.660 | Now, I ask the system whether it can
00:10:53.140 | generalize by filling out this entire truth table.
00:10:56.480 | Well, even within the bounds of
00:10:59.260 | the hypothesis space for normal Boolean logic,
00:11:01.860 | there are two reasonable hypotheses here.
00:11:04.500 | I might have in mind the material conditional symbol
00:11:08.180 | by symbolized by the arrow here or disjunction,
00:11:11.240 | inclusive conjunction as symbolized by this V symbol down here.
00:11:15.200 | My training data as depicted on the left here simply did
00:11:18.880 | not disambiguate what my learning target was.
00:11:21.440 | Again, it is no fair to scold systems if they arrive at
00:11:25.580 | the conclusion that I meant the conditional
00:11:27.500 | when secretly I meant disjunction.
00:11:30.660 | The paper that I mentioned before,
00:11:34.300 | Liu et al. 2019 is a lovely framework for thinking about how to
00:11:38.060 | get over this analytic hurdle and
00:11:40.220 | distinguish between dataset weaknesses and model weaknesses.
00:11:43.620 | This is the framework that they call inoculation by fine-tuning.
00:11:47.280 | This is a diagram from their paper. Let's walk through it.
00:11:49.920 | Suppose we train our system on our original data,
00:11:54.140 | and then we test it on
00:11:55.700 | the original test set and some challenge set that we're interested in.
00:12:00.200 | We observe that the system does well on that original test
00:12:04.160 | and very poorly on the challenge dataset.
00:12:07.660 | The question is, why is that happening?
00:12:10.600 | I've already presented to you the major choice point here.
00:12:14.120 | Is this a dataset weakness or
00:12:16.240 | a model weakness that we are seeing?
00:12:18.400 | The proposed method for resolving that question works as follows.
00:12:22.260 | We're going to fine-tune on a few challenge examples.
00:12:26.040 | We're going to update the model and then retest on
00:12:29.120 | both the original and the challenge datasets.
00:12:32.640 | We have three possible general outcomes here.
00:12:36.120 | The dataset weakness case is the case where now,
00:12:39.560 | having done this fine-tuning,
00:12:40.840 | we see good performance on both the original and our challenge dataset.
00:12:45.460 | In particular, the challenge performance has gone way up.
00:12:48.840 | That is an indication to us that there were simply some gaps in
00:12:52.320 | the available training experiences of our model that
00:12:55.200 | were quickly overcome by our fine-tuning.
00:12:58.240 | That's a data weakness.
00:13:00.200 | Conversely, a model weakness would be
00:13:02.640 | a situation where even after doing this fine-tuning,
00:13:05.360 | we still see poor performance on our challenge dataset,
00:13:09.160 | even though we have maintained performance on the original.
00:13:12.600 | That might mean that there is something about
00:13:15.060 | our new examples from our challenge set
00:13:17.620 | that are fundamentally difficult for this model.
00:13:20.700 | Call that a model weakness.
00:13:22.700 | Then the third outcome, also important, annotation artifacts.
00:13:26.900 | This is where having done this fine-tuning,
00:13:28.780 | we have now hurt the model in the sense that
00:13:31.540 | performance on the original test set has plummeted.
00:13:35.860 | That's a case where we might discover that our challenge dataset
00:13:39.540 | is doing something unusual and problematic to the model.
00:13:43.700 | That might cause us to reflect again
00:13:45.700 | on the nature of the challenge we've posed.
00:13:48.620 | Here's a diagram from the paper using
00:13:51.380 | an adversarial test that they study in detail,
00:13:54.700 | and that was released in relation to NLI models.
00:13:58.120 | They're organized by the three outcomes that they see.
00:14:01.240 | Outcome 1 is the dataset weakness case.
00:14:04.240 | This is the characteristic process for this.
00:14:06.360 | Let's focus on these green lines here.
00:14:08.720 | The dots here indicate performance on
00:14:13.040 | the original set and the crosses
00:14:15.560 | here are performance on the new challenge set.
00:14:18.140 | This is a dataset weakness in that you see that as we
00:14:20.880 | fine-tune across this x-axis on more and more challenge examples,
00:14:25.360 | we see performance on that challenge set go up,
00:14:28.280 | and we maintain performance throughout
00:14:30.400 | that fine-tuning process on the original dataset.
00:14:33.600 | That is a characteristic picture of
00:14:35.580 | something we could call data weakness.
00:14:38.520 | The model weakness case is also pretty clear to see.
00:14:41.840 | Here again, we have the original dataset with these dots.
00:14:44.600 | We maintain performance on that
00:14:46.500 | across all of the different levels of fine-tuning.
00:14:49.460 | But well below that is
00:14:51.580 | the corresponding line for the challenge dataset,
00:14:54.000 | also pretty flat.
00:14:55.360 | No matter how many examples we fine-tune on,
00:14:58.040 | we never really budge on performance on those examples,
00:15:01.260 | suggesting that there's a real problem with the underlying model.
00:15:05.080 | Then outcome 3 is the dataset artifacts,
00:15:07.760 | and this is the case where our fine-tuning actually
00:15:10.560 | introduces something chaotic into the mix by disturbing the model.
00:15:14.960 | The net effect there is that for the original dataset,
00:15:18.200 | pick this one here,
00:15:19.480 | we see variable performance.
00:15:21.720 | We see some gains on the challenge dataset,
00:15:24.200 | but really at a cost to the overall performance of the system.
00:15:28.160 | That would suggest to us that the data
00:15:30.280 | in the challenge set are somehow problematic.
00:15:35.480 | Those are general lessons here.
00:15:38.560 | I have one more story that I thought I would
00:15:40.760 | tell you that comes from work that we did in my group,
00:15:43.460 | and this relates to having negation as a learning target.
00:15:46.720 | Again, this is in the spirit of helping you avoid what could
00:15:49.280 | be a serious analytic mistake for behavioral testing.
00:15:53.240 | We have this intuitive learning target related to negation.
00:15:56.680 | If A entails B,
00:15:58.160 | then not B entails not A.
00:16:00.080 | That is the classic entailment reversing property of negation.
00:16:03.480 | It applies at all levels in language and is responsible for why,
00:16:06.960 | for example, where we have pizza entails food,
00:16:10.280 | then not food entails not pizza.
00:16:13.280 | Simple intuitive learning target with lots of consequences for language,
00:16:17.120 | and then we have this observation through many papers in the literature,
00:16:21.280 | that our top performing natural language inference models
00:16:24.660 | fail to hit that learning target.
00:16:27.360 | Of course, the tempting conclusion there is that
00:16:29.940 | our top performing models are incapable of learning negation.
00:16:34.300 | We want to make that conclusion because it's a headline result
00:16:38.320 | that will mean we have a really fundamental limitation that we have discovered.
00:16:42.600 | But we have to pair that with the observation that negation is severely
00:16:48.180 | underrepresented in the NLI benchmarks that are driving these models.
00:16:54.180 | That should introduce doubt in our minds that we've really found a model weakness,
00:16:58.660 | we might be confronting a dataset weakness instead.
00:17:02.340 | To address that question,
00:17:04.420 | we followed the inoculation by fine-tuning template and constructed
00:17:08.900 | a slightly synthetic dataset that we call MoNLI from monotonicity NLI.
00:17:14.180 | It has two parts.
00:17:15.380 | In positive MoNLI, there are about 1,500 examples.
00:17:19.300 | We took actual hypotheses from the SNLI benchmark,
00:17:23.500 | like food was served,
00:17:25.220 | and we used a WordNet to find a special case of food like pizza,
00:17:30.340 | an entailment case, and then we created a new example,
00:17:33.300 | pizza was served.
00:17:34.420 | Having constructed that new example,
00:17:36.120 | we now have two new positive MoNLI cases.
00:17:39.620 | A is neutral with respect to B,
00:17:41.740 | and B entails A. We also have negative MoNLI,
00:17:46.820 | which has a similar number of examples and follows the same protocol,
00:17:50.100 | except now we begin from
00:17:52.300 | negated examples like the children are not holding plants.
00:17:55.820 | Again, use WordNet for a lookup.
00:17:57.700 | We have flowers entails plants,
00:17:59.300 | and that creates a new example,
00:18:00.940 | the children are not holding flowers.
00:18:03.380 | Because of the entailment reversing property of negation,
00:18:06.320 | we get our two examples again,
00:18:07.740 | but now the labels are reversed.
00:18:09.780 | A entails B, and B is neutral with respect to A,
00:18:13.320 | the converse of the pattern we saw up here.
00:18:16.740 | We did our level best to pose this as a very hard generalization task.
00:18:22.060 | In the sense that we held out entire words for testing,
00:18:26.900 | to be sure that we were getting a look at whether or not systems had truly
00:18:30.860 | acquired a theory of lexical relations in addition to acquiring a theory of negation.
00:18:36.820 | We're making this as hard as we can,
00:18:38.740 | but we're also trying to be sure that we have good coverage
00:18:41.540 | over the relevant phenomena for negation.
00:18:44.860 | One thing we did with this dataset is use MoNLI as a challenge dataset.
00:18:50.380 | The initial results are quite worrisome.
00:18:53.240 | Let's look at the BERT row of this table here.
00:18:55.780 | It was trained on SNLI,
00:18:57.420 | it does great on SNLI,
00:18:59.580 | and it does extremely well on the positive part of the MoNLI split,
00:19:04.500 | but it has essentially zero accuracy on the negative part of MoNLI.
00:19:11.220 | The strategy seems clear here,
00:19:13.440 | the model is simply ignoring negations and therefore getting
00:19:17.180 | every single one of these examples wrong because they look like positive cases to the model.
00:19:22.060 | You might think, "Aha, we have found a fundamental limitation of BERT,"
00:19:26.020 | but I think that's incorrect.
00:19:27.720 | If we do a little bit of inoculation by fine-tuning on negative MoNLI cases,
00:19:33.180 | performance on that split immediately goes up.
00:19:36.460 | Now we have maintained performance on SNLI,
00:19:39.460 | and we have excellent performance on the negative split for MoNLI,
00:19:43.260 | and this strongly suggests that we had found not a model weakness,
00:19:47.260 | but rather a dataset weakness.
00:19:50.260 | Final thing I want to say here by way of wrapping up,
00:19:56.660 | is that I have emphasized fairness for our systems.
00:20:00.860 | I think that is important to have in mind so that we don't confuse ourselves.
00:20:05.340 | But I couldn't resist pointing out that biological creatures are amazing,
00:20:10.560 | and we now know that they often solve
00:20:13.820 | tasks that are unfair in the sense that I just described it.
00:20:17.180 | Here is a classic case.
00:20:18.420 | This is called relational match to sample,
00:20:21.580 | and this is the observation that even very,
00:20:24.140 | very young humans and some animals,
00:20:27.140 | including crows and non-primate humans,
00:20:29.740 | are able to solve tasks like this.
00:20:31.780 | I show you two red squares,
00:20:33.700 | and then ask you to pick from these two options here,
00:20:36.860 | and people go for the two same ones,
00:20:39.820 | matching with the original prompt.
00:20:42.120 | You don't need training instances for this,
00:20:44.360 | people naturally gravitate to it.
00:20:46.700 | Whereas if I show you two different shapes and ask you to make a similar choice,
00:20:50.600 | now what people do is go for the two different ones.
00:20:53.820 | This is same different reasoning that we
00:20:56.960 | do consistently with essentially no training data.
00:21:00.640 | As post here, I maintain that these tasks are unfair,
00:21:04.640 | and yet nonetheless, humans and many biological entities
00:21:08.440 | are able to systematically solve these tasks.
00:21:11.300 | That is a puzzle about human and other biological creature cognition,
00:21:16.580 | and it's something that we should keep in mind.
00:21:18.920 | People solve unfair tasks,
00:21:20.920 | and the question is,
00:21:21.940 | how would we get our machine learning models to solve such tasks,
00:21:25.320 | if indeed that's what we want them to do?
00:21:27.640 | These are just the simpler cases.
00:21:29.440 | For example, we can do hierarchical versions of equality,
00:21:33.360 | and here with some training,
00:21:34.960 | even crows can do problems like this one,
00:21:37.880 | and people solve them out of the box, so to speak,
00:21:40.200 | with essentially no training instances or
00:21:42.720 | not enough training instances to fully disambiguate the task.
00:21:46.400 | Again, pointing out that biological creatures are amazing.
00:21:51.320 | We should pose fair tasks to our systems while keeping in
00:21:55.120 | mind that there are scenarios in which we might have
00:21:58.720 | an expectation for a solution that is not supported by the data,
00:22:03.000 | but nonetheless, the one that all of us arrive at with seemingly no effort.
00:22:08.760 | [BLANK_AUDIO]