Stanford XCS224U I Behavioral Eval of NLU Models, Pt 2: Analytical Considerations I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.440 | This is part 2 in our series on

00:00:08.360 | advanced behavioral methods for NLU.

00:00:10.640 | We're going to talk about

00:00:11.800 | some analytic considerations that

00:00:13.580 | surround this kind of assessment and analysis.

00:00:16.520 | The key questions for us are,

00:00:18.720 | what can behavioral testing tell us?

00:00:21.520 | Just as crucially, what can't it

00:00:23.680 | tell us about the nature of our systems?

00:00:27.040 | I said this in the first part of

00:00:29.080 | these screencasts it bears repeating.

00:00:31.480 | We talk about adversarial testing as a shorthand,

00:00:34.360 | but there is no intrinsic need to

00:00:36.160 | be adversarial in all cases.

00:00:38.140 | We could just be trying to

00:00:39.600 | explore what our systems are capable of.

00:00:42.280 | Here are some example questions

00:00:44.280 | that you might be thinking of in this mode.

00:00:46.160 | They start off purely exploratory

00:00:48.520 | and they end up being quite adversarial.

00:00:50.920 | For example, has my system

00:00:52.960 | learned anything about numerical terms?

00:00:54.920 | You could ask this in an open-ended way and

00:00:57.440 | construct a test dataset that

00:00:59.120 | would help you address this question.

00:01:01.380 | Does my system understand how negation works?

00:01:03.760 | Same thing, maybe you did an audit of

00:01:05.440 | the training data and you found that

00:01:07.280 | negation was sparsely represented and you want to ask,

00:01:10.560 | given what it did see,

00:01:12.040 | was it able to induce

00:01:13.520 | a reasonable theory of how negation works?

00:01:15.960 | Behavioral testing could help you address that question.

00:01:19.160 | Does my system work with a new style or genre?

00:01:22.600 | That's an important subtle domain shift.

00:01:25.320 | Maybe it was trained on text from newspapers and you would

00:01:28.200 | just want to find out whether

00:01:30.120 | its behavior is accurate on Twitter data.

00:01:33.480 | This system is supposed to know about numerical terms,

00:01:36.840 | but here are some test cases that are

00:01:38.880 | outside of its training experiences for such terms.

00:01:42.280 | What will happen? We are now moving into

00:01:44.760 | a mode of being more thoroughly adversarial.

00:01:47.280 | We might have discovered about

00:01:49.280 | the system that it's not good at numerical terms,

00:01:51.980 | and now we're trying to expose

00:01:53.880 | that gap in its abilities.

00:01:56.040 | When applied to invented genres,

00:01:58.380 | that is very unfamiliar kinds of inputs,

00:02:00.640 | does the system produce socially problematic,

00:02:03.120 | say stereotyped outputs?

00:02:04.800 | At this point, you're actively trying to construct examples

00:02:08.560 | that you know will be outside of

00:02:10.920 | standard experiences for the model in an effort to

00:02:13.760 | discover what its behaviors are like in

00:02:15.760 | those unusual tail situations.

00:02:18.400 | That is probably more thoroughly adversarial.

00:02:21.560 | Maybe the most adversarial of all would be exploring

00:02:25.440 | random inputs that lead

00:02:27.160 | the system to produce problematic outputs.

00:02:29.280 | This is the mode where you take

00:02:30.980 | normal examples and append a certain sequence of

00:02:33.360 | Unicode characters onto the end and see

00:02:35.840 | something very surprising happen as a result,

00:02:38.680 | as a way of auditing the system

00:02:41.260 | for gaps in its security in some general sense.

00:02:45.520 | All these things are on the table for us.

00:02:47.360 | These are all interesting behavioral tests to run.

00:02:51.160 | But behavioral testing has its limits.

00:02:55.040 | I think we all know this in our hearts,

00:02:57.200 | but it's worth dwelling on it.

00:02:59.220 | No amount of behavioral testing can truly offer

00:03:02.600 | you a guarantee about what our systems will be like.

00:03:06.400 | You are at the mercy of

00:03:08.500 | the set of examples that you decided to construct.

00:03:11.600 | I think this is pretty clear.

00:03:13.160 | This often goes under the heading of

00:03:14.760 | the limits of scientific induction.

00:03:17.880 | But let's linger over an example to see

00:03:20.200 | just how this can become pressing.

00:03:22.820 | For an illustration, I've got

00:03:24.320 | an even-odd model in the middle here,

00:03:26.120 | and I've drawn it as a big opaque rectangle.

00:03:28.520 | We don't know how this model works.

00:03:30.520 | But the promise of this model is that it takes in

00:03:33.360 | strings like four and predicts whether

00:03:36.120 | those strings refer to even or odd numbers.

00:03:39.120 | Here four has come in and it has predicted even.

00:03:42.200 | So far so good. 21 comes in and it predicted odd.

00:03:46.020 | Also good. 32, prediction of even.

00:03:50.000 | 36, prediction of even.

00:03:52.620 | 63, prediction of odd.

00:03:55.440 | This behavioral testing is going great and suggests that we

00:03:58.880 | have a very solid model of even-odd detection.

00:04:02.480 | But suppose I now expose for you how this model works.

00:04:06.720 | I show you the insides and what that reveals for

00:04:09.440 | you is that this model is just a simple lookup,

00:04:12.560 | and we got lucky on those five inputs because those are

00:04:15.400 | exactly the five inputs that the model was prepared for.

00:04:18.640 | It has this else clause that says odd.

00:04:21.680 | Now you know exactly how to expose a weakness of the system.

00:04:25.280 | When you input 22,

00:04:26.980 | that is not in the list and it defaults to

00:04:29.560 | its elsewhere case and says odd,

00:04:31.760 | and that is an incorrect prediction.

00:04:34.100 | Notice that it was not the behavioral test that gave us this insight,

00:04:38.160 | but rather being able to peek under the hood.

00:04:41.240 | Now we move to even-odd model 2.

00:04:43.720 | It gets 22 right, it says even,

00:04:45.840 | and let's assume it gets all those green cases correct as well.

00:04:49.280 | Five comes in and it says odd,

00:04:51.720 | 89 comes in and it says odd,

00:04:54.280 | 56 comes in and it says even,

00:04:56.640 | and you think now we really have an excellent model of even-odd detection.

00:05:02.100 | But once again, now I let you look inside.

00:05:06.200 | What's revealed when you look inside is that this is

00:05:08.600 | a more sophisticated version of the same lookup.

00:05:11.960 | Now what this model does is look at

00:05:13.980 | the final token by splitting on white space and

00:05:16.760 | use that as the basis for classification decisions.

00:05:20.560 | If that final word is not in its lookup table,

00:05:23.740 | it defaults to predicting odd.

00:05:26.180 | Having seen that, we can now be adversarial.

00:05:29.020 | We feed in 16,

00:05:30.520 | it predicts odd by following that elsewhere case,

00:05:33.540 | and we have shown that the model has a flaw.

00:05:36.600 | But again, we saw the flaw not from our behavioral test,

00:05:40.500 | but rather from looking inside the model.

00:05:42.900 | Now we move to model 3,

00:05:44.540 | it gets 16 right.

00:05:45.840 | Is this the one true model of even-odd?

00:05:48.380 | Well, we can keep our behavioral testing up,

00:05:50.900 | but we will always have doubts in the back of our minds that we have missed

00:05:54.260 | some important cases in our test that are

00:05:56.740 | hiding significant problems for the system.

00:05:59.720 | That is a simple illustration of

00:06:02.500 | the actual situation that you are in if you deploy a model out into the world.

00:06:07.180 | You have done limited testing,

00:06:08.980 | and now you have to see what happens for the unfamiliar cases.

00:06:13.340 | Another more incidental limitation to keep in mind of behavioral testing,

00:06:19.600 | as we're going to discuss it,

00:06:20.600 | is that by and large,

00:06:22.200 | we set aside the question of metrics.

00:06:24.460 | When you look through the literature on challenge and adversarial tests,

00:06:27.820 | you find that mostly people are adopting the metrics that are

00:06:31.260 | familiar from the underlying tasks

00:06:34.060 | and simply probing the models within those guardrails.

00:06:37.420 | I think that's fine, but in the fullness of adversarial testing,

00:06:41.260 | we should feel free to break out of the confines of these tasks and assess

00:06:45.260 | models in new ways to expose new limitations and so forth.

00:06:50.620 | I'm going to play by the rules by and large in this lecture,

00:06:53.580 | but have in mind that one way to be adversarial would be to put

00:06:57.660 | models in entirely unfamiliar situations and ask new things of them.

00:07:02.740 | Here's another really crucial analytical point

00:07:07.020 | that we need to think about when we do behavioral testing.

00:07:09.620 | When we see a failure,

00:07:11.420 | is this a failure of the model or is it a failure of the underlying dataset?

00:07:17.260 | Lovely paper that provides a framework for thinking about this is Liu et al,

00:07:22.020 | 2019, which has the heading,

00:07:24.140 | inoculation by fine-tuning.

00:07:26.500 | We're going to talk about that idea in a second,

00:07:28.700 | but the guiding idea behind the paper is embodied in this quote.

00:07:33.540 | What should we conclude when a system fails on a challenge dataset?

00:07:37.980 | In some cases, a challenge might exploit

00:07:40.740 | blind spots in the design of the original dataset,

00:07:43.620 | call that a dataset weakness.

00:07:45.820 | In others, the challenge might expose an inherent inability of

00:07:49.180 | a particular model family to handle certain kinds of natural language phenomena.

00:07:53.860 | That's the model weakness.

00:07:55.660 | These are, of course, not mutually exclusive.

00:07:58.500 | Dataset weakness and model weakness.

00:08:01.220 | The thing to watch out for is that people

00:08:03.740 | want to claim they have found model weaknesses.

00:08:06.620 | That's where the action is.

00:08:07.860 | If you can show that the transformer architecture is

00:08:10.780 | fundamentally incapable of capturing some phenomenon,

00:08:14.860 | then you have a real headline result.

00:08:17.020 | That is important. It might mean that the transformer is

00:08:19.860 | a non-starter when it comes to modeling language.

00:08:23.420 | But frankly, it's more likely that you have found a dataset weakness.

00:08:28.860 | There is something about the available training data

00:08:31.500 | that means the model has not hit your learning targets.

00:08:34.780 | That is a much less interesting result because it often

00:08:37.580 | means that we just need to supplement with more data.

00:08:40.820 | We need to be careful about this because we don't want to

00:08:43.500 | mistake dataset weaknesses for model weaknesses.

00:08:48.260 | We made a similar point in a paper that we did

00:08:51.580 | about posing fair but challenging evaluation tasks.

00:08:55.340 | We write, however, for any evaluation method,

00:08:58.660 | we should ask whether it is fair.

00:09:01.140 | Fair in the following sense,

00:09:02.860 | has the model been shown data sufficient to

00:09:05.260 | support the generalization we are asking of it?

00:09:08.700 | Unless we can say yes with complete certainty,

00:09:12.020 | we can't be sure whether a failed evaluation traces to

00:09:15.060 | a model limitation or a data limitation that no model could overcome.

00:09:19.940 | This is an important point.

00:09:21.860 | When we say fair to our models,

00:09:24.260 | we don't mean that we're particularly worried about

00:09:26.380 | them that they might be mistreated or something.

00:09:28.740 | Rather, we are worried about an analytic mistake where we blame a model for

00:09:33.780 | a failing when in fact the failing is on us because something about

00:09:37.580 | the specification that we gave didn't

00:09:39.900 | fully disambiguate the learning targets that we had in mind.

00:09:43.340 | This can easily happen and it can lead to misdiagnosis of problems.

00:09:48.220 | Here's an example that's just at

00:09:49.980 | a very human level that can show that any agent

00:09:52.540 | could feel stumped by a misspecified problem.

00:09:55.620 | Suppose I begin the numerical sequence 3, 5, 7,

00:09:59.180 | and I ask you to guess what the next number is.

00:10:02.300 | Well, even within human expectations here,

00:10:07.180 | it seems reasonable to assume that I was listing out odd numbers,

00:10:10.860 | in which case you should say 9,

00:10:12.740 | or prime numbers, in which case you should say 11.

00:10:16.820 | It's absolutely unfair if I was imagining

00:10:19.380 | the prime case for me to scold you for saying 9 in this context.

00:10:23.880 | But that is exactly the mistake that we are at risk of

00:10:26.940 | making when we pose challenged problems to our systems.

00:10:30.740 | Here's another case in which this could happen that's more

00:10:33.540 | oriented toward natural language understanding.

00:10:36.020 | Suppose I want to probe systems to see whether they can

00:10:38.420 | learn basic aspects of Boolean logic.

00:10:42.060 | What I do is show the system cases of combinations of p and q,

00:10:46.300 | where they're both true and where p is false and q is true.

00:10:50.660 | Now, I ask the system whether it can

00:10:53.140 | generalize by filling out this entire truth table.

00:10:56.480 | Well, even within the bounds of

00:10:59.260 | the hypothesis space for normal Boolean logic,

00:11:01.860 | there are two reasonable hypotheses here.

00:11:04.500 | I might have in mind the material conditional symbol

00:11:08.180 | by symbolized by the arrow here or disjunction,

00:11:11.240 | inclusive conjunction as symbolized by this V symbol down here.

00:11:15.200 | My training data as depicted on the left here simply did

00:11:18.880 | not disambiguate what my learning target was.

00:11:21.440 | Again, it is no fair to scold systems if they arrive at

00:11:25.580 | the conclusion that I meant the conditional

00:11:27.500 | when secretly I meant disjunction.

00:11:30.660 | The paper that I mentioned before,

00:11:34.300 | Liu et al. 2019 is a lovely framework for thinking about how to

00:11:38.060 | get over this analytic hurdle and

00:11:40.220 | distinguish between dataset weaknesses and model weaknesses.

00:11:43.620 | This is the framework that they call inoculation by fine-tuning.

00:11:47.280 | This is a diagram from their paper. Let's walk through it.

00:11:49.920 | Suppose we train our system on our original data,

00:11:54.140 | and then we test it on

00:11:55.700 | the original test set and some challenge set that we're interested in.

00:12:00.200 | We observe that the system does well on that original test

00:12:04.160 | and very poorly on the challenge dataset.

00:12:07.660 | The question is, why is that happening?

00:12:10.600 | I've already presented to you the major choice point here.

00:12:14.120 | Is this a dataset weakness or

00:12:16.240 | a model weakness that we are seeing?

00:12:18.400 | The proposed method for resolving that question works as follows.

00:12:22.260 | We're going to fine-tune on a few challenge examples.

00:12:26.040 | We're going to update the model and then retest on

00:12:29.120 | both the original and the challenge datasets.

00:12:32.640 | We have three possible general outcomes here.

00:12:36.120 | The dataset weakness case is the case where now,

00:12:39.560 | having done this fine-tuning,

00:12:40.840 | we see good performance on both the original and our challenge dataset.

00:12:45.460 | In particular, the challenge performance has gone way up.

00:12:48.840 | That is an indication to us that there were simply some gaps in

00:12:52.320 | the available training experiences of our model that

00:12:55.200 | were quickly overcome by our fine-tuning.

00:12:58.240 | That's a data weakness.

00:13:00.200 | Conversely, a model weakness would be

00:13:02.640 | a situation where even after doing this fine-tuning,

00:13:05.360 | we still see poor performance on our challenge dataset,

00:13:09.160 | even though we have maintained performance on the original.

00:13:12.600 | That might mean that there is something about

00:13:15.060 | our new examples from our challenge set

00:13:17.620 | that are fundamentally difficult for this model.

00:13:20.700 | Call that a model weakness.

00:13:22.700 | Then the third outcome, also important, annotation artifacts.

00:13:26.900 | This is where having done this fine-tuning,

00:13:28.780 | we have now hurt the model in the sense that

00:13:31.540 | performance on the original test set has plummeted.

00:13:35.860 | That's a case where we might discover that our challenge dataset

00:13:39.540 | is doing something unusual and problematic to the model.

00:13:43.700 | That might cause us to reflect again

00:13:45.700 | on the nature of the challenge we've posed.

00:13:48.620 | Here's a diagram from the paper using

00:13:51.380 | an adversarial test that they study in detail,

00:13:54.700 | and that was released in relation to NLI models.

00:13:58.120 | They're organized by the three outcomes that they see.

00:14:01.240 | Outcome 1 is the dataset weakness case.

00:14:04.240 | This is the characteristic process for this.

00:14:06.360 | Let's focus on these green lines here.

00:14:08.720 | The dots here indicate performance on

00:14:13.040 | the original set and the crosses

00:14:15.560 | here are performance on the new challenge set.

00:14:18.140 | This is a dataset weakness in that you see that as we

00:14:20.880 | fine-tune across this x-axis on more and more challenge examples,

00:14:25.360 | we see performance on that challenge set go up,

00:14:28.280 | and we maintain performance throughout

00:14:30.400 | that fine-tuning process on the original dataset.

00:14:33.600 | That is a characteristic picture of

00:14:35.580 | something we could call data weakness.

00:14:38.520 | The model weakness case is also pretty clear to see.

00:14:41.840 | Here again, we have the original dataset with these dots.

00:14:44.600 | We maintain performance on that

00:14:46.500 | across all of the different levels of fine-tuning.

00:14:49.460 | But well below that is

00:14:51.580 | the corresponding line for the challenge dataset,

00:14:54.000 | also pretty flat.

00:14:55.360 | No matter how many examples we fine-tune on,

00:14:58.040 | we never really budge on performance on those examples,

00:15:01.260 | suggesting that there's a real problem with the underlying model.

00:15:05.080 | Then outcome 3 is the dataset artifacts,

00:15:07.760 | and this is the case where our fine-tuning actually

00:15:10.560 | introduces something chaotic into the mix by disturbing the model.

00:15:14.960 | The net effect there is that for the original dataset,

00:15:18.200 | pick this one here,

00:15:19.480 | we see variable performance.

00:15:21.720 | We see some gains on the challenge dataset,

00:15:24.200 | but really at a cost to the overall performance of the system.

00:15:28.160 | That would suggest to us that the data

00:15:30.280 | in the challenge set are somehow problematic.

00:15:35.480 | Those are general lessons here.

00:15:38.560 | I have one more story that I thought I would

00:15:40.760 | tell you that comes from work that we did in my group,

00:15:43.460 | and this relates to having negation as a learning target.

00:15:46.720 | Again, this is in the spirit of helping you avoid what could

00:15:49.280 | be a serious analytic mistake for behavioral testing.

00:15:53.240 | We have this intuitive learning target related to negation.

00:15:56.680 | If A entails B,

00:15:58.160 | then not B entails not A.

00:16:00.080 | That is the classic entailment reversing property of negation.

00:16:03.480 | It applies at all levels in language and is responsible for why,

00:16:06.960 | for example, where we have pizza entails food,

00:16:10.280 | then not food entails not pizza.

00:16:13.280 | Simple intuitive learning target with lots of consequences for language,

00:16:17.120 | and then we have this observation through many papers in the literature,

00:16:21.280 | that our top performing natural language inference models

00:16:24.660 | fail to hit that learning target.

00:16:27.360 | Of course, the tempting conclusion there is that

00:16:29.940 | our top performing models are incapable of learning negation.

00:16:34.300 | We want to make that conclusion because it's a headline result

00:16:38.320 | that will mean we have a really fundamental limitation that we have discovered.

00:16:42.600 | But we have to pair that with the observation that negation is severely

00:16:48.180 | underrepresented in the NLI benchmarks that are driving these models.

00:16:54.180 | That should introduce doubt in our minds that we've really found a model weakness,

00:16:58.660 | we might be confronting a dataset weakness instead.

00:17:02.340 | To address that question,

00:17:04.420 | we followed the inoculation by fine-tuning template and constructed

00:17:08.900 | a slightly synthetic dataset that we call MoNLI from monotonicity NLI.

00:17:14.180 | It has two parts.

00:17:15.380 | In positive MoNLI, there are about 1,500 examples.

00:17:19.300 | We took actual hypotheses from the SNLI benchmark,

00:17:23.500 | like food was served,

00:17:25.220 | and we used a WordNet to find a special case of food like pizza,

00:17:30.340 | an entailment case, and then we created a new example,

00:17:33.300 | pizza was served.

00:17:34.420 | Having constructed that new example,

00:17:36.120 | we now have two new positive MoNLI cases.

00:17:39.620 | A is neutral with respect to B,

00:17:41.740 | and B entails A. We also have negative MoNLI,

00:17:46.820 | which has a similar number of examples and follows the same protocol,

00:17:50.100 | except now we begin from

00:17:52.300 | negated examples like the children are not holding plants.

00:17:55.820 | Again, use WordNet for a lookup.

00:17:57.700 | We have flowers entails plants,

00:17:59.300 | and that creates a new example,

00:18:00.940 | the children are not holding flowers.

00:18:03.380 | Because of the entailment reversing property of negation,

00:18:06.320 | we get our two examples again,

00:18:07.740 | but now the labels are reversed.

00:18:09.780 | A entails B, and B is neutral with respect to A,

00:18:13.320 | the converse of the pattern we saw up here.

00:18:16.740 | We did our level best to pose this as a very hard generalization task.

00:18:22.060 | In the sense that we held out entire words for testing,

00:18:26.900 | to be sure that we were getting a look at whether or not systems had truly

00:18:30.860 | acquired a theory of lexical relations in addition to acquiring a theory of negation.

00:18:36.820 | We're making this as hard as we can,

00:18:38.740 | but we're also trying to be sure that we have good coverage

00:18:41.540 | over the relevant phenomena for negation.

00:18:44.860 | One thing we did with this dataset is use MoNLI as a challenge dataset.

00:18:50.380 | The initial results are quite worrisome.

00:18:53.240 | Let's look at the BERT row of this table here.

00:18:55.780 | It was trained on SNLI,

00:18:57.420 | it does great on SNLI,

00:18:59.580 | and it does extremely well on the positive part of the MoNLI split,

00:19:04.500 | but it has essentially zero accuracy on the negative part of MoNLI.

00:19:11.220 | The strategy seems clear here,

00:19:13.440 | the model is simply ignoring negations and therefore getting

00:19:17.180 | every single one of these examples wrong because they look like positive cases to the model.

00:19:22.060 | You might think, "Aha, we have found a fundamental limitation of BERT,"

00:19:26.020 | but I think that's incorrect.

00:19:27.720 | If we do a little bit of inoculation by fine-tuning on negative MoNLI cases,

00:19:33.180 | performance on that split immediately goes up.

00:19:36.460 | Now we have maintained performance on SNLI,

00:19:39.460 | and we have excellent performance on the negative split for MoNLI,

00:19:43.260 | and this strongly suggests that we had found not a model weakness,

00:19:47.260 | but rather a dataset weakness.

00:19:50.260 | Final thing I want to say here by way of wrapping up,

00:19:56.660 | is that I have emphasized fairness for our systems.

00:20:00.860 | I think that is important to have in mind so that we don't confuse ourselves.

00:20:05.340 | But I couldn't resist pointing out that biological creatures are amazing,

00:20:10.560 | and we now know that they often solve

00:20:13.820 | tasks that are unfair in the sense that I just described it.

00:20:17.180 | Here is a classic case.

00:20:18.420 | This is called relational match to sample,

00:20:21.580 | and this is the observation that even very,

00:20:24.140 | very young humans and some animals,

00:20:27.140 | including crows and non-primate humans,

00:20:29.740 | are able to solve tasks like this.

00:20:31.780 | I show you two red squares,

00:20:33.700 | and then ask you to pick from these two options here,

00:20:36.860 | and people go for the two same ones,

00:20:39.820 | matching with the original prompt.

00:20:42.120 | You don't need training instances for this,

00:20:44.360 | people naturally gravitate to it.

00:20:46.700 | Whereas if I show you two different shapes and ask you to make a similar choice,

00:20:50.600 | now what people do is go for the two different ones.

00:20:53.820 | This is same different reasoning that we

00:20:56.960 | do consistently with essentially no training data.

00:21:00.640 | As post here, I maintain that these tasks are unfair,

00:21:04.640 | and yet nonetheless, humans and many biological entities

00:21:08.440 | are able to systematically solve these tasks.

00:21:11.300 | That is a puzzle about human and other biological creature cognition,

00:21:16.580 | and it's something that we should keep in mind.

00:21:18.920 | People solve unfair tasks,

00:21:20.920 | and the question is,

00:21:21.940 | how would we get our machine learning models to solve such tasks,

00:21:25.320 | if indeed that's what we want them to do?

00:21:27.640 | These are just the simpler cases.

00:21:29.440 | For example, we can do hierarchical versions of equality,

00:21:33.360 | and here with some training,

00:21:34.960 | even crows can do problems like this one,

00:21:37.880 | and people solve them out of the box, so to speak,

00:21:40.200 | with essentially no training instances or

00:21:42.720 | not enough training instances to fully disambiguate the task.

00:21:46.400 | Again, pointing out that biological creatures are amazing.

00:21:51.320 | We should pose fair tasks to our systems while keeping in

00:21:55.120 | mind that there are scenarios in which we might have

00:21:58.720 | an expectation for a solution that is not supported by the data,

00:22:03.000 | but nonetheless, the one that all of us arrive at with seemingly no effort.

00:22:08.760 | [BLANK_AUDIO]