Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 17

00:00:00.000 | Welcome to CS224N, lecture 17, Model Analysis and Explanation.

00:00:14.840 | OK, look at us.

00:00:16.560 | We're here.

00:00:19.080 | Start with some course logistics.

00:00:21.760 | We have updated the policy on the guest lecture reactions.

00:00:26.440 | They're all due Friday, all at 11.59 PM.

00:00:30.600 | You can't use late days for this, so please get them in.

00:00:35.520 | Watch the lectures.

00:00:36.280 | They're awesome lectures.

00:00:37.280 | They're awesome guests.

00:00:39.680 | And you get something like half a point for each of them.

00:00:42.360 | And yeah, all three can be submitted up through Friday.

00:00:46.760 | OK, so final projects.

00:00:48.520 | Remember that the due date is Tuesday.

00:00:51.680 | It's Tuesday at 4.30 PM, March 16.

00:00:55.120 | And let me emphasize that there's a hard deadline

00:00:59.960 | on the three days from then, Friday.

00:01:05.000 | We won't be accepting for additional points

00:01:07.440 | off assignments-- sorry, final projects that

00:01:10.920 | are submitted after the 4.30 deadline on Friday.

00:01:15.600 | We need to get these graded and get grades in.

00:01:17.920 | So it's the end stretch, week nine.

00:01:21.680 | Our week 10 is really the lectures

00:01:24.320 | are us giving you help on the final projects.

00:01:27.400 | So this is really the last week of lectures.

00:01:29.200 | Thanks for all your hard work and for asking

00:01:32.600 | awesome questions in lecture and in office hours and on ed.

00:01:35.680 | And let's get right into it.

00:01:37.560 | So today, we get to talk about one

00:01:40.420 | of my favorite subjects in natural language processing.

00:01:43.880 | It's model analysis and explanation.

00:01:47.040 | So first, we're going to do what I love doing,

00:01:49.080 | which is motivating why we want to talk about the topic at all.

00:01:54.280 | We'll talk about how we can look at a model

00:01:57.440 | at different levels of abstraction

00:01:59.040 | to perform different kinds of analysis on it.

00:02:02.000 | We'll talk about out-of-domain evaluation sets.

00:02:05.080 | So this will feel familiar to the robust QA folks.

00:02:10.400 | Then we'll talk about trying to figure out,

00:02:13.680 | for a given example, why did it make the decision that it made?

00:02:17.160 | It had some input.

00:02:17.960 | It produced some output.

00:02:19.520 | Can we come up with some sort of interpretable explanation for it?

00:02:23.400 | And then we'll look at, actually,

00:02:26.960 | the representations of the models.

00:02:29.160 | So these are the sort of hidden states,

00:02:30.880 | the vectors that are being built throughout the processing

00:02:34.200 | of the model, try to figure out if we can understand

00:02:37.120 | some of the representations and mechanisms

00:02:39.480 | that the model is performing.

00:02:41.400 | And then we'll actually come back to one of the default

00:02:45.240 | states that we've been in this course, which

00:02:47.080 | is trying to look at model improvements,

00:02:50.160 | removing things from models, seeing how it performs,

00:02:52.800 | and relate that to the analysis that we're

00:02:55.280 | doing in this lecture, show how it's not all that different.

00:02:58.360 | So if you haven't seen this XKCD, now you have.

00:03:06.440 | And it's one of my favorites.

00:03:07.880 | I'm going to say all the words.

00:03:09.600 | So person A says, this is your machine learning system.

00:03:14.280 | Person B says, yup, you pour the data

00:03:16.400 | into this big pile of linear algebra

00:03:19.120 | and then collect the answers on the other side.

00:03:21.560 | Person A, what if the answers are wrong?

00:03:24.400 | And person B, just stir the pile until they

00:03:26.480 | start looking right.

00:03:28.520 | And I feel like, at its worst, deep learning

00:03:31.080 | can feel like this from time to time.

00:03:32.880 | You have a model.

00:03:34.360 | Maybe it works for some things.

00:03:35.840 | Maybe it doesn't work for other things.

00:03:37.680 | You're not sure why it works for some things

00:03:39.560 | and doesn't work for others.

00:03:40.960 | And the changes that we make to our models,

00:03:45.080 | they're based on intuition.

00:03:46.240 | But frequently, what have the TAs told everyone in office?

00:03:49.600 | I was like, ah, sometimes you just

00:03:51.040 | have to try it and see if it's going to work out

00:03:53.000 | because it's very hard to tell.

00:03:55.080 | It's very, very difficult to understand our models

00:03:59.960 | on sort of any level.

00:04:01.320 | And so today, we'll go through a number

00:04:03.040 | of ways for trying to sort of carve out

00:04:05.080 | little bits of understanding here and there.

00:04:08.040 | So beyond it being important because it's in the next KCD

00:04:13.640 | comic, why should we care about understanding our models?

00:04:18.360 | One is that we want to know what our models are doing.

00:04:23.640 | So here, you have a black box.

00:04:27.280 | Black box functions are sort of this idea

00:04:29.600 | that you can't look into them and interpret

00:04:31.320 | what they're doing.

00:04:33.440 | You have an input sentence, say, and then some output

00:04:36.520 | prediction.

00:04:37.680 | Maybe this black box is actually your final project model,

00:04:42.720 | and it gets some accuracy.

00:04:45.960 | Now, we summarize our models.

00:04:48.840 | And in your final projects, you'll summarize your model

00:04:51.040 | with sort of one or a handful of summary metrics of accuracy

00:04:55.680 | or F1 score or blue score or something.

00:04:58.800 | But it's a lot of model to explain

00:05:01.200 | with just a small number of metrics.

00:05:03.640 | So what do they learn?

00:05:05.040 | Why do they succeed, and why do they fail?

00:05:08.440 | What's another motivation?

00:05:09.520 | So we want to sort of know what our models are doing, OK.

00:05:12.800 | But maybe that's because we want to be

00:05:15.280 | able to make tomorrow's model.

00:05:17.080 | So today, when you're building models in this class

00:05:21.120 | at the company, you start out with some kind of recipe

00:05:24.480 | that is known to work either at the company

00:05:27.440 | or because you have experience from this class.

00:05:30.320 | And it's not perfect.

00:05:31.320 | It makes mistakes.

00:05:32.160 | You look at the errors.

00:05:33.440 | And then over time, you take what works, maybe,

00:05:37.920 | and then you find what needs changing.

00:05:39.520 | So it seems like maybe adding another layer to the model

00:05:42.560 | helped.

00:05:43.760 | And maybe that's a nice tweak, and the model performance

00:05:46.680 | gets better, et cetera.

00:05:48.880 | And incremental progress doesn't always feel exciting.

00:05:53.920 | But I want to pitch to you that it's actually

00:05:56.000 | very important for us to understand

00:05:58.320 | how much incremental progress can kind of get us

00:06:01.400 | towards some of our goals.

00:06:03.080 | So that we can have a better job of evaluating

00:06:06.720 | when we need big leaps, when we need major changes,

00:06:10.560 | because there are problems that we're

00:06:12.080 | attacking with our incremental sort of progress,

00:06:14.400 | and we're not getting very far.

00:06:16.680 | OK, so we want to make tomorrow's model.

00:06:20.320 | Another thing that is, I think, very related to

00:06:23.200 | and sort of both a part of and bigger

00:06:25.680 | than this field of analysis is model biases.

00:06:29.480 | So let's say you take your Word2Vec analogies solver

00:06:34.040 | from GloVe or Word2Vec, that is, from assignment one,

00:06:39.720 | and you give it the analogy, man is to computer programmer

00:06:43.200 | as woman is to--

00:06:44.480 | and it gives you the output, homemaker.

00:06:47.040 | This is a real example from the paper below.

00:06:50.840 | You should be like, wow, well, I'm glad I know that now.

00:06:55.160 | And of course, you saw the lecture from Yulia Svetkov

00:07:00.040 | last week.

00:07:00.680 | You say, wow, I'm glad I know that now.

00:07:02.680 | And that's a huge problem.

00:07:04.800 | What did the model use in its decision?

00:07:06.680 | What biases is it learning from data

00:07:08.280 | and possibly making even worse?

00:07:10.400 | So that's the kind of thing that you can also

00:07:12.600 | do with model analysis beyond just making models better

00:07:15.320 | according to some sort of summary metric as well.

00:07:19.640 | And then another thing, we don't just

00:07:21.440 | want to make tomorrow's model.

00:07:22.800 | And this is something that I think is super important.

00:07:25.120 | We don't just want to look at that time scale.

00:07:30.200 | We want to say, what about 10, 15, 25 years from now?

00:07:34.040 | What kinds of things will we be doing?

00:07:36.280 | What are the limits?

00:07:37.640 | What can be learned by language model pre-training?

00:07:41.080 | What's the model that will replace the transformer?

00:07:43.920 | What's the model that will replace that model?

00:07:46.320 | What does deep learning struggle to do?

00:07:48.000 | What are we sort of attacking over and over again

00:07:50.600 | and failing to make significant progress on?

00:07:52.880 | What do neural models tell us about language potentially?

00:07:55.600 | There's some people who are primarily

00:07:57.140 | interested in understanding language better

00:07:59.640 | using neural networks.

00:08:00.960 | Cool.

00:08:02.840 | How are our models affecting people,

00:08:06.400 | transferring power between groups of people,

00:08:08.920 | governments, et cetera?

00:08:10.640 | That's an excellent type of analysis.

00:08:12.720 | What can't be learned via language model pre-training?

00:08:15.140 | So that's sort of the complementary question there.

00:08:17.640 | If you sort of come to the edge of what

00:08:19.680 | you can learn via language model pre-training,

00:08:22.240 | is there stuff that we need total paradigm shifts

00:08:24.280 | in order to do well?

00:08:28.080 | So all of this falls under some category

00:08:32.160 | of trying to really deeply understand our models

00:08:34.240 | and their capabilities.

00:08:36.520 | And there's a lot of different methods

00:08:38.680 | here that we'll go over today.

00:08:40.440 | And one thing that I want you to take away from it

00:08:42.520 | is that they're all going to tell us

00:08:47.520 | some aspect of the model elucidates

00:08:49.440 | some kind of intuition or something, but none of them

00:08:52.160 | are we going to say, aha, I really

00:08:54.640 | understand 100% about what this model is doing now.

00:08:57.600 | So they're going to provide some clarity,

00:08:59.360 | but never total clarity.

00:09:01.000 | And one way, if you're trying to decide

00:09:04.080 | how you want to understand your model more,

00:09:06.420 | I think you should sort of start out by thinking about is,

00:09:09.080 | at what level of abstraction do I

00:09:11.160 | want to be looking at my model?

00:09:12.960 | So the sort of very high level abstraction,

00:09:17.280 | let's say you've trained a QA model

00:09:20.600 | to estimate the probabilities of start and end indices

00:09:23.880 | in a reading comprehension problem,

00:09:25.960 | or you've trained a language model that assigns probabilities

00:09:28.980 | to words in context.

00:09:30.480 | You can just look at the model as that object.

00:09:33.480 | So it's just a probability distribution

00:09:35.800 | defined by your model.

00:09:37.480 | You are not looking into it any further than the fact

00:09:40.080 | that you can sort of give it inputs

00:09:41.760 | and see what outputs it provides.

00:09:45.000 | So that's like, who even cares if it's a neural network?

00:09:49.480 | It could be anything.

00:09:50.960 | But it's a way to understand its behavior.

00:09:53.240 | Another level of abstraction that you can look at,

00:09:55.360 | you can dig a little deeper.

00:09:56.520 | You can say, well, I know that my network is

00:09:59.360 | a bunch of layers that are kind of stacked

00:10:01.100 | on top of each other.

00:10:02.000 | You've got sort of maybe your transformer encoder

00:10:05.280 | with one layer, two layer, three layer.

00:10:07.440 | You can try to see what it's doing as it

00:10:09.120 | goes deeper in the layers.

00:10:11.680 | So maybe your neural model is a sequence

00:10:13.520 | of these vector representations.

00:10:15.200 | A third option of sort of specificity

00:10:17.840 | is to look at as much detail as you can.

00:10:21.800 | You've got these parameters in there.

00:10:23.300 | You've got the connections in the computation graph.

00:10:26.120 | So now you're sort of trying to remove all of the abstraction

00:10:29.560 | that you can and look at as many details as possible.

00:10:32.160 | And all three of these sort of ways

00:10:33.640 | of looking at your model and performing analysis

00:10:36.200 | are going to be useful and will actually

00:10:38.400 | sort of travel slowly from one to two to three

00:10:42.220 | as we go through this lecture.

00:10:45.680 | OK.

00:10:47.080 | So we haven't actually talked about any analyses yet.

00:10:51.120 | So we're going to get started on that now.

00:10:55.360 | And we're starting with the sort of testing

00:10:57.720 | our model's behaviors.

00:10:59.520 | So would we want to see, will my model perform well?

00:11:02.720 | I mean, the natural thing to ask is, well,

00:11:05.280 | how does it behave on some sort of test set?

00:11:10.000 | And so we don't really care about mechanisms yet.

00:11:13.100 | Why is it performing this?

00:11:14.860 | By what method is it making its decision?

00:11:17.400 | Instead, we're just interested in sort of the more higher

00:11:19.980 | level abstraction of, does it perform

00:11:23.020 | the way I want it to perform?

00:11:24.940 | So let's take our model evaluation

00:11:28.540 | that we are already doing and sort of recast it

00:11:31.500 | in the framework of analysis.

00:11:33.700 | So you've trained your model on some samples

00:11:36.180 | from some distribution.

00:11:37.340 | So you've got input/output pairs of some kind.

00:11:40.440 | So how does the model behave on samples

00:11:42.420 | from the same distribution?

00:11:43.980 | It's a simple question.

00:11:45.180 | And it's sort of--

00:11:47.300 | it's known as in-domain accuracy.

00:11:49.620 | Or you can say that the samples are IID,

00:11:52.420 | and that's what you're testing on.

00:11:53.900 | And this is just what we've been doing this whole time.

00:11:56.300 | It's your test set accuracy, or F1, or blue score.

00:11:59.940 | And so you've got some model with some accuracy.

00:12:04.620 | And maybe it's better than some model with some other accuracy

00:12:08.140 | on this test set.

00:12:09.060 | So this is what you're doing as you're

00:12:10.640 | iterating on your models in your final project as well.

00:12:14.420 | You say, well, on my test set, which

00:12:16.580 | is what I've decided to care about for now,

00:12:18.700 | model A does better.

00:12:20.220 | They both seem pretty good.

00:12:22.020 | And so maybe I'll choose model A to keep working on.

00:12:24.740 | Maybe I'll choose it if you were putting something

00:12:26.900 | into production.

00:12:28.620 | But remember back to this idea that it's just one number

00:12:33.060 | to summarize a very complex system.

00:12:36.100 | It's not going to be sufficient to tell you

00:12:37.940 | how it's going to perform in a wide variety of settings.

00:12:42.060 | OK.

00:12:42.540 | So we've been doing this.

00:12:44.060 | This is model evaluation as model analysis.

00:12:48.380 | Now we are going to say, what if we are not

00:12:52.020 | testing on exactly the same type of data that we trained on?

00:12:56.300 | So now we're asking, did the model learn something

00:12:58.900 | such that it's able to sort of extrapolate or perform

00:13:02.340 | how I want it to on data that looks a little bit different

00:13:04.860 | from what it was trained on?

00:13:06.100 | And we're going to take the example of natural language

00:13:08.400 | inference.

00:13:08.980 | So to recall the task of natural language inference--

00:13:11.540 | and this is through the multi-NLI data set

00:13:13.340 | that we're just pulling our definition--

00:13:15.540 | you have a premise.

00:13:16.860 | He turned and saw John sleeping in his half tent.

00:13:19.940 | And you have a hypothesis.

00:13:21.700 | He saw John was asleep.

00:13:23.940 | And then you give them both to a model.

00:13:26.300 | And this is the model that we had before

00:13:27.980 | that gets some good accuracy.

00:13:29.740 | And the model is supposed to tell

00:13:31.780 | whether the hypothesis is sort of implied by the premise

00:13:35.900 | or contradicting.

00:13:37.740 | So it could be contradicting, maybe,

00:13:39.500 | if the hypothesis is John was awake, for example,

00:13:43.260 | or he saw John was awake.

00:13:44.340 | Maybe that would be a contradiction.

00:13:45.840 | Neutral, if sort of both could be true at the same time,

00:13:49.220 | so to speak.

00:13:50.300 | And then entailment, in this case,

00:13:51.900 | it seems like they're saying that the premise implies

00:13:54.340 | the hypothesis.

00:13:55.860 | And so you would say, probably, this

00:13:58.780 | is likely to get the right answer,

00:14:00.240 | since the accuracy of the model is 95%.

00:14:02.740 | 95% of the time, it gets the right answer.

00:14:06.300 | And we're going to dig deeper into that.

00:14:09.100 | What if the model is not doing what

00:14:11.780 | we think we want it to be doing in order to perform

00:14:15.060 | natural language inference?

00:14:16.740 | So in a data set like multi-NLI, the authors

00:14:19.580 | who gathered the data set will have asked humans

00:14:22.580 | to perform the task and gotten the accuracy

00:14:25.420 | that the humans achieved.

00:14:27.260 | And models nowadays are achieving accuracies

00:14:29.940 | that are around where humans are achieving,

00:14:33.980 | which sounds great at first.

00:14:36.700 | But as we'll see, it's not the same

00:14:38.820 | as actually performing the task more broadly in the right way.

00:14:45.460 | So what if the model is not doing something smart,

00:14:47.540 | effectively?

00:14:49.260 | We're going to use a diagnostic test

00:14:51.580 | set of carefully constructed examples that

00:14:54.740 | seem like things the model should be able to do to test

00:14:58.020 | for a specific skill or capacity.

00:15:01.300 | In this case, we'll use HANS.

00:15:03.100 | So HANS is the heuristic analysis for NLI systems data

00:15:06.620 | set.

00:15:07.380 | And it's intended to take systems

00:15:10.020 | that do natural language inference

00:15:11.860 | and test whether they're using some simple syntactic

00:15:14.620 | heuristics.

00:15:16.140 | What we'll have in each of these cases, we'll have some heuristic.

00:15:19.540 | We'll talk through the definition.

00:15:21.180 | We'll get an example.

00:15:22.020 | So the first thing is lexical overlap.

00:15:24.320 | So the model might do this thing where

00:15:28.700 | it assumes that a premise entails

00:15:30.260 | all hypotheses constructed from words in the premise.

00:15:32.900 | So in this example, you have the premise,

00:15:36.980 | the doctor was paid by the actor.

00:15:40.820 | And then the hypothesis is the doctor paid the actor.

00:15:43.740 | And you'll notice that in bold here, get the doctor,

00:15:46.860 | and then paid, and then the actor.

00:15:49.900 | And so if you use this heuristic,

00:15:52.460 | you will think that the doctor was paid by the actor

00:15:54.660 | implies the doctor paid the actor.

00:15:56.380 | That does not imply it, of course.

00:15:58.580 | And so you could expect a model.

00:16:00.500 | You want the model to be able to do this.

00:16:02.160 | It's somewhat simple.

00:16:03.420 | But if it's using this heuristic,

00:16:04.900 | it won't get this example right.

00:16:07.900 | Next is subsequence heuristics.

00:16:10.500 | So here, if the model assumes that the premise entails

00:16:15.500 | all of its contiguous subsequences,

00:16:17.620 | it will get this one wrong as well.

00:16:19.100 | So this example is the doctor near the actor danced.

00:16:23.300 | That's the premise.

00:16:24.140 | The hypothesis is the actor danced.

00:16:26.700 | Now, this is a simple syntactic thing.

00:16:28.300 | The doctor is doing the dancing near the actor

00:16:31.460 | is this prepositional phrase.

00:16:33.860 | And so the model uses this heuristic.

00:16:36.140 | Oh, look, the actor danced.

00:16:37.300 | That's a subsequence entailed.

00:16:38.980 | Awesome.

00:16:39.660 | Then it'll get this one wrong as well.

00:16:42.420 | And here's another one that's a lot like subsequence.

00:16:45.940 | But so if the model thinks that the premise entails

00:16:50.660 | all complete subtrees-- so this is like fully formed phrases.

00:16:55.660 | So the artist slept here is a fully formed subtree.

00:17:00.420 | If the artist slept, the actor ran.

00:17:02.540 | And then that's the premise.

00:17:03.960 | Does it entail the hypothesis the actor slept?

00:17:07.820 | No.

00:17:09.740 | Sorry, the artist slept.

00:17:10.980 | That does not entail it because this is in that conditional.

00:17:14.460 | OK.

00:17:15.180 | Let me pause here for some questions

00:17:16.660 | before I move on to see how these models do.

00:17:20.460 | Anyone unclear about how this sort of evaluation

00:17:25.500 | is being set up?

00:17:26.220 | No?

00:17:33.760 | OK.

00:17:34.260 | Cool.

00:17:38.000 | OK.

00:17:39.740 | OK, so how do models perform?

00:17:42.820 | That's sort of the question of the hour.

00:17:46.220 | What we'll do is we'll look at these results

00:17:49.340 | from the same paper that released the data set.

00:17:51.860 | So they took four strong multi-analyte models

00:17:56.380 | with the following accuracies.

00:17:57.660 | So the accuracies here are something between 60% and 80%

00:18:01.900 | Bert over here is doing the best.

00:18:04.660 | And in domain, in that first setting that we talked about,

00:18:10.900 | you get these reasonable accuracies.

00:18:13.820 | And that is sort of what we said before about it

00:18:16.900 | like looking pretty good.

00:18:19.820 | And when we evaluate on Hans, in this setting

00:18:24.740 | here, we have examples where the heuristics we talked about

00:18:29.580 | actually work.

00:18:30.420 | So if the model is using the heuristic,

00:18:32.120 | it will get this right.

00:18:34.020 | And it gets very high accuracies.

00:18:37.020 | And then if we evaluate the model in the settings

00:18:40.700 | where if it uses the heuristic, it gets the examples wrong,

00:18:45.500 | maybe Bert's doing epsilon better than some

00:18:48.060 | of the other stuff here.

00:18:49.260 | But it's a very different story.

00:18:53.700 | And you saw those examples.

00:18:55.540 | They're not complex in our own idea of complexity.

00:19:03.180 | And so this is why it feels like a clear failure of the system.

00:19:08.420 | Now, you can say, though, that, well, maybe the training data

00:19:11.780 | sort of didn't have any of those sort of phenomena.

00:19:14.740 | So the model couldn't have learned not to do that.

00:19:18.060 | And that's sort of a reasonable argument, except, well,

00:19:20.700 | Bert is pre-trained on a bunch of language text.

00:19:23.540 | So you might expect, you might hope that it does better.

00:19:26.380 | So we saw that example of models performing well

00:19:37.380 | on examples that are like those that it was trained on,

00:19:40.420 | and then performing not very well at all

00:19:42.420 | on examples that seem reasonable but are

00:19:46.540 | sort of a little bit tricky.

00:19:49.380 | Now we're going to take this idea of having a test

00:19:52.340 | set that we've carefully crafted and go in a slightly

00:19:54.540 | different direction.

00:19:55.740 | So we're going to have, what does it

00:19:57.260 | mean to try to understand the linguistic properties

00:20:00.040 | of our models?

00:20:01.180 | So that syntactic heuristics question

00:20:03.380 | was one thing for natural language inference.

00:20:05.260 | But can we sort of test how the models,

00:20:08.260 | whether they think certain things are sort of right

00:20:10.740 | or wrong as language models?

00:20:14.300 | And the first way that we'll do this is we'll ask, well,

00:20:16.740 | how do we think about sort of what humans

00:20:18.900 | think of as good language?

00:20:21.260 | How do we evaluate their sort of preferences about language?

00:20:26.700 | And one answer is minimal pairs.

00:20:29.020 | And the idea of a minimal pair is

00:20:30.580 | that you've got one sentence that sounds OK to a speaker.

00:20:34.740 | So this sentence is, the chef who made the pizzas is here.

00:20:39.660 | It's called it's an acceptable sentence, at least to me.

00:20:43.700 | And then with a small change, a minimal change,

00:20:47.500 | the sentence is no longer OK to the speaker.

00:20:50.020 | So the chef who made the pizzas are here.

00:20:53.380 | And this-- whoops.

00:20:57.340 | This should be present tense verbs.

00:21:01.260 | In English, present tense verbs agree in number

00:21:03.460 | with their subject when they are third person.

00:21:07.020 | So chef, pizzas, OK.

00:21:10.540 | And this is sort of a pretty general thing.

00:21:14.740 | Most people don't like this.

00:21:15.980 | It's a misconjugated verb.

00:21:18.580 | And so the syntax here looks like you

00:21:21.220 | have the chef who made the pizzas.

00:21:23.180 | And then this arc of agreement in number

00:21:26.740 | is requiring the word is here to be singular

00:21:30.260 | is instead of plural are, despite the fact

00:21:33.180 | that there's this noun pizzas, which is plural,

00:21:36.580 | closer linearly, comes back to dependency parsing.

00:21:39.860 | We're back.

00:21:40.460 | OK.

00:21:42.140 | And what this looks like in the tree structure

00:21:45.060 | is, well, chef and is are attached in the tree.

00:21:52.140 | Chef is the subject of is.

00:21:54.140 | Pizzas is down here in this subtree.

00:21:56.900 | And so that subject-verb relationship

00:21:59.300 | has this sort of agreement thing.

00:22:02.500 | So this is a pretty sort of basic and interesting property

00:22:05.660 | of language that also reflects the syntactic sort

00:22:09.180 | of hierarchical structure of language.

00:22:11.060 | So we've been training these language models,

00:22:12.900 | sampling from them, seeing that they get interesting things.

00:22:15.740 | And they tend to seem to generate syntactic content.

00:22:19.340 | But does it really understand, or does it

00:22:21.980 | behave as if it understands this idea of agreement more broadly?

00:22:26.140 | And does it sort of get the syntax right

00:22:28.380 | so that it matches the subjects and the verbs?

00:22:31.820 | But language models can't tell us

00:22:33.860 | exactly whether they think that a sentence is good or bad.

00:22:36.980 | They just tell us the probability of a sentence.

00:22:40.300 | So before, we had acceptable and unacceptable.

00:22:43.380 | That's what we get from humans.

00:22:45.780 | And the language model's analog is just,

00:22:47.900 | does it assign higher probability

00:22:49.780 | to the acceptable sentence in the minimal pair?

00:22:52.180 | So you have the probability under the model of the chef who

00:22:56.020 | made the pizzas is here.

00:22:58.140 | And then you have the probability

00:22:59.980 | under the model of the chef who made the pizzas are here.

00:23:03.740 | And you want this probability here to be higher.

00:23:08.020 | And if it is, that's sort of like a simple way

00:23:10.500 | to test whether the model got it right effectively.

00:23:15.460 | And just like in Hans, we can develop a test set

00:23:19.940 | with very carefully chosen properties.

00:23:22.100 | So most sentences in English don't

00:23:24.900 | have terribly complex subject-verb agreement

00:23:29.300 | structure with a lot of words in the middle,

00:23:31.180 | like pizzas, that are going to make it difficult.

00:23:34.220 | So if I say, the dog runs, sort of no way

00:23:39.340 | to get it wrong, because this index is very simple.

00:23:44.860 | So we can create, or we can look for sentences that have--

00:23:49.940 | these are the things called attractors in the sentence.

00:23:53.620 | So pizzas is an attractor, because the model

00:23:56.500 | might be attracted to the plurality here

00:23:59.300 | and get the conjugation wrong.

00:24:02.980 | So this is our question.

00:24:03.940 | Can language models sort of very generally handle

00:24:06.580 | these examples with attractors?

00:24:08.500 | So we can take examples with zero attractors,

00:24:11.340 | see whether the model gets the minimal pairs evaluation right.

00:24:14.540 | We can take examples with one attractor, two attractors.

00:24:18.340 | You can see how people would still reasonably understand

00:24:20.660 | these sentences, right?

00:24:21.820 | Chef who made the pizzas and prepped the ingredients is.

00:24:24.700 | It's still the chef who is.

00:24:26.460 | And then on and on and on, it gets rarer, obviously.

00:24:29.980 | But you can have more and more attractors.

00:24:32.620 | And so now we've created this test set

00:24:34.180 | that's intended to evaluate this very specific linguistic

00:24:36.840 | phenomenon.

00:24:39.260 | So in this paper here, Concur et al.

00:24:43.140 | trained an LSTM language model on a subset of Wikipedia

00:24:46.540 | back in 2018.

00:24:47.900 | And they evaluated it sort of in these buckets

00:24:50.540 | that are specified by the paper that sort of introduced

00:24:55.660 | subject-verb agreement to the NLP field,

00:25:00.220 | more recently at least.

00:25:02.340 | And they evaluated it in buckets based

00:25:04.700 | on the number of attractors.

00:25:06.140 | And so in this table here that you're about to see,

00:25:09.620 | the numbers are sort of the percent of times

00:25:12.580 | that you get this assign higher probability

00:25:14.780 | to the correct sentence in the minimal pair.

00:25:19.660 | So if you were just to do random or majority class,

00:25:21.780 | you get these errors.

00:25:23.220 | Oh, sorry, it's the percent of times that you get it wrong.

00:25:26.380 | Sorry about that.

00:25:27.260 | So lower is better.

00:25:29.780 | And so with no attractors, you get very low error rates.

00:25:33.460 | So this is 1.3 error rate with a 350-dimensional LSTM.

00:25:38.940 | And with one attractor, your error rate is higher.

00:25:43.020 | But actually, humans start to get errors

00:25:45.480 | with more attractors too.

00:25:47.300 | So zero attractors is easy.

00:25:50.220 | The larger the LSTM, it looks like in general,

00:25:52.400 | the better you're doing.

00:25:53.580 | So the smaller model's doing worse.

00:25:56.460 | And then even on very difficult examples with four attractors,

00:26:00.220 | which try to think of an example in your head,

00:26:02.420 | like the chef made the pizzas and took out the trash.

00:26:06.860 | It sort of has to be this long sentence.

00:26:08.820 | The error rate is definitely higher,

00:26:10.340 | so it gets more difficult. But it's still relatively low.

00:26:15.260 | And so even on these very hard examples,

00:26:16.900 | models are actually performing subject-verb number agreement

00:26:19.740 | relatively well.

00:26:21.300 | Very cool.

00:26:24.020 | OK.

00:26:25.500 | Here's some examples that a model got wrong.

00:26:28.540 | This is actually a worse model than the ones

00:26:30.300 | from the paper that was just there.

00:26:31.960 | But I think, actually, the errors are quite interesting.

00:26:34.980 | So here's a sentence.

00:26:35.900 | The ship that the player drives has a very high speed.

00:26:41.320 | Now, this model thought that was less probable than the ship

00:26:45.100 | that the player drives have a very high speed.

00:26:50.940 | My hypothesis is that it sort of misanalyzes drives

00:26:56.940 | as a plural noun, for example.

00:27:00.060 | It's sort of a difficult construction there.

00:27:01.900 | I think it's pretty interesting.

00:27:04.500 | Likewise here, this one is fun.

00:27:07.100 | The lead is also rather long.

00:27:09.300 | Five paragraphs is pretty lengthy.

00:27:12.520 | So here, five paragraphs is a singular noun together.

00:27:16.980 | It's like a unit of length, I guess.

00:27:20.340 | But the model thought that it was more likely to say

00:27:23.420 | five paragraphs are pretty lengthy,

00:27:26.380 | because it's referring to this sort of five paragraphs

00:27:30.620 | as the five actual paragraphs themselves,

00:27:33.380 | as opposed to a single unit of length describing the lead.

00:27:37.540 | Fascinating.

00:27:40.620 | OK.

00:27:41.120 | Any questions again?

00:27:46.620 | [INAUDIBLE]

00:27:53.540 | So I guess there are a couple.

00:27:56.060 | Can we do the similar heuristic analysis

00:27:59.180 | for other tasks, such as Q&A, classification?

00:28:02.420 | Yes.

00:28:07.580 | So yes, I think that it's easier to do this kind of analysis

00:28:11.260 | for the Hans style analysis with question answering

00:28:18.340 | and other sorts of tasks, because you can construct

00:28:22.140 | examples that similarly have these heuristics

00:28:32.820 | and then have the answer depend on the syntax or not.

00:28:36.060 | The actual probability of one sentence

00:28:39.660 | is higher than the other, of course,

00:28:41.160 | sort of a language model dependent thing.

00:28:43.300 | But the idea that you can develop bespoke test

00:28:48.700 | sets for various tasks, I think, is very, very general

00:28:54.140 | and something I think is actually quite interesting.

00:28:59.340 | Yes.

00:28:59.860 | So I won't go on further, but I think the answer is just yes.

00:29:04.980 | So there's another one.

00:29:07.380 | How do you know where to find these failure cases?

00:29:10.180 | Maybe that's the right time to advertise linguistics classes.

00:29:14.380 | Sorry.

00:29:16.180 | You're still very quiet over here.

00:29:18.220 | How do we find what?

00:29:19.740 | How do you know where to find these failure cases?

00:29:23.260 | Oh, interesting.

00:29:24.100 | Yes, how do we know where to find the failure cases?

00:29:27.100 | That's a good question.

00:29:28.500 | I mean, I think I agree with Chris

00:29:30.500 | that actually thinking about what

00:29:33.740 | is interesting about things in language is one way to do it.

00:29:39.500 | I mean, the heuristics that we saw in our language model--

00:29:45.500 | sorry, in our NLI models with Hans,

00:29:49.380 | you can imagine that they--

00:29:53.620 | if the model was sort of ignoring facts about language

00:29:56.780 | and sort of just doing this sort of rough bag of words

00:29:59.540 | with some extra magic, then it would do well about as bad

00:30:03.780 | as it's doing here.

00:30:05.360 | And these sorts of ideas about understanding

00:30:10.540 | that this statement, if the artist slept, the actor ran,

00:30:13.260 | does not imply the artist slept, is the kind of thing

00:30:15.980 | that maybe you'd think up on your own,

00:30:18.380 | but also you'd spend time sort of pondering about and thinking

00:30:22.760 | broad thoughts about in linguistics curricula as well.

00:30:27.380 | So anything else, Chris?

00:30:32.960 | Yeah.

00:30:35.940 | So there's also-- well, I guess someone was also saying--

00:30:41.020 | I think it's about the sort of intervening verbs example--

00:30:44.700 | intervening nouns, sorry, example.

00:30:46.660 | But the data set itself probably includes mistakes

00:30:50.260 | with higher attractors.

00:30:53.020 | Yeah, yeah, that's a good point.

00:30:55.540 | Yeah, because humans make more and more mistakes

00:30:57.980 | as the number of attractors gets larger.

00:31:03.880 | On the other hand, I think that the mistakes are fewer

00:31:06.880 | in written text than in spoken.

00:31:10.200 | Maybe I'm just making that up.

00:31:12.000 | That's what I think.

00:31:13.560 | But yeah, it would be interesting to actually go

00:31:15.520 | through that test set and see how many of the errors

00:31:20.400 | a really strong model makes are actually

00:31:22.360 | due to the sort of observed form being incorrect.

00:31:26.000 | I'd be super curious.

00:31:27.420 | OK, should I move on?

00:31:36.180 | Yeah.

00:31:36.940 | Great.

00:31:37.440 | OK, so what does it feel like we're

00:31:52.740 | doing when we are kind of constructing

00:31:55.360 | these sort of bespoke, small, careful test

00:31:57.860 | sets for various phenomena?

00:31:59.980 | Well, it sort of feels like unit testing.

00:32:03.500 | And in fact, this sort of idea has been brought to the fore,

00:32:10.700 | you might say, in NLP unit tests,

00:32:13.560 | but for these NLP neural networks.

00:32:15.260 | And in particular, the paper here

00:32:18.380 | that I'm citing at the bottom suggests this minimum

00:32:21.900 | functionality test.

00:32:23.220 | You want a small test set that targets a specific behavior.

00:32:26.500 | That should sound like some of the things

00:32:28.660 | that we've already talked about.

00:32:30.820 | But in this case, we're going to get even more specific.

00:32:34.660 | So here's a single test case.

00:32:36.820 | We're going to have an expected label, what was actually

00:32:40.220 | predicted, whether the model passed this unit test.

00:32:43.660 | And the labels are going to be sentiment analysis here.

00:32:47.620 | So negative label, positive label,

00:32:49.740 | or neutral are the three options.

00:32:52.220 | And the unit test is going to consist simply

00:32:54.940 | of sentences that follow this template.

00:32:57.780 | I, then a negation, a positive verb, and then the thing.

00:33:02.700 | So if you negation positive verb,

00:33:05.580 | it means you negative verb.

00:33:07.820 | And so here's an example.

00:33:09.020 | I can't say I recommend the food.

00:33:11.300 | The expected label is negative.

00:33:13.060 | The answer that the model provided--

00:33:14.660 | and this is, I think, a commercial sentiment analysis

00:33:17.780 | system.

00:33:19.380 | So it predicted positive.

00:33:21.060 | And then I didn't love the flight.

00:33:24.460 | The expected label was negative.

00:33:26.100 | And then the predicted answer was neutral.

00:33:29.820 | And this commercial sentiment analysis system

00:33:32.740 | gets a lot of what you could imagine

00:33:35.780 | are pretty reasonably simple examples wrong.

00:33:38.340 | And so what your bureau at all 2020

00:33:41.500 | showed is that they could actually provide a system that

00:33:44.700 | sort of had this framework of building test cases for NLP

00:33:48.300 | models to ML engineers working on these products

00:33:52.980 | and give them that interface.

00:33:55.180 | And they would actually find bugs--

00:33:59.620 | bugs being categories of high error--

00:34:01.900 | find bugs in their models that they could then

00:34:03.780 | kind of try to go and fix.

00:34:06.420 | And this was kind of an efficient way

00:34:08.380 | of trying to find things that were simple and still wrong

00:34:11.300 | with what should be pretty sophisticated neural systems.

00:34:16.500 | But I really like this.

00:34:17.660 | And it's sort of a nice way of thinking more specifically

00:34:21.180 | about what are the capabilities in sort of precise terms

00:34:25.060 | of our models.

00:34:27.260 | And altogether now, you've seen problems

00:34:29.980 | in natural language inference.

00:34:33.380 | You've seen language models actually perform pretty well

00:34:35.860 | at the language modeling objective.

00:34:37.340 | But then you see--

00:34:38.860 | you just saw an example of a commercial sentiment analysis

00:34:41.740 | system that sort of should do better and doesn't.

00:34:44.980 | And this comes to this really, I think,

00:34:47.980 | broad and important takeaway, which

00:34:50.180 | is if you get high accuracy on the in-domain test set,

00:34:54.940 | you are not guaranteed high accuracy on even

00:34:58.980 | what you might consider to be reasonable out-of-domain

00:35:03.540 | evaluations.

00:35:04.820 | And life is always out of domain.

00:35:08.180 | And if you're building a system that will be given to users,

00:35:11.980 | it's immediately out of domain, at the very least

00:35:14.020 | because it's trained on text that's

00:35:15.620 | now older than the things that the users are now saying.

00:35:18.220 | So it's a really, really important takeaway

00:35:20.780 | that your sort of benchmark accuracy

00:35:23.300 | is a single number that does not guarantee good performance

00:35:26.340 | on a wide variety of things.

00:35:28.060 | And from a what are our neural networks doing perspective,

00:35:32.100 | one way to think about it is that models seem

00:35:34.220 | to be learning the data set, fitting

00:35:36.300 | sort of the fine-grained heuristics and statistics that

00:35:40.020 | help it fit this one data set, as opposed

00:35:43.420 | to learning the task.

00:35:44.580 | So humans can perform natural language inference.

00:35:46.980 | If you give them examples from whatever data set,

00:35:49.980 | once you've told them how to do the task,

00:35:51.700 | they'll be very generally strong at it.

00:35:55.260 | But you take your MNLI model, and you test it on Hans,

00:35:59.900 | and it got whatever that was, below chance accuracy.

00:36:03.100 | That's not the kind of thing that you want to see.

00:36:05.180 | So it definitely learns the data set well,

00:36:07.140 | because the accuracy in-domain is high.

00:36:10.700 | But our models are seemingly not frequently

00:36:14.900 | learning sort of the mechanisms that we

00:36:17.900 | would like them to be learning.

00:36:19.500 | Last week, we heard about language models

00:36:22.340 | and sort of the implicit knowledge

00:36:23.700 | that they encode about the world through pre-training.

00:36:26.380 | And one of the ways that we saw it interact with language

00:36:29.320 | models was providing them with a prompt,

00:36:32.100 | like Dante was born in Masque, and then

00:36:35.060 | seeing if it puts high probability

00:36:36.740 | on the correct continuation, which

00:36:39.220 | requires you to access knowledge about where Dante was born.

00:36:43.380 | And we didn't frame it this way last week,

00:36:45.900 | but this fits into the set of behavioral studies

00:36:48.000 | that we've done so far.

00:36:49.480 | This is a specific kind of input.

00:36:51.860 | You could ask this for multiple people.

00:36:55.140 | You could swap out Dante for other people.

00:36:57.140 | You could swap out born in for, I don't know,

00:37:00.180 | died in or something.

00:37:01.980 | And then there are like test suites again.

00:37:04.820 | And so it's all connected.

00:37:07.380 | OK, so I won't go too deep into sort

00:37:09.420 | of the knowledge of language models

00:37:11.220 | in terms of world knowledge, because we've

00:37:13.540 | gone over it some.

00:37:14.740 | But when you're thinking about ways

00:37:16.580 | of interacting with your models, this sort of behavioral study

00:37:20.980 | can be very, very general, even though, remember,

00:37:23.820 | we're at still this highest level of abstraction,

00:37:26.900 | where we're just looking at the probability distributions that

00:37:29.360 | are defined.

00:37:29.860 | All right, so now we'll go into-- so we've sort of looked

00:37:35.900 | at understanding in fine-grained areas what

00:37:38.740 | our model is actually doing.

00:37:41.540 | What about sort of why for an individual input

00:37:45.700 | is it getting the answer right or wrong?

00:37:48.020 | And then are there changes to the inputs

00:37:50.060 | that look fine to humans, but actually make

00:37:52.860 | the models do a bad job?

00:37:55.980 | So one study that I love to reference that really draws

00:38:00.000 | back into our original motivation of using LSTM

00:38:04.380 | networks instead of simple recurrent neural networks

00:38:06.740 | was that they could use long context.

00:38:10.420 | But how long is your long and short-term memory?

00:38:15.020 | And the idea of Kendall-Wall et al.

00:38:18.100 | 2018 was shuffle or remove contexts

00:38:23.140 | that are farther than some k words away, changing k.

00:38:29.140 | And if the accuracy, if the predictive ability

00:38:33.220 | of your language model, the perplexity,

00:38:35.860 | doesn't change once you do that, it

00:38:37.820 | means the model wasn't actually using that context.

00:38:40.740 | I think this is so cool.

00:38:42.100 | So on the x-axis, we've got how far away from the word

00:38:46.820 | that you're trying to predict.

00:38:48.260 | Are you actually sort of corrupting, shuffling,

00:38:51.340 | or removing stuff from the sequence?

00:38:54.140 | And then on the y-axis is the increase in loss.

00:38:57.180 | So if the increase in loss is zero,

00:39:00.460 | it means that the model was not using the thing

00:39:03.340 | that you just removed.

00:39:04.540 | Because if it was using it, it would now

00:39:06.580 | do worse without it.

00:39:08.140 | And so if you shuffle in the blue line here,

00:39:11.460 | if you shuffle the history that's farther away from 50

00:39:14.620 | words, the model does not even notice.

00:39:18.620 | I think that's really interesting.

00:39:20.080 | One, it says everything past 50 words of this LSTM language

00:39:23.620 | model, you could have given it in random order,

00:39:26.020 | and it wouldn't have noticed.

00:39:28.500 | And then two, it says that if you're closer than that,

00:39:31.060 | it actually is making use of the word order.

00:39:33.380 | That's a pretty long memory.

00:39:34.900 | OK, that's really interesting.

00:39:36.740 | And then if you actually remove the words entirely,

00:39:39.980 | you can kind of notice that the words are

00:39:42.620 | missing up to 200 words away.

00:39:45.660 | So you don't care about the order they're in,

00:39:48.420 | but you care whether they're there or not.

00:39:50.620 | And so this is an evaluation of, well,

00:39:53.140 | do LSTMs have long-term memory?

00:39:54.800 | Well, this one at least has effectively no longer

00:39:57.420 | than 200 words of memory, but also no less.

00:40:02.180 | So very cool.

00:40:03.860 | So that's a general study for a single model.

00:40:09.420 | It talks about its average behavior

00:40:13.180 | over a wide range of examples.

00:40:14.580 | But we want to talk about individual predictions

00:40:17.020 | on individual inputs.

00:40:17.980 | So let's talk about that.

00:40:19.340 | So one way of interpreting why did my model make

00:40:23.860 | this decision that's very popular is, for a single

00:40:27.180 | example, what parts of the input actually led to the decision?

00:40:31.340 | And this is where we come in with saliency maps.

00:40:34.380 | So a saliency map provides a score

00:40:36.980 | for each word indicating its importance

00:40:39.300 | to the model's prediction.

00:40:40.620 | So you've got something like Bert here.

00:40:44.100 | You've got Bert.

00:40:45.060 | Bert is making a prediction for this mask.

00:40:47.580 | The mask rushed to the emergency room to see her patient.

00:40:52.340 | And the predictions that the model is making

00:40:55.460 | is things with 47%.

00:40:57.300 | It's going to be nurse that's here in the mask instead,

00:41:01.060 | or maybe woman, or doctor, or mother, or girl.

00:41:04.580 | And then the saliency map is being visualized here in orange.

00:41:07.780 | According to this method of saliency

00:41:09.740 | called simple gradients, which we'll get into,

00:41:12.060 | emergency, her, and the SEP token--

00:41:15.980 | let's not worry about the SEP token for now.

00:41:17.900 | But emergency and her are the important words, apparently.

00:41:21.860 | And the SEP token shows up in every sentence.

00:41:23.700 | So I'm not going to--

00:41:25.820 | and so these two together are, according to this method,

00:41:29.380 | what's important for the model to make this prediction to mask.

00:41:33.420 | And you can see maybe some statistics, biases, et cetera,

00:41:36.980 | that is picked up in the predictions

00:41:39.100 | and then have it mapped out onto the sentence.

00:41:41.820 | And this is-- well, it seems like it's really

00:41:44.240 | helping interpretability.

00:41:47.060 | And yeah, I think that this is a very useful tool.

00:41:52.580 | Actually, this is part of a demo from Alan NLP

00:41:56.300 | that allows you to do this yourself for any sentence

00:42:00.820 | that you want.

00:42:02.660 | So what's this way of making saliency maps?

00:42:05.660 | We're not going to go-- there's so many ways to do it.

00:42:07.940 | We're going to take a very simple one

00:42:09.480 | and work through why it makes sense.

00:42:12.660 | So the issue is, how do you define importance?

00:42:17.420 | What does it mean to be important to the model's

00:42:19.460 | prediction?

00:42:20.620 | And here's one way of thinking about it.

00:42:22.300 | It's called the simple gradient method.

00:42:24.220 | Let's get a little formal.

00:42:25.300 | You've got words x1 to xn.

00:42:28.300 | And then you've got a model score for a given output class.

00:42:31.020 | So maybe you've got, in the BERT example,

00:42:33.700 | each output class was each output word

00:42:35.900 | that you could possibly predict.

00:42:38.740 | And then you take the norm of the gradient of the score,

00:42:42.640 | with respect to each word.

00:42:44.740 | So what we're saying here is, the score

00:42:48.620 | is the unnormalized probability for that class.

00:42:55.500 | So you've got a single class.

00:42:56.700 | You're taking the score.

00:42:57.700 | It's how likely it is, not yet normalized

00:42:59.900 | by how likely everything else is.

00:43:02.660 | Gradient, how much is it going to change

00:43:05.340 | if I move it a little bit in one direction or another?

00:43:08.380 | And then you take the norm to get a scalar from a vector.

00:43:10.900 | So it looks like this.

00:43:12.260 | The salience of word i, you have the norm bars on the outside,

00:43:16.580 | gradient with respect to xi.

00:43:18.900 | So that's if I change a little bit locally xi,

00:43:22.740 | how much does my score change?

00:43:25.460 | So the idea is that a high gradient norm

00:43:27.940 | means that if I were to change it locally,

00:43:30.380 | I'd affect the score a lot.

00:43:32.060 | And that means it was very important to the decision.

00:43:34.300 | Let's visualize this a little bit.

00:43:35.740 | So here on the y-axis, we've got loss.

00:43:39.460 | Just the loss of the model-- sorry,

00:43:41.700 | this should be score.

00:43:43.260 | Should be score.

00:43:44.180 | And on the x-axis, you've got word space.

00:43:46.980 | The word space is like sort of a flattening of the ability

00:43:51.700 | to move your word embedding in 1,000 dimensional space.

00:43:54.740 | So I've just plotted it here in one dimension.

00:43:58.780 | And now, a high saliency thing, you

00:44:00.880 | can see that the relationship between what should be score

00:44:04.420 | and moving the word in word space,

00:44:06.500 | you move it a little bit on the x-axis,

00:44:08.740 | and the score changes a lot.

00:44:10.860 | That's that derivative.

00:44:11.820 | That's the gradient.

00:44:12.660 | Awesome, love it.

00:44:13.740 | Low saliency, you move the word around locally,

00:44:16.900 | and the score doesn't change.

00:44:20.260 | So the interpretation is that means

00:44:23.740 | that the actual identity of this word

00:44:25.980 | wasn't that important to the prediction,

00:44:27.680 | because I could have changed it, and the score

00:44:29.740 | wouldn't have changed.

00:44:31.340 | Now, why are there more methods than this?

00:44:33.860 | Because honestly, reading that, I was like,

00:44:36.300 | that sounds awesome.

00:44:37.140 | That sounds great.

00:44:38.420 | So there are sort of lots of issues

00:44:40.580 | with this kind of method and lots of ways

00:44:44.300 | of getting around them.

00:44:45.220 | Here's one issue.

00:44:46.620 | It's not perfect, because, well, maybe your linear approximation

00:44:51.860 | that the gradient gives you holds only very, very locally.

00:44:56.780 | So here, the gradient is 0.

00:45:00.140 | So this is a low saliency word, because I'm

00:45:02.660 | at the bottom of this parabola.

00:45:04.420 | But if I were to move even a little bit

00:45:06.340 | in either direction, the score would shoot up.

00:45:10.220 | So is this not an important word?

00:45:11.860 | It seems important to be right there,

00:45:15.980 | as opposed to anywhere else even sort of nearby in order

00:45:19.780 | for the score not to go up.

00:45:22.060 | But the simple gradients method won't capture this,

00:45:24.780 | because it just looks at the gradient, which

00:45:27.060 | is that 0 right there.

00:45:28.300 | But if you want to look into more,

00:45:32.820 | there's a bunch of different methods

00:45:34.340 | that are sort of applied in these papers.

00:45:36.420 | And I think that is a good tool for the toolbox.

00:45:42.540 | OK, so that is one way of explaining a prediction.

00:45:47.260 | And it has some issues, like why are individual words being

00:45:53.100 | scored, as opposed to phrases or something like that.

00:45:56.980 | But for now, we're going to move on to another type

00:45:59.140 | of explanation.

00:46:00.740 | And I'm going to check the time.

00:46:02.340 | OK, cool.

00:46:04.820 | Actually, yeah, let me pause for a second.

00:46:06.580 | Any questions about this?

00:46:07.620 | I mean, earlier on, there were a couple of questions.

00:46:16.180 | One of them was, what are your thoughts

00:46:19.780 | on whether looking at attention weights

00:46:21.520 | is a methodologically rigorous way of determining

00:46:24.540 | the importance that the model places on certain tokens?

00:46:27.960 | It seems like there's some back and forth in the literature.

00:46:31.900 | That is a great question.

00:46:34.820 | And I probably won't engage with that question

00:46:36.900 | as much as I could if we had a second lecture on this.

00:46:40.660 | I actually will provide some attention analyses

00:46:43.260 | and tell you they're interesting.

00:46:44.820 | And then I'll sort of say a little bit

00:46:46.900 | about why they can be interesting without being

00:46:53.380 | sort of maybe the end all of analysis of where information

00:47:03.220 | is flowing in a transformer, for example.

00:47:05.980 | I think the debate is something that we

00:47:08.420 | would have to get into in a much longer period of time.

00:47:11.580 | But look at the slides that I show about attention

00:47:14.020 | and the caveats that I provide.

00:47:15.740 | And let me know if that answers your question first,

00:47:17.900 | because we have quite a number of slides on it.

00:47:19.860 | And if not, please, please ask again.

00:47:21.940 | And we can chat more about it.

00:47:25.100 | And maybe you can go on.

00:47:27.140 | Great.

00:47:27.820 | OK.

00:47:28.340 | So I think this is a really fascinating question, which

00:47:31.820 | also gets at what was important about the input,

00:47:35.220 | but in actually kind of an even more direct way, which

00:47:38.260 | is, could I just keep some minimal part of the input

00:47:41.780 | and get the same answer?

00:47:43.340 | So here's an example from SQuAD.

00:47:45.620 | You have this passage in 1899.

00:47:47.140 | John Jacob Astor IV invested $100,000 for Tesla.

00:47:51.020 | OK.

00:47:51.940 | And then the answer that is being predicted by the model

00:47:54.220 | is going to always be in blue in these examples, Colorado

00:47:56.620 | Springs Experiments.

00:47:58.140 | So you've got this passage.

00:47:59.860 | And the question is, what did Tesla spend Astor's money on?

00:48:03.660 | That's why the prediction is Colorado Springs Experiments.

00:48:06.020 | The model gets the answer right, which is nice.

00:48:10.300 | And we would like to think it's because it's doing

00:48:12.500 | some kind of reading comprehension.

00:48:14.860 | But here's the issue.

00:48:16.460 | It turns out, based on this fascinating paper,

00:48:19.860 | that if you just reduce the question to did,

00:48:25.340 | you actually get exactly the same answer.

00:48:30.780 | In fact, with the original question,

00:48:33.140 | the model had sort of a 0.78 confidence probability

00:48:36.820 | in that answer.

00:48:37.860 | And with the reduced question did,

00:48:41.820 | you get even higher confidence.

00:48:43.740 | And if you give a human this, they

00:48:46.100 | would not be able to know really what you're trying to ask about.

00:48:49.720 | So it seems like some things are going really wonky here.

00:48:53.340 | Here's another.

00:48:54.460 | So here's sort of like a very high level

00:48:56.340 | overview of the method.

00:48:58.980 | In fact, it actually references our input saliency methods.

00:49:01.480 | Ah, nice.

00:49:02.180 | It's connected.

00:49:03.180 | So you iteratively remove non-salient or unimportant

00:49:08.100 | words.

00:49:08.980 | So here's a passage again talking about football,

00:49:13.420 | I think.

00:49:13.940 | Yeah.

00:49:15.460 | And-- oh, nice.

00:49:16.660 | OK, so the question is, where did the Broncos practice

00:49:19.060 | for the Super Bowl as the prediction of Stanford

00:49:21.420 | University?

00:49:24.060 | And that is correct.

00:49:25.340 | So again, seems nice.

00:49:27.100 | And now, we're not actually going

00:49:28.620 | to get the model to be incorrect.

00:49:31.220 | We're just going to say, how can I

00:49:33.820 | change this question such that I still get the answer right?

00:49:37.180 | So I'm going to remove the word that

00:49:38.940 | was least important according to a saliency method.

00:49:41.780 | So now, it's where did the practice for the Super Bowl?

00:49:45.020 | Already, this is sort of unanswerable

00:49:46.620 | because you've got two teams practicing.

00:49:48.700 | You don't even know which one you're asking about.

00:49:50.780 | So why the model still thinks it's

00:49:52.620 | so confident in Stanford University makes no sense.

00:49:55.260 | But you can just sort of keep going.

00:49:58.940 | And now, I think, here, the model

00:50:03.220 | stops being confident in the answer, Stanford University.

00:50:07.220 | But I think this is really interesting just

00:50:10.660 | to show that if the model is able to do this

00:50:13.260 | with very high confidence, it's not

00:50:16.620 | reflecting the uncertainty that really should be there

00:50:19.660 | because you can't know what you're even asking about.

00:50:23.420 | OK, so what was important to make this answer?

00:50:26.180 | Well, at least these parts were important

00:50:30.100 | because you could keep just those parts

00:50:31.860 | and get the same answer.

00:50:33.140 | Fascinating.

00:50:35.900 | All right, so that's sort of the end of the admittedly brief

00:50:40.180 | section on thinking about input saliency

00:50:44.220 | methods and similar things.

00:50:45.340 | Now, we're going to talk about actually breaking models

00:50:47.740 | and understanding models by breaking them.

00:50:50.940 | OK, cool.

00:50:52.540 | So if we have a passage here, Peyton Manning

00:50:54.460 | became the first quarterback, something,

00:50:58.500 | Super Bowl, age 39, past record held by John Elway.

00:51:02.180 | Again, we're doing question answering.

00:51:03.760 | We've got this question.

00:51:05.220 | What was the name of the quarterback who

00:51:06.920 | was 38 in the Super Bowl?

00:51:08.540 | The prediction is correct.

00:51:11.060 | Looks good.

00:51:12.060 | Now, we're not going to change the question to try to sort

00:51:15.040 | of make the question nonsensical while keeping the same answer.

00:51:18.460 | Instead, we're going to change the passage

00:51:22.540 | by adding the sentence at the end, which really

00:51:24.460 | shouldn't distract anyone.

00:51:25.540 | This is quarterback, well-known quarterback, Jeff Dean,

00:51:29.100 | had jersey number 37 in Champ Bowl.

00:51:31.460 | So this just doesn't--

00:51:32.620 | it's really not even related.

00:51:34.700 | But now, the prediction is Jeff Dean for our nice QA model.

00:51:40.300 | And so this shows, as well, that it

00:51:44.020 | seems like maybe there's this end of the passage bias

00:51:47.260 | as to where the answer should be, for example.

00:51:49.900 | And so this is an adversarial example

00:51:52.900 | where we flipped the prediction by adding something

00:51:55.220 | that is innocuous to humans.

00:51:57.220 | And so sort of the higher level takeaway

00:51:59.380 | is, oh, it seems like the QA model

00:52:01.700 | that we had that seemed good is not actually performing QA

00:52:04.740 | how we want it to, even though its in-domain accuracy was

00:52:07.620 | good.

00:52:09.920 | And here's another example.

00:52:12.220 | So you've got this paragraph with a question,

00:52:16.780 | what has been the result of this publicity?

00:52:19.620 | The answer is increased scrutiny on teacher misconduct.

00:52:22.780 | Now, instead of changing the paragraph,

00:52:25.100 | we're going to change the question in really,

00:52:27.660 | really seemingly insignificant ways

00:52:31.420 | to change the model's prediction.

00:52:32.740 | So first, what HA-- now you've got this typo, L--

00:52:37.580 | then the result of this publicity,

00:52:39.420 | the answer changes to teacher misconduct.

00:52:42.420 | Likely, a human would sort of ignore this typo or something

00:52:46.020 | and answer the right answer.

00:52:47.500 | And then this is really nuts.

00:52:49.420 | Instead of asking, what has been the result of this publicity,

00:52:52.700 | if you ask, what's been the result of this publicity,

00:52:56.620 | the answer also changes.

00:52:59.380 | And this is-- the authors call this a semantically equivalent

00:53:02.940 | adversary.

00:53:04.460 | This is pretty rough.

00:53:05.700 | And in general, swapping what for what in this QA model

00:53:09.820 | breaks it pretty frequently.

00:53:13.100 | And so again, when you go back and sort of re-tinker

00:53:17.260 | how to build your model, you're going

00:53:19.060 | to be thinking about these things, not just

00:53:20.900 | the sort of average accuracy.

00:53:23.820 | So that's sort of talking about noise.

00:53:28.060 | Are models robust to noise in their inputs?

00:53:31.060 | Are humans robust to noise is another question we can ask.

00:53:34.100 | And so you can kind of go to this popular sort of meme

00:53:38.740 | passed around the internet from time to time,

00:53:41.620 | where you have all the letters in these words scrambled.

00:53:44.900 | You say, according to research at Cambridge University,

00:53:49.140 | it doesn't matter in what order the letters in a word are.

00:53:52.380 | And so it seems like--

00:53:55.620 | I think I did a pretty good job there.

00:53:57.620 | Seemingly, we got this noise.

00:53:59.540 | That's a specific kind of noise.

00:54:01.380 | And we can be robust as humans to reading and processing

00:54:05.060 | the language without actually all that much of a difficulty.

00:54:10.140 | So that's maybe something that we might want our models

00:54:12.380 | to also be robust to.

00:54:15.180 | And it's very practical as well.

00:54:19.020 | Noise is a part of all NLP systems inputs at all times.

00:54:23.620 | There's just no such thing effectively

00:54:25.380 | as having users, for example, and not having any noise.

00:54:30.500 | And so there's a study that was performed

00:54:32.540 | on some popular machine translation models, where

00:54:36.300 | you train machine translation models in French, German,

00:54:39.620 | and Czech, I think all to English.

00:54:42.260 | And you get blue scores.

00:54:43.660 | These blue scores will look a lot better

00:54:45.300 | than the ones in your assignment four

00:54:46.800 | because much, much more training data.

00:54:48.660 | The idea is these are actually pretty strong machine

00:54:51.100 | translation systems.

00:54:53.060 | And that's an in-domain clean text.

00:54:56.100 | Now, if you add character swaps, like the ones

00:54:59.220 | we saw in that sentence about Cambridge,

00:55:03.740 | the blue scores take a pretty harsh dive.

00:55:07.620 | Not very good.

00:55:09.620 | And even if you take a somewhat more natural typo noise

00:55:15.220 | distribution here, you'll see that you're still

00:55:18.020 | getting 20-ish, yeah, very high drops in blue score

00:55:25.500 | through simply natural noise.

00:55:27.900 | And so maybe you'll go back and retrain the model on more types

00:55:30.380 | of noise.

00:55:30.900 | And then you ask, oh, if I do that,

00:55:32.620 | is it robust to even different kinds of noise?

00:55:34.820 | These are the questions that are going to be really important.

00:55:37.540 | And it's important to know that you're

00:55:39.120 | able to break your model really easily so that you can then

00:55:41.780 | go and try to make it more robust.

00:55:45.260 | OK, now, let's see, 20 minutes, awesome.

00:55:53.340 | Now we're going to, I guess, yeah.

00:55:57.580 | So now we're going to look at the representations

00:55:59.780 | of our neural networks.

00:56:01.580 | We've talked about their behavior

00:56:03.700 | and then whether we could change or observe

00:56:07.260 | reasons behind their behavior.

00:56:09.060 | Now we'll go into less abstraction,

00:56:12.980 | look more at the actual vector representations that

00:56:15.900 | are being built by models.

00:56:17.380 | And we can answer a different kind of question,

00:56:20.460 | at the very least, than with the other studies.

00:56:24.260 | The first thing is related to the question

00:56:26.620 | I was asked about attention, which

00:56:28.740 | is that some modeling components lend themselves to inspection.

00:56:33.660 | Now this is a sentence that I chose somewhat carefully,

00:56:36.260 | actually, because in part of this debate,

00:56:39.380 | are they interpretable components?

00:56:41.700 | We'll see.

00:56:43.220 | But they lend themselves to inspection in the following way.

00:56:46.580 | You can visualize them well, and you can correlate them easily

00:56:49.740 | with various properties.

00:56:51.660 | So let's say you have attention heads in BERT.

00:56:53.860 | This is from a really nice study that was done here,

00:56:58.020 | where you look at attention heads of BERT,

00:57:00.580 | and you say, on most sentences, this attention head, head 1,

00:57:04.740 | 1, seems to do this very global aggregation.

00:57:08.380 | Simple kind of operation does this pretty consistently.

00:57:11.700 | That's cool.

00:57:13.740 | Is it interpretable?

00:57:15.740 | Well, maybe.

00:57:18.460 | So it's the first layer, which means that this word found

00:57:22.140 | is sort of uncontextualized.

00:57:24.060 | But in deeper layers, the problem

00:57:29.300 | is that once you do some rounds of attention,

00:57:32.820 | you've had information mixing and flowing between words.

00:57:36.820 | And how do you know exactly what information you're combining,

00:57:40.020 | what you're attending to, even?

00:57:41.740 | It's a little hard to tell.

00:57:44.540 | And saliency methods more directly

00:57:47.820 | evaluate the importance of models.

00:57:50.140 | But it's still interesting to see,

00:57:52.060 | at sort of a local mechanistic point of view,

00:57:54.500 | what kinds of things are being attended to.

00:57:57.620 | So let's take another example.

00:57:59.580 | Some attention heads seem to perform simple operations.

00:58:02.460 | So you have the global aggregation here

00:58:04.180 | that we saw already.

00:58:05.500 | Others seem to attend pretty robustly to the next token.

00:58:09.260 | Cool.

00:58:10.060 | Next token is a great signal.

00:58:11.780 | Some heads attend to the SEP token.

00:58:14.860 | So here you have attending to SEP.

00:58:16.900 | And then maybe some attend to periods.

00:58:18.760 | Maybe that's sort of splitting sentences together and things

00:58:22.580 | like that.

00:58:23.300 | Not things that are hard to do, but things

00:58:25.340 | that some attention heads seem to pretty robustly perform.

00:58:27.780 | Again now, though, deep in the network,

00:58:32.460 | what's actually represented at this period at layer 11?

00:58:37.740 | Little unclear.

00:58:38.740 | Little unclear.

00:58:39.620 | OK.

00:58:41.260 | So some heads, though, are correlated

00:58:43.900 | with really interesting linguistic properties.

00:58:46.060 | So this head is actually attending to noun modifiers.

00:58:49.880 | So you've got this the complicated language

00:58:53.600 | in the huge new law.

00:58:57.460 | That's pretty fascinating.

00:58:59.980 | Even if the model is not doing this as a causal mechanism

00:59:03.800 | to do syntax necessarily, the fact

00:59:06.360 | that these things so strongly correlate

00:59:08.320 | is actually pretty, pretty cool.

00:59:09.960 | And so what we have in all of these studies

00:59:11.720 | is we've got sort of an approximate interpretation

00:59:14.240 | and quantitative analysis allowing

00:59:18.380 | us to reason about very complicated model behavior.

00:59:21.760 | They're all approximations, but they're

00:59:23.400 | definitely interesting.

00:59:24.760 | One other example is that of coreference.

00:59:26.680 | So we saw some work on coreference.

00:59:29.600 | And it seems like this head does a pretty OK job of actually

00:59:34.320 | matching up coreferent entities.

00:59:37.440 | These are in red.

00:59:38.920 | Talks, negotiations, she, her.

00:59:41.840 | And that's not obvious how to do that.

00:59:43.800 | This is a difficult task.

00:59:45.520 | And so it does so with some percentage of the time.

00:59:49.960 | And again, it's sort of connecting very complex model

00:59:52.240 | behavior to these sort of interpretable summaries

00:59:57.320 | of correlating properties.

01:00:00.240 | Other cases, you can have individual hidden units

01:00:02.440 | that lend themselves to interpretation.

01:00:04.480 | So here, you've got a character level LSTM language model.

01:00:10.280 | Each row here is a sentence.

01:00:12.080 | If you can't read it, it's totally OK.

01:00:14.120 | The interpretation that you should take

01:00:15.740 | is that as we walk along the sentence,

01:00:17.520 | this single unit is going from, I think,

01:00:20.640 | very negative to very positive or very positive

01:00:23.040 | to very negative.

01:00:23.760 | I don't really remember.

01:00:26.360 | But it's tracking the position in the line.

01:00:30.120 | So it's just a linear position unit

01:00:31.760 | and pretty robustly doing so across all of these sentences.

01:00:36.560 | So this is from a nice visualization study

01:00:39.040 | way back in 2016, way back.

01:00:41.920 | Here's another cell from that same LSTM language model

01:00:44.960 | that seems to sort of turn on inside quotes.

01:00:48.320 | So here's a quote.

01:00:50.040 | And then it turns on.

01:00:51.000 | So I guess that's positive in the blue.

01:00:53.080 | End quote here.

01:00:55.600 | And then it's negative.

01:00:57.160 | Here, you start with no quote, negative in the red,

01:01:00.720 | see a quote, and then blue.

01:01:03.680 | Seems, again, very interpretable.

01:01:05.560 | Also, potentially a very useful feature to keep in mind.

01:01:08.000 | And this is just an individual unit in the LSTM

01:01:10.200 | that you can just look at and see that it does this.

01:01:12.800 | Very, very interesting.

01:01:14.320 | Even farther on this--

01:01:19.080 | and this is actually a study by some AI and neuroscience

01:01:24.080 | researchers--

01:01:25.120 | is we saw the LSTMs were good at subject verb number agreement.

01:01:29.560 | Can we figure out the mechanisms by which the LSTM is

01:01:31.880 | solving the task?

01:01:32.880 | Can we actually get some insight into that?

01:01:35.040 | And so we have a word level language model.

01:01:37.720 | The word level language model is going

01:01:39.560 | to be a little small.

01:01:40.400 | But you have a sentence, "the boy gently and kindly

01:01:43.280 | greets the."

01:01:45.400 | And this cell that's being tracked here--

01:01:47.840 | so it's an individual hidden unit, one dimension--

01:01:52.320 | is actually, after it sees boy, it sort of starts to go higher.

01:01:57.800 | And then it goes down to something very small

01:02:00.840 | once it sees greets.

01:02:02.360 | And this cell seems to correlate with the scope of a subject

01:02:06.560 | verb number agreement instance, effectively.

01:02:09.560 | So here, "the boy that watches the dog that watches the cat

01:02:12.720 | greets."

01:02:13.880 | You've got that cell, again, staying high,

01:02:16.520 | maintaining the scope of subject until greets,

01:02:19.400 | and at which point it stops.

01:02:21.840 | What allows it to do that?

01:02:23.480 | Probably some complex other dynamics in the network.

01:02:27.320 | But it's still a fascinating, I think, insight.

01:02:31.000 | And yeah, this is just neuron 1,150 in this LSTM.

01:02:39.840 | So those are all observational studies

01:02:42.440 | that you could do by picking out individual components

01:02:47.040 | of the model that you can just take each one of

01:02:50.280 | and correlating them with some behavior.

01:02:53.240 | Now, we'll look at a general class of methods

01:02:57.040 | called probing, by which we still

01:03:00.200 | use supervised knowledge, like knowledge

01:03:04.160 | of the type of coreference that we're looking for.

01:03:06.840 | But instead of seeing if it correlates with something

01:03:09.160 | that's immediately interpretable,

01:03:10.520 | like a attention head, we're going

01:03:13.920 | to look into the vector representations of the model

01:03:16.720 | and see if these properties can be read out

01:03:19.120 | by some simple function to say, oh, maybe this property was

01:03:23.440 | made very easily accessible by my neural network.

01:03:26.760 | So let's dig into this.

01:03:28.680 | So the general paradigm is that you've

01:03:30.720 | got language data that goes into some big pre-trained

01:03:34.640 | transformer with fine tuning.

01:03:36.440 | And you get state-of-the-art results.

01:03:38.880 | Soda means state-of-the-art.

01:03:40.960 | And so the question for the probing methodology

01:03:44.240 | is, if it's providing these general purpose language

01:03:47.320 | representations, what does it actually encode about language?

01:03:53.320 | Can we quantify this?

01:03:54.560 | Can we figure out what kinds of things

01:03:56.100 | is learning about language that we seemingly now

01:03:58.200 | don't have to tell it?

01:04:00.480 | And so you might have something like a sentence,

01:04:03.800 | like I record the record.

01:04:06.440 | That's an interesting sentence.

01:04:08.000 | And you put it into your transformer model

01:04:11.220 | with its word embeddings at the beginning,

01:04:13.840 | maybe some layers of self-attention and stuff.

01:04:16.320 | And you make some predictions.

01:04:17.800 | And now our objects of study are going

01:04:19.920 | to be these intermediate layers.

01:04:22.560 | So it's a vector per word or subword for every layer.

01:04:27.020 | And the question is, can we use these linguistic properties,

01:04:30.300 | like the dependency parsing that we

01:04:32.380 | had way back in the early part of the course,

01:04:35.380 | to understand correlations between properties

01:04:41.040 | in the vectors and these things that we can interpret?

01:04:44.140 | We can interpret dependency parses.

01:04:47.980 | So there are a couple of things that we

01:04:49.940 | might want to look for here.

01:04:51.700 | We might want to look for semantics.

01:04:53.500 | So here in this sentence, I record the record.

01:04:56.560 | I am an agent.

01:04:58.440 | That's a semantics thing.

01:05:00.960 | Record is a patient.

01:05:02.000 | It's the thing I'm recording.

01:05:04.160 | You might have syntax.

01:05:05.240 | So you might have this syntax tree

01:05:06.680 | that you're interested in.

01:05:07.760 | That's the dependency parse tree.

01:05:09.280 | Maybe you're interested in part of speech,

01:05:11.040 | because you have record and record.

01:05:14.720 | And the first one's a verb.

01:05:16.160 | The second one's a noun.

01:05:17.380 | They're identical strings.

01:05:19.000 | Does the model encode that one is one and the other

01:05:22.040 | is the other?

01:05:23.880 | So how do we do this kind of study?

01:05:26.200 | So we're going to decide on a layer that we want to analyze.

01:05:29.440 | And we're going to freeze BERT.

01:05:31.120 | So we're not going to fine tune BERT.

01:05:32.620 | All the parameters are frozen.

01:05:34.760 | So we're going to decide on layer 2 of BERT.

01:05:36.640 | And we're going to pass it some sentences.

01:05:38.440 | We decide on what's called a probe family.

01:05:41.960 | And the question I'm asking is, can I

01:05:44.040 | use a model from my family, say linear,

01:05:47.680 | to decode a property that I'm interested in really

01:05:52.120 | well from this layer?

01:05:53.800 | So it's indicating that this property is easily

01:05:56.800 | accessible to linear models, effectively.

01:06:00.000 | So maybe I train a linear classifier right on top of BERT.

01:06:05.940 | And I get a really high accuracy.

01:06:08.880 | And that's sort of interesting already, because you know,

01:06:12.000 | from prior work in part of speech tagging,

01:06:13.920 | that if you run a linear classifier on simpler features

01:06:17.240 | that aren't BERT, you probably don't

01:06:19.120 | get as high an accuracy.

01:06:20.160 | So that's an interesting sort of takeaway.

01:06:22.280 | But then you can also take a baseline.

01:06:24.360 | So I want to compare two layers now.

01:06:26.000 | So I've got layer 1 here.

01:06:27.680 | I want to compare it to layer 2.

01:06:29.680 | I train a probe on it as well.

01:06:32.340 | Maybe the accuracy isn't as good.

01:06:34.440 | And now I can say, oh, wow, look, by layer 2,

01:06:38.260 | part of speech is more easily accessible to linear functions

01:06:42.200 | than it was at layer 1.

01:06:44.440 | So what did that?

01:06:45.200 | Well, the self-attention and feed-forward stuff

01:06:47.560 | made it more easily accessible.

01:06:49.360 | That's interesting, because it's a statement about the information

01:06:51.960 | processing of the model.

01:06:53.680 | So we're going to analyze these layers.

01:07:00.200 | Let's take a second more to think about it.

01:07:02.680 | And just really give me just a second.

01:07:05.960 | So if you have the model's representations, h1 to ht,

01:07:10.160 | and you have a function family F,

01:07:12.240 | that's the subset linear models.

01:07:13.800 | So maybe you have a feed-forward neural network,

01:07:16.560 | some fixed set of hyperparameters.

01:07:18.680 | Freeze the model, train the probe,

01:07:21.520 | so you get some predictions for part of speech tagging

01:07:24.000 | or whatever.

01:07:24.880 | That's just the probe applied to the hidden state of the model.

01:07:28.520 | The probe is a member of the probe family.

01:07:31.000 | And then the extent that we can predict

01:07:32.880 | y is a measure of accessibility.

01:07:34.760 | So that's just written out, not as pictorially.

01:07:38.560 | So I'm not going to stay on this for too much longer.

01:07:44.200 | And it may help in the search for causal mechanisms,

01:07:48.480 | but it sort of just gives us a rough understanding

01:07:50.640 | of processing of the model and what things

01:07:53.420 | are accessible at what layer.

01:07:55.560 | So what are some results here?

01:07:57.000 | So one result is that BERT, if you run linear probes on it,

01:08:01.960 | does really, really well on things

01:08:03.800 | that require syntax and part of speech,

01:08:06.080 | named entity recognition.

01:08:07.600 | Actually, in some cases, approximately as well as just

01:08:10.600 | doing the very best thing you could possibly do without BERT.

01:08:15.440 | So it just makes easily accessible, amazingly strong

01:08:18.060 | features for these properties.

01:08:19.920 | And that's an interesting sort of emergent quality of BERT,

01:08:23.520 | you might say.

01:08:26.000 | It seems like as well that the layers of BERT

01:08:29.320 | have this property where--

01:08:31.200 | so if you look at the columns of this plot here,

01:08:35.220 | each column is a task.

01:08:37.000 | You've got input words at the sort of layer 0 of BERT here.

01:08:41.080 | Layer 24 is the last layer of BERT large.

01:08:44.120 | Lower performance is yellow.

01:08:45.240 | Higher performance is blue.

01:08:46.840 | And the resolution isn't perfect,

01:08:50.240 | but consistently, the best place to read out these properties

01:08:53.880 | is somewhere a bit past the middle of the model, which

01:08:57.240 | is this very consistent rule, which is fascinating.

01:09:01.180 | And then it seems as well like if you

01:09:04.160 | look at this function of increasingly abstract

01:09:07.280 | or increasingly difficult to compute

01:09:09.040 | linguistic properties on this axis,

01:09:11.400 | an increasing depth in the network on that axis.

01:09:14.320 | So the deeper you go in the network,

01:09:16.640 | it seems like the more easily you

01:09:19.360 | can access more and more abstract linguistic properties,

01:09:23.700 | suggesting that that accessibility is being

01:09:26.700 | constructed over time by the layers of processing of BERT.

01:09:30.080 | So it's building more and more abstract features, which

01:09:33.160 | I think is, again, a really interesting result.

01:09:37.360 | And now I think--

01:09:39.400 | yeah, one thing that I think comes

01:09:41.440 | to mind that really brings us back right to day one

01:09:45.460 | is we built intuitions around Word2Vec.

01:09:48.840 | We were asking, what does each dimension of Word2Vec mean?

01:09:51.520 | And the answer was, not really anything.

01:09:54.120 | But we could build intuitions about it

01:09:56.840 | and think about properties of it through these connections

01:10:00.640 | between simple mathematical properties of Word2Vec

01:10:04.320 | and linguistic properties that we could understand.

01:10:08.040 | So we had this approximation, which is not 100% true.

01:10:11.400 | But it's an approximation that says cosine similarity is

01:10:15.760 | effectively correlated with semantic similarity.

01:10:19.440 | Think about even if all we're going

01:10:23.560 | to do at the end of the day is fine tune these word

01:10:25.880 | embeddings anyway.

01:10:27.720 | Likewise, we had this idea about the analogies being

01:10:30.800 | encoded by linear offsets.

01:10:32.160 | So some relationships are linear in space.

01:10:36.120 | And they didn't have to be.

01:10:37.480 | That's fascinating.

01:10:39.040 | It's this emergent property that we've now

01:10:40.920 | been able to study since we discovered this.

01:10:43.240 | Why is that the case in Word2Vec?

01:10:45.480 | And in general, even though you can't

01:10:47.520 | interpret the individual dimensions of Word2Vec,

01:10:50.960 | these emergent, interpretable connections

01:10:53.920 | between approximate linguistic ideas

01:10:56.840 | and simple math on these objects is fascinating.

01:11:00.520 | And so one piece of work that extends this idea

01:11:04.760 | comes back to dependency parse trees.

01:11:06.520 | So they describe the syntax of sentences.

01:11:09.520 | And in a paper that I did with Chris,

01:11:14.560 | we showed that actually BERTs and models like it

01:11:17.840 | make dependency parse tree structure emergent,

01:11:22.400 | more easily accessible than one might

01:11:24.640 | imagine in its vector space.

01:11:26.760 | So if you've got a tree right here,

01:11:28.400 | the chef who rented the store was out of food, what you can

01:11:34.120 | do is think about the tree in terms

01:11:36.160 | of distances between words.

01:11:38.920 | So you've got the number of edges in the tree between two

01:11:42.960 | words is their path distance.

01:11:44.160 | So you've got that the distance between chef and was is 1.

01:11:48.240 | And we're going to use this interpretation of a tree

01:11:50.320 | as a distance to make a connection with BERT's

01:11:53.520 | embedding space.

01:11:54.840 | And what we were able to show is that under a single linear

01:11:57.800 | transformation, the squared Euclidean distance between BERT

01:12:02.000 | vectors for the same sentence actually correlates well,

01:12:07.240 | if you choose the B matrix right,

01:12:09.880 | with the distances in the tree.

01:12:12.280 | So here in this Euclidean space that we've transformed,

01:12:16.440 | the approximate distance between chef and was is also 1.

01:12:20.960 | Likewise, the difference between was and store

01:12:23.840 | is 4 in the tree.

01:12:25.960 | And in my simple transformation of BERT space,

01:12:29.560 | the distance between store and was is also approximately 4.

01:12:33.440 | And this is true across a wide range of sentences.

01:12:36.480 | And this is, to me, a fascinating example of,

01:12:39.880 | again, emergent approximate structure in these very

01:12:43.400 | nonlinear models that don't necessarily need to encode

01:12:46.480 | things so simply.

01:12:48.800 | OK.

01:12:51.200 | All right.

01:12:52.480 | Great.

01:12:53.040 | So probing studies and correlation studies

01:12:56.640 | are, I think, interesting and point us in directions

01:12:59.280 | to build intuitions about models.

01:13:01.680 | But they're not arguments that the model is actually

01:13:03.800 | using the thing that you're finding to make a decision.

01:13:07.120 | They're not causal studies.

01:13:09.960 | This is for probing and correlation studies.

01:13:12.000 | So in some work that I did around the same time,

01:13:15.960 | we showed actually that certain conditions on probes

01:13:19.440 | allow you to achieve high accuracy on a task that's

01:13:22.440 | effectively just fitting random labels.

01:13:24.880 | And so there's a difficulty of interpreting

01:13:29.480 | what the model could or could not

01:13:31.000 | be doing with this thing that is somehow easily accessible.

01:13:34.800 | It's interesting that this property is easily accessible.

01:13:37.520 | But the model might not be doing anything with it, for example,

01:13:40.400 | because it's totally random.

01:13:42.520 | Likewise, another paper showed that you

01:13:44.640 | can achieve high accuracy with a probe

01:13:46.800 | even if the model is trained to know that thing that you're

01:13:49.800 | probing for is not useful.

01:13:52.160 | And there's causal studies that try to extend this work.

01:13:56.000 | It's much more difficult to read this paper.

01:13:58.040 | And it's a fascinating line of future work.

01:14:01.440 | Now in my last two minutes, I want

01:14:04.680 | to talk about recasting model tweaks and ablations

01:14:07.480 | as analysis.

01:14:09.480 | So we had this improvement process

01:14:11.240 | where we had a network that was going to work OK.

01:14:14.160 | And we would see whether we could tweak it

01:14:16.040 | in simple ways to improve it.

01:14:17.640 | And then you could see whether you could remove anything

01:14:19.960 | and have it still be OK.

01:14:21.080 | And that's kind of like analysis.

01:14:22.440 | I have my network.

01:14:23.400 | Do I want it to--

01:14:24.480 | is it going to be better if it's more complicated?

01:14:26.640 | If it's going to be better if it's simpler?

01:14:28.480 | Can I get away with it being simpler?

01:14:30.320 | And so one example of some folks who did this

01:14:33.160 | is they took this idea of multi-headed attention

01:14:35.760 | and said, oh, so many heads.

01:14:38.400 | Are all the heads important?

01:14:39.960 | And what they showed is that if you train

01:14:42.120 | a system with multi-headed attention

01:14:44.520 | and then just remove the heads at test time

01:14:46.600 | and not use them at all, you can actually

01:14:48.720 | do pretty well on the original task,

01:14:50.880 | not retraining at all, without some of the attention heads,

01:14:54.280 | showing that they weren't important.

01:14:56.000 | You could just get rid of them after training.

01:14:58.480 | And likewise, you can do the same thing for--

01:15:00.560 | this is on machine translation.

01:15:01.800 | This is on multi-NLI.

01:15:03.120 | You can actually get away without a large, large

01:15:05.320 | percentage of your attention heads.

01:15:06.800 | Let's see.

01:15:12.040 | Yeah, so another thing that you could think about

01:15:15.040 | is questioning sort of the basics of the models

01:15:18.160 | that we're building.

01:15:19.000 | So we have transformer models that

01:15:20.720 | are sort of self-attention, feedforward, self-attention,

01:15:23.120 | feedforward.

01:15:23.840 | But why in that order with some of the things emitted here?

01:15:27.560 | And this paper asked this question and said,

01:15:30.960 | if this is my transformer, self-attention, feedforward,

01:15:33.760 | self-attention, feedforward, et cetera, et cetera, et cetera,

01:15:36.520 | what if I just reordered it so that I

01:15:37.860 | had a bunch of self-attentions at the head

01:15:39.800 | and a bunch of feedforwards at the back?

01:15:41.600 | And they tried a bunch of these orderings.

01:15:43.480 | And this one actually does better.

01:15:45.760 | So this achieves a lower perplexity on a benchmark.

01:15:48.720 | And this is a way of analyzing what's

01:15:51.040 | important about the architectures that I'm building

01:15:53.320 | and how can they be changed in order to perform better.

01:15:56.200 | So neural models are very complex.

01:15:58.440 | And they're difficult to characterize

01:15:59.960 | and impossible to characterize with a single sort

01:16:02.560 | of statistic, I think, for your test set accuracy,

01:16:05.200 | especially in domain.

01:16:07.400 | And we want to find intuitive descriptions of model

01:16:09.840 | behaviors.

01:16:11.440 | But we should look at multiple levels of abstraction.

01:16:13.960 | And none of them are going to be complete.

01:16:16.760 | When someone tells you that their neural network is

01:16:19.160 | interpretable, I encourage you to engage critically with that.

01:16:23.880 | It's not necessarily false.

01:16:25.440 | But the levels of interpretability

01:16:27.400 | and what you can interpret, these

01:16:29.040 | are the questions that you should be asking.

01:16:30.840 | Because it's going to be opaque in some ways,

01:16:32.920 | almost definitely.

01:16:35.160 | And then bring this lens to your model building

01:16:39.500 | as you try to think about how to build better models,

01:16:41.720 | even if you're not going to be doing analysis as sort of one

01:16:44.280 | of your main driving goals.

01:16:46.880 | And with that, good luck on your final projects.

01:16:50.360 | I realize we're at time.

01:16:52.120 | The teaching staff is really appreciative of your efforts

01:16:55.960 | over this difficult quarter.

01:16:57.360 | And yeah, hope-- yeah, there's a lecture left on Thursday.

01:17:02.440 | But yeah, this is my last one.

01:17:04.280 | So thanks, everyone.

01:17:06.520 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 17 - Model Analysis and Explanation

Chapters