back to index

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 17 - Model Analysis and Explanation


Chapters

0:0 Introduction
4:7 Why Care
6:19 Model biases
8:28 Deep model analysis
12:47 Natural language inference
14:48 HANS
17:40 How do models perform
19:27 Linguistic properties
24:38 Error rates
26:23 Examples
27:40 Questions
31:50 Unit Testing
36:19 Language Models
37:31 Long Term Memory
40:5 Saliency Maps
42:2 Simple Gradient Method
47:29 Example from Squad
48:52 Example from Quest
50:34 Breaking Models
53:22 Robust to Noise
56:23 Attention

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome to CS224N, lecture 17, Model Analysis and Explanation.
00:00:14.840 | OK, look at us.
00:00:16.560 | We're here.
00:00:19.080 | Start with some course logistics.
00:00:21.760 | We have updated the policy on the guest lecture reactions.
00:00:26.440 | They're all due Friday, all at 11.59 PM.
00:00:30.600 | You can't use late days for this, so please get them in.
00:00:35.520 | Watch the lectures.
00:00:36.280 | They're awesome lectures.
00:00:37.280 | They're awesome guests.
00:00:39.680 | And you get something like half a point for each of them.
00:00:42.360 | And yeah, all three can be submitted up through Friday.
00:00:46.760 | OK, so final projects.
00:00:48.520 | Remember that the due date is Tuesday.
00:00:51.680 | It's Tuesday at 4.30 PM, March 16.
00:00:55.120 | And let me emphasize that there's a hard deadline
00:00:59.960 | on the three days from then, Friday.
00:01:05.000 | We won't be accepting for additional points
00:01:07.440 | off assignments-- sorry, final projects that
00:01:10.920 | are submitted after the 4.30 deadline on Friday.
00:01:15.600 | We need to get these graded and get grades in.
00:01:17.920 | So it's the end stretch, week nine.
00:01:21.680 | Our week 10 is really the lectures
00:01:24.320 | are us giving you help on the final projects.
00:01:27.400 | So this is really the last week of lectures.
00:01:29.200 | Thanks for all your hard work and for asking
00:01:32.600 | awesome questions in lecture and in office hours and on ed.
00:01:35.680 | And let's get right into it.
00:01:37.560 | So today, we get to talk about one
00:01:40.420 | of my favorite subjects in natural language processing.
00:01:43.880 | It's model analysis and explanation.
00:01:47.040 | So first, we're going to do what I love doing,
00:01:49.080 | which is motivating why we want to talk about the topic at all.
00:01:54.280 | We'll talk about how we can look at a model
00:01:57.440 | at different levels of abstraction
00:01:59.040 | to perform different kinds of analysis on it.
00:02:02.000 | We'll talk about out-of-domain evaluation sets.
00:02:05.080 | So this will feel familiar to the robust QA folks.
00:02:10.400 | Then we'll talk about trying to figure out,
00:02:13.680 | for a given example, why did it make the decision that it made?
00:02:17.160 | It had some input.
00:02:17.960 | It produced some output.
00:02:19.520 | Can we come up with some sort of interpretable explanation for it?
00:02:23.400 | And then we'll look at, actually,
00:02:26.960 | the representations of the models.
00:02:29.160 | So these are the sort of hidden states,
00:02:30.880 | the vectors that are being built throughout the processing
00:02:34.200 | of the model, try to figure out if we can understand
00:02:37.120 | some of the representations and mechanisms
00:02:39.480 | that the model is performing.
00:02:41.400 | And then we'll actually come back to one of the default
00:02:45.240 | states that we've been in this course, which
00:02:47.080 | is trying to look at model improvements,
00:02:50.160 | removing things from models, seeing how it performs,
00:02:52.800 | and relate that to the analysis that we're
00:02:55.280 | doing in this lecture, show how it's not all that different.
00:02:58.360 | So if you haven't seen this XKCD, now you have.
00:03:06.440 | And it's one of my favorites.
00:03:07.880 | I'm going to say all the words.
00:03:09.600 | So person A says, this is your machine learning system.
00:03:14.280 | Person B says, yup, you pour the data
00:03:16.400 | into this big pile of linear algebra
00:03:19.120 | and then collect the answers on the other side.
00:03:21.560 | Person A, what if the answers are wrong?
00:03:24.400 | And person B, just stir the pile until they
00:03:26.480 | start looking right.
00:03:28.520 | And I feel like, at its worst, deep learning
00:03:31.080 | can feel like this from time to time.
00:03:32.880 | You have a model.
00:03:34.360 | Maybe it works for some things.
00:03:35.840 | Maybe it doesn't work for other things.
00:03:37.680 | You're not sure why it works for some things
00:03:39.560 | and doesn't work for others.
00:03:40.960 | And the changes that we make to our models,
00:03:45.080 | they're based on intuition.
00:03:46.240 | But frequently, what have the TAs told everyone in office?
00:03:49.600 | I was like, ah, sometimes you just
00:03:51.040 | have to try it and see if it's going to work out
00:03:53.000 | because it's very hard to tell.
00:03:55.080 | It's very, very difficult to understand our models
00:03:59.960 | on sort of any level.
00:04:01.320 | And so today, we'll go through a number
00:04:03.040 | of ways for trying to sort of carve out
00:04:05.080 | little bits of understanding here and there.
00:04:08.040 | So beyond it being important because it's in the next KCD
00:04:13.640 | comic, why should we care about understanding our models?
00:04:18.360 | One is that we want to know what our models are doing.
00:04:23.640 | So here, you have a black box.
00:04:27.280 | Black box functions are sort of this idea
00:04:29.600 | that you can't look into them and interpret
00:04:31.320 | what they're doing.
00:04:33.440 | You have an input sentence, say, and then some output
00:04:36.520 | prediction.
00:04:37.680 | Maybe this black box is actually your final project model,
00:04:42.720 | and it gets some accuracy.
00:04:45.960 | Now, we summarize our models.
00:04:48.840 | And in your final projects, you'll summarize your model
00:04:51.040 | with sort of one or a handful of summary metrics of accuracy
00:04:55.680 | or F1 score or blue score or something.
00:04:58.800 | But it's a lot of model to explain
00:05:01.200 | with just a small number of metrics.
00:05:03.640 | So what do they learn?
00:05:05.040 | Why do they succeed, and why do they fail?
00:05:08.440 | What's another motivation?
00:05:09.520 | So we want to sort of know what our models are doing, OK.
00:05:12.800 | But maybe that's because we want to be
00:05:15.280 | able to make tomorrow's model.
00:05:17.080 | So today, when you're building models in this class
00:05:21.120 | at the company, you start out with some kind of recipe
00:05:24.480 | that is known to work either at the company
00:05:27.440 | or because you have experience from this class.
00:05:30.320 | And it's not perfect.
00:05:31.320 | It makes mistakes.
00:05:32.160 | You look at the errors.
00:05:33.440 | And then over time, you take what works, maybe,
00:05:37.920 | and then you find what needs changing.
00:05:39.520 | So it seems like maybe adding another layer to the model
00:05:42.560 | helped.
00:05:43.760 | And maybe that's a nice tweak, and the model performance
00:05:46.680 | gets better, et cetera.
00:05:48.880 | And incremental progress doesn't always feel exciting.
00:05:53.920 | But I want to pitch to you that it's actually
00:05:56.000 | very important for us to understand
00:05:58.320 | how much incremental progress can kind of get us
00:06:01.400 | towards some of our goals.
00:06:03.080 | So that we can have a better job of evaluating
00:06:06.720 | when we need big leaps, when we need major changes,
00:06:10.560 | because there are problems that we're
00:06:12.080 | attacking with our incremental sort of progress,
00:06:14.400 | and we're not getting very far.
00:06:16.680 | OK, so we want to make tomorrow's model.
00:06:20.320 | Another thing that is, I think, very related to
00:06:23.200 | and sort of both a part of and bigger
00:06:25.680 | than this field of analysis is model biases.
00:06:29.480 | So let's say you take your Word2Vec analogies solver
00:06:34.040 | from GloVe or Word2Vec, that is, from assignment one,
00:06:39.720 | and you give it the analogy, man is to computer programmer
00:06:43.200 | as woman is to--
00:06:44.480 | and it gives you the output, homemaker.
00:06:47.040 | This is a real example from the paper below.
00:06:50.840 | You should be like, wow, well, I'm glad I know that now.
00:06:55.160 | And of course, you saw the lecture from Yulia Svetkov
00:07:00.040 | last week.
00:07:00.680 | You say, wow, I'm glad I know that now.
00:07:02.680 | And that's a huge problem.
00:07:04.800 | What did the model use in its decision?
00:07:06.680 | What biases is it learning from data
00:07:08.280 | and possibly making even worse?
00:07:10.400 | So that's the kind of thing that you can also
00:07:12.600 | do with model analysis beyond just making models better
00:07:15.320 | according to some sort of summary metric as well.
00:07:19.640 | And then another thing, we don't just
00:07:21.440 | want to make tomorrow's model.
00:07:22.800 | And this is something that I think is super important.
00:07:25.120 | We don't just want to look at that time scale.
00:07:30.200 | We want to say, what about 10, 15, 25 years from now?
00:07:34.040 | What kinds of things will we be doing?
00:07:36.280 | What are the limits?
00:07:37.640 | What can be learned by language model pre-training?
00:07:41.080 | What's the model that will replace the transformer?
00:07:43.920 | What's the model that will replace that model?
00:07:46.320 | What does deep learning struggle to do?
00:07:48.000 | What are we sort of attacking over and over again
00:07:50.600 | and failing to make significant progress on?
00:07:52.880 | What do neural models tell us about language potentially?
00:07:55.600 | There's some people who are primarily
00:07:57.140 | interested in understanding language better
00:07:59.640 | using neural networks.
00:08:00.960 | Cool.
00:08:02.840 | How are our models affecting people,
00:08:06.400 | transferring power between groups of people,
00:08:08.920 | governments, et cetera?
00:08:10.640 | That's an excellent type of analysis.
00:08:12.720 | What can't be learned via language model pre-training?
00:08:15.140 | So that's sort of the complementary question there.
00:08:17.640 | If you sort of come to the edge of what
00:08:19.680 | you can learn via language model pre-training,
00:08:22.240 | is there stuff that we need total paradigm shifts
00:08:24.280 | in order to do well?
00:08:28.080 | So all of this falls under some category
00:08:32.160 | of trying to really deeply understand our models
00:08:34.240 | and their capabilities.
00:08:36.520 | And there's a lot of different methods
00:08:38.680 | here that we'll go over today.
00:08:40.440 | And one thing that I want you to take away from it
00:08:42.520 | is that they're all going to tell us
00:08:47.520 | some aspect of the model elucidates
00:08:49.440 | some kind of intuition or something, but none of them
00:08:52.160 | are we going to say, aha, I really
00:08:54.640 | understand 100% about what this model is doing now.
00:08:57.600 | So they're going to provide some clarity,
00:08:59.360 | but never total clarity.
00:09:01.000 | And one way, if you're trying to decide
00:09:04.080 | how you want to understand your model more,
00:09:06.420 | I think you should sort of start out by thinking about is,
00:09:09.080 | at what level of abstraction do I
00:09:11.160 | want to be looking at my model?
00:09:12.960 | So the sort of very high level abstraction,
00:09:17.280 | let's say you've trained a QA model
00:09:20.600 | to estimate the probabilities of start and end indices
00:09:23.880 | in a reading comprehension problem,
00:09:25.960 | or you've trained a language model that assigns probabilities
00:09:28.980 | to words in context.
00:09:30.480 | You can just look at the model as that object.
00:09:33.480 | So it's just a probability distribution
00:09:35.800 | defined by your model.
00:09:37.480 | You are not looking into it any further than the fact
00:09:40.080 | that you can sort of give it inputs
00:09:41.760 | and see what outputs it provides.
00:09:45.000 | So that's like, who even cares if it's a neural network?
00:09:49.480 | It could be anything.
00:09:50.960 | But it's a way to understand its behavior.
00:09:53.240 | Another level of abstraction that you can look at,
00:09:55.360 | you can dig a little deeper.
00:09:56.520 | You can say, well, I know that my network is
00:09:59.360 | a bunch of layers that are kind of stacked
00:10:01.100 | on top of each other.
00:10:02.000 | You've got sort of maybe your transformer encoder
00:10:05.280 | with one layer, two layer, three layer.
00:10:07.440 | You can try to see what it's doing as it
00:10:09.120 | goes deeper in the layers.
00:10:11.680 | So maybe your neural model is a sequence
00:10:13.520 | of these vector representations.
00:10:15.200 | A third option of sort of specificity
00:10:17.840 | is to look at as much detail as you can.
00:10:21.800 | You've got these parameters in there.
00:10:23.300 | You've got the connections in the computation graph.
00:10:26.120 | So now you're sort of trying to remove all of the abstraction
00:10:29.560 | that you can and look at as many details as possible.
00:10:32.160 | And all three of these sort of ways
00:10:33.640 | of looking at your model and performing analysis
00:10:36.200 | are going to be useful and will actually
00:10:38.400 | sort of travel slowly from one to two to three
00:10:42.220 | as we go through this lecture.
00:10:47.080 | So we haven't actually talked about any analyses yet.
00:10:51.120 | So we're going to get started on that now.
00:10:55.360 | And we're starting with the sort of testing
00:10:57.720 | our model's behaviors.
00:10:59.520 | So would we want to see, will my model perform well?
00:11:02.720 | I mean, the natural thing to ask is, well,
00:11:05.280 | how does it behave on some sort of test set?
00:11:10.000 | And so we don't really care about mechanisms yet.
00:11:13.100 | Why is it performing this?
00:11:14.860 | By what method is it making its decision?
00:11:17.400 | Instead, we're just interested in sort of the more higher
00:11:19.980 | level abstraction of, does it perform
00:11:23.020 | the way I want it to perform?
00:11:24.940 | So let's take our model evaluation
00:11:28.540 | that we are already doing and sort of recast it
00:11:31.500 | in the framework of analysis.
00:11:33.700 | So you've trained your model on some samples
00:11:36.180 | from some distribution.
00:11:37.340 | So you've got input/output pairs of some kind.
00:11:40.440 | So how does the model behave on samples
00:11:42.420 | from the same distribution?
00:11:43.980 | It's a simple question.
00:11:45.180 | And it's sort of--
00:11:47.300 | it's known as in-domain accuracy.
00:11:49.620 | Or you can say that the samples are IID,
00:11:52.420 | and that's what you're testing on.
00:11:53.900 | And this is just what we've been doing this whole time.
00:11:56.300 | It's your test set accuracy, or F1, or blue score.
00:11:59.940 | And so you've got some model with some accuracy.
00:12:04.620 | And maybe it's better than some model with some other accuracy
00:12:08.140 | on this test set.
00:12:09.060 | So this is what you're doing as you're
00:12:10.640 | iterating on your models in your final project as well.
00:12:14.420 | You say, well, on my test set, which
00:12:16.580 | is what I've decided to care about for now,
00:12:18.700 | model A does better.
00:12:20.220 | They both seem pretty good.
00:12:22.020 | And so maybe I'll choose model A to keep working on.
00:12:24.740 | Maybe I'll choose it if you were putting something
00:12:26.900 | into production.
00:12:28.620 | But remember back to this idea that it's just one number
00:12:33.060 | to summarize a very complex system.
00:12:36.100 | It's not going to be sufficient to tell you
00:12:37.940 | how it's going to perform in a wide variety of settings.
00:12:42.540 | So we've been doing this.
00:12:44.060 | This is model evaluation as model analysis.
00:12:48.380 | Now we are going to say, what if we are not
00:12:52.020 | testing on exactly the same type of data that we trained on?
00:12:56.300 | So now we're asking, did the model learn something
00:12:58.900 | such that it's able to sort of extrapolate or perform
00:13:02.340 | how I want it to on data that looks a little bit different
00:13:04.860 | from what it was trained on?
00:13:06.100 | And we're going to take the example of natural language
00:13:08.400 | inference.
00:13:08.980 | So to recall the task of natural language inference--
00:13:11.540 | and this is through the multi-NLI data set
00:13:13.340 | that we're just pulling our definition--
00:13:15.540 | you have a premise.
00:13:16.860 | He turned and saw John sleeping in his half tent.
00:13:19.940 | And you have a hypothesis.
00:13:21.700 | He saw John was asleep.
00:13:23.940 | And then you give them both to a model.
00:13:26.300 | And this is the model that we had before
00:13:27.980 | that gets some good accuracy.
00:13:29.740 | And the model is supposed to tell
00:13:31.780 | whether the hypothesis is sort of implied by the premise
00:13:35.900 | or contradicting.
00:13:37.740 | So it could be contradicting, maybe,
00:13:39.500 | if the hypothesis is John was awake, for example,
00:13:43.260 | or he saw John was awake.
00:13:44.340 | Maybe that would be a contradiction.
00:13:45.840 | Neutral, if sort of both could be true at the same time,
00:13:49.220 | so to speak.
00:13:50.300 | And then entailment, in this case,
00:13:51.900 | it seems like they're saying that the premise implies
00:13:54.340 | the hypothesis.
00:13:55.860 | And so you would say, probably, this
00:13:58.780 | is likely to get the right answer,
00:14:00.240 | since the accuracy of the model is 95%.
00:14:02.740 | 95% of the time, it gets the right answer.
00:14:06.300 | And we're going to dig deeper into that.
00:14:09.100 | What if the model is not doing what
00:14:11.780 | we think we want it to be doing in order to perform
00:14:15.060 | natural language inference?
00:14:16.740 | So in a data set like multi-NLI, the authors
00:14:19.580 | who gathered the data set will have asked humans
00:14:22.580 | to perform the task and gotten the accuracy
00:14:25.420 | that the humans achieved.
00:14:27.260 | And models nowadays are achieving accuracies
00:14:29.940 | that are around where humans are achieving,
00:14:33.980 | which sounds great at first.
00:14:36.700 | But as we'll see, it's not the same
00:14:38.820 | as actually performing the task more broadly in the right way.
00:14:45.460 | So what if the model is not doing something smart,
00:14:47.540 | effectively?
00:14:49.260 | We're going to use a diagnostic test
00:14:51.580 | set of carefully constructed examples that
00:14:54.740 | seem like things the model should be able to do to test
00:14:58.020 | for a specific skill or capacity.
00:15:01.300 | In this case, we'll use HANS.
00:15:03.100 | So HANS is the heuristic analysis for NLI systems data
00:15:07.380 | And it's intended to take systems
00:15:10.020 | that do natural language inference
00:15:11.860 | and test whether they're using some simple syntactic
00:15:14.620 | heuristics.
00:15:16.140 | What we'll have in each of these cases, we'll have some heuristic.
00:15:19.540 | We'll talk through the definition.
00:15:21.180 | We'll get an example.
00:15:22.020 | So the first thing is lexical overlap.
00:15:24.320 | So the model might do this thing where
00:15:28.700 | it assumes that a premise entails
00:15:30.260 | all hypotheses constructed from words in the premise.
00:15:32.900 | So in this example, you have the premise,
00:15:36.980 | the doctor was paid by the actor.
00:15:40.820 | And then the hypothesis is the doctor paid the actor.
00:15:43.740 | And you'll notice that in bold here, get the doctor,
00:15:46.860 | and then paid, and then the actor.
00:15:49.900 | And so if you use this heuristic,
00:15:52.460 | you will think that the doctor was paid by the actor
00:15:54.660 | implies the doctor paid the actor.
00:15:56.380 | That does not imply it, of course.
00:15:58.580 | And so you could expect a model.
00:16:00.500 | You want the model to be able to do this.
00:16:02.160 | It's somewhat simple.
00:16:03.420 | But if it's using this heuristic,
00:16:04.900 | it won't get this example right.
00:16:07.900 | Next is subsequence heuristics.
00:16:10.500 | So here, if the model assumes that the premise entails
00:16:15.500 | all of its contiguous subsequences,
00:16:17.620 | it will get this one wrong as well.
00:16:19.100 | So this example is the doctor near the actor danced.
00:16:23.300 | That's the premise.
00:16:24.140 | The hypothesis is the actor danced.
00:16:26.700 | Now, this is a simple syntactic thing.
00:16:28.300 | The doctor is doing the dancing near the actor
00:16:31.460 | is this prepositional phrase.
00:16:33.860 | And so the model uses this heuristic.
00:16:36.140 | Oh, look, the actor danced.
00:16:37.300 | That's a subsequence entailed.
00:16:38.980 | Awesome.
00:16:39.660 | Then it'll get this one wrong as well.
00:16:42.420 | And here's another one that's a lot like subsequence.
00:16:45.940 | But so if the model thinks that the premise entails
00:16:50.660 | all complete subtrees-- so this is like fully formed phrases.
00:16:55.660 | So the artist slept here is a fully formed subtree.
00:17:00.420 | If the artist slept, the actor ran.
00:17:02.540 | And then that's the premise.
00:17:03.960 | Does it entail the hypothesis the actor slept?
00:17:09.740 | Sorry, the artist slept.
00:17:10.980 | That does not entail it because this is in that conditional.
00:17:15.180 | Let me pause here for some questions
00:17:16.660 | before I move on to see how these models do.
00:17:20.460 | Anyone unclear about how this sort of evaluation
00:17:25.500 | is being set up?
00:17:34.260 | Cool.
00:17:39.740 | OK, so how do models perform?
00:17:42.820 | That's sort of the question of the hour.
00:17:46.220 | What we'll do is we'll look at these results
00:17:49.340 | from the same paper that released the data set.
00:17:51.860 | So they took four strong multi-analyte models
00:17:56.380 | with the following accuracies.
00:17:57.660 | So the accuracies here are something between 60% and 80%
00:18:01.900 | Bert over here is doing the best.
00:18:04.660 | And in domain, in that first setting that we talked about,
00:18:10.900 | you get these reasonable accuracies.
00:18:13.820 | And that is sort of what we said before about it
00:18:16.900 | like looking pretty good.
00:18:19.820 | And when we evaluate on Hans, in this setting
00:18:24.740 | here, we have examples where the heuristics we talked about
00:18:29.580 | actually work.
00:18:30.420 | So if the model is using the heuristic,
00:18:32.120 | it will get this right.
00:18:34.020 | And it gets very high accuracies.
00:18:37.020 | And then if we evaluate the model in the settings
00:18:40.700 | where if it uses the heuristic, it gets the examples wrong,
00:18:45.500 | maybe Bert's doing epsilon better than some
00:18:48.060 | of the other stuff here.
00:18:49.260 | But it's a very different story.
00:18:53.700 | And you saw those examples.
00:18:55.540 | They're not complex in our own idea of complexity.
00:19:03.180 | And so this is why it feels like a clear failure of the system.
00:19:08.420 | Now, you can say, though, that, well, maybe the training data
00:19:11.780 | sort of didn't have any of those sort of phenomena.
00:19:14.740 | So the model couldn't have learned not to do that.
00:19:18.060 | And that's sort of a reasonable argument, except, well,
00:19:20.700 | Bert is pre-trained on a bunch of language text.
00:19:23.540 | So you might expect, you might hope that it does better.
00:19:26.380 | So we saw that example of models performing well
00:19:37.380 | on examples that are like those that it was trained on,
00:19:40.420 | and then performing not very well at all
00:19:42.420 | on examples that seem reasonable but are
00:19:46.540 | sort of a little bit tricky.
00:19:49.380 | Now we're going to take this idea of having a test
00:19:52.340 | set that we've carefully crafted and go in a slightly
00:19:54.540 | different direction.
00:19:55.740 | So we're going to have, what does it
00:19:57.260 | mean to try to understand the linguistic properties
00:20:00.040 | of our models?
00:20:01.180 | So that syntactic heuristics question
00:20:03.380 | was one thing for natural language inference.
00:20:05.260 | But can we sort of test how the models,
00:20:08.260 | whether they think certain things are sort of right
00:20:10.740 | or wrong as language models?
00:20:14.300 | And the first way that we'll do this is we'll ask, well,
00:20:16.740 | how do we think about sort of what humans
00:20:18.900 | think of as good language?
00:20:21.260 | How do we evaluate their sort of preferences about language?
00:20:26.700 | And one answer is minimal pairs.
00:20:29.020 | And the idea of a minimal pair is
00:20:30.580 | that you've got one sentence that sounds OK to a speaker.
00:20:34.740 | So this sentence is, the chef who made the pizzas is here.
00:20:39.660 | It's called it's an acceptable sentence, at least to me.
00:20:43.700 | And then with a small change, a minimal change,
00:20:47.500 | the sentence is no longer OK to the speaker.
00:20:50.020 | So the chef who made the pizzas are here.
00:20:53.380 | And this-- whoops.
00:20:57.340 | This should be present tense verbs.
00:21:01.260 | In English, present tense verbs agree in number
00:21:03.460 | with their subject when they are third person.
00:21:07.020 | So chef, pizzas, OK.
00:21:10.540 | And this is sort of a pretty general thing.
00:21:14.740 | Most people don't like this.
00:21:15.980 | It's a misconjugated verb.
00:21:18.580 | And so the syntax here looks like you
00:21:21.220 | have the chef who made the pizzas.
00:21:23.180 | And then this arc of agreement in number
00:21:26.740 | is requiring the word is here to be singular
00:21:30.260 | is instead of plural are, despite the fact
00:21:33.180 | that there's this noun pizzas, which is plural,
00:21:36.580 | closer linearly, comes back to dependency parsing.
00:21:39.860 | We're back.
00:21:42.140 | And what this looks like in the tree structure
00:21:45.060 | is, well, chef and is are attached in the tree.
00:21:52.140 | Chef is the subject of is.
00:21:54.140 | Pizzas is down here in this subtree.
00:21:56.900 | And so that subject-verb relationship
00:21:59.300 | has this sort of agreement thing.
00:22:02.500 | So this is a pretty sort of basic and interesting property
00:22:05.660 | of language that also reflects the syntactic sort
00:22:09.180 | of hierarchical structure of language.
00:22:11.060 | So we've been training these language models,
00:22:12.900 | sampling from them, seeing that they get interesting things.
00:22:15.740 | And they tend to seem to generate syntactic content.
00:22:19.340 | But does it really understand, or does it
00:22:21.980 | behave as if it understands this idea of agreement more broadly?
00:22:26.140 | And does it sort of get the syntax right
00:22:28.380 | so that it matches the subjects and the verbs?
00:22:31.820 | But language models can't tell us
00:22:33.860 | exactly whether they think that a sentence is good or bad.
00:22:36.980 | They just tell us the probability of a sentence.
00:22:40.300 | So before, we had acceptable and unacceptable.
00:22:43.380 | That's what we get from humans.
00:22:45.780 | And the language model's analog is just,
00:22:47.900 | does it assign higher probability
00:22:49.780 | to the acceptable sentence in the minimal pair?
00:22:52.180 | So you have the probability under the model of the chef who
00:22:56.020 | made the pizzas is here.
00:22:58.140 | And then you have the probability
00:22:59.980 | under the model of the chef who made the pizzas are here.
00:23:03.740 | And you want this probability here to be higher.
00:23:08.020 | And if it is, that's sort of like a simple way
00:23:10.500 | to test whether the model got it right effectively.
00:23:15.460 | And just like in Hans, we can develop a test set
00:23:19.940 | with very carefully chosen properties.
00:23:22.100 | So most sentences in English don't
00:23:24.900 | have terribly complex subject-verb agreement
00:23:29.300 | structure with a lot of words in the middle,
00:23:31.180 | like pizzas, that are going to make it difficult.
00:23:34.220 | So if I say, the dog runs, sort of no way
00:23:39.340 | to get it wrong, because this index is very simple.
00:23:44.860 | So we can create, or we can look for sentences that have--
00:23:49.940 | these are the things called attractors in the sentence.
00:23:53.620 | So pizzas is an attractor, because the model
00:23:56.500 | might be attracted to the plurality here
00:23:59.300 | and get the conjugation wrong.
00:24:02.980 | So this is our question.
00:24:03.940 | Can language models sort of very generally handle
00:24:06.580 | these examples with attractors?
00:24:08.500 | So we can take examples with zero attractors,
00:24:11.340 | see whether the model gets the minimal pairs evaluation right.
00:24:14.540 | We can take examples with one attractor, two attractors.
00:24:18.340 | You can see how people would still reasonably understand
00:24:20.660 | these sentences, right?
00:24:21.820 | Chef who made the pizzas and prepped the ingredients is.
00:24:24.700 | It's still the chef who is.
00:24:26.460 | And then on and on and on, it gets rarer, obviously.
00:24:29.980 | But you can have more and more attractors.
00:24:32.620 | And so now we've created this test set
00:24:34.180 | that's intended to evaluate this very specific linguistic
00:24:36.840 | phenomenon.
00:24:39.260 | So in this paper here, Concur et al.
00:24:43.140 | trained an LSTM language model on a subset of Wikipedia
00:24:46.540 | back in 2018.
00:24:47.900 | And they evaluated it sort of in these buckets
00:24:50.540 | that are specified by the paper that sort of introduced
00:24:55.660 | subject-verb agreement to the NLP field,
00:25:00.220 | more recently at least.
00:25:02.340 | And they evaluated it in buckets based
00:25:04.700 | on the number of attractors.
00:25:06.140 | And so in this table here that you're about to see,
00:25:09.620 | the numbers are sort of the percent of times
00:25:12.580 | that you get this assign higher probability
00:25:14.780 | to the correct sentence in the minimal pair.
00:25:19.660 | So if you were just to do random or majority class,
00:25:21.780 | you get these errors.
00:25:23.220 | Oh, sorry, it's the percent of times that you get it wrong.
00:25:26.380 | Sorry about that.
00:25:27.260 | So lower is better.
00:25:29.780 | And so with no attractors, you get very low error rates.
00:25:33.460 | So this is 1.3 error rate with a 350-dimensional LSTM.
00:25:38.940 | And with one attractor, your error rate is higher.
00:25:43.020 | But actually, humans start to get errors
00:25:45.480 | with more attractors too.
00:25:47.300 | So zero attractors is easy.
00:25:50.220 | The larger the LSTM, it looks like in general,
00:25:52.400 | the better you're doing.
00:25:53.580 | So the smaller model's doing worse.
00:25:56.460 | And then even on very difficult examples with four attractors,
00:26:00.220 | which try to think of an example in your head,
00:26:02.420 | like the chef made the pizzas and took out the trash.
00:26:06.860 | It sort of has to be this long sentence.
00:26:08.820 | The error rate is definitely higher,
00:26:10.340 | so it gets more difficult. But it's still relatively low.
00:26:15.260 | And so even on these very hard examples,
00:26:16.900 | models are actually performing subject-verb number agreement
00:26:19.740 | relatively well.
00:26:21.300 | Very cool.
00:26:25.500 | Here's some examples that a model got wrong.
00:26:28.540 | This is actually a worse model than the ones
00:26:30.300 | from the paper that was just there.
00:26:31.960 | But I think, actually, the errors are quite interesting.
00:26:34.980 | So here's a sentence.
00:26:35.900 | The ship that the player drives has a very high speed.
00:26:41.320 | Now, this model thought that was less probable than the ship
00:26:45.100 | that the player drives have a very high speed.
00:26:50.940 | My hypothesis is that it sort of misanalyzes drives
00:26:56.940 | as a plural noun, for example.
00:27:00.060 | It's sort of a difficult construction there.
00:27:01.900 | I think it's pretty interesting.
00:27:04.500 | Likewise here, this one is fun.
00:27:07.100 | The lead is also rather long.
00:27:09.300 | Five paragraphs is pretty lengthy.
00:27:12.520 | So here, five paragraphs is a singular noun together.
00:27:16.980 | It's like a unit of length, I guess.
00:27:20.340 | But the model thought that it was more likely to say
00:27:23.420 | five paragraphs are pretty lengthy,
00:27:26.380 | because it's referring to this sort of five paragraphs
00:27:30.620 | as the five actual paragraphs themselves,
00:27:33.380 | as opposed to a single unit of length describing the lead.
00:27:37.540 | Fascinating.
00:27:41.120 | Any questions again?
00:27:46.620 | [INAUDIBLE]
00:27:53.540 | So I guess there are a couple.
00:27:56.060 | Can we do the similar heuristic analysis
00:27:59.180 | for other tasks, such as Q&A, classification?
00:28:07.580 | So yes, I think that it's easier to do this kind of analysis
00:28:11.260 | for the Hans style analysis with question answering
00:28:18.340 | and other sorts of tasks, because you can construct
00:28:22.140 | examples that similarly have these heuristics
00:28:32.820 | and then have the answer depend on the syntax or not.
00:28:36.060 | The actual probability of one sentence
00:28:39.660 | is higher than the other, of course,
00:28:41.160 | sort of a language model dependent thing.
00:28:43.300 | But the idea that you can develop bespoke test
00:28:48.700 | sets for various tasks, I think, is very, very general
00:28:54.140 | and something I think is actually quite interesting.
00:28:59.860 | So I won't go on further, but I think the answer is just yes.
00:29:04.980 | So there's another one.
00:29:07.380 | How do you know where to find these failure cases?
00:29:10.180 | Maybe that's the right time to advertise linguistics classes.
00:29:14.380 | Sorry.
00:29:16.180 | You're still very quiet over here.
00:29:18.220 | How do we find what?
00:29:19.740 | How do you know where to find these failure cases?
00:29:23.260 | Oh, interesting.
00:29:24.100 | Yes, how do we know where to find the failure cases?
00:29:27.100 | That's a good question.
00:29:28.500 | I mean, I think I agree with Chris
00:29:30.500 | that actually thinking about what
00:29:33.740 | is interesting about things in language is one way to do it.
00:29:39.500 | I mean, the heuristics that we saw in our language model--
00:29:45.500 | sorry, in our NLI models with Hans,
00:29:49.380 | you can imagine that they--
00:29:53.620 | if the model was sort of ignoring facts about language
00:29:56.780 | and sort of just doing this sort of rough bag of words
00:29:59.540 | with some extra magic, then it would do well about as bad
00:30:03.780 | as it's doing here.
00:30:05.360 | And these sorts of ideas about understanding
00:30:10.540 | that this statement, if the artist slept, the actor ran,
00:30:13.260 | does not imply the artist slept, is the kind of thing
00:30:15.980 | that maybe you'd think up on your own,
00:30:18.380 | but also you'd spend time sort of pondering about and thinking
00:30:22.760 | broad thoughts about in linguistics curricula as well.
00:30:27.380 | So anything else, Chris?
00:30:32.960 | Yeah.
00:30:35.940 | So there's also-- well, I guess someone was also saying--
00:30:41.020 | I think it's about the sort of intervening verbs example--
00:30:44.700 | intervening nouns, sorry, example.
00:30:46.660 | But the data set itself probably includes mistakes
00:30:50.260 | with higher attractors.
00:30:53.020 | Yeah, yeah, that's a good point.
00:30:55.540 | Yeah, because humans make more and more mistakes
00:30:57.980 | as the number of attractors gets larger.
00:31:03.880 | On the other hand, I think that the mistakes are fewer
00:31:06.880 | in written text than in spoken.
00:31:10.200 | Maybe I'm just making that up.
00:31:12.000 | That's what I think.
00:31:13.560 | But yeah, it would be interesting to actually go
00:31:15.520 | through that test set and see how many of the errors
00:31:20.400 | a really strong model makes are actually
00:31:22.360 | due to the sort of observed form being incorrect.
00:31:26.000 | I'd be super curious.
00:31:27.420 | OK, should I move on?
00:31:36.180 | Yeah.
00:31:36.940 | Great.
00:31:37.440 | OK, so what does it feel like we're
00:31:52.740 | doing when we are kind of constructing
00:31:55.360 | these sort of bespoke, small, careful test
00:31:57.860 | sets for various phenomena?
00:31:59.980 | Well, it sort of feels like unit testing.
00:32:03.500 | And in fact, this sort of idea has been brought to the fore,
00:32:10.700 | you might say, in NLP unit tests,
00:32:13.560 | but for these NLP neural networks.
00:32:15.260 | And in particular, the paper here
00:32:18.380 | that I'm citing at the bottom suggests this minimum
00:32:21.900 | functionality test.
00:32:23.220 | You want a small test set that targets a specific behavior.
00:32:26.500 | That should sound like some of the things
00:32:28.660 | that we've already talked about.
00:32:30.820 | But in this case, we're going to get even more specific.
00:32:34.660 | So here's a single test case.
00:32:36.820 | We're going to have an expected label, what was actually
00:32:40.220 | predicted, whether the model passed this unit test.
00:32:43.660 | And the labels are going to be sentiment analysis here.
00:32:47.620 | So negative label, positive label,
00:32:49.740 | or neutral are the three options.
00:32:52.220 | And the unit test is going to consist simply
00:32:54.940 | of sentences that follow this template.
00:32:57.780 | I, then a negation, a positive verb, and then the thing.
00:33:02.700 | So if you negation positive verb,
00:33:05.580 | it means you negative verb.
00:33:07.820 | And so here's an example.
00:33:09.020 | I can't say I recommend the food.
00:33:11.300 | The expected label is negative.
00:33:13.060 | The answer that the model provided--
00:33:14.660 | and this is, I think, a commercial sentiment analysis
00:33:17.780 | system.
00:33:19.380 | So it predicted positive.
00:33:21.060 | And then I didn't love the flight.
00:33:24.460 | The expected label was negative.
00:33:26.100 | And then the predicted answer was neutral.
00:33:29.820 | And this commercial sentiment analysis system
00:33:32.740 | gets a lot of what you could imagine
00:33:35.780 | are pretty reasonably simple examples wrong.
00:33:38.340 | And so what your bureau at all 2020
00:33:41.500 | showed is that they could actually provide a system that
00:33:44.700 | sort of had this framework of building test cases for NLP
00:33:48.300 | models to ML engineers working on these products
00:33:52.980 | and give them that interface.
00:33:55.180 | And they would actually find bugs--
00:33:59.620 | bugs being categories of high error--
00:34:01.900 | find bugs in their models that they could then
00:34:03.780 | kind of try to go and fix.
00:34:06.420 | And this was kind of an efficient way
00:34:08.380 | of trying to find things that were simple and still wrong
00:34:11.300 | with what should be pretty sophisticated neural systems.
00:34:16.500 | But I really like this.
00:34:17.660 | And it's sort of a nice way of thinking more specifically
00:34:21.180 | about what are the capabilities in sort of precise terms
00:34:25.060 | of our models.
00:34:27.260 | And altogether now, you've seen problems
00:34:29.980 | in natural language inference.
00:34:33.380 | You've seen language models actually perform pretty well
00:34:35.860 | at the language modeling objective.
00:34:37.340 | But then you see--
00:34:38.860 | you just saw an example of a commercial sentiment analysis
00:34:41.740 | system that sort of should do better and doesn't.
00:34:44.980 | And this comes to this really, I think,
00:34:47.980 | broad and important takeaway, which
00:34:50.180 | is if you get high accuracy on the in-domain test set,
00:34:54.940 | you are not guaranteed high accuracy on even
00:34:58.980 | what you might consider to be reasonable out-of-domain
00:35:03.540 | evaluations.
00:35:04.820 | And life is always out of domain.
00:35:08.180 | And if you're building a system that will be given to users,
00:35:11.980 | it's immediately out of domain, at the very least
00:35:14.020 | because it's trained on text that's
00:35:15.620 | now older than the things that the users are now saying.
00:35:18.220 | So it's a really, really important takeaway
00:35:20.780 | that your sort of benchmark accuracy
00:35:23.300 | is a single number that does not guarantee good performance
00:35:26.340 | on a wide variety of things.
00:35:28.060 | And from a what are our neural networks doing perspective,
00:35:32.100 | one way to think about it is that models seem
00:35:34.220 | to be learning the data set, fitting
00:35:36.300 | sort of the fine-grained heuristics and statistics that
00:35:40.020 | help it fit this one data set, as opposed
00:35:43.420 | to learning the task.
00:35:44.580 | So humans can perform natural language inference.
00:35:46.980 | If you give them examples from whatever data set,
00:35:49.980 | once you've told them how to do the task,
00:35:51.700 | they'll be very generally strong at it.
00:35:55.260 | But you take your MNLI model, and you test it on Hans,
00:35:59.900 | and it got whatever that was, below chance accuracy.
00:36:03.100 | That's not the kind of thing that you want to see.
00:36:05.180 | So it definitely learns the data set well,
00:36:07.140 | because the accuracy in-domain is high.
00:36:10.700 | But our models are seemingly not frequently
00:36:14.900 | learning sort of the mechanisms that we
00:36:17.900 | would like them to be learning.
00:36:19.500 | Last week, we heard about language models
00:36:22.340 | and sort of the implicit knowledge
00:36:23.700 | that they encode about the world through pre-training.
00:36:26.380 | And one of the ways that we saw it interact with language
00:36:29.320 | models was providing them with a prompt,
00:36:32.100 | like Dante was born in Masque, and then
00:36:35.060 | seeing if it puts high probability
00:36:36.740 | on the correct continuation, which
00:36:39.220 | requires you to access knowledge about where Dante was born.
00:36:43.380 | And we didn't frame it this way last week,
00:36:45.900 | but this fits into the set of behavioral studies
00:36:48.000 | that we've done so far.
00:36:49.480 | This is a specific kind of input.
00:36:51.860 | You could ask this for multiple people.
00:36:55.140 | You could swap out Dante for other people.
00:36:57.140 | You could swap out born in for, I don't know,
00:37:00.180 | died in or something.
00:37:01.980 | And then there are like test suites again.
00:37:04.820 | And so it's all connected.
00:37:07.380 | OK, so I won't go too deep into sort
00:37:09.420 | of the knowledge of language models
00:37:11.220 | in terms of world knowledge, because we've
00:37:13.540 | gone over it some.
00:37:14.740 | But when you're thinking about ways
00:37:16.580 | of interacting with your models, this sort of behavioral study
00:37:20.980 | can be very, very general, even though, remember,
00:37:23.820 | we're at still this highest level of abstraction,
00:37:26.900 | where we're just looking at the probability distributions that
00:37:29.360 | are defined.
00:37:29.860 | All right, so now we'll go into-- so we've sort of looked
00:37:35.900 | at understanding in fine-grained areas what
00:37:38.740 | our model is actually doing.
00:37:41.540 | What about sort of why for an individual input
00:37:45.700 | is it getting the answer right or wrong?
00:37:48.020 | And then are there changes to the inputs
00:37:50.060 | that look fine to humans, but actually make
00:37:52.860 | the models do a bad job?
00:37:55.980 | So one study that I love to reference that really draws
00:38:00.000 | back into our original motivation of using LSTM
00:38:04.380 | networks instead of simple recurrent neural networks
00:38:06.740 | was that they could use long context.
00:38:10.420 | But how long is your long and short-term memory?
00:38:15.020 | And the idea of Kendall-Wall et al.
00:38:18.100 | 2018 was shuffle or remove contexts
00:38:23.140 | that are farther than some k words away, changing k.
00:38:29.140 | And if the accuracy, if the predictive ability
00:38:33.220 | of your language model, the perplexity,
00:38:35.860 | doesn't change once you do that, it
00:38:37.820 | means the model wasn't actually using that context.
00:38:40.740 | I think this is so cool.
00:38:42.100 | So on the x-axis, we've got how far away from the word
00:38:46.820 | that you're trying to predict.
00:38:48.260 | Are you actually sort of corrupting, shuffling,
00:38:51.340 | or removing stuff from the sequence?
00:38:54.140 | And then on the y-axis is the increase in loss.
00:38:57.180 | So if the increase in loss is zero,
00:39:00.460 | it means that the model was not using the thing
00:39:03.340 | that you just removed.
00:39:04.540 | Because if it was using it, it would now
00:39:06.580 | do worse without it.
00:39:08.140 | And so if you shuffle in the blue line here,
00:39:11.460 | if you shuffle the history that's farther away from 50
00:39:14.620 | words, the model does not even notice.
00:39:18.620 | I think that's really interesting.
00:39:20.080 | One, it says everything past 50 words of this LSTM language
00:39:23.620 | model, you could have given it in random order,
00:39:26.020 | and it wouldn't have noticed.
00:39:28.500 | And then two, it says that if you're closer than that,
00:39:31.060 | it actually is making use of the word order.
00:39:33.380 | That's a pretty long memory.
00:39:34.900 | OK, that's really interesting.
00:39:36.740 | And then if you actually remove the words entirely,
00:39:39.980 | you can kind of notice that the words are
00:39:42.620 | missing up to 200 words away.
00:39:45.660 | So you don't care about the order they're in,
00:39:48.420 | but you care whether they're there or not.
00:39:50.620 | And so this is an evaluation of, well,
00:39:53.140 | do LSTMs have long-term memory?
00:39:54.800 | Well, this one at least has effectively no longer
00:39:57.420 | than 200 words of memory, but also no less.
00:40:02.180 | So very cool.
00:40:03.860 | So that's a general study for a single model.
00:40:09.420 | It talks about its average behavior
00:40:13.180 | over a wide range of examples.
00:40:14.580 | But we want to talk about individual predictions
00:40:17.020 | on individual inputs.
00:40:17.980 | So let's talk about that.
00:40:19.340 | So one way of interpreting why did my model make
00:40:23.860 | this decision that's very popular is, for a single
00:40:27.180 | example, what parts of the input actually led to the decision?
00:40:31.340 | And this is where we come in with saliency maps.
00:40:34.380 | So a saliency map provides a score
00:40:36.980 | for each word indicating its importance
00:40:39.300 | to the model's prediction.
00:40:40.620 | So you've got something like Bert here.
00:40:44.100 | You've got Bert.
00:40:45.060 | Bert is making a prediction for this mask.
00:40:47.580 | The mask rushed to the emergency room to see her patient.
00:40:52.340 | And the predictions that the model is making
00:40:55.460 | is things with 47%.
00:40:57.300 | It's going to be nurse that's here in the mask instead,
00:41:01.060 | or maybe woman, or doctor, or mother, or girl.
00:41:04.580 | And then the saliency map is being visualized here in orange.
00:41:07.780 | According to this method of saliency
00:41:09.740 | called simple gradients, which we'll get into,
00:41:12.060 | emergency, her, and the SEP token--
00:41:15.980 | let's not worry about the SEP token for now.
00:41:17.900 | But emergency and her are the important words, apparently.
00:41:21.860 | And the SEP token shows up in every sentence.
00:41:23.700 | So I'm not going to--
00:41:25.820 | and so these two together are, according to this method,
00:41:29.380 | what's important for the model to make this prediction to mask.
00:41:33.420 | And you can see maybe some statistics, biases, et cetera,
00:41:36.980 | that is picked up in the predictions
00:41:39.100 | and then have it mapped out onto the sentence.
00:41:41.820 | And this is-- well, it seems like it's really
00:41:44.240 | helping interpretability.
00:41:47.060 | And yeah, I think that this is a very useful tool.
00:41:52.580 | Actually, this is part of a demo from Alan NLP
00:41:56.300 | that allows you to do this yourself for any sentence
00:42:00.820 | that you want.
00:42:02.660 | So what's this way of making saliency maps?
00:42:05.660 | We're not going to go-- there's so many ways to do it.
00:42:07.940 | We're going to take a very simple one
00:42:09.480 | and work through why it makes sense.
00:42:12.660 | So the issue is, how do you define importance?
00:42:17.420 | What does it mean to be important to the model's
00:42:19.460 | prediction?
00:42:20.620 | And here's one way of thinking about it.
00:42:22.300 | It's called the simple gradient method.
00:42:24.220 | Let's get a little formal.
00:42:25.300 | You've got words x1 to xn.
00:42:28.300 | And then you've got a model score for a given output class.
00:42:31.020 | So maybe you've got, in the BERT example,
00:42:33.700 | each output class was each output word
00:42:35.900 | that you could possibly predict.
00:42:38.740 | And then you take the norm of the gradient of the score,
00:42:42.640 | with respect to each word.
00:42:44.740 | So what we're saying here is, the score
00:42:48.620 | is the unnormalized probability for that class.
00:42:55.500 | So you've got a single class.
00:42:56.700 | You're taking the score.
00:42:57.700 | It's how likely it is, not yet normalized
00:42:59.900 | by how likely everything else is.
00:43:02.660 | Gradient, how much is it going to change
00:43:05.340 | if I move it a little bit in one direction or another?
00:43:08.380 | And then you take the norm to get a scalar from a vector.
00:43:10.900 | So it looks like this.
00:43:12.260 | The salience of word i, you have the norm bars on the outside,
00:43:16.580 | gradient with respect to xi.
00:43:18.900 | So that's if I change a little bit locally xi,
00:43:22.740 | how much does my score change?
00:43:25.460 | So the idea is that a high gradient norm
00:43:27.940 | means that if I were to change it locally,
00:43:30.380 | I'd affect the score a lot.
00:43:32.060 | And that means it was very important to the decision.
00:43:34.300 | Let's visualize this a little bit.
00:43:35.740 | So here on the y-axis, we've got loss.
00:43:39.460 | Just the loss of the model-- sorry,
00:43:41.700 | this should be score.
00:43:43.260 | Should be score.
00:43:44.180 | And on the x-axis, you've got word space.
00:43:46.980 | The word space is like sort of a flattening of the ability
00:43:51.700 | to move your word embedding in 1,000 dimensional space.
00:43:54.740 | So I've just plotted it here in one dimension.
00:43:58.780 | And now, a high saliency thing, you
00:44:00.880 | can see that the relationship between what should be score
00:44:04.420 | and moving the word in word space,
00:44:06.500 | you move it a little bit on the x-axis,
00:44:08.740 | and the score changes a lot.
00:44:10.860 | That's that derivative.
00:44:11.820 | That's the gradient.
00:44:12.660 | Awesome, love it.
00:44:13.740 | Low saliency, you move the word around locally,
00:44:16.900 | and the score doesn't change.
00:44:20.260 | So the interpretation is that means
00:44:23.740 | that the actual identity of this word
00:44:25.980 | wasn't that important to the prediction,
00:44:27.680 | because I could have changed it, and the score
00:44:29.740 | wouldn't have changed.
00:44:31.340 | Now, why are there more methods than this?
00:44:33.860 | Because honestly, reading that, I was like,
00:44:36.300 | that sounds awesome.
00:44:37.140 | That sounds great.
00:44:38.420 | So there are sort of lots of issues
00:44:40.580 | with this kind of method and lots of ways
00:44:44.300 | of getting around them.
00:44:45.220 | Here's one issue.
00:44:46.620 | It's not perfect, because, well, maybe your linear approximation
00:44:51.860 | that the gradient gives you holds only very, very locally.
00:44:56.780 | So here, the gradient is 0.
00:45:00.140 | So this is a low saliency word, because I'm
00:45:02.660 | at the bottom of this parabola.
00:45:04.420 | But if I were to move even a little bit
00:45:06.340 | in either direction, the score would shoot up.
00:45:10.220 | So is this not an important word?
00:45:11.860 | It seems important to be right there,
00:45:15.980 | as opposed to anywhere else even sort of nearby in order
00:45:19.780 | for the score not to go up.
00:45:22.060 | But the simple gradients method won't capture this,
00:45:24.780 | because it just looks at the gradient, which
00:45:27.060 | is that 0 right there.
00:45:28.300 | But if you want to look into more,
00:45:32.820 | there's a bunch of different methods
00:45:34.340 | that are sort of applied in these papers.
00:45:36.420 | And I think that is a good tool for the toolbox.
00:45:42.540 | OK, so that is one way of explaining a prediction.
00:45:47.260 | And it has some issues, like why are individual words being
00:45:53.100 | scored, as opposed to phrases or something like that.
00:45:56.980 | But for now, we're going to move on to another type
00:45:59.140 | of explanation.
00:46:00.740 | And I'm going to check the time.
00:46:02.340 | OK, cool.
00:46:04.820 | Actually, yeah, let me pause for a second.
00:46:06.580 | Any questions about this?
00:46:07.620 | I mean, earlier on, there were a couple of questions.
00:46:16.180 | One of them was, what are your thoughts
00:46:19.780 | on whether looking at attention weights
00:46:21.520 | is a methodologically rigorous way of determining
00:46:24.540 | the importance that the model places on certain tokens?
00:46:27.960 | It seems like there's some back and forth in the literature.
00:46:31.900 | That is a great question.
00:46:34.820 | And I probably won't engage with that question
00:46:36.900 | as much as I could if we had a second lecture on this.
00:46:40.660 | I actually will provide some attention analyses
00:46:43.260 | and tell you they're interesting.
00:46:44.820 | And then I'll sort of say a little bit
00:46:46.900 | about why they can be interesting without being
00:46:53.380 | sort of maybe the end all of analysis of where information
00:47:03.220 | is flowing in a transformer, for example.
00:47:05.980 | I think the debate is something that we
00:47:08.420 | would have to get into in a much longer period of time.
00:47:11.580 | But look at the slides that I show about attention
00:47:14.020 | and the caveats that I provide.
00:47:15.740 | And let me know if that answers your question first,
00:47:17.900 | because we have quite a number of slides on it.
00:47:19.860 | And if not, please, please ask again.
00:47:21.940 | And we can chat more about it.
00:47:25.100 | And maybe you can go on.
00:47:27.140 | Great.
00:47:28.340 | So I think this is a really fascinating question, which
00:47:31.820 | also gets at what was important about the input,
00:47:35.220 | but in actually kind of an even more direct way, which
00:47:38.260 | is, could I just keep some minimal part of the input
00:47:41.780 | and get the same answer?
00:47:43.340 | So here's an example from SQuAD.
00:47:45.620 | You have this passage in 1899.
00:47:47.140 | John Jacob Astor IV invested $100,000 for Tesla.
00:47:51.940 | And then the answer that is being predicted by the model
00:47:54.220 | is going to always be in blue in these examples, Colorado
00:47:56.620 | Springs Experiments.
00:47:58.140 | So you've got this passage.
00:47:59.860 | And the question is, what did Tesla spend Astor's money on?
00:48:03.660 | That's why the prediction is Colorado Springs Experiments.
00:48:06.020 | The model gets the answer right, which is nice.
00:48:10.300 | And we would like to think it's because it's doing
00:48:12.500 | some kind of reading comprehension.
00:48:14.860 | But here's the issue.
00:48:16.460 | It turns out, based on this fascinating paper,
00:48:19.860 | that if you just reduce the question to did,
00:48:25.340 | you actually get exactly the same answer.
00:48:30.780 | In fact, with the original question,
00:48:33.140 | the model had sort of a 0.78 confidence probability
00:48:36.820 | in that answer.
00:48:37.860 | And with the reduced question did,
00:48:41.820 | you get even higher confidence.
00:48:43.740 | And if you give a human this, they
00:48:46.100 | would not be able to know really what you're trying to ask about.
00:48:49.720 | So it seems like some things are going really wonky here.
00:48:53.340 | Here's another.
00:48:54.460 | So here's sort of like a very high level
00:48:56.340 | overview of the method.
00:48:58.980 | In fact, it actually references our input saliency methods.
00:49:01.480 | Ah, nice.
00:49:02.180 | It's connected.
00:49:03.180 | So you iteratively remove non-salient or unimportant
00:49:08.100 | words.
00:49:08.980 | So here's a passage again talking about football,
00:49:13.420 | I think.
00:49:13.940 | Yeah.
00:49:15.460 | And-- oh, nice.
00:49:16.660 | OK, so the question is, where did the Broncos practice
00:49:19.060 | for the Super Bowl as the prediction of Stanford
00:49:21.420 | University?
00:49:24.060 | And that is correct.
00:49:25.340 | So again, seems nice.
00:49:27.100 | And now, we're not actually going
00:49:28.620 | to get the model to be incorrect.
00:49:31.220 | We're just going to say, how can I
00:49:33.820 | change this question such that I still get the answer right?
00:49:37.180 | So I'm going to remove the word that
00:49:38.940 | was least important according to a saliency method.
00:49:41.780 | So now, it's where did the practice for the Super Bowl?
00:49:45.020 | Already, this is sort of unanswerable
00:49:46.620 | because you've got two teams practicing.
00:49:48.700 | You don't even know which one you're asking about.
00:49:50.780 | So why the model still thinks it's
00:49:52.620 | so confident in Stanford University makes no sense.
00:49:55.260 | But you can just sort of keep going.
00:49:58.940 | And now, I think, here, the model
00:50:03.220 | stops being confident in the answer, Stanford University.
00:50:07.220 | But I think this is really interesting just
00:50:10.660 | to show that if the model is able to do this
00:50:13.260 | with very high confidence, it's not
00:50:16.620 | reflecting the uncertainty that really should be there
00:50:19.660 | because you can't know what you're even asking about.
00:50:23.420 | OK, so what was important to make this answer?
00:50:26.180 | Well, at least these parts were important
00:50:30.100 | because you could keep just those parts
00:50:31.860 | and get the same answer.
00:50:33.140 | Fascinating.
00:50:35.900 | All right, so that's sort of the end of the admittedly brief
00:50:40.180 | section on thinking about input saliency
00:50:44.220 | methods and similar things.
00:50:45.340 | Now, we're going to talk about actually breaking models
00:50:47.740 | and understanding models by breaking them.
00:50:50.940 | OK, cool.
00:50:52.540 | So if we have a passage here, Peyton Manning
00:50:54.460 | became the first quarterback, something,
00:50:58.500 | Super Bowl, age 39, past record held by John Elway.
00:51:02.180 | Again, we're doing question answering.
00:51:03.760 | We've got this question.
00:51:05.220 | What was the name of the quarterback who
00:51:06.920 | was 38 in the Super Bowl?
00:51:08.540 | The prediction is correct.
00:51:11.060 | Looks good.
00:51:12.060 | Now, we're not going to change the question to try to sort
00:51:15.040 | of make the question nonsensical while keeping the same answer.
00:51:18.460 | Instead, we're going to change the passage
00:51:22.540 | by adding the sentence at the end, which really
00:51:24.460 | shouldn't distract anyone.
00:51:25.540 | This is quarterback, well-known quarterback, Jeff Dean,
00:51:29.100 | had jersey number 37 in Champ Bowl.
00:51:31.460 | So this just doesn't--
00:51:32.620 | it's really not even related.
00:51:34.700 | But now, the prediction is Jeff Dean for our nice QA model.
00:51:40.300 | And so this shows, as well, that it
00:51:44.020 | seems like maybe there's this end of the passage bias
00:51:47.260 | as to where the answer should be, for example.
00:51:49.900 | And so this is an adversarial example
00:51:52.900 | where we flipped the prediction by adding something
00:51:55.220 | that is innocuous to humans.
00:51:57.220 | And so sort of the higher level takeaway
00:51:59.380 | is, oh, it seems like the QA model
00:52:01.700 | that we had that seemed good is not actually performing QA
00:52:04.740 | how we want it to, even though its in-domain accuracy was
00:52:07.620 | good.
00:52:09.920 | And here's another example.
00:52:12.220 | So you've got this paragraph with a question,
00:52:16.780 | what has been the result of this publicity?
00:52:19.620 | The answer is increased scrutiny on teacher misconduct.
00:52:22.780 | Now, instead of changing the paragraph,
00:52:25.100 | we're going to change the question in really,
00:52:27.660 | really seemingly insignificant ways
00:52:31.420 | to change the model's prediction.
00:52:32.740 | So first, what HA-- now you've got this typo, L--
00:52:37.580 | then the result of this publicity,
00:52:39.420 | the answer changes to teacher misconduct.
00:52:42.420 | Likely, a human would sort of ignore this typo or something
00:52:46.020 | and answer the right answer.
00:52:47.500 | And then this is really nuts.
00:52:49.420 | Instead of asking, what has been the result of this publicity,
00:52:52.700 | if you ask, what's been the result of this publicity,
00:52:56.620 | the answer also changes.
00:52:59.380 | And this is-- the authors call this a semantically equivalent
00:53:02.940 | adversary.
00:53:04.460 | This is pretty rough.
00:53:05.700 | And in general, swapping what for what in this QA model
00:53:09.820 | breaks it pretty frequently.
00:53:13.100 | And so again, when you go back and sort of re-tinker
00:53:17.260 | how to build your model, you're going
00:53:19.060 | to be thinking about these things, not just
00:53:20.900 | the sort of average accuracy.
00:53:23.820 | So that's sort of talking about noise.
00:53:28.060 | Are models robust to noise in their inputs?
00:53:31.060 | Are humans robust to noise is another question we can ask.
00:53:34.100 | And so you can kind of go to this popular sort of meme
00:53:38.740 | passed around the internet from time to time,
00:53:41.620 | where you have all the letters in these words scrambled.
00:53:44.900 | You say, according to research at Cambridge University,
00:53:49.140 | it doesn't matter in what order the letters in a word are.
00:53:52.380 | And so it seems like--
00:53:55.620 | I think I did a pretty good job there.
00:53:57.620 | Seemingly, we got this noise.
00:53:59.540 | That's a specific kind of noise.
00:54:01.380 | And we can be robust as humans to reading and processing
00:54:05.060 | the language without actually all that much of a difficulty.
00:54:10.140 | So that's maybe something that we might want our models
00:54:12.380 | to also be robust to.
00:54:15.180 | And it's very practical as well.
00:54:19.020 | Noise is a part of all NLP systems inputs at all times.
00:54:23.620 | There's just no such thing effectively
00:54:25.380 | as having users, for example, and not having any noise.
00:54:30.500 | And so there's a study that was performed
00:54:32.540 | on some popular machine translation models, where
00:54:36.300 | you train machine translation models in French, German,
00:54:39.620 | and Czech, I think all to English.
00:54:42.260 | And you get blue scores.
00:54:43.660 | These blue scores will look a lot better
00:54:45.300 | than the ones in your assignment four
00:54:46.800 | because much, much more training data.
00:54:48.660 | The idea is these are actually pretty strong machine
00:54:51.100 | translation systems.
00:54:53.060 | And that's an in-domain clean text.
00:54:56.100 | Now, if you add character swaps, like the ones
00:54:59.220 | we saw in that sentence about Cambridge,
00:55:03.740 | the blue scores take a pretty harsh dive.
00:55:07.620 | Not very good.
00:55:09.620 | And even if you take a somewhat more natural typo noise
00:55:15.220 | distribution here, you'll see that you're still
00:55:18.020 | getting 20-ish, yeah, very high drops in blue score
00:55:25.500 | through simply natural noise.
00:55:27.900 | And so maybe you'll go back and retrain the model on more types
00:55:30.380 | of noise.
00:55:30.900 | And then you ask, oh, if I do that,
00:55:32.620 | is it robust to even different kinds of noise?
00:55:34.820 | These are the questions that are going to be really important.
00:55:37.540 | And it's important to know that you're
00:55:39.120 | able to break your model really easily so that you can then
00:55:41.780 | go and try to make it more robust.
00:55:45.260 | OK, now, let's see, 20 minutes, awesome.
00:55:53.340 | Now we're going to, I guess, yeah.
00:55:57.580 | So now we're going to look at the representations
00:55:59.780 | of our neural networks.
00:56:01.580 | We've talked about their behavior
00:56:03.700 | and then whether we could change or observe
00:56:07.260 | reasons behind their behavior.
00:56:09.060 | Now we'll go into less abstraction,
00:56:12.980 | look more at the actual vector representations that
00:56:15.900 | are being built by models.
00:56:17.380 | And we can answer a different kind of question,
00:56:20.460 | at the very least, than with the other studies.
00:56:24.260 | The first thing is related to the question
00:56:26.620 | I was asked about attention, which
00:56:28.740 | is that some modeling components lend themselves to inspection.
00:56:33.660 | Now this is a sentence that I chose somewhat carefully,
00:56:36.260 | actually, because in part of this debate,
00:56:39.380 | are they interpretable components?
00:56:41.700 | We'll see.
00:56:43.220 | But they lend themselves to inspection in the following way.
00:56:46.580 | You can visualize them well, and you can correlate them easily
00:56:49.740 | with various properties.
00:56:51.660 | So let's say you have attention heads in BERT.
00:56:53.860 | This is from a really nice study that was done here,
00:56:58.020 | where you look at attention heads of BERT,
00:57:00.580 | and you say, on most sentences, this attention head, head 1,
00:57:04.740 | 1, seems to do this very global aggregation.
00:57:08.380 | Simple kind of operation does this pretty consistently.
00:57:11.700 | That's cool.
00:57:13.740 | Is it interpretable?
00:57:15.740 | Well, maybe.
00:57:18.460 | So it's the first layer, which means that this word found
00:57:22.140 | is sort of uncontextualized.
00:57:24.060 | But in deeper layers, the problem
00:57:29.300 | is that once you do some rounds of attention,
00:57:32.820 | you've had information mixing and flowing between words.
00:57:36.820 | And how do you know exactly what information you're combining,
00:57:40.020 | what you're attending to, even?
00:57:41.740 | It's a little hard to tell.
00:57:44.540 | And saliency methods more directly
00:57:47.820 | evaluate the importance of models.
00:57:50.140 | But it's still interesting to see,
00:57:52.060 | at sort of a local mechanistic point of view,
00:57:54.500 | what kinds of things are being attended to.
00:57:57.620 | So let's take another example.
00:57:59.580 | Some attention heads seem to perform simple operations.
00:58:02.460 | So you have the global aggregation here
00:58:04.180 | that we saw already.
00:58:05.500 | Others seem to attend pretty robustly to the next token.
00:58:09.260 | Cool.
00:58:10.060 | Next token is a great signal.
00:58:11.780 | Some heads attend to the SEP token.
00:58:14.860 | So here you have attending to SEP.
00:58:16.900 | And then maybe some attend to periods.
00:58:18.760 | Maybe that's sort of splitting sentences together and things
00:58:22.580 | like that.
00:58:23.300 | Not things that are hard to do, but things
00:58:25.340 | that some attention heads seem to pretty robustly perform.
00:58:27.780 | Again now, though, deep in the network,
00:58:32.460 | what's actually represented at this period at layer 11?
00:58:37.740 | Little unclear.
00:58:38.740 | Little unclear.
00:58:41.260 | So some heads, though, are correlated
00:58:43.900 | with really interesting linguistic properties.
00:58:46.060 | So this head is actually attending to noun modifiers.
00:58:49.880 | So you've got this the complicated language
00:58:53.600 | in the huge new law.
00:58:57.460 | That's pretty fascinating.
00:58:59.980 | Even if the model is not doing this as a causal mechanism
00:59:03.800 | to do syntax necessarily, the fact
00:59:06.360 | that these things so strongly correlate
00:59:08.320 | is actually pretty, pretty cool.
00:59:09.960 | And so what we have in all of these studies
00:59:11.720 | is we've got sort of an approximate interpretation
00:59:14.240 | and quantitative analysis allowing
00:59:18.380 | us to reason about very complicated model behavior.
00:59:21.760 | They're all approximations, but they're
00:59:23.400 | definitely interesting.
00:59:24.760 | One other example is that of coreference.
00:59:26.680 | So we saw some work on coreference.
00:59:29.600 | And it seems like this head does a pretty OK job of actually
00:59:34.320 | matching up coreferent entities.
00:59:37.440 | These are in red.
00:59:38.920 | Talks, negotiations, she, her.
00:59:41.840 | And that's not obvious how to do that.
00:59:43.800 | This is a difficult task.
00:59:45.520 | And so it does so with some percentage of the time.
00:59:49.960 | And again, it's sort of connecting very complex model
00:59:52.240 | behavior to these sort of interpretable summaries
00:59:57.320 | of correlating properties.
01:00:00.240 | Other cases, you can have individual hidden units
01:00:02.440 | that lend themselves to interpretation.
01:00:04.480 | So here, you've got a character level LSTM language model.
01:00:10.280 | Each row here is a sentence.
01:00:12.080 | If you can't read it, it's totally OK.
01:00:14.120 | The interpretation that you should take
01:00:15.740 | is that as we walk along the sentence,
01:00:17.520 | this single unit is going from, I think,
01:00:20.640 | very negative to very positive or very positive
01:00:23.040 | to very negative.
01:00:23.760 | I don't really remember.
01:00:26.360 | But it's tracking the position in the line.
01:00:30.120 | So it's just a linear position unit
01:00:31.760 | and pretty robustly doing so across all of these sentences.
01:00:36.560 | So this is from a nice visualization study
01:00:39.040 | way back in 2016, way back.
01:00:41.920 | Here's another cell from that same LSTM language model
01:00:44.960 | that seems to sort of turn on inside quotes.
01:00:48.320 | So here's a quote.
01:00:50.040 | And then it turns on.
01:00:51.000 | So I guess that's positive in the blue.
01:00:53.080 | End quote here.
01:00:55.600 | And then it's negative.
01:00:57.160 | Here, you start with no quote, negative in the red,
01:01:00.720 | see a quote, and then blue.
01:01:03.680 | Seems, again, very interpretable.
01:01:05.560 | Also, potentially a very useful feature to keep in mind.
01:01:08.000 | And this is just an individual unit in the LSTM
01:01:10.200 | that you can just look at and see that it does this.
01:01:12.800 | Very, very interesting.
01:01:14.320 | Even farther on this--
01:01:19.080 | and this is actually a study by some AI and neuroscience
01:01:24.080 | researchers--
01:01:25.120 | is we saw the LSTMs were good at subject verb number agreement.
01:01:29.560 | Can we figure out the mechanisms by which the LSTM is
01:01:31.880 | solving the task?
01:01:32.880 | Can we actually get some insight into that?
01:01:35.040 | And so we have a word level language model.
01:01:37.720 | The word level language model is going
01:01:39.560 | to be a little small.
01:01:40.400 | But you have a sentence, "the boy gently and kindly
01:01:43.280 | greets the."
01:01:45.400 | And this cell that's being tracked here--
01:01:47.840 | so it's an individual hidden unit, one dimension--
01:01:52.320 | is actually, after it sees boy, it sort of starts to go higher.
01:01:57.800 | And then it goes down to something very small
01:02:00.840 | once it sees greets.
01:02:02.360 | And this cell seems to correlate with the scope of a subject
01:02:06.560 | verb number agreement instance, effectively.
01:02:09.560 | So here, "the boy that watches the dog that watches the cat
01:02:12.720 | greets."
01:02:13.880 | You've got that cell, again, staying high,
01:02:16.520 | maintaining the scope of subject until greets,
01:02:19.400 | and at which point it stops.
01:02:21.840 | What allows it to do that?
01:02:23.480 | Probably some complex other dynamics in the network.
01:02:27.320 | But it's still a fascinating, I think, insight.
01:02:31.000 | And yeah, this is just neuron 1,150 in this LSTM.
01:02:39.840 | So those are all observational studies
01:02:42.440 | that you could do by picking out individual components
01:02:47.040 | of the model that you can just take each one of
01:02:50.280 | and correlating them with some behavior.
01:02:53.240 | Now, we'll look at a general class of methods
01:02:57.040 | called probing, by which we still
01:03:00.200 | use supervised knowledge, like knowledge
01:03:04.160 | of the type of coreference that we're looking for.
01:03:06.840 | But instead of seeing if it correlates with something
01:03:09.160 | that's immediately interpretable,
01:03:10.520 | like a attention head, we're going
01:03:13.920 | to look into the vector representations of the model
01:03:16.720 | and see if these properties can be read out
01:03:19.120 | by some simple function to say, oh, maybe this property was
01:03:23.440 | made very easily accessible by my neural network.
01:03:26.760 | So let's dig into this.
01:03:28.680 | So the general paradigm is that you've
01:03:30.720 | got language data that goes into some big pre-trained
01:03:34.640 | transformer with fine tuning.
01:03:36.440 | And you get state-of-the-art results.
01:03:38.880 | Soda means state-of-the-art.
01:03:40.960 | And so the question for the probing methodology
01:03:44.240 | is, if it's providing these general purpose language
01:03:47.320 | representations, what does it actually encode about language?
01:03:53.320 | Can we quantify this?
01:03:54.560 | Can we figure out what kinds of things
01:03:56.100 | is learning about language that we seemingly now
01:03:58.200 | don't have to tell it?
01:04:00.480 | And so you might have something like a sentence,
01:04:03.800 | like I record the record.
01:04:06.440 | That's an interesting sentence.
01:04:08.000 | And you put it into your transformer model
01:04:11.220 | with its word embeddings at the beginning,
01:04:13.840 | maybe some layers of self-attention and stuff.
01:04:16.320 | And you make some predictions.
01:04:17.800 | And now our objects of study are going
01:04:19.920 | to be these intermediate layers.
01:04:22.560 | So it's a vector per word or subword for every layer.
01:04:27.020 | And the question is, can we use these linguistic properties,
01:04:30.300 | like the dependency parsing that we
01:04:32.380 | had way back in the early part of the course,
01:04:35.380 | to understand correlations between properties
01:04:41.040 | in the vectors and these things that we can interpret?
01:04:44.140 | We can interpret dependency parses.
01:04:47.980 | So there are a couple of things that we
01:04:49.940 | might want to look for here.
01:04:51.700 | We might want to look for semantics.
01:04:53.500 | So here in this sentence, I record the record.
01:04:56.560 | I am an agent.
01:04:58.440 | That's a semantics thing.
01:05:00.960 | Record is a patient.
01:05:02.000 | It's the thing I'm recording.
01:05:04.160 | You might have syntax.
01:05:05.240 | So you might have this syntax tree
01:05:06.680 | that you're interested in.
01:05:07.760 | That's the dependency parse tree.
01:05:09.280 | Maybe you're interested in part of speech,
01:05:11.040 | because you have record and record.
01:05:14.720 | And the first one's a verb.
01:05:16.160 | The second one's a noun.
01:05:17.380 | They're identical strings.
01:05:19.000 | Does the model encode that one is one and the other
01:05:22.040 | is the other?
01:05:23.880 | So how do we do this kind of study?
01:05:26.200 | So we're going to decide on a layer that we want to analyze.
01:05:29.440 | And we're going to freeze BERT.
01:05:31.120 | So we're not going to fine tune BERT.
01:05:32.620 | All the parameters are frozen.
01:05:34.760 | So we're going to decide on layer 2 of BERT.
01:05:36.640 | And we're going to pass it some sentences.
01:05:38.440 | We decide on what's called a probe family.
01:05:41.960 | And the question I'm asking is, can I
01:05:44.040 | use a model from my family, say linear,
01:05:47.680 | to decode a property that I'm interested in really
01:05:52.120 | well from this layer?
01:05:53.800 | So it's indicating that this property is easily
01:05:56.800 | accessible to linear models, effectively.
01:06:00.000 | So maybe I train a linear classifier right on top of BERT.
01:06:05.940 | And I get a really high accuracy.
01:06:08.880 | And that's sort of interesting already, because you know,
01:06:12.000 | from prior work in part of speech tagging,
01:06:13.920 | that if you run a linear classifier on simpler features
01:06:17.240 | that aren't BERT, you probably don't
01:06:19.120 | get as high an accuracy.
01:06:20.160 | So that's an interesting sort of takeaway.
01:06:22.280 | But then you can also take a baseline.
01:06:24.360 | So I want to compare two layers now.
01:06:26.000 | So I've got layer 1 here.
01:06:27.680 | I want to compare it to layer 2.
01:06:29.680 | I train a probe on it as well.
01:06:32.340 | Maybe the accuracy isn't as good.
01:06:34.440 | And now I can say, oh, wow, look, by layer 2,
01:06:38.260 | part of speech is more easily accessible to linear functions
01:06:42.200 | than it was at layer 1.
01:06:44.440 | So what did that?
01:06:45.200 | Well, the self-attention and feed-forward stuff
01:06:47.560 | made it more easily accessible.
01:06:49.360 | That's interesting, because it's a statement about the information
01:06:51.960 | processing of the model.
01:06:53.680 | So we're going to analyze these layers.
01:07:00.200 | Let's take a second more to think about it.
01:07:02.680 | And just really give me just a second.
01:07:05.960 | So if you have the model's representations, h1 to ht,
01:07:10.160 | and you have a function family F,
01:07:12.240 | that's the subset linear models.
01:07:13.800 | So maybe you have a feed-forward neural network,
01:07:16.560 | some fixed set of hyperparameters.
01:07:18.680 | Freeze the model, train the probe,
01:07:21.520 | so you get some predictions for part of speech tagging
01:07:24.000 | or whatever.
01:07:24.880 | That's just the probe applied to the hidden state of the model.
01:07:28.520 | The probe is a member of the probe family.
01:07:31.000 | And then the extent that we can predict
01:07:32.880 | y is a measure of accessibility.
01:07:34.760 | So that's just written out, not as pictorially.
01:07:38.560 | So I'm not going to stay on this for too much longer.
01:07:44.200 | And it may help in the search for causal mechanisms,
01:07:48.480 | but it sort of just gives us a rough understanding
01:07:50.640 | of processing of the model and what things
01:07:53.420 | are accessible at what layer.
01:07:55.560 | So what are some results here?
01:07:57.000 | So one result is that BERT, if you run linear probes on it,
01:08:01.960 | does really, really well on things
01:08:03.800 | that require syntax and part of speech,
01:08:06.080 | named entity recognition.
01:08:07.600 | Actually, in some cases, approximately as well as just
01:08:10.600 | doing the very best thing you could possibly do without BERT.
01:08:15.440 | So it just makes easily accessible, amazingly strong
01:08:18.060 | features for these properties.
01:08:19.920 | And that's an interesting sort of emergent quality of BERT,
01:08:23.520 | you might say.
01:08:26.000 | It seems like as well that the layers of BERT
01:08:29.320 | have this property where--
01:08:31.200 | so if you look at the columns of this plot here,
01:08:35.220 | each column is a task.
01:08:37.000 | You've got input words at the sort of layer 0 of BERT here.
01:08:41.080 | Layer 24 is the last layer of BERT large.
01:08:44.120 | Lower performance is yellow.
01:08:45.240 | Higher performance is blue.
01:08:46.840 | And the resolution isn't perfect,
01:08:50.240 | but consistently, the best place to read out these properties
01:08:53.880 | is somewhere a bit past the middle of the model, which
01:08:57.240 | is this very consistent rule, which is fascinating.
01:09:01.180 | And then it seems as well like if you
01:09:04.160 | look at this function of increasingly abstract
01:09:07.280 | or increasingly difficult to compute
01:09:09.040 | linguistic properties on this axis,
01:09:11.400 | an increasing depth in the network on that axis.
01:09:14.320 | So the deeper you go in the network,
01:09:16.640 | it seems like the more easily you
01:09:19.360 | can access more and more abstract linguistic properties,
01:09:23.700 | suggesting that that accessibility is being
01:09:26.700 | constructed over time by the layers of processing of BERT.
01:09:30.080 | So it's building more and more abstract features, which
01:09:33.160 | I think is, again, a really interesting result.
01:09:37.360 | And now I think--
01:09:39.400 | yeah, one thing that I think comes
01:09:41.440 | to mind that really brings us back right to day one
01:09:45.460 | is we built intuitions around Word2Vec.
01:09:48.840 | We were asking, what does each dimension of Word2Vec mean?
01:09:51.520 | And the answer was, not really anything.
01:09:54.120 | But we could build intuitions about it
01:09:56.840 | and think about properties of it through these connections
01:10:00.640 | between simple mathematical properties of Word2Vec
01:10:04.320 | and linguistic properties that we could understand.
01:10:08.040 | So we had this approximation, which is not 100% true.
01:10:11.400 | But it's an approximation that says cosine similarity is
01:10:15.760 | effectively correlated with semantic similarity.
01:10:19.440 | Think about even if all we're going
01:10:23.560 | to do at the end of the day is fine tune these word
01:10:25.880 | embeddings anyway.
01:10:27.720 | Likewise, we had this idea about the analogies being
01:10:30.800 | encoded by linear offsets.
01:10:32.160 | So some relationships are linear in space.
01:10:36.120 | And they didn't have to be.
01:10:37.480 | That's fascinating.
01:10:39.040 | It's this emergent property that we've now
01:10:40.920 | been able to study since we discovered this.
01:10:43.240 | Why is that the case in Word2Vec?
01:10:45.480 | And in general, even though you can't
01:10:47.520 | interpret the individual dimensions of Word2Vec,
01:10:50.960 | these emergent, interpretable connections
01:10:53.920 | between approximate linguistic ideas
01:10:56.840 | and simple math on these objects is fascinating.
01:11:00.520 | And so one piece of work that extends this idea
01:11:04.760 | comes back to dependency parse trees.
01:11:06.520 | So they describe the syntax of sentences.
01:11:09.520 | And in a paper that I did with Chris,
01:11:14.560 | we showed that actually BERTs and models like it
01:11:17.840 | make dependency parse tree structure emergent,
01:11:22.400 | more easily accessible than one might
01:11:24.640 | imagine in its vector space.
01:11:26.760 | So if you've got a tree right here,
01:11:28.400 | the chef who rented the store was out of food, what you can
01:11:34.120 | do is think about the tree in terms
01:11:36.160 | of distances between words.
01:11:38.920 | So you've got the number of edges in the tree between two
01:11:42.960 | words is their path distance.
01:11:44.160 | So you've got that the distance between chef and was is 1.
01:11:48.240 | And we're going to use this interpretation of a tree
01:11:50.320 | as a distance to make a connection with BERT's
01:11:53.520 | embedding space.
01:11:54.840 | And what we were able to show is that under a single linear
01:11:57.800 | transformation, the squared Euclidean distance between BERT
01:12:02.000 | vectors for the same sentence actually correlates well,
01:12:07.240 | if you choose the B matrix right,
01:12:09.880 | with the distances in the tree.
01:12:12.280 | So here in this Euclidean space that we've transformed,
01:12:16.440 | the approximate distance between chef and was is also 1.
01:12:20.960 | Likewise, the difference between was and store
01:12:23.840 | is 4 in the tree.
01:12:25.960 | And in my simple transformation of BERT space,
01:12:29.560 | the distance between store and was is also approximately 4.
01:12:33.440 | And this is true across a wide range of sentences.
01:12:36.480 | And this is, to me, a fascinating example of,
01:12:39.880 | again, emergent approximate structure in these very
01:12:43.400 | nonlinear models that don't necessarily need to encode
01:12:46.480 | things so simply.
01:12:51.200 | All right.
01:12:52.480 | Great.
01:12:53.040 | So probing studies and correlation studies
01:12:56.640 | are, I think, interesting and point us in directions
01:12:59.280 | to build intuitions about models.
01:13:01.680 | But they're not arguments that the model is actually
01:13:03.800 | using the thing that you're finding to make a decision.
01:13:07.120 | They're not causal studies.
01:13:09.960 | This is for probing and correlation studies.
01:13:12.000 | So in some work that I did around the same time,
01:13:15.960 | we showed actually that certain conditions on probes
01:13:19.440 | allow you to achieve high accuracy on a task that's
01:13:22.440 | effectively just fitting random labels.
01:13:24.880 | And so there's a difficulty of interpreting
01:13:29.480 | what the model could or could not
01:13:31.000 | be doing with this thing that is somehow easily accessible.
01:13:34.800 | It's interesting that this property is easily accessible.
01:13:37.520 | But the model might not be doing anything with it, for example,
01:13:40.400 | because it's totally random.
01:13:42.520 | Likewise, another paper showed that you
01:13:44.640 | can achieve high accuracy with a probe
01:13:46.800 | even if the model is trained to know that thing that you're
01:13:49.800 | probing for is not useful.
01:13:52.160 | And there's causal studies that try to extend this work.
01:13:56.000 | It's much more difficult to read this paper.
01:13:58.040 | And it's a fascinating line of future work.
01:14:01.440 | Now in my last two minutes, I want
01:14:04.680 | to talk about recasting model tweaks and ablations
01:14:07.480 | as analysis.
01:14:09.480 | So we had this improvement process
01:14:11.240 | where we had a network that was going to work OK.
01:14:14.160 | And we would see whether we could tweak it
01:14:16.040 | in simple ways to improve it.
01:14:17.640 | And then you could see whether you could remove anything
01:14:19.960 | and have it still be OK.
01:14:21.080 | And that's kind of like analysis.
01:14:22.440 | I have my network.
01:14:23.400 | Do I want it to--
01:14:24.480 | is it going to be better if it's more complicated?
01:14:26.640 | If it's going to be better if it's simpler?
01:14:28.480 | Can I get away with it being simpler?
01:14:30.320 | And so one example of some folks who did this
01:14:33.160 | is they took this idea of multi-headed attention
01:14:35.760 | and said, oh, so many heads.
01:14:38.400 | Are all the heads important?
01:14:39.960 | And what they showed is that if you train
01:14:42.120 | a system with multi-headed attention
01:14:44.520 | and then just remove the heads at test time
01:14:46.600 | and not use them at all, you can actually
01:14:48.720 | do pretty well on the original task,
01:14:50.880 | not retraining at all, without some of the attention heads,
01:14:54.280 | showing that they weren't important.
01:14:56.000 | You could just get rid of them after training.
01:14:58.480 | And likewise, you can do the same thing for--
01:15:00.560 | this is on machine translation.
01:15:01.800 | This is on multi-NLI.
01:15:03.120 | You can actually get away without a large, large
01:15:05.320 | percentage of your attention heads.
01:15:06.800 | Let's see.
01:15:12.040 | Yeah, so another thing that you could think about
01:15:15.040 | is questioning sort of the basics of the models
01:15:18.160 | that we're building.
01:15:19.000 | So we have transformer models that
01:15:20.720 | are sort of self-attention, feedforward, self-attention,
01:15:23.120 | feedforward.
01:15:23.840 | But why in that order with some of the things emitted here?
01:15:27.560 | And this paper asked this question and said,
01:15:30.960 | if this is my transformer, self-attention, feedforward,
01:15:33.760 | self-attention, feedforward, et cetera, et cetera, et cetera,
01:15:36.520 | what if I just reordered it so that I
01:15:37.860 | had a bunch of self-attentions at the head
01:15:39.800 | and a bunch of feedforwards at the back?
01:15:41.600 | And they tried a bunch of these orderings.
01:15:43.480 | And this one actually does better.
01:15:45.760 | So this achieves a lower perplexity on a benchmark.
01:15:48.720 | And this is a way of analyzing what's
01:15:51.040 | important about the architectures that I'm building
01:15:53.320 | and how can they be changed in order to perform better.
01:15:56.200 | So neural models are very complex.
01:15:58.440 | And they're difficult to characterize
01:15:59.960 | and impossible to characterize with a single sort
01:16:02.560 | of statistic, I think, for your test set accuracy,
01:16:05.200 | especially in domain.
01:16:07.400 | And we want to find intuitive descriptions of model
01:16:09.840 | behaviors.
01:16:11.440 | But we should look at multiple levels of abstraction.
01:16:13.960 | And none of them are going to be complete.
01:16:16.760 | When someone tells you that their neural network is
01:16:19.160 | interpretable, I encourage you to engage critically with that.
01:16:23.880 | It's not necessarily false.
01:16:25.440 | But the levels of interpretability
01:16:27.400 | and what you can interpret, these
01:16:29.040 | are the questions that you should be asking.
01:16:30.840 | Because it's going to be opaque in some ways,
01:16:32.920 | almost definitely.
01:16:35.160 | And then bring this lens to your model building
01:16:39.500 | as you try to think about how to build better models,
01:16:41.720 | even if you're not going to be doing analysis as sort of one
01:16:44.280 | of your main driving goals.
01:16:46.880 | And with that, good luck on your final projects.
01:16:50.360 | I realize we're at time.
01:16:52.120 | The teaching staff is really appreciative of your efforts
01:16:55.960 | over this difficult quarter.
01:16:57.360 | And yeah, hope-- yeah, there's a lecture left on Thursday.
01:17:02.440 | But yeah, this is my last one.
01:17:04.280 | So thanks, everyone.
01:17:06.520 | [BLANK_AUDIO]