back to indexStanford XCS224U: Analysis NLU, Part 5: Distributed Alignment Search (DAS) & Conclusion I Spring 23
00:00:12.660 |
about a brand new method we've been developing, 00:00:17.000 |
I think this overcomes crucial limitations with 00:00:19.800 |
causal abstraction as I presented it to you before. 00:00:23.140 |
I'm going to give you a high-level overview of 00:00:29.640 |
into the future about analysis methods in the field. 00:00:39.820 |
the three smileys along the interventions row, 00:00:59.540 |
with sets of neurons in the target neural model. 00:01:10.920 |
I mean, to call it astronomical would be to fail to 00:01:32.880 |
we presume that we're doing it in a standard basis. 00:01:36.280 |
The central insight behind DAS is that there might be 00:01:39.820 |
interpretable structure that we would find if we 00:02:13.380 |
If both of the inputs were true, otherwise false. 00:02:16.960 |
On the right, I have a very simple neural model. 00:02:23.440 |
our Boolean conjunction task with this set of parameters. 00:02:30.140 |
Now, in the classical causal abstraction mode, 00:02:37.460 |
It looks good. I align the inputs as you would expect. 00:02:43.080 |
and I'll add in the decision procedure that if 00:03:02.260 |
Now, this model is perfect behaviorally as I said, 00:03:13.780 |
crucial and I'll just give you the spoiler here. 00:03:19.420 |
is reverse the order of those internal variables. 00:03:27.180 |
What we're doing with this simple example is simulating 00:03:32.680 |
a mistake about what set of alignments I decided to look at, 00:03:35.960 |
and I picked one that is suboptimal in terms of 00:03:44.680 |
even if I start with this incorrect alignment, 00:03:50.480 |
First, I'll just substantiate for you that we do 00:03:56.160 |
I'll show you a failed interchange intervention. 00:04:02.960 |
We take V1 from the right example and put it into 00:04:09.680 |
The original output for the left example was false, 00:04:22.700 |
we end up with an output state that is negative. 00:04:27.420 |
but the causal model said we should predict true, 00:04:30.000 |
and that's exactly the kind of thing that leads us to say that this is 00:04:38.360 |
The two models have unequal counterfactual predictions. 00:04:46.680 |
It's because I chose the wrong alignment due to 00:04:57.000 |
The alignment relationship does hold in a non-standard basis. 00:05:01.400 |
If I take the current network and the current alignment and I 00:05:04.320 |
simply rotate H1 and H2 using this rotation matrix, 00:05:09.260 |
then I have a network that is behaviorally perfect 00:05:12.840 |
and satisfies the causal abstraction relationship. 00:05:19.560 |
missed this because of the standard basis we chose, 00:05:31.360 |
our neural models prefer to operate in that basis. 00:05:37.820 |
interpretable structure by dropping that assumption about the basis. 00:05:53.700 |
and then the rotation matrix becomes the asset that you can use for 00:06:03.320 |
Here's a more high-level abstract overview of how this 00:06:06.920 |
might happen using a pair of aligned interventions. 00:06:12.940 |
I have two source models on the left and right. 00:06:16.260 |
They process their various examples and we're going to 00:06:21.040 |
X_2, and X_3 across these different uses of the model. 00:06:28.200 |
that representation that we targeted to create some new variables, 00:06:37.840 |
and that is the matrix that we're going to learn 00:06:40.280 |
using essentially interchange intervention training. 00:06:51.820 |
with Y_2, and then copying Y_3 over from this core base example. 00:06:59.960 |
and then we un-rotate and we do the intervention. 00:07:05.160 |
Remember the essence of DAS is that we're going to freeze the model parameters. 00:07:11.880 |
not a method where we change the core underlying target model. 00:07:15.540 |
But the thing that we do is learn a rotation matrix that essentially 00:07:19.960 |
maximizes the interchange intervention accuracy that we get from 00:07:24.720 |
doing this rotation and then un-rotation to create these new models. 00:07:35.940 |
We keep the model frozen because we want to interpret it, 00:07:50.160 |
we show that models learn truly hierarchical solutions to a hierarchical equality task. 00:07:56.020 |
This is in fact the one that's reviewed in our notebook for this course. 00:08:02.980 |
standard causal abstraction because of this non-standard basis issue. 00:08:12.560 |
we found that models learn theories of lexical entailment and negation that 00:08:17.080 |
align with a high-level intuitive causal model. 00:08:21.040 |
But with DAS, we can uncover that they do that in a brittle way that actually 00:08:26.160 |
preserves the identities of the lexical items rather than 00:08:29.340 |
truly learning a general solution to the entailment issue. 00:08:37.900 |
This is tremendously exciting because it shows that we can 00:08:43.360 |
before due to our lack of a need for searching for alignments, 00:08:46.600 |
because now we essentially learn the alignment. 00:08:49.460 |
We scaled DAS to alpaca and we discovered that alpaca, 00:08:55.660 |
implements an intuitive algorithm to solve a numerical reasoning task. 00:09:00.860 |
I think this is just the start of the potential that we see for using 00:09:05.160 |
DAS to understand our biggest and most performant 00:09:13.140 |
Let me turn now to wrapping up just some high-level conclusions here. 00:09:18.100 |
First, I wanted to return to this diagram that I used 00:09:24.600 |
We have these incredibly important goals for the field, 00:09:30.640 |
identifying and correcting pernicious social biases, 00:09:34.480 |
and guaranteeing models as safe in certain contexts. 00:09:38.260 |
I feel that we cannot offer guarantees about these issues 00:09:42.220 |
unless we have analytic guarantees about the underlying models. 00:09:46.460 |
For me, that implies a truly deep causal understanding 00:09:50.760 |
of the mechanisms that shape their input-output behavior. 00:09:54.180 |
For that reason, I think the analysis project in 00:09:58.300 |
NLP is one of the most pressing projects for the field. 00:10:02.860 |
In that spirit, let's look ahead a little bit to 00:10:05.440 |
the near future of explainability research for the field. 00:10:21.960 |
mathematical explanations of how the transformer 00:10:24.840 |
worked and call that explainability research. 00:10:27.880 |
But that's at the wrong level for humans trying to offer 00:10:37.800 |
We need to apply these methods to ever larger instruct-trained LLMs. 00:10:43.200 |
Those are the most relevant artifacts for the current moment. 00:10:47.280 |
I think we're starting to approach this goal with DAS. 00:10:50.760 |
I just mentioned how we can apply it to alpaca. 00:10:57.160 |
unconstrained in terms of what we can explore, 00:10:59.520 |
and that requires a lot more innovation in the space. 00:11:11.280 |
I think we're seeing increasing evidence that 00:11:16.640 |
that is a mapping from language into a network of concepts. 00:11:20.660 |
If they are doing that and if we can find strong evidence for that, 00:11:24.800 |
it's tremendously eye-opening about what kinds of 00:11:32.880 |
have induced a semantics from its experiences, 00:11:35.360 |
which would in turn lead us down the road of having 00:11:38.480 |
many more guarantees that their behavior would be systematic, 00:11:42.280 |
which could be a basis for them being, again, 00:11:47.480 |
and all of those important goals for the field and for society.