back to indexStanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim
00:00:05.400 |
Today I'm delighted to introduce as our final guest speaker, Bean Kim. 00:00:11.480 |
Bean Kim is a staff research scientist at Google Brain. 00:00:15.760 |
If you're really into Googleology, those funny words at the beginning, 00:00:22.480 |
And that means that Bean's a good research scientist. 00:00:25.280 |
[LAUGH] So I discovered at lunch today that Bean started out 00:00:30.960 |
studying mechanical engineering at Seoul National University. 00:00:36.520 |
But she moved on to, I don't know, it's better things or not. 00:00:40.360 |
But she moved on to computer science and did her PhD at MIT. 00:00:46.000 |
And there she started working on the interpretability and 00:00:52.480 |
I think she'll be talking about some different parts of her work. 00:00:56.680 |
But a theme that she's had in some of her recent work that I find especially 00:01:02.000 |
appealing as an NLP person is the idea that we should be using 00:01:07.720 |
higher level human interpretable languages for 00:01:15.200 |
So welcome, Bean, looking forward to your talk, and go for it. 00:01:33.760 |
But then I live in Seattle, so this is pretty common. 00:01:36.960 |
So I still was able to see the blue sky today. 00:01:39.080 |
I was like, this works, I really like it here. 00:01:42.080 |
So today I'm going to share some of my dreams, 00:01:45.120 |
chasing my dreams to communicate with machines. 00:01:49.040 |
So if you're in this class, you probably agree, you don't have to, 00:01:53.240 |
that large language models and generator models are pretty cool. 00:01:59.040 |
But you may also agree that they're a little bit frightening. 00:02:01.920 |
Not just because they're impressive, they're doing a really good job, but 00:02:07.120 |
also we're not quite sure where we're going with this technology. 00:02:12.040 |
In 10 years out, will we look back and say, that technology was net positive? 00:02:17.160 |
Or we will say, that was catastrophic, we didn't know that that would happen. 00:02:21.280 |
Ultimately, what I would like to do, or maybe hopefully what we all want to do, 00:02:28.600 |
is to have this technology benefit us, humans. 00:02:32.560 |
I know in 10 years time, or maybe, well, 20 years or earlier, 00:02:36.440 |
he's gonna ask me, he's gonna be like, mom, did you work on this AI stuff? 00:02:43.680 |
And did you know that how this will profoundly change our lives? 00:02:53.480 |
I really hope that I have some good things to say to him. 00:02:56.000 |
So my initial thought, and still so, or current thought, 00:03:05.120 |
is that if we want our ultimate goal to be benefit humanity, 00:03:15.480 |
There's lots of different ways we can benefit. 00:03:18.240 |
But one way we can benefit is to treat this like a colleague. 00:03:22.640 |
You know, a colleague who are really good at something. 00:03:28.240 |
it's good at something enough that you want to learn something from them. 00:03:31.520 |
One difference is though, in this case, is that this colleague is kind of weird. 00:03:37.560 |
This colleague might have very different values, 00:03:40.720 |
it might have very different experiences in the world. 00:03:44.480 |
It may not care about surviving as much as we do. 00:03:48.080 |
Maybe mortality isn't really a thing for this colleague. 00:03:52.400 |
So you have to navigate that in our conversation. 00:03:55.760 |
So what do you do when you first meet somebody? 00:03:59.120 |
There's someone so different, what do you do? 00:04:01.040 |
You try to have a conversation to figure out how do you do what you do? 00:04:07.320 |
How are you solving decades old protein folding problem? 00:04:11.280 |
How are you beating the world gold champion so easily, what it seems? 00:04:17.520 |
Are you using the same language, the science language that we use, 00:04:23.480 |
Or do you think about the world in a very different way? 00:04:27.320 |
And more importantly, how can we work together? 00:04:29.800 |
I have one alien that I really want to talk to, and it's AlphaGo. 00:04:37.000 |
So AlphaGo beat world gold champion Isidore in 2016. 00:04:41.040 |
Isidore is from South Korea, I'm from South Korea. 00:04:44.800 |
It was such a big deal in South Korea and worldwide, I hope. 00:04:48.440 |
And in one of the matches, AlphaGo played this move called move 37. 00:05:02.600 |
And I remember the nine-dan commentator who's 00:05:05.360 |
been talking a lot throughout the matches suddenly got really quiet. 00:05:10.000 |
And he said, hmm, that's a very strange move. 00:05:14.640 |
And I knew then that something really interesting 00:05:17.560 |
has just happened in front of my eyes, that this 00:05:21.800 |
This AlphaGo has made something that we're going to remember forever. 00:05:25.320 |
And sure enough, this move turned around the game for AlphaGo 00:05:28.840 |
and leading AlphaGo to win one of the matches. 00:05:33.160 |
So goal players today continue to analyze this move 00:05:42.120 |
So the question is, how did AlphaGo know this is a good move? 00:05:45.680 |
My dream is to learn something new by communicating with machines 00:05:54.720 |
and having a conversation, and such that humanity 00:05:58.040 |
will gain some new angle to our important problems 00:06:04.960 |
And this is not just about discovering new things. 00:06:11.280 |
have to have a meaningful conversation with somebody 00:06:18.200 |
So in a way, solving this problem is a superset of solving AI safety, too. 00:06:29.080 |
Conversation assumes that we share some common vocabulary between that 00:06:33.720 |
exchange to exchange meaning and ultimately the knowledge. 00:06:36.880 |
And naturally, a representation plays a key role in this conversation. 00:06:40.880 |
On the left-- and we can visualize this on the left-- 00:06:43.400 |
we say, this is a representational space of what humans know. 00:06:50.120 |
Here in left circle, there will be something like, this dog is fluffy. 00:06:54.000 |
And you know what that means, because we all share somewhat similar vocabulary. 00:06:59.360 |
But on the right, we have something like move 37, 00:07:02.760 |
where humans yet to have a representation for. 00:07:14.800 |
And the more overlap we have, the better conversation we're going to have. 00:07:21.520 |
Like here, everyone is learning something new. 00:07:24.320 |
So we can expand what we know by learning new concepts and vocabularies. 00:07:29.920 |
And doing so, I believe, will help us to build 00:07:32.800 |
machines that can better align with our values and our goals. 00:07:40.880 |
If you're curious about some of the work we're doing towards this direction, 00:07:50.280 |
But today, I'm going to talk more about my hopes and dreams. 00:07:53.880 |
And hopefully, at the end of the day, your hopes and dreams too. 00:07:59.160 |
So first of all, I'm just going to set the expectation. 00:08:03.600 |
So at the end of this talk, we still don't know how the move 37 is made. 00:08:12.360 |
In fact, the first part of this talk is going 00:08:16.160 |
to be about how we move backwards in this progress, 00:08:21.640 |
in terms of making this progress in our journey. 00:08:24.480 |
And still a very, very small portion of our entire journey 00:08:31.680 |
And of course, this journey wouldn't be like a singular path. 00:08:35.160 |
There will be lots of different branches coming in. 00:08:38.480 |
Core ideas like transformer helped many domains across. 00:08:46.560 |
of our work on understanding emerging behaviors in reinforcement learning. 00:08:51.560 |
And all the techniques that I'm going to talk about 00:08:54.280 |
is going to be in principle applicable to NLP. 00:08:56.920 |
So coming back to our hopes and dreams, move 37. 00:09:05.120 |
So let's first think about how we might realize this dream. 00:09:09.440 |
And taking a step back, we have to ask, do we 00:09:12.400 |
have tools to first estimate what even machines know? 00:09:17.480 |
There has been many development in machine learning last decade 00:09:20.720 |
now to develop tools to understand and estimate this purple circle. 00:09:31.200 |
that there is a huge gap between what machines actually know 00:09:40.200 |
And identifying and bridging this gap is important 00:09:43.400 |
because these tools will form basis for understanding 00:09:57.680 |
So saliency map is one of the popular interpretability methods. 00:10:02.840 |
For simplicity, let's say an image net, you have an image like this. 00:10:07.240 |
The explanation is going to take a form of the same image, 00:10:10.920 |
but where each pixel is associated with a number that 00:10:15.320 |
is supposed to imply some importance of that pixel 00:10:26.160 |
is that that number indicates how the function 00:10:31.240 |
So for example, if I have a pixel ixj, maybe around xj, 00:10:38.400 |
or function is flat, or function is going down like the green curve. 00:10:43.880 |
And so if it's flat like a blue curve or red curve, 00:10:47.960 |
maybe that feature is irrelevant to predicting bird. 00:10:52.080 |
Then it's maybe more important because the value of x increases 00:10:58.280 |
So let's think about what are the few ways why this gap might exist. 00:11:13.120 |
So this alien, again, these machines that we train, 00:11:16.360 |
works in a completely different, perhaps completely 00:11:18.520 |
different representational space, very different experiences 00:11:23.120 |
So assuming that it sees the world that we do just like we do, 00:11:26.840 |
like having the gestalt phenomenon, there's few dots. 00:11:35.480 |
So maybe our assumptions about these machines are wrong. 00:11:42.040 |
We thought it was doing x, but it was actually doing y. 00:11:52.920 |
I'm going to dig deeper into some of these, our work. 00:12:01.480 |
So again, coming back to the earlier story about salience 00:12:04.600 |
map, we're going to play with some of these methods. 00:12:09.000 |
Now, in 2018, we stumbled upon this phenomenon 00:12:14.560 |
that was quite shocking, which was that we were actually 00:12:17.400 |
trying to write some different people, again, paper, 00:12:23.000 |
realized that trained network and untrained network 00:12:37.440 |
We thought we had a bug, but it turned out we didn't. 00:12:40.800 |
It actually is indistinguishable qualitatively and quantitatively. 00:12:47.840 |
But then we wondered, maybe it's a one-off case. 00:12:59.040 |
OK, what if the model had an error, one of these errors? 00:13:06.600 |
Maybe it had out-of-distribution at test time. 00:13:17.360 |
It turns out that that's also not quite true. 00:13:21.720 |
You might think that, oh, maybe spurious correlation. 00:13:48.400 |
So this is more recent work where we theoretically 00:13:50.800 |
prove that some of these methods, very popular methods, 00:13:56.400 |
So I'm going to talk a little bit about that. 00:14:03.640 |
I just realized this is also work with Peng Wei. 00:14:13.400 |
Now, the original paper that developed this method, IG 00:14:20.920 |
be used for accounting the contributions of each feature. 00:14:25.560 |
So what that means is that when the tool assigns 00:14:27.960 |
zero attribution to a pixel, we're going to say, OK, 00:14:40.480 |
And in fact, this is how it's been used in practice. 00:14:45.840 |
They used Schaub to figure out the eligibility 00:14:55.760 |
of these inferences that seemed pretty natural were true. 00:15:01.240 |
And in fact, just because popular attribution methods 00:15:07.240 |
you cannot conclude anything about the actual model 00:15:22.560 |
I learned about theory proving from this project as well. 00:15:25.600 |
So I'll tell you the way that we pursued this particular work 00:15:32.000 |
And we're going to formulate into some other problem 00:15:37.040 |
So in this case, we formulate this as hypothesis testing. 00:15:41.120 |
Because once you formulate into hypothesis testing, yes or no, 00:15:52.640 |
I got an attribution value from one of these tools. 00:15:55.840 |
And I have a mental model of, ah, this feature is important 00:16:01.840 |
Then the hypothesis is that whether that's true or not. 00:16:06.040 |
And what we showed is that given whatever hypothesis you 00:16:09.600 |
may have, you cannot do better than random guessing, 00:16:14.360 |
invalidating or invalidating this hypothesis testing. 00:16:26.980 |
if you just don't know if it's as good as random guessing? 00:16:50.600 |
all falls under this line of random guessing. 00:16:57.080 |
But maybe this still works in practice for some reason. 00:17:03.960 |
that we had that didn't quite meet in the practice. 00:17:13.600 |
We now have more image graphs and more bigger models. 00:17:21.960 |
or use these methods to do, recourse or spurious 00:17:32.520 |
I would have a high chance of getting a loan. 00:17:34.960 |
So I tweak this one feature and see if my value goes up or down. 00:17:39.520 |
Very reasonable task if people do it all the time. 00:17:45.840 |
So for two of these concrete end tasks, both of them 00:17:50.520 |
boil down to this hypothesis testing framework 00:18:15.600 |
And I really hope you're not one of those people. 00:18:25.560 |
There is a reason, perhaps a lot of times, why they work. 00:18:35.480 |
So again, your goal is to estimate a function shape. 00:18:43.920 |
You sample around that point and evaluate the function 00:18:50.760 |
If it goes down, maybe function's coming down. 00:18:54.480 |
So that's the simplest way you can brute force it. 00:18:58.480 |
But then the question is, how many samples do we need? 00:19:12.760 |
The more samples you have, the better estimation you have. 00:19:15.680 |
And differences in output, how much resolution do you care? 00:19:26.440 |
That's resolution that you care about and number 00:19:31.160 |
So if you worry about making some conclusion based 00:19:53.480 |
We're currently working on even bigger models 00:19:55.480 |
to show just again and again empirical evidence that, yes, 00:20:11.200 |
of course, you're going to need more samples. 00:20:20.480 |
Because these methods have a pretty good root 00:20:39.960 |
then in a given function, I can test it and say, 00:20:47.520 |
That would be still very useful for ongoing work. 00:20:56.040 |
To the findings you have about the JPEG models, 00:20:59.400 |
does it only apply to computer vision models? 00:21:10.920 |
Simplish proof that can show simply any function, this holds. 00:21:24.560 |
But it sort of seems like for the last couple of years, 00:21:28.800 |
there have been at least dozens, maybe hundreds of people 00:21:35.320 |
I mean, is your guess that most of that work is invalid? 00:21:42.520 |
Or that a lot of it might be OK because whatever conditions 00:21:47.600 |
that it's all right might often be being there? 00:21:53.680 |
My hypothesis testing results shows that it's random. 00:21:57.760 |
So maybe in the optimistic case, 50% of those papers, 00:22:14.000 |
But even if it helped human at the end task, whatever 00:22:17.440 |
that might be, help doctors to be more efficient, 00:22:19.720 |
identifying bugs and whatnot, and if they did the validation 00:22:22.920 |
correctly with the right control testing setup, 00:22:29.840 |
this noisy tools together work with human in the loop, maybe. 00:22:41.160 |
It's just that I think we need to narrow down 00:22:43.120 |
our expectations so that our expectations are 00:22:56.800 |
So this is one of those papers, just like the many other papers 00:23:07.560 |
And we thought, we're going to locate ethical knowledge 00:23:12.360 |
And then maybe we're going to edit them to make them 00:23:17.000 |
And then we thought, oh, the Rome paper from David Bauer. 00:23:24.960 |
But then we start digging into and implementing the Rome. 00:23:30.960 |
So we do sanity check, experiment after sanity check. 00:23:34.320 |
And we ended up writing a completely different paper, 00:23:39.960 |
So this paper, the Rome, for those who are not familiar, 00:23:45.200 |
which I'm going into a little more detail in a bit, 00:23:58.960 |
Because you can locate them, you can mess with it 00:24:05.000 |
In fact, that's a lot of times how localization or editing 00:24:11.480 |
But what we show is that this assumption is actually not 00:24:33.720 |
is stored outside of layers that are identified 00:24:39.480 |
And you will see this a little more detail in a bit. 00:24:44.600 |
In fact, the correlation between where the location, where 00:24:48.880 |
the facts are located, and how well you will edit 00:24:52.120 |
if you edit that location is completely correlated, 00:25:06.160 |
What we mean by editing can mean a lot of different things. 00:25:08.960 |
So let's think about different ways to edit a thing. 00:25:13.040 |
So we tried a bunch of things with little success. 00:25:16.280 |
We couldn't find an editing definition that actually 00:25:19.480 |
relates really well with localization methods, 00:25:24.560 |
So let's talk a little bit about ROM, how ROM works, 00:25:30.760 |
There's a lot of details missed out on the slide, 00:25:38.480 |
They have what's called causal tracing algorithm. 00:25:46.440 |
set, counterfact data set, that has this tuple, subject, 00:25:56.280 |
And so you're going to have a clean run of the space needle 00:26:01.600 |
You stole every single module, every single value, 00:26:05.960 |
And then in the second run, which they call corrupted run, 00:26:09.680 |
you're going to add noise in the space needle is-- 00:26:15.280 |
Then you're going to intervene at every single one 00:26:20.080 |
of those modules by copying this module to the corrupted run. 00:26:26.320 |
So as if that particular model was never interrupted, 00:26:37.160 |
where you pretend everything else being equal. 00:26:43.320 |
what is the probability of having the right answer? 00:26:46.400 |
So in this case, probability of the right answer, 00:26:55.000 |
So at the end of the day, you'll find a graph 00:26:57.440 |
like that where each layer and each token has a score. 00:27:02.560 |
How likely it is if I intervene on that token in that layer? 00:27:07.160 |
How likely is it that I will recover the right answer? 00:27:10.960 |
Because if I recover right answer, that's the module. 00:27:17.960 |
I couldn't find technical flaw in this algorithm. 00:27:21.640 |
But when we start looking at this using the same model 00:27:38.080 |
because that was supposedly the best layer across this data set 00:27:42.720 |
Most of the factual knowledge is stored in layer 6. 00:27:49.120 |
But we realized the truth looks like the graph on the right. 00:28:01.360 |
The black bars are histogram of where the knowledge was 00:28:04.840 |
actually peaked if you test every single layer. 00:28:08.600 |
And as you can see, not a lot of facts fall into that region. 00:28:12.000 |
So in fact, every single fact has different region 00:28:16.160 |
So layer 6, for a lot of facts, weren't the best layer. 00:28:26.080 |
So we thought, what do we do to find this ethical knowledge? 00:28:37.640 |
We're going to actually do a sanity check first 00:28:49.560 |
And that's when everything started to falling apart. 00:28:56.080 |
The edit success-- this is the rewrite score, 00:29:02.960 |
And the tracing effect-- this is localization-- 00:29:10.040 |
So when we plotted the relation between tracing effect 00:29:32.040 |
It's actually negative in this particular data set. 00:29:42.640 |
We're going to do this for every single layer. 00:30:30.920 |
They acknowledged that this is a phenomenon that exists. 00:30:48.880 |
So the added benefit of the rest of the localization 00:30:52.720 |
could only help you look at which is the correct subred 00:31:06.480 |
In fact, don't worry about localization at all. 00:31:16.240 |
But then we thought, maybe the particular definition of edit 00:31:20.840 |
that they used in the room was maybe different. 00:31:26.920 |
that correlates a lot better with localization. 00:31:34.560 |
So we tried a bunch of different definitions of edits. 00:31:56.000 |
is our square value for four different methods. 00:31:59.400 |
And this wasn't just the case for ROM and MEM. 00:32:01.360 |
It was also the case for fine tuning methods. 00:32:17.920 |
You might feel that effect forcing, the last one, 00:32:22.240 |
But still, compared to the impact of layer, choice of layer, 00:32:32.560 |
can't locate the ethical knowledge at this project. 00:32:37.800 |
And we ended up doing a lot more in-depth analysis on this. 00:32:41.520 |
So in summary, does localization help editing? 00:32:51.600 |
For this particular editing method, from what I know, 00:33:03.440 |
But if somebody can answer this question for me, 00:33:07.360 |
I feel like there should still be something there 00:33:26.360 |
But what we found here is that it has nothing 00:33:35.120 |
But a lot of insights that they found in their paper 00:33:37.880 |
is still useful, like the early to mid-range NLP 00:33:42.360 |
They represent the factual, something we didn't know before. 00:33:46.160 |
But it is important not to validate localization methods 00:33:52.960 |
and maybe not to motivate editing methods using 00:33:58.600 |
Those are the two things now we know that we shouldn't do, 00:34:05.200 |
Any questions on this one before I move on to the next one? 00:34:35.280 |
what machines know versus what we think machines know. 00:34:45.280 |
There's a good quote that says, "Good artists steal. 00:34:50.520 |
We have to be really suspicious of everything that we do. 00:34:57.880 |
that once you like your results so much, that's a bad sign. 00:35:02.360 |
Come back, go home, have a beer, go to sleep. 00:35:05.720 |
And next day, you come back and put your paper on your desk 00:35:09.240 |
and think, OK, now I'm going to review this paper. 00:35:23.160 |
So let's bring our attention back to our hopes and dreams. 00:35:28.720 |
So here, I came to realize maybe instead of just building 00:35:33.560 |
tools to understand, perhaps we need to do some groundwork. 00:35:38.920 |
Well, this alien that we've been dealing with, 00:35:41.680 |
trying to generate explanations, seems to be a different kind. 00:35:50.560 |
Study them as if they're like new species in the wild. 00:35:59.120 |
But one of the ways is to do observational study. 00:36:02.120 |
So you saw some species in the wild far away. 00:36:09.560 |
what are their habitat, what are their values and whatnot. 00:36:14.440 |
And second way, you can actually intervene and do a control 00:36:18.800 |
So we did something like this with reinforcement learning 00:36:25.560 |
I'm going to talk about these two papers, first paper. 00:36:37.120 |
If you haven't seen it, just Google it and watch it. 00:36:40.440 |
I'm only covering the tip of an iceberg in this. 00:36:42.920 |
But at the end of this hide and seek episode, at some point, 00:36:47.400 |
the agents discover a bug in this physical system 00:37:02.360 |
And also humanoid football and capture the flag from deep mind. 00:37:05.840 |
Lots of interesting behaviors emerging that we observed. 00:37:15.680 |
are labels that are provided by OpenAI, running and chasing, 00:37:25.360 |
went painstakingly, one by one, watch all these videos 00:37:39.800 |
help us explore this complex domain a little better. 00:37:50.240 |
treat the agents like an observational study, 00:38:15.600 |
Generative model-- have you covered the Bayesian 00:38:29.400 |
Think about this as a fake or hypothetical data generation 00:38:39.240 |
I'm going to first generate a joint latent embedding space 00:38:51.480 |
And each embedding, when it's conditioned with state, 00:38:58.960 |
what action is given the state and the embedding pair. 00:39:11.880 |
And you do inference to learn all these parameters. 00:39:21.800 |
This is like my idea of how these new species might work. 00:39:27.240 |
we're going to try this and see if anything useful comes up. 00:39:30.920 |
And the way you do this is-- one of the ways you do this 00:39:33.320 |
is you optimize for a variation of lower bound. 00:39:38.920 |
If one gets into this exponential family business, 00:39:52.160 |
Here, we're going to pretend that we have two agents, one 00:39:54.960 |
controlling back leg and one controlling the front leg. 00:39:57.800 |
And on the right, we're showing that joint embedding space 00:40:08.640 |
So now I'm going to select-- this is a visualization 00:40:17.960 |
You can select a little space in agent 1 space. 00:40:21.440 |
And you see it maps to pretty tight space in agent 0. 00:40:28.760 |
And now I'm going to select somewhere else in agent 1 00:40:32.480 |
that maps to kind of dispersed area in agent 0. 00:40:38.800 |
And this is just an insight that we gain for this data only. 00:40:45.800 |
this tight mapping business kind of represents 00:40:49.840 |
the good running behavior and bad running behaviors. 00:40:52.680 |
That's something that you can do pretty efficiently. 00:40:55.200 |
And now I'm going to show you something more interesting. 00:40:58.160 |
So of course, we have to do this because we have the data. 00:41:02.880 |
So we apply this framework in the OpenAI's hide and seek. 00:41:10.120 |
has pretty complex structure, 100 dimensional observations, 00:41:18.200 |
pretend that we don't know the labels given by OpenAI. 00:41:28.280 |
So again, this is the result of z omega and z alpha. 00:41:33.960 |
But the coloring is something that we didn't know before. 00:41:41.120 |
nice kind of pattern that we can roughly separate what 00:41:46.120 |
makes sense to humans and what makes sense to us. 00:41:48.840 |
But remember, the green and gray, kind of everywhere, 00:41:54.400 |
So in this particular run of OpenAI's hide and seek, 00:42:05.400 |
it seems to be pretty separate and distinguishable 00:42:17.760 |
But in case of orange, which is fort building, 00:42:31.720 |
so oranges are a little more entangled than seekers. 00:42:34.400 |
Perhaps if seekers had built more separate fort building 00:42:38.600 |
representation, maybe they would have won this game. 00:42:40.720 |
So this work, can we learn something interesting, 00:42:47.920 |
emerging behaviors by just simply observing the system? 00:42:51.960 |
The answer seems to be yes, at least for the domains 00:43:01.280 |
But remember that these methods don't give you 00:43:04.720 |
So you would have to go and investigate and click 00:43:10.720 |
And if the cluster represents super human concept, 00:43:23.440 |
And also, if you have access to the model and the reward 00:43:33.240 |
I'm going to talk about this work with Nico and Natasha 00:43:39.560 |
So here, this time, we're going to intervene. 00:43:46.440 |
So problem is that we're going to build a new multi-agent 00:43:49.440 |
system, going to build it from scratch, such that we 00:43:56.520 |
So we're going to try to match the performance 00:44:08.520 |
where we proposed this pretty simple idea, which 00:44:13.240 |
Why don't we embed concepts in the middle of the bottleneck, 00:44:37.480 |
So it's particularly useful in the medical setting, 00:44:39.640 |
where there are some features that doctors don't want. 00:44:44.800 |
So this is the work to extend this to RL setting. 00:44:48.720 |
It's actually not as simple extension as we thought. 00:44:57.240 |
And we're building each of the concept bottleneck 00:45:06.360 |
Just think about this as make the auto system work, 00:45:09.680 |
plus minimizing the difference between the true concept 00:45:31.440 |
First domain, how many people saw this cooking game before? 00:45:37.560 |
Yeah, it's a pretty commonly used cooking domain 00:45:50.800 |
They wait for the tomato and bring the dishes, 00:45:56.240 |
Their goal is to deliver as many soups as possible, 00:46:01.360 |
And here, concepts that we use are agent position, 00:46:04.520 |
orientation, agent has tomato, has dish, et cetera, et cetera. 00:46:08.160 |
Something that's immediately available to you already. 00:46:11.280 |
And you can, of course, tweak the environment 00:46:14.920 |
So you can make it that they have to collaborate. 00:46:34.160 |
that you can detect this emerging behavior of coordination 00:46:45.840 |
and suppose the RL system that we trained worked, 00:46:54.440 |
This is a reward of an agent one when there's no intervention. 00:47:04.080 |
This is average value of intervening on all concepts. 00:47:07.480 |
But I'm also going to show you each concept soon. 00:47:12.880 |
can tell that in the right, when we intervene, 00:47:16.120 |
reward deteriorated quite a lot for both of them. 00:47:19.800 |
And that's one way to see, ah, they are coordinating. 00:47:30.480 |
But this is what was really interesting to me, 00:47:35.600 |
So this is the same graph as the one you saw before, 00:47:40.080 |
but except I'm plotting for intervention for each concept. 00:47:44.000 |
So I'm intervening team position, team orientation, 00:47:52.640 |
or rather, when we intervene on team orientation, 00:47:56.160 |
the degradation of performance was the biggest, 00:47:58.760 |
to the extent that we believe that orientation had 00:48:15.880 |
Just a clarification question on the orientation. 00:48:17.880 |
Is that like the direction that the team is projecting? 00:48:21.360 |
So it seems like orientation would let you predict 00:48:28.400 |
Where were you when I was pulling my hair over this 00:48:47.280 |
that an agent can get about the next move of the other agent. 00:48:51.760 |
Because they're facing the pot, they're going to the pot. 00:49:00.160 |
Obvious to some, but I needed this graph to work that out. 00:49:05.600 |
And of course, you can use this to identify lazy agents. 00:49:09.240 |
If you look at the rightmost yellow agent, our friend, 00:49:23.560 |
And you can easily identify this by using this graph. 00:49:27.160 |
If I intervene, it just doesn't impact any of their rewards. 00:49:39.080 |
So this is studying inter-agent social dynamics. 00:49:42.920 |
So in this domain, there is a little bit of tension. 00:49:51.560 |
Just yellow things, or green things are apples. 00:50:01.360 |
And you can see, if you have four people trying 00:50:04.560 |
to collect apples, you can just stay someone else's-- 00:50:12.920 |
And concepts here, again, are pretty common things-- 00:50:20.440 |
position, orientation, and pollution, positions, et 00:50:36.280 |
So the story here is that when I intervene on agent 1, 00:50:49.920 |
how reward was impacted when I intervene on agent 1. 00:50:57.320 |
Same with idle time, same with the inter-agent distance. 00:51:13.280 |
So we're going to do a little more work here, but not a lot. 00:51:22.960 |
This is simplest, dumbest way to build a graph. 00:51:42.640 |
consists of features of these movies, so length, 00:51:48.120 |
And the simplest way to build a graph is to do a regression. 00:51:58.600 |
And that gives me beta, which is a kind of coefficient 00:52:04.080 |
And that beta represents the strength of the edges. 00:52:08.920 |
So this movie is more related to this movie and not 00:52:22.120 |
Instead of movie, we're going to use intervention on concept 00:52:36.440 |
wouldn't have been available without our framework 00:52:39.760 |
for reward, resource collected, and many other things. 00:52:45.240 |
And when you build this graph, at the end of the day, 00:53:03.320 |
is nicely highlighted between agent 1 and 4 and that only, 00:53:07.800 |
contradicting the original hypothesis that we had. 00:53:15.640 |
it turns out that there is no high edge, strong edges 00:53:23.560 |
But there is strong edges between agent 1 and 4. 00:53:26.360 |
So we dig deeper into it, watched a lot of sessions 00:53:32.000 |
And it turns out that the story was a lot more complicated. 00:53:39.200 |
But when that fails, agent 1 and 2 gets cornered in. 00:53:44.800 |
Agent 4 get agent 1 and 2, blue and yellow agent, 00:53:55.840 |
because of the way that we built this environment. 00:54:00.880 |
But the raw statistics wouldn't have told us this story, 00:54:06.140 |
In fact, there was no correlation, no coordination 00:54:10.200 |
But only after the graph, we realized this was the case. 00:54:17.440 |
A lot of emerging behaviors that we want to detect, a lot of them 00:54:22.640 |
And we really want to get to the truth of that 00:54:25.000 |
rather than having some surface-level statistics. 00:54:34.920 |
that enables intervention and performs as well? 00:54:38.160 |
There's a graph that shows the red line and blue line roughly 00:54:48.360 |
Or you should have some way of getting those concepts, 00:55:13.520 |
So I did tell you that we're not going to know, 00:55:23.760 |
that I'm currently doing I'm really excited about. 00:55:29.840 |
Will this understanding move 37 happen before 00:55:37.680 |
So we start-- this is all about research, right? 00:55:51.760 |
my ultimate goal, of understanding that move 37. 00:55:56.160 |
So before that, how many people here know AlphaZero from TMI? 00:56:00.800 |
AlphaZero is a self-trained chess playing machine 00:56:07.360 |
that has higher ELO rating than any other humans 00:56:10.320 |
and beats Stockfish, which is arguably no existing 00:56:15.040 |
So in a previous paper, we try to discover human chess 00:56:29.320 |
and when in the training time, and which we call 00:56:37.320 |
of opening moves between humans and AlphaZero. 00:56:43.800 |
And as you can see, there's a pretty huge difference. 00:56:49.920 |
It turns out that AlphaZero can master, or supposedly master, 00:56:54.480 |
a lot of variety of different types of openings. 00:57:07.400 |
So that begs the question, what does AlphaZero know 00:57:18.000 |
We're actually almost-- we're about to evaluate. 00:57:22.080 |
So the goal of this work is teach the world chess champion 00:57:41.440 |
He's still champion in two categories, actually. 00:57:45.840 |
is we're going to discover new chess strategy 00:57:49.320 |
by explicitly forgetting existing chess strategy, which 00:57:56.520 |
And then we're going to learn a graph, this time 00:58:03.720 |
the existing relationships between existing concepts 00:58:07.400 |
so that we can get a little bit more idea of what 00:58:12.160 |
And Magnus Carlsen-- so my favorite part about this work-- 00:58:17.160 |
My favorite part about this work is that the evaluation 00:58:21.520 |
So it's not just like Magnus coming in and say, 00:58:23.640 |
oh, your work is kind of nice, and say nice things 00:58:26.720 |
No, Magnus actually has to solve some puzzles. 00:58:29.880 |
And we will be able to evaluate him, whether he did it or not. 00:58:36.400 |
This kind of work I can only do because of Lisa, 00:58:40.280 |
who is a champion herself, but also a PhD student at Oxford. 00:58:50.000 |
And she's going to be the ultimate pre-superhuman 00:59:25.460 |
And given that, because alpha 0 has really weird architecture, 00:59:33.900 |
That's just the way that they decide to do it. 00:59:37.420 |
identify or generate the board positions that 00:59:46.540 |
what move it's going to make given that board position. 00:59:58.380 |
You give a board position and then ask Magnus to make a move. 01:00:01.580 |
We explain the concept and then give Magnus more board 01:00:04.660 |
positions and see if we can apply that concept that he just 01:00:21.740 |
Yeah, so if I were to ask Stockfish to solve those puzzles, 01:00:28.860 |
Because we are interested in whether we can teach human, 01:00:34.200 |
That's actually an interesting thing that we could do, 01:00:37.900 |
But our goal is to just teach one superhuman. 01:00:41.420 |
Like if I have, for example, 10,000 superhuman concepts, 01:00:45.580 |
and only three of them are digestible by Magnus, 01:00:50.220 |
That would be a big win for this type of research. 01:01:06.820 |
We talked about the gap between what machines know 01:01:21.340 |
We talked about studying aliens, these machines, 01:01:27.380 |
There are many other ways to study a species. 01:01:30.460 |
And I'm not an expert, but anthropology and other humanities 01:01:33.100 |
studies would know a lot better, more about this. 01:01:36.580 |
And maybe, just maybe, we can try to understand MOVE 37 01:01:44.300 |
through this chess project that I'm very excited about. 01:02:09.640 |
certain interpretability techniques from one modality 01:02:19.580 |
I think-- like, think about fairness research, 01:02:22.420 |
which builds on strong mathematical foundation. 01:02:26.220 |
And that's applicable for any questions around fairness, 01:02:33.260 |
if your goal is to actually solve a fairness issue at hand 01:02:41.660 |
You would have to customize it for a particular person. 01:02:44.060 |
You would have to customize it for a particular application. 01:02:48.420 |
And I think similar is true interpretability, 01:02:52.740 |
SHAP and IG are used across domains, like vision, texts. 01:02:57.300 |
So that theory paper would be applicable across the domain. 01:03:09.620 |
I don't even know how to think about agents in NLP yet. 01:03:22.660 |
So I saw some recent work in which some amateur Go players 01:03:32.580 |
And that seemed like a concept that humans know 01:03:37.940 |
I just want to know your thoughts about that. 01:03:40.060 |
Yeah, actually, it's funny you mentioned that. 01:03:48.220 |
Because you kind of know what are the most unseen 01:03:56.580 |
And Lisa guessed that if Isador had known something more 01:03:59.780 |
about AI, then maybe he would have tried to confuse AlphaGo. 01:04:15.220 |
like the one that Magnus made a couple of days 01:04:53.180 |
These work that I've presented are pretty new. 01:04:56.500 |
But there has been a bit of discussion in the robotics, 01:05:03.980 |
But things that reinforcement learning in the wild people 01:05:13.380 |
If you have a test for it, like if you have a unit test for it, 01:05:18.100 |
Because you're going to test before you deploy. 01:05:20.820 |
I think the biggest risk for any of these deployment systems 01:05:27.700 |
So my work around the visualization and others 01:05:36.780 |
But here's a tool that helps you better discover 01:05:54.500 |
about a lot of ways in which we try to visualize or understand 01:06:02.300 |
But I was wondering whether we could turn it around 01:06:08.260 |
using our language, what they're doing in their representations. 01:06:13.620 |
and then get the machine to do the translation for us 01:06:16.620 |
instead of us going into the machine to see it. 01:06:21.940 |
Because that's something that I kind of tried in my work, 01:06:26.900 |
previous work, called Testing with Concept Activation 01:06:30.420 |
So that was to map human language into a machine's space 01:06:39.740 |
The challenge is that, how would you do that for something 01:06:44.340 |
Like, we don't have a vocabulary for it, like move 37. 01:06:48.340 |
Then there is going to be a lot of missing valuable knowledge 01:07:07.700 |
Because it's a kind of proxy for what we think a penguin is. 01:07:13.820 |
everyone thinks very differently about what a penguin is. 01:07:19.300 |
everyone is thinking different penguin right now. 01:07:22.420 |
Australia has the cutest penguin, the fairy penguin. 01:07:26.620 |
But I don't know how many people are thinking that. 01:07:34.260 |
Extend that to 100 concepts and composing those concepts, 01:07:43.220 |
I think some applications, exclusively just using 01:07:53.180 |
But my ambition is that we shouldn't stop there. 01:07:56.460 |
We should benefit from them by having them teach us new things 01:08:05.580 |
So the second thing you talked about with Jerome 01:08:07.740 |
was that where knowledge is located in the embedding space 01:08:11.460 |
isn't super correlated with what you'd like to edit 01:08:15.180 |
Do you think that has any implications for the later 01:08:23.780 |
like to just get strategies in the embedding space 01:08:31.740 |
just because I feel like the Roam thing as well is not-- 01:08:36.260 |
So it's like some transformed space of our embedding space 01:08:44.340 |
So thinking about that as a raw vector is a dead end. 01:08:52.420 |
In a couple of months, I might rethink my strategy. 01:09:00.460 |
realize that a lot of this stuff that we're trying to do here, 01:09:18.260 |
of a study of neural network to help us understand stuff 01:09:31.580 |
The whole neural network is to understand human brain. 01:09:41.780 |
there is some biases that we have in neuroscience 01:09:48.420 |
like physical tools and availability of humans 01:09:52.100 |
I think that influences interpretability research. 01:09:56.540 |
So in cat, the horizontal line and vertical line neuron 01:10:00.580 |
in cat brain, so they put the prop in and figure out 01:10:15.780 |
Well, because you had one cat, poor, poor cat. 01:10:20.100 |
And you had-- we can only probe a few neurons at a time, right? 01:10:25.340 |
So that implied a lot of-- a few interpretability research 01:10:29.900 |
are very focused on neuron-wise representation. 01:10:36.940 |
That was limited by our ability, like physical ability 01:10:41.140 |
But in your network, you don't have to do that. 01:10:44.860 |
You can change the whole embedding to something else, 01:10:48.060 |
So that kind of is actually a obstacle in our thinking 01:11:03.460 |
So for Thursday, we're not having a lecture on Thursday. 01:11:11.020 |
So if you have any last-minute panics on your project 01:11:14.860 |
or think we might have some great insight to help you, 01:11:22.220 |
Do come along, and you can chat to us as we find projects