Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim

Today I'm delighted to introduce as our final guest speaker, Bean Kim. Bean Kim is a staff research scientist at Google Brain. If you're really into Googleology, those funny words at the beginning, like staff, sort of says how senior you are. And that means that Bean's a good research scientist.

So I discovered at lunch today that Bean started out studying mechanical engineering at Seoul National University. But she moved on to, I don't know, it's better things or not. But she moved on to computer science and did her PhD at MIT. And there she started working on the interpretability and explainability of machine learning models.

I think she'll be talking about some different parts of her work. But a theme that she's had in some of her recent work that I find especially appealing as an NLP person is the idea that we should be using higher level human interpretable languages for communication between people and machines.

So welcome, Bean, looking forward to your talk, and go for it. >> Thank you. >> >> Thank you. Thanks for having me. It's an honor to be here. It's the rainiest Stanford I've ever seen. Last night, I got here last night. But then I live in Seattle, so this is pretty common.

So I still was able to see the blue sky today. I was like, this works, I really like it here. So today I'm going to share some of my dreams, chasing my dreams to communicate with machines. So if you're in this class, you probably agree, you don't have to, that large language models and generator models are pretty cool.

They're impressive. But you may also agree that they're a little bit frightening. Not just because they're impressive, they're doing a really good job, but also we're not quite sure where we're going with this technology. In 10 years out, will we look back and say, that technology was net positive?

Or we will say, that was catastrophic, we didn't know that that would happen. Ultimately, what I would like to do, or maybe hopefully what we all want to do, is to have this technology benefit us, humans. I know in 10 years time, or maybe, well, 20 years or earlier, he's gonna ask me, he's gonna be like, mom, did you work on this AI stuff?

I watched some of your talks. And did you know that how this will profoundly change our lives? And what did you do about that? And I have to answer that question, and I really hope that I have some good things to say to him. So my initial thought, and still so, or current thought, is that if we want our ultimate goal to be benefit humanity, why not directly optimize for it, why wait?

So how can we benefit? There's lots of different ways we can benefit. But one way we can benefit is to treat this like a colleague. You know, a colleague who are really good at something. This colleague is not perfect, but it's good at something enough that you want to learn something from them.

One difference is though, in this case, is that this colleague is kind of weird. This colleague might have very different values, it might have very different experiences in the world. It may not care about surviving as much as we do. Maybe mortality isn't really a thing for this colleague.

So you have to navigate that in our conversation. So what do you do when you first meet somebody? There's someone so different, what do you do? You try to have a conversation to figure out how do you do what you do? How are you solving decades old protein folding problem?

How are you beating the world gold champion so easily, what it seems? Are you using the same language, the science language that we use, atoms, molecules? Or do you think about the world in a very different way? And more importantly, how can we work together? I have one alien that I really want to talk to, and it's AlphaGo.

So AlphaGo beat world gold champion Isidore in 2016. Isidore is from South Korea, I'm from South Korea. I watched every single match. It was such a big deal in South Korea and worldwide, I hope. And in one of the matches, AlphaGo played this move called move 37. How many people watched AlphaGo matches?

And how many people remember move 37? Yeah, a few people, right? And I remember the nine-dan commentator who's been talking a lot throughout the matches suddenly got really quiet. And he said, hmm, that's a very strange move. And I knew then that something really interesting has just happened in front of my eyes, that this is going to change something.

This AlphaGo has made something that we're going to remember forever. And sure enough, this move turned around the game for AlphaGo and leading AlphaGo to win one of the matches. So goal players today continue to analyze this move and still discuss when people talk about, this is not the move a human would fan them.

So the question is, how did AlphaGo know this is a good move? My dream is to learn something new by communicating with machines and having a conversation, and such that humanity will gain some new angle to our important problems like medicine and science and many others. And this is not just about discovering new things.

If you think about reward hacking, you have to have a meaningful conversation with somebody to truly figure out what their true goal is. So in a way, solving this problem is a superset of solving AI safety, too. So how do we have this conversation? Conversation assumes that we share some common vocabulary between that exchange to exchange meaning and ultimately the knowledge.

And naturally, a representation plays a key role in this conversation. On the left-- and we can visualize this on the left-- we say, this is a representational space of what humans know. On the right, what machines know. Here in left circle, there will be something like, this dog is fluffy.

And you know what that means, because we all share somewhat similar vocabulary. But on the right, we have something like move 37, where humans yet to have a representation for. So how do we have this conversation? Our representational space needs overlap. And the more overlap we have, the better conversation we're going to have.

Humans are all good at learning new things. Like here, everyone is learning something new. So we can expand what we know by learning new concepts and vocabularies. And doing so, I believe, will help us to build machines that can better align with our values and our goals. So this is the talk that I gave.

If you're curious about some of the work we're doing towards this direction, I highly recommend it. It's a YouTube video. I clear keynote, half an hour. You can fast. Do a fast feed. But today, I'm going to talk more about my hopes and dreams. And hopefully, at the end of the day, your hopes and dreams too.

So first of all, I'm just going to set the expectation. So at the end of this talk, we still don't know how the move 37 is made. Sorry. That's going to take a while. In fact, the first part of this talk is going to be about how we move backwards in this progress, in terms of making this progress in our journey.

And still a very, very small portion of our entire journey towards understanding move 37. And of course, this journey wouldn't be like a singular path. There will be lots of different branches coming in. Core ideas like transformer helped many domains across. There will be similar here. So I'm going to talk in the part two some of our work on understanding emerging behaviors in reinforcement learning.

And all the techniques that I'm going to talk about is going to be in principle applicable to NLP. So coming back to our hopes and dreams, move 37. So let's first think about how we might realize this dream. And taking a step back, we have to ask, do we have tools to first estimate what even machines know?

There has been many development in machine learning last decade now to develop tools to understand and estimate this purple circle. So is that accurate? Unfortunately, many recent research show that there is a huge gap between what machines actually know and what we think the machines know. And identifying and bridging this gap is important because these tools will form basis for understanding that move 37.

So what are these tools? How many people familiar with saliency maps? A lot, but you don't have to. I'll explain what it is. So saliency map is one of the popular interpretability methods. For simplicity, let's say an image net, you have an image like this. You have a bird.

The explanation is going to take a form of the same image, but where each pixel is associated with a number that is supposed to imply some importance of that pixel for prediction of this image. And one definition of that importance is that that number indicates how the function look like around this pixel.

So for example, if I have a pixel ixj, maybe around xj, the function moves up like the yellow curve, or function is flat, or function is going down like the green curve. And so if it's flat like a blue curve or red curve, maybe that feature is irrelevant to predicting bird.

Maybe it's going up. Then it's maybe more important because the value of x increases and the function value goes up. Function value here like a prediction value. So let's think about what are the few ways why this gap might exist. There are a few ways. Not exhaustive, they overlap a little bit, but helpful for us to think about.

Maybe assumptions are wrong. So this alien, again, these machines that we train, works in a completely different, perhaps completely different representational space, very different experiences about the world. So assuming that it sees the world that we do just like we do, like having the gestalt phenomenon, there's few dots.

Humans have tendency to connect them. Maybe machines have that too. Maybe not. So maybe our assumptions about these machines are wrong. Maybe our expectations are mismatched. We thought it was doing x, but it was actually doing y. Or maybe it's beyond us. Maybe it's showing something superhuman that humans just can't understand.

I'm going to dig deeper into some of these, our work. This is more recent work. So again, coming back to the earlier story about salience map, we're going to play with some of these methods. Now, in 2018, we stumbled upon this phenomenon that was quite shocking, which was that we were actually trying to write some different people, again, paper, of course, the end here.

But we were testing something, and we realized that trained network and untrained network has the same, very similar saliency map. In other words, random prediction and meaningful prediction were giving me the same explanation. So that was puzzling. We thought we had a bug, but it turned out we didn't.

It actually is indistinguishable qualitatively and quantitatively. So that was shocking. But then we wondered, maybe it's a one-off case. Maybe it still works somehow in practice. So we tested that in a follow-up paper. OK, what if the model had an error, one of these errors? Maybe it has a labeling error.

Maybe it has a spurious correlation. Maybe it had out-of-distribution at test time. If we intentionally insert these bugs, can explanation tell us that there's something wrong with the model? It turns out that that's also not quite true. You might think that, oh, maybe spurious correlation. Another follow-up work also showed that this is also not the case.

So we were disappointed. But then still, we say, you know, maybe there's no theoretical proof of this. Maybe this is, again, a lab-setting test. We had grad students to test this system. Maybe there's still some hope. So this is more recent work where we theoretically prove that some of these methods, very popular methods, cannot do better than random.

So I'm going to talk a little bit about that. I'm missing one person. I'm missing Peng Wei in the author list. I just realized this is also work with Peng Wei. So let's first talk about our expectation. What is our expectation about this tool? Now, the original paper that developed this method, IG and Schaub, talks about how IG can be used for accounting the contributions of each feature.

So what that means is that when the tool assigns zero attribution to a pixel, we're going to say, OK, well, pixel is unused by the function. And that means that f will be insensitive if I perturb this x. And in fact, this is how it's been used in practice.

This is a paper published in Nature. They used Schaub to figure out the eligibility criteria in a medical trial. What we show in this work is that none of these inferences that seemed pretty natural were true. And in fact, just because popular attribution methods tell you anything about attribution is x, you cannot conclude anything about the actual model behavior.

So how does that work? How many people here do theory proof? A few. Great. I'll tell you. I learned about theory proving from this project as well. So I'll tell you the way that we pursued this particular work is that first think about this problem. And we're going to formulate into some other problem that we know how to solve.

So in this case, we formulate this as hypothesis testing. Because once you formulate into hypothesis testing, yes or no, there are lots of tools and statistics you can use to prove this. So what is hypothesis? The hypothesis is that I'm a user. I got an attribution value from one of these tools.

And I have a mental model of, ah, this feature is important or maybe not important. Then the hypothesis is that whether that's true or not. And what we showed is that given whatever hypothesis you may have, you cannot do better than random guessing, invalidating or invalidating this hypothesis testing.

And that means, yes, sometimes it's right. But you don't do hypothesis testing if you cannot validate yes or no. You just don't. Because what's the point of doing it if you just don't know if it's as good as random guessing? And the result is that, yes, for this graph, it's just a visualization of our result.

If you plot true negative and true positive, and line is random guessing, because this is the worst method, that's the best method. All the equal distance is this line. Methods that we know, SHOP and IG, all falls under this line of random guessing. That's bad news. But maybe this still works in practice for some reason.

Maybe there were some assumptions that we had that didn't quite meet in the practice. So does this phenomenon hold in practice? The answer is yes. We now have more image graphs and more bigger models. But here we test two concrete end tasks that people care about in interpretability, or use these methods to do, recourse or spurious correlation.

So recourse, for those who are not familiar, is you're getting a loan. And you wonder whether, if I'm older, I would have a high chance of getting a loan. So I tweak this one feature and see if my value goes up or down. Very reasonable task if people do it all the time.

Pretty significant implication socially. So for two of these concrete end tasks, both of them boil down to this hypothesis testing framework that I talked about. They're all around the random guessing line, or worse than random guessing. So you might say, oh, no. This is not good. A lot of people are using these tools.

What do we do? We have a very simple idea about this. So people like developing complex tools. And I really hope you're not one of those people. Because a lot of times, simple methods work. Who comes razor? But also, simple methods are elegant. There is a reason, perhaps a lot of times, why they work.

They're simple. You can understand them. They make sense. So let's try that idea here. So again, your goal is to estimate a function shape. What do you do? Well, the simplest thing you do is you have a point of interest. You sample around that point and evaluate the function around that point.

If it goes up, maybe function's going up. If it goes down, maybe function's coming down. So that's the simplest way you can brute force it. But then the question is, how many samples do we need? So here, this is the equation that you're lifting this line upwards that way by adding that additional term.

It's proportional to number of samples. The more samples you have, the better estimation you have. It makes sense. And differences in output, how much resolution do you care? Do you care 0.1 to 0.2? Or do you only care 0 slope to slope 1? That's resolution that you care about and number of features, of course.

So if you worry about making some conclusion based on function shape, sample. Easy. So can we infer the model behavior using these popular methods? Answer is no. And this holds both theory and practice. We're currently working on even bigger models to show just again and again empirical evidence that, yes, it just really doesn't work.

Please think twice and three times before using these methods. And also, model-dependent sample complexity. If your function is kind of crazy, of course, you're going to need more samples. So what is the definition? How do we characterize these functions? And finally, we haven't quite given up yet. Because these methods have a pretty good root in economics and shop-levels and all that.

So maybe there are a lot narrower condition where these methods work. And we believe such condition does exist. We just have to figure out when. Once we figure out what that condition is, then in a given function, I can test it and say, yes, I can use shop here.

Yes, I can use IG here. Or no, I can't. That would be still very useful for ongoing work. Before I go to the next one, any questions? Yes? To the findings you have about the JPEG models, does it only apply to computer vision models? Or does it apply to NLP-only models?

Any model that has a function. Yeah, very simple. Simple, actually. Simplish proof that can show simply any function, this holds. Any other questions? Wonderful. Yeah, Chris? This may disrelate to your last bullet. But it sort of seems like for the last couple of years, there have been at least dozens, maybe hundreds of people writing papers using Shepley values.

I mean, is your guess that most of that work is invalid? Or that a lot of it might be OK because whatever conditions that it's all right might often be being there? So two answers to that question. My hypothesis testing results shows that it's random. So maybe in the optimistic case, 50% of those papers, you hit it.

And on the other side, on the second note, even if maybe Shep wasn't perfect, maybe it was kind of wrong. But even if it helped human at the end task, whatever that might be, help doctors to be more efficient, identifying bugs and whatnot, and if they did the validation correctly with the right control testing setup, then I think it's good.

You figured out somehow how to make this noisy tools together work with human in the loop, maybe. And that's also good. And I personally really like Shep's paper. And I'm a good friend with Scott. And I love all his work. It's just that I think we need to narrow down our expectations so that our expectations are better aligned.

All right. I'm going to talk about another work that's kind of similar flavor. Now it's an NLP. So this is one of those papers, just like the many other papers that we ended up writing. One of those serendipity papers. So initially, Peter came up as an intern. And we thought, we're going to locate ethical knowledge in this large language models.

And then maybe we're going to edit them to make them a little more ethical. So that was the goal. And then we thought, oh, the Rome paper from David Bauer. And I also love David's work. And let's use that. That's the start of this work. But then we start digging into and implementing the Rome.

And things didn't quite line up. So we do sanity check, experiment after sanity check. And we ended up writing a completely different paper, which I'm about to talk to you about. So this paper, the Rome, for those who are not familiar, which I'm going into a little more detail in a bit, is about editing a model.

So you first locate a knowledge in a model. Like the space needle is in Seattle. That's a fact, your knowledge. You locate them. You edit them. Because you can locate them, you can mess with it to edit that fact. That's the whole promise of it. In fact, that's a lot of times how localization or editing methods were motivated in the literature.

But what we show is that this assumption is actually not true. And to be quite honest with you, I still don't quite get why this is not related. And I'll talk more about this, because this is a big question to us. This is pretty active work. So substantial fraction of factual knowledge is stored outside of layers that are identified as having the knowledge.

And you will see this a little more detail in a bit. In fact, the correlation between where the location, where the facts are located, and how well you will edit if you edit that location is completely correlated, uncorrelated. So they have nothing to do with each other. So we thought, well, maybe it's the problem with the definition of editing.

What we mean by editing can mean a lot of different things. So let's think about different ways to edit a thing. So we tried a bunch of things with little success. We couldn't find an editing definition that actually relates really well with localization methods, like in particular with ROM.

So let's talk a little bit about ROM, how ROM works, super briefly. There's a lot of details missed out on the slide, but roughly you will get the idea. So ROM is Mangeto, 2022. They have what's called causal tracing algorithm. And the way it works is that you're going to run a model on this particular data set, counterfact data set, that has this tuple, subject, relation, and object.

The space needle is located in Seattle. And so you're going to have a clean run of the space needle is in Seattle one time. You stole every single module, every single value, activations. And then in the second run, which they call corrupted run, you're going to add noise in the space needle is-- or the space.

Then you're going to intervene at every single one of those modules by copying this module to the corrupted run. So as if that particular model was never interrupted, noise was never added to that module. So it's a typical intervention case where you pretend everything else being equal. If I change just this one module, what is the probability of having the right answer?

So in this case, probability of the right answer, Seattle, given that I know is the model and I intervened on it. So at the end of the day, you'll find a graph like that where each layer and each token has a score. How likely it is if I intervene on that token in that layer?

How likely is it that I will recover the right answer? Because if I recover right answer, that's the module. That's the module that stored the knowledge. Really reasonable algorithm. I couldn't find technical flaw in this algorithm. I quite like it, actually. But when we start looking at this using the same model that they used, GPT-J, we realize that a lot of these facts-- so Roam uses just layer 6 to edit, because that was supposedly the best layer across this data set to edit.

Most of the factual knowledge is stored in layer 6. And they showed editing success and whatnot. But we realized the truth looks like the graph on the right. So the red line is the layer 6. Their extension paper called MEME edits multiple layers at the blue region. The black bars are histogram of where the knowledge was actually peaked if you test every single layer.

And as you can see, not a lot of facts fall into that region. So in fact, every single fact has different region that where it peaked. So layer 6, for a lot of facts, weren't the best layer. But the editing really worked. It really works. And we were able to duplicate the results.

So we thought, what do we do to find this ethical knowledge? How do we find the best layer to edit? So that's where we started. But then we thought, you know what? Take a step back. We're going to actually do a sanity check first to make sure that tracing effect-- the tracing effect is the localization-- implies better editing results.

And that's when everything started to falling apart. So let's define some metrics first. The edit success-- this is the rewrite score, same score as Rome paper used. That's what we used. And the tracing effect-- this is localization-- is probably-- you can read the slide. So when we plotted the relation between tracing effect and rewrite score, the editing method, red line implies the perfect correlation.

And that was our assumption, that there will be perfectly correlated, which is why we do localization to begin with. The actual line was yellow. It's close to zero. It's actually negative in this particular data set. That is not even uncorrelated. It's anti-correlated. And we didn't stop there. We were so puzzled.

We're going to do this for every single layer. And we're going to find R-squared value. So how much of the choice of layer versus the localization, the tracing effect, explains the variance of successful edit? If you're not familiar with R-squared, R-squared is like a-- think about it as an importance of a factor.

And it turns out that layer takes 94%. Tracing effect is 0.16. And so we were really puzzled. We were scratching our head. Why is this true? But it was true across layer. We tried all sorts of different things. We tried different model. We tried different data set. It was all roughly the case.

So at this point, we contacted David. And we started talking about it. And we resolved them. They acknowledged that this is a phenomenon that exists. Yeah, John? So apart from the layer, the other way which localization can happen is are you looking at the correct token? Is that the other corresponding-- Yeah.

Yeah, in this graph, the token is in-- So the added benefit of the rest of the localization could only help you look at which is the correct subred token, is that it? Yeah, yeah. So looking at any of the subred tokens is sort of fine is what I should think of?

Yeah, yeah. Just layer is the most biggest thing. That's the only thing you should care. You care about editing? Layers. In fact, don't worry about localization at all. It's extra wasted carbon climate effect. So that was our conclusion. But then we thought, maybe the particular definition of edit that they used in the room was maybe different.

Maybe there exists a definition of editing that correlates a lot better with localization. Because there must be. I'm still puzzled. Why is this not correlated? So we tried a bunch of different definitions of edits. You might inject an error. You might reverse the tracing. You might want to erase a fag.

You might want to amplify the fag. All these things. Like maybe one of these will work. It didn't. So the graph that you're seeing down here is our square value for four different methods. And this wasn't just the case for ROM and MEM. It was also the case for fine tuning methods.

That you want to look at the difference between blue and orange bar represents how much the tracing effect influenced our square value of the tracing effect. As you can see, it's ignorable. They're all the same. You might feel that effect forcing, the last one, has a little bit of hope.

But still, compared to the impact of layer, choice of layer, it's ignorable. So at this point, we said, OK, well, we can't locate the ethical knowledge at this project. We're going to have to switch the direction. And we ended up doing a lot more in-depth analysis on this. So in summary, does localization help editing?

No. The relationship is actually zero. For this particular editing method, from what I know, it's pretty state of the art. And the counterfact data, it's not true. Are there any other editing method that correlate better? No. But if somebody can answer this question for me, that will be very satisfying.

I feel like there should still be something there that we're missing. But causal tracing, I think what it does is it reveals the factual information when the transformer is passing forward. I think it represents where is the fact when you're doing that. But what we found here is that it has nothing to do with editing success.

Those two things are different. And we have to resolve that somehow. But a lot of insights that they found in their paper is still useful, like the early to mid-range NLP representation, loss token. They represent the factual, something we didn't know before. But it is important not to validate localization methods using the editing method, now we know, and maybe not to motivate editing methods using via localization.

Those are the two things now we know that we shouldn't do, because we couldn't find a relationship. Any questions on this one before I move on to the next one? I'm not shocked by this. I am shocked by this. I'm still so puzzled. There should be something. I don't know.

All right. So in summary of this first part, we talked about why the gap might exist, what machines know versus what we think machines know. There are three hypotheses. There are three ideas. Assumptions are wrong. Maybe our expectations are wrong. Maybe it's beyond us. There's a good quote that says, "Good artists steal.

I think good researchers doubt." We have to be really suspicious of everything that we do. And that's maybe the biggest lesson that I've learned over many years, that once you like your results so much, that's a bad sign. Come back, go home, have a beer, go to sleep. And next day, you come back and put your paper on your desk and think, OK, now I'm going to review this paper.

How do I criticize this? What do I not like about this paper? That's one way to look at it. Criticize your own research, and that will improve your thinking a lot. So let's bring our attention back to our hopes and dreams. It keeps coming back. So here, I came to realize maybe instead of just building tools to understand, perhaps we need to do some groundwork.

What do I mean? Well, this alien that we've been dealing with, trying to generate explanations, seems to be a different kind. So maybe we should study them as if they're like newbies to the field. Study them as if they're like new species in the wild. So what do you do when you observe a new species in the wild?

You have a couple of ways. But one of the ways is to do observational study. So you saw some species in the wild far away. First, you just kind of watch them. You watch them and see what are they like, what are their habitat, what are their values and whatnot.

And second way, you can actually intervene and do a control study. So we did something like this with reinforcement learning setup. I'm going to talk about these two papers, first paper. Emergent behaviors in multi-agent systems has been so cool. Who saw this hide and seek video by OpenAI? Yeah, it's so cool.

If you haven't seen it, just Google it and watch it. It's so fascinating. I'm only covering the tip of an iceberg in this. But at the end of this hide and seek episode, at some point, the agents discover a bug in this physical system and start anti-gravity flying in the air and shooting hiders everywhere.

It's a super interesting video. You must watch. So lots of that. And also humanoid football and capture the flag from deep mind. Lots of interesting behaviors emerging that we observed. Here's my favorite one. But these labels-- so here, these are labels that are provided by OpenAI, running and chasing, fort building, and ramp use.

And these ones were that a human or humans went painstakingly, one by one, watch all these videos and label them manually. So our question is, is there a better way to discover these emergent behaviors? Perhaps some nice visualization can help us explore this complex domain a little better. So that's our goal.

So in this work, we're going to, again, treat the agents like an observational study, like a new species. And we're going to do observational study. And what that means is that we only get to observe state and action pair. So where they are, what are they doing, what are they doing?

And we're going to discover agent behavior by basically clustering the data. That's all we're going to do. And how do we do it? Pretty simple. Generative model-- have you covered the Bayesian generative model, graphical model? No, gotcha. OK. So think about-- Hi. That also what you teach? Yeah, so this is a graphical model.

Think about this as a fake or hypothetical data generation process. So how does this work? Like, I'm generating the data. I created this system. I'm going to first generate a joint latent embedding space that represents numbers, that represents all the behaviors in the system. And then for each agent, I'm going to generate another embedding.

And each embedding, when it's conditioned with state, it's going to generate policy. It's going to decide what it's going to do, what action is given the state and the embedding pair. And then what that whole thing generates is what you see, the state and action pair. So how does this work?

And then given this, you build a model. And you do inference to learn all these parameters. Kind of same business as neural network, but it's just have a little more structure. So this is completely made up, right? This is like my idea of how these new species might work.

And our goal is to-- we're going to try this and see if anything useful comes up. And the way you do this is-- one of the ways you do this is you optimize for a variation of lower bound. You don't need to know that. It's very cool, actually. If one gets into this exponential family business, very cool.

CS228. OK. So here's one of the results that we had. It's a domain called MuJoCo. Here, we're going to pretend that we have two agents, one controlling back leg and one controlling the front leg. And on the right, we're showing that joint embedding space z omega and z alpha.

While video is running, I'm going to try to put the video back. So now I'm going to select-- this is a visualization that we built online. You can go check it out. You can select a little space in agent 1 space. And you see it maps to pretty tight space in agent 0.

And it shows pretty decent running ability. That's cool. And now I'm going to select somewhere else in agent 1 that maps to kind of dispersed area in agent 0. It looks like it's not doing as well. And this is just an insight that we gain for this data only.

But I was quickly able to identify, ah, this tight mapping business kind of represents the good running behavior and bad running behaviors. That's something that you can do pretty efficiently. And now I'm going to show you something more interesting. So of course, we have to do this because we have the data.

It's here. It's so cool. So we apply this framework in the OpenAI's hide and seek. This has four agents. It looks like a simple game, but it has pretty complex structure, 100 dimensional observations, five dimensional action space. So in this work, remember that we pretend that we don't know the labels given by OpenAI.

We just shuffle them in the mix. But we can color them, our results, with respect to their labels. So again, this is the result of z omega and z alpha. The individual agents. But the coloring is something that we didn't know before. We just did it after the fact.

You can see in the z omega, there's nice kind of pattern that we can roughly separate what makes sense to humans and what makes sense to us. But remember, the green and gray, kind of everywhere, they're mixed. So in this particular run of OpenAI's hide and seek, it seemed that those two representations were kind of entangled.

The running and chasing, the blue dots, it seems to be pretty separate and distinguishable from all the other colors. And that kind of makes sense, because that's basis of playing this game. So if you don't have that representation, you have a big trouble. But in case of orange, which is fort building, it's a lot more distinguishable in hiders.

And that makes sense, because hiders are the ones building the fort. And seekers don't build the fort, so oranges are a little more entangled than seekers. Perhaps if seekers had built more separate fort building representation, maybe they would have won this game. So this work, can we learn something interesting, emerging behaviors by just simply observing the system?

The answer seems to be yes, at least for the domains that we tested. A lot more complex domains should be tested. But these are the ones we had. But remember that these methods don't give you names of these clusters. So you would have to go and investigate and click through and explore.

And if the cluster represents super human concept, this is not going to help you. And I'll talk a little more about a work that we do try to help them. But this is not for you. This is not going to help you there. And also, if you have access to the model and the reward signal, you should use it.

Why dump it? So next work, we do use it. I'm going to talk about this work with Nico and Natasha and Shai again. So here, this time, we're going to intervene. We're going to be a little intrusive, but hopefully we'll learn a little more. So problem is that we're going to build a new multi-agent system, going to build it from scratch, such that we can do control testing.

But at the same time, we shouldn't sacrifice the performance. So we're going to try to match the performance of the overall system. We do succeed. I had this paper collaboration with folks at Stanford, actually, here in 2020, where we proposed this pretty simple idea, which is you have a neural network.

Why don't we embed concepts in the middle of the bottleneck, where one neuron represents trees, the other represents stripes, and just train the model end-to-end? And why are we doing this? Well, because then at inference time, you can actually intervene. You can pretend, you know, predicting zebra, I don't think trees should matter.

So I'm going to zero out this neuron and feed forward and see what happens. So it's particularly useful in the medical setting, where there are some features that doctors don't want. We can cancel out and test. So this is the work to extend this to RL setting. It's actually not as simple extension as we thought.

It came out to be pretty complex. But essentially, we're doing that. And we're building each of the concept bottleneck for each agent. And at the end of the day, what you optimize is what you usually do, typical PPO. Just think about this as make the auto system work, plus minimizing the difference between the true concept and estimated concept.

That's all you do. Why are we doing this? You can intervene. You can pretend now agent two, pretend that you can't see agent one. What happens now? That's what we're doing here. We're going to do this in two domains. First domain, how many people saw this cooking game before?

Yeah, it's a pretty commonly used cooking domain in reinforcement learning, very simple. We have two agents, yellow and blue. And they're going to make soup. They can bring three tomatoes. They get a word. They wait for the tomato and bring the dishes, a dish to the cooking pot. They get a reward finally.

Their goal is to deliver as many soups as possible, given some time. And here, concepts that we use are agent position, orientation, agent has tomato, has dish, et cetera, et cetera. Something that's immediately available to you already. And you can, of course, tweak the environment to make it more fun.

So you can make it that they have to collaborate. You can build a wall between them so that they have to work together in order to serve any tomato soup. Or you can make them freely available. You can work independently or together, whatever your choice. First, just kind of sanity check was that you can detect this emerging behavior of coordination versus non-coordination.

So when the impassable environment, when we made up that environment, and suppose the RL system that we trained worked, they were able to deliver some soups, then you see that when we intervene-- this graph, let me explain. This is a reward of an agent one when there's no intervention.

So this is perfectly good world. And when there was an intervention. This is average value of intervening on all concepts. But I'm also going to show you each concept soon. If you compare left and right, you can tell that in the right, when we intervene, reward deteriorated quite a lot for both of them.

And that's one way to see, ah, they are coordinating. Because somehow intervening at this concept impacted a lot of their performance. But this is what was really interesting to me, and I'm curious. Anyone can guess. So this is the same graph as the one you saw before, but except I'm plotting for intervention for each concept.

So I'm intervening team position, team orientation, team has tomato, et cetera, et cetera. It turns out that they are using-- or rather, when we intervene on team orientation, the degradation of performance was the biggest, to the extent that we believe that orientation had to do with sub-coordination. Anyone can guess why this might be?

It's not the position. It's orientation. Yes? Just a clarification question on the orientation. Is that like the direction that the team is projecting? Yes. So it seems like orientation would let you predict where the moving heads? Yes, yes, that's right. Yes. Where were you when I was pulling my hair over this question?

Yes, that's exactly right. And initially, I was really puzzled. Like, why not position? Because I expected to be positioned. But exactly, that's exactly right. So the orientation of the team is But exactly, that's exactly right. So the orientation is the first signal that an agent can get about the next move of the other agent.

Because they're facing the pot, they're going to the pot. They're facing the tomato, they're going to get the tomato. Really interesting intuition. Obvious to some, but I needed this graph to work that out. And of course, you can use this to identify lazy agents. If you look at the rightmost yellow agent, our friend, just chilling in the background.

And he's lazy. And if you train an RL agent, there's always some agents just hanging out. They just not do anything. And you can easily identify this by using this graph. If I intervene, it just doesn't impact any of their rewards. That one's me. So the second domain, we're going to look at a little more complex domain.

So this is studying inter-agent social dynamics. So in this domain, there is a little bit of tension. This is called a cleanup. We have four agents. They only get rewards if they eat apples. Just yellow things, or green things are apples. But if you don't clean the river, then apple stops you all.

So somebody has to clean the river. And you can see, if you have four people trying to collect apples, you can just stay someone else's-- wait until someone else to clean the river and then collect the apples. And in fact, that's sometimes what happens. And concepts here, again, are pretty common things-- position, orientation, and pollution, positions, et cetera.

So when we first plotted the same graph as the previous domain, it tells a story. So the story here is that when I intervene on agent 1, it seems to influence agent 2 quite a lot, if you look at these three different graphs, how reward was impacted when I intervene on agent 1.

It's agent 3 and 4 are fine, but it seems that only agent 2 is influenced. Same with idle time, same with the inter-agent distance. So we were like, oh, maybe that's true. But we keep wondering. There's a lot going on in this domain. How do we know this is the case?

So we decided to take another step. So we're going to do a little more work here, but not a lot. We're going to build a graph to discover inter-agent relationships. This is simplest, dumbest way to build a graph. But again, I like simple things. So how do you build a graph?

Well, suppose that you're building a graph between movies. This is not what we do, but just to describe what we're trying to do. We have each row. We're going to build a matrix. Each row is a movie, and each column consists of features of these movies, so length, genre of the movie, and so on.

And the simplest way to build a graph is to do a regression. So exclude i-th row, and then we're going to regress over everyone else. And that gives me beta, which is a kind of coefficient for each of these. And that beta represents the strength of the edges. So this movie is more related to this movie and not the other movie.

And ta-da, you have a graph. It's the dumbest way. There's a lot of caveats to it. You shouldn't do this a lot of times, but this is the simplest way to do it. So we did the same thing here. Instead of movie, we're going to use intervention on concept C on agent N as our node.

And to build this matrix, we're going to use intervention outcome, which wouldn't have been available without our framework for reward, resource collected, and many other things. And when you build this graph, at the end of the day, you get betas that represent relationship between these interventions. So I had a graph of that matrix.

Apparently, I removed before I came over. But imagine there was a matrix that is nicely highlighted between agent 1 and 4 and that only, contradicting the original hypothesis that we had. And this is the video of it. So when we stared at that matrix, it turns out that there is no high edge, strong edges between agent 1 and 2.

So we were like, that's weird. But there is strong edges between agent 1 and 4. So we dig deeper into it, watched a lot of sessions to validate what's happening. And it turns out that the story was a lot more complicated. The 1's orientation was important for 4. But when that fails, agent 1 and 2 gets cornered in.

And you can see that in the graph. Agent 4 get agent 1 and 2, blue and yellow agent, gets in the corner together. They get stuck. And this is simply just accidental because of the way that we built this environment. It just happened. But the raw statistics wouldn't have told us this story, that this was completely accidental.

In fact, there was no correlation, no coordination between agent 1 and 2. But only after the graph, we realized this was the case. Now, this might be a one-off case. But you know what? A lot of emerging behaviors that we want to detect, a lot of them will be one-off case.

And we really want to get to the truth of that rather than having some surface-level statistics. So can we build multi-agent system that enables intervention and performs as well? The answer is yes. There's a graph that shows the red line and blue line roughly aligned. That's good news. We are performing as well.

But remember these concepts. You need to label them. Or you should have some way of getting those concepts, positions, and orientation. That might be something that we would love to extend in the future. Before I go on, any questions? You shy? You shy? Cool. All right. So I did tell you that we're not going to know, does the solution to move 37.

I still don't. I still don't. But I'll tell you a little bit of work that I'm currently doing I'm really excited about. That we started thinking, you know what? Will this understanding move 37 happen before within my lifetime? And I was like, oh, maybe not. But I kind of want it to happen.

So we start-- this is all about research, right? You started carving out a space where things are a little bit solvable. And you try to attack that problem. So this is our attempt to do exactly that, to get a little closer to our ultimate goal, my ultimate goal, of understanding that move 37.

So before that, how many people here know AlphaZero from TMI? Yes. AlphaZero is a self-trained chess playing machine that beats-- that has higher ELO rating than any other humans and beats Stockfish, which is arguably no existing human can beat Stockfish. So in a previous paper, we try to discover human chess concepts in this network.

So when does concept like material imbalance appear in its network, which layer, and when in the training time, and which we call what, when, and where plots. And we also compare the evolution of opening moves between humans and AlphaZero. These are the first couple moves that you make when you play chess.

And as you can see, there's a pretty huge difference. Left is human, right is AlphaZero. It turns out that AlphaZero can master, or supposedly master, a lot of variety of different types of openings. Openings can be very aggressive. Openings can be very boring. Could be very long range, targeting for long range strategy or short range.

Very different. So that begs the question, what does AlphaZero know that humans don't know? Don't you want to learn what that might be? So that's what we're doing right now. We're actually almost-- we're about to evaluate. So the goal of this work is teach the world chess champion a new chess, superhuman chess strategy.

And we just got yes from Magnus Carlsen, who is the world chess champion. He just lost the match, I know. But he's still champion in my mind. He's still champion in two categories, actually. So the way that we are doing this is we're going to discover new chess strategy by explicitly forgetting existing chess strategy, which we have a lot of data for.

And then we're going to learn a graph, this time a little more complicated graph, by using the existing relationships between existing concepts so that we can get a little bit more idea of what the new concept might look like. And Magnus Carlsen-- so my favorite part about this work-- I talk about carving out.

My favorite part about this work is that the evaluation is going to be pretty clear. So it's not just like Magnus coming in and say, oh, your work is kind of nice, and say nice things about our work. No, Magnus actually has to solve some puzzles. And we will be able to evaluate him, whether he did it or not.

So it's like a kind of success and fail. But I'm extremely excited. This kind of work I can only do because of Lisa, who is a champion herself, but also a PhD student at Oxford. And she played against Magnus in the past, and many other chess players in the world.

And she's going to be the ultimate pre-superhuman filtering to filter out these concepts that will eventually get to Magnus. So I'm super excited about this. I have no results, but it's coming up. I'm excited. Yes? Puzzles are actually pretty simple. So the way that we generate concepts are within the embedding space of alpha 0.

And given that, because alpha 0 has really weird architecture, so every single latent layer in alpha 0 has the exact same position as a chessboard. That's just the way that they decide to do it. So because of that, we can actually identify or generate the board positions that corresponds to that concept.

And because we have MCTS, we can predict what move it's going to make given that board position. Because at inference time, it's actually deterministic, a whole alpha 0 thing. So we have a lot of board positions. And that's all you need for puzzles. You give a board position and then ask Magnus to make a move.

We explain the concept and then give Magnus more board positions and see if we can apply that concept that he just learned. Yeah, so if I were to ask Stockfish to solve those puzzles, that would be a different question. Because we are interested in whether we can teach human, not Stockfish.

Stockfish might be able to do it. That's actually an interesting thing that we could do, now I think about it. But our goal is to just teach one superhuman. Like if I have, for example, 10,000 superhuman concepts, and only three of them are digestible by Magnus, that's a win.

That would be a big win for this type of research. Questions? Yeah, so wrap up. Small steps towards our hopes and dreams. We talked about the gap between what machines know versus what we think machines know. Three ideas why that might be true. The three different maybe angles we can try to attack and answer those questions and bridge that gap.

We talked about studying aliens, these machines, in observation study or control study. There are many other ways to study a species. And I'm not an expert, but anthropology and other humanities studies would know a lot better, more about this. And maybe, just maybe, we can try to understand MOVE 37 at some point, hopefully within my lifetime, through this chess project that I'm very excited about.

Thank you. Thank you very much. Questions? You talked about interpretability research across NLP, vision, and RL. Do you think there's much hope for taking certain interpretability techniques from one modality into other modalities? And if so, what's the pathway? Hmm. So it depends on your goal. I think-- like, think about fairness research, which builds on strong mathematical foundation.

And that's applicable for any questions around fairness, or hopefully applicable. But then, once you-- if your goal is to actually solve a fairness issue at hand for somebody, the real person in the world, that's a completely different question. You would have to customize it for a particular person. You would have to customize it for a particular application.

So there are two venues. And I think similar is true interpretability, like the theory work that I talked about. SHAP and IG are used across domains, like vision, texts. So that theory paper would be applicable across the domain. Things like RL and the way that we build that generative model, you would need to test a little bit more to make sure that this works in NLP.

I don't even know how to think about agents in NLP yet. So we will need a little bit of tweaking. But both directions are fruitful. I want to have a question. So I saw some recent work in which some amateur Go players found a very tricky strategy to trick up.

I think it was AlphaGo. And that seemed like a concept that humans know that machines don't in that Venn diagram. I just want to know your thoughts about that. Yeah, actually, it's funny you mentioned that. Lisa can beat AlphaZero pretty easily. And it's a similar idea. Because you kind of know what are the most unseen autodistribution moves are.

And she can break AlphaZero pretty easily. And Lisa guessed that if Isador had known something more about AI, then maybe he would have tried to confuse AlphaGo. But the truth is, it's a high stake game. Like Isador is the famous star worldwide. So he wouldn't want to make a move that would be seen as a complete mistake, like the one that Magnus made a couple of days ago that got on the newsfeed everywhere, that he made this century-wide mistake.

And that probably hurts. Any other questions? These work that I've presented are pretty new. But there has been a bit of discussion in the robotics, applying potentially just to robotics. And of course, I can't talk about details. But things that reinforcement learning in the wild people worry about are some of the surprises.

If you have a test for it, like if you have a unit test for it, you're never going to fail. Because you're going to test before you deploy. I think the biggest risk for any of these deployment systems is the surprises that you didn't expect. So my work around the visualization and others aim to help you with that.

So we may not know names of these surprises. But here's a tool that helps you better discover those surprises before someone else does or someone else gets harmed. Thanks so much for the talk. This is kind of an open-ended question. But I was wondering, we're talking about a lot of ways in which we try to visualize or understand what's going on in the representation inside the machine.

But I was wondering whether we could turn it around and try to teach machines to tell us what-- using our language, what they're doing in their representations. Like, if we build representations of ours and then get the machine to do the translation for us instead of us going into the machine to see it.

Yeah, great question. So it's a really interesting question. Because that's something that I kind of tried in my work, previous work, called Testing with Concept Activation Vectors. So that was to map human language into a machine's space so that they can only speak our language. Because I understand my language and just talk to me in my language.

The challenge is that, how would you do that for something like alpha 0? Like, we don't have a vocabulary for it, like move 37. Then there is going to be a lot of missing valuable knowledge that we might not get from the machine. So I think the approach has to be both ways.

We should leverage as much as we can. But acknowledging that, even that mapping, that trying to map our language to machines, is not going to be perfect. Because it's a kind of proxy for what we think a penguin is. There's a psychology research that says, everyone thinks very differently about what a penguin is.

Like, if I take a picture of a penguin, everyone is thinking different penguin right now. Australia has the cutest penguin, the fairy penguin. I'm thinking that, right? But I don't know how many people are thinking that. So given that, we are so different, machine's going to think something else.

So how do you bridge that gap? Extend that to 100 concepts and composing those concepts, it's going to go out of wild very soon. So there's pros and cons. I'm into both of them. I think some applications, exclusively just using human concepts are still very helpful. It gets you halfway.

But my ambition is that we shouldn't stop there. We should benefit from them by having them teach us new things that we didn't know before. Yeah? So the second thing you talked about with Jerome was that where knowledge is located in the embedding space isn't super correlated with what you'd like to edit to change that knowledge.

Do you think that has any implications for the later stuff you talked about, like the cost thing? But I don't know, like trying to locate, like to just get strategies in the embedding space might not be as helpful? Oh, what are the alternatives? I guess I don't know the alternatives, just because I feel like the Roam thing as well is not-- That's possible.

So it's like some transformed space of our embedding space in alpha 0, maybe it's a function applied to that embedding space. So thinking about that as a raw vector is a dead end. Could be. We'll see how this chess project goes. In a couple of months, I might rethink my strategy.

But interesting thought. Yeah? So I'm a psychology major, and I do realize that a lot of this stuff that we're trying to do here, like reasoning parts of the game, is how we figure out how our brains work. So do you think that this-- would there be stuff that moves that's applicable to neural networks?

And on the contrary, do you think there must be this interpretability of a study of neural network to help us understand stuff about our own brain? Yeah. Talk to Geoff Hinton. He would really like this. So I believe-- I mean, you probably know about this history. I think that's how it all started, right?

The whole neural network is to understand human brain. So that's the answer to your question. Interesting, however, in my view, there is some biases that we have in neuroscience because of the limitations of tools, like physical tools and availability of humans that you can poke in. I think that influences interpretability research.

And I'll give you an example of what I mean. So in cat, the horizontal line and vertical line neuron in cat brain, so they put the prop in and figure out this one neuron detects vertical lines. And you can validate it. It's really cool if you look at the video.

The video is still online. Yeah, what is it? Yes, yes, yes. So why did they do that? Well, because you had one cat, poor, poor cat. And you had-- we can only probe a few neurons at a time, right? So that implied a lot of-- a few interpretability research actually looked at-- are very focused on neuron-wise representation.

This one neuron must be very special. I actually think that's not true. That was limited by our ability, like physical ability to plop organisms. But in your network, you don't have to do that. You can apply functions to embeddings. You can change the whole embedding to something else, overwrite.

So that kind of is actually a obstacle in our thinking rather than helping. OK, maybe we should call it there. So for Thursday, we're not having a lecture on Thursday. There'll be TAs and me here. So if you have any last-minute panics on your project or think we might have some great insight to help you, we probably won't actually.

It'll be all right. Do come along, and you can chat to us as we find projects and we can give you help. That means that Dean actually got to give the final lecture of CS224 in today. So a round of applause for him.

Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim

Transcript