Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim

00:00:00.000 | [BLANK_AUDIO]

00:00:05.400 | Today I'm delighted to introduce as our final guest speaker, Bean Kim.

00:00:11.480 | Bean Kim is a staff research scientist at Google Brain.

00:00:15.760 | If you're really into Googleology, those funny words at the beginning,

00:00:19.480 | like staff, sort of says how senior you are.

00:00:22.480 | And that means that Bean's a good research scientist.

00:00:25.280 | [LAUGH] So I discovered at lunch today that Bean started out

00:00:30.960 | studying mechanical engineering at Seoul National University.

00:00:36.520 | But she moved on to, I don't know, it's better things or not.

00:00:40.360 | But she moved on to computer science and did her PhD at MIT.

00:00:46.000 | And there she started working on the interpretability and

00:00:49.120 | explainability of machine learning models.

00:00:52.480 | I think she'll be talking about some different parts of her work.

00:00:56.680 | But a theme that she's had in some of her recent work that I find especially

00:01:02.000 | appealing as an NLP person is the idea that we should be using

00:01:07.720 | higher level human interpretable languages for

00:01:11.280 | communication between people and machines.

00:01:15.200 | So welcome, Bean, looking forward to your talk, and go for it.

00:01:20.160 | >> Thank you.

00:01:21.160 | >> [APPLAUSE] >> Thank you.

00:01:25.240 | Thanks for having me.

00:01:26.760 | It's an honor to be here.

00:01:28.720 | It's the rainiest Stanford I've ever seen.

00:01:31.880 | Last night, I got here last night.

00:01:33.760 | But then I live in Seattle, so this is pretty common.

00:01:36.960 | So I still was able to see the blue sky today.

00:01:39.080 | I was like, this works, I really like it here.

00:01:42.080 | So today I'm going to share some of my dreams,

00:01:45.120 | chasing my dreams to communicate with machines.

00:01:49.040 | So if you're in this class, you probably agree, you don't have to,

00:01:53.240 | that large language models and generator models are pretty cool.

00:01:57.480 | They're impressive.

00:01:59.040 | But you may also agree that they're a little bit frightening.

00:02:01.920 | Not just because they're impressive, they're doing a really good job, but

00:02:07.120 | also we're not quite sure where we're going with this technology.

00:02:12.040 | In 10 years out, will we look back and say, that technology was net positive?

00:02:17.160 | Or we will say, that was catastrophic, we didn't know that that would happen.

00:02:21.280 | Ultimately, what I would like to do, or maybe hopefully what we all want to do,

00:02:28.600 | is to have this technology benefit us, humans.

00:02:32.560 | I know in 10 years time, or maybe, well, 20 years or earlier,

00:02:36.440 | he's gonna ask me, he's gonna be like, mom, did you work on this AI stuff?

00:02:41.320 | I watched some of your talks.

00:02:43.680 | And did you know that how this will profoundly change our lives?

00:02:48.680 | And what did you do about that?

00:02:50.560 | And I have to answer that question, and

00:02:53.480 | I really hope that I have some good things to say to him.

00:02:56.000 | So my initial thought, and still so, or current thought,

00:03:05.120 | is that if we want our ultimate goal to be benefit humanity,

00:03:09.080 | why not directly optimize for it, why wait?

00:03:11.760 | So how can we benefit?

00:03:15.480 | There's lots of different ways we can benefit.

00:03:18.240 | But one way we can benefit is to treat this like a colleague.

00:03:22.640 | You know, a colleague who are really good at something.

00:03:26.280 | This colleague is not perfect, but

00:03:28.240 | it's good at something enough that you want to learn something from them.

00:03:31.520 | One difference is though, in this case, is that this colleague is kind of weird.

00:03:37.560 | This colleague might have very different values,

00:03:40.720 | it might have very different experiences in the world.

00:03:44.480 | It may not care about surviving as much as we do.

00:03:48.080 | Maybe mortality isn't really a thing for this colleague.

00:03:52.400 | So you have to navigate that in our conversation.

00:03:55.760 | So what do you do when you first meet somebody?

00:03:59.120 | There's someone so different, what do you do?

00:04:01.040 | You try to have a conversation to figure out how do you do what you do?

00:04:07.320 | How are you solving decades old protein folding problem?

00:04:11.280 | How are you beating the world gold champion so easily, what it seems?

00:04:17.520 | Are you using the same language, the science language that we use,

00:04:21.800 | atoms, molecules?

00:04:23.480 | Or do you think about the world in a very different way?

00:04:27.320 | And more importantly, how can we work together?

00:04:29.800 | I have one alien that I really want to talk to, and it's AlphaGo.

00:04:37.000 | So AlphaGo beat world gold champion Isidore in 2016.

00:04:41.040 | Isidore is from South Korea, I'm from South Korea.

00:04:43.360 | I watched every single match.

00:04:44.800 | It was such a big deal in South Korea and worldwide, I hope.

00:04:48.440 | And in one of the matches, AlphaGo played this move called move 37.

00:04:53.960 | How many people watched AlphaGo matches?

00:04:57.680 | And how many people remember move 37?

00:05:00.920 | Yeah, a few people, right?

00:05:02.600 | And I remember the nine-dan commentator who's

00:05:05.360 | been talking a lot throughout the matches suddenly got really quiet.

00:05:10.000 | And he said, hmm, that's a very strange move.

00:05:14.640 | And I knew then that something really interesting

00:05:17.560 | has just happened in front of my eyes, that this

00:05:20.600 | is going to change something.

00:05:21.800 | This AlphaGo has made something that we're going to remember forever.

00:05:25.320 | And sure enough, this move turned around the game for AlphaGo

00:05:28.840 | and leading AlphaGo to win one of the matches.

00:05:33.160 | So goal players today continue to analyze this move

00:05:36.440 | and still discuss when people talk about,

00:05:38.720 | this is not the move a human would fan them.

00:05:42.120 | So the question is, how did AlphaGo know this is a good move?

00:05:45.680 | My dream is to learn something new by communicating with machines

00:05:54.720 | and having a conversation, and such that humanity

00:05:58.040 | will gain some new angle to our important problems

00:06:01.280 | like medicine and science and many others.

00:06:04.960 | And this is not just about discovering new things.

00:06:08.440 | If you think about reward hacking, you

00:06:11.280 | have to have a meaningful conversation with somebody

00:06:14.800 | to truly figure out what their true goal is.

00:06:18.200 | So in a way, solving this problem is a superset of solving AI safety, too.

00:06:23.240 | So how do we have this conversation?

00:06:29.080 | Conversation assumes that we share some common vocabulary between that

00:06:33.720 | exchange to exchange meaning and ultimately the knowledge.

00:06:36.880 | And naturally, a representation plays a key role in this conversation.

00:06:40.880 | On the left-- and we can visualize this on the left--

00:06:43.400 | we say, this is a representational space of what humans know.

00:06:47.440 | On the right, what machines know.

00:06:50.120 | Here in left circle, there will be something like, this dog is fluffy.

00:06:54.000 | And you know what that means, because we all share somewhat similar vocabulary.

00:06:59.360 | But on the right, we have something like move 37,

00:07:02.760 | where humans yet to have a representation for.

00:07:06.200 | So how do we have this conversation?

00:07:12.160 | Our representational space needs overlap.

00:07:14.800 | And the more overlap we have, the better conversation we're going to have.

00:07:19.160 | Humans are all good at learning new things.

00:07:21.520 | Like here, everyone is learning something new.

00:07:24.320 | So we can expand what we know by learning new concepts and vocabularies.

00:07:29.920 | And doing so, I believe, will help us to build

00:07:32.800 | machines that can better align with our values and our goals.

00:07:36.160 | So this is the talk that I gave.

00:07:40.880 | If you're curious about some of the work we're doing towards this direction,

00:07:43.680 | I highly recommend it.

00:07:44.600 | It's a YouTube video.

00:07:45.640 | I clear keynote, half an hour.

00:07:47.280 | You can fast.

00:07:48.560 | Do a fast feed.

00:07:50.280 | But today, I'm going to talk more about my hopes and dreams.

00:07:53.880 | And hopefully, at the end of the day, your hopes and dreams too.

00:07:59.160 | So first of all, I'm just going to set the expectation.

00:08:03.600 | So at the end of this talk, we still don't know how the move 37 is made.

00:08:09.160 | Sorry.

00:08:09.840 | That's going to take a while.

00:08:12.360 | In fact, the first part of this talk is going

00:08:16.160 | to be about how we move backwards in this progress,

00:08:21.640 | in terms of making this progress in our journey.

00:08:24.480 | And still a very, very small portion of our entire journey

00:08:27.960 | towards understanding move 37.

00:08:31.680 | And of course, this journey wouldn't be like a singular path.

00:08:35.160 | There will be lots of different branches coming in.

00:08:38.480 | Core ideas like transformer helped many domains across.

00:08:42.560 | There will be similar here.

00:08:43.920 | So I'm going to talk in the part two some

00:08:46.560 | of our work on understanding emerging behaviors in reinforcement learning.

00:08:51.560 | And all the techniques that I'm going to talk about

00:08:54.280 | is going to be in principle applicable to NLP.

00:08:56.920 | So coming back to our hopes and dreams, move 37.

00:09:05.120 | So let's first think about how we might realize this dream.

00:09:09.440 | And taking a step back, we have to ask, do we

00:09:12.400 | have tools to first estimate what even machines know?

00:09:17.480 | There has been many development in machine learning last decade

00:09:20.720 | now to develop tools to understand and estimate this purple circle.

00:09:26.800 | So is that accurate?

00:09:28.880 | Unfortunately, many recent research show

00:09:31.200 | that there is a huge gap between what machines actually know

00:09:35.880 | and what we think the machines know.

00:09:40.200 | And identifying and bridging this gap is important

00:09:43.400 | because these tools will form basis for understanding

00:09:46.640 | that move 37.

00:09:47.360 | So what are these tools?

00:09:51.800 | How many people familiar with saliency maps?

00:09:55.280 | A lot, but you don't have to.

00:09:56.480 | I'll explain what it is.

00:09:57.680 | So saliency map is one of the popular interpretability methods.

00:10:02.840 | For simplicity, let's say an image net, you have an image like this.

00:10:06.200 | You have a bird.

00:10:07.240 | The explanation is going to take a form of the same image,

00:10:10.920 | but where each pixel is associated with a number that

00:10:15.320 | is supposed to imply some importance of that pixel

00:10:20.760 | for prediction of this image.

00:10:23.520 | And one definition of that importance

00:10:26.160 | is that that number indicates how the function

00:10:29.000 | look like around this pixel.

00:10:31.240 | So for example, if I have a pixel ixj, maybe around xj,

00:10:35.640 | the function moves up like the yellow curve,

00:10:38.400 | or function is flat, or function is going down like the green curve.

00:10:43.880 | And so if it's flat like a blue curve or red curve,

00:10:47.960 | maybe that feature is irrelevant to predicting bird.

00:10:51.040 | Maybe it's going up.

00:10:52.080 | Then it's maybe more important because the value of x increases

00:10:55.200 | and the function value goes up.

00:10:56.480 | Function value here like a prediction value.

00:10:58.280 | So let's think about what are the few ways why this gap might exist.

00:11:06.560 | There are a few ways.

00:11:07.440 | Not exhaustive, they overlap a little bit,

00:11:09.440 | but helpful for us to think about.

00:11:11.320 | Maybe assumptions are wrong.

00:11:13.120 | So this alien, again, these machines that we train,

00:11:16.360 | works in a completely different, perhaps completely

00:11:18.520 | different representational space, very different experiences

00:11:21.640 | about the world.

00:11:23.120 | So assuming that it sees the world that we do just like we do,

00:11:26.840 | like having the gestalt phenomenon, there's few dots.

00:11:30.120 | Humans have tendency to connect them.

00:11:32.680 | Maybe machines have that too.

00:11:34.280 | Maybe not.

00:11:35.480 | So maybe our assumptions about these machines are wrong.

00:11:39.560 | Maybe our expectations are mismatched.

00:11:42.040 | We thought it was doing x, but it was actually doing y.

00:11:45.760 | Or maybe it's beyond us.

00:11:49.000 | Maybe it's showing something superhuman

00:11:51.080 | that humans just can't understand.

00:11:52.920 | I'm going to dig deeper into some of these, our work.

00:11:59.440 | This is more recent work.

00:12:01.480 | So again, coming back to the earlier story about salience

00:12:04.600 | map, we're going to play with some of these methods.

00:12:09.000 | Now, in 2018, we stumbled upon this phenomenon

00:12:14.560 | that was quite shocking, which was that we were actually

00:12:17.400 | trying to write some different people, again, paper,

00:12:19.600 | of course, the end here.

00:12:20.920 | But we were testing something, and we

00:12:23.000 | realized that trained network and untrained network

00:12:26.400 | has the same, very similar saliency map.

00:12:29.440 | In other words, random prediction

00:12:31.920 | and meaningful prediction were giving me

00:12:33.720 | the same explanation.

00:12:36.160 | So that was puzzling.

00:12:37.440 | We thought we had a bug, but it turned out we didn't.

00:12:40.800 | It actually is indistinguishable qualitatively and quantitatively.

00:12:45.920 | So that was shocking.

00:12:47.840 | But then we wondered, maybe it's a one-off case.

00:12:51.560 | Maybe it still works somehow in practice.

00:12:56.680 | So we tested that in a follow-up paper.

00:12:59.040 | OK, what if the model had an error, one of these errors?

00:13:02.760 | Maybe it has a labeling error.

00:13:04.440 | Maybe it has a spurious correlation.

00:13:06.600 | Maybe it had out-of-distribution at test time.

00:13:09.880 | If we intentionally insert these bugs,

00:13:12.440 | can explanation tell us that there's

00:13:14.800 | something wrong with the model?

00:13:17.360 | It turns out that that's also not quite true.

00:13:21.720 | You might think that, oh, maybe spurious correlation.

00:13:24.160 | Another follow-up work also showed

00:13:25.960 | that this is also not the case.

00:13:28.960 | So we were disappointed.

00:13:31.600 | But then still, we say, you know,

00:13:33.920 | maybe there's no theoretical proof of this.

00:13:38.240 | Maybe this is, again, a lab-setting test.

00:13:40.920 | We had grad students to test this system.

00:13:44.000 | Maybe there's still some hope.

00:13:48.400 | So this is more recent work where we theoretically

00:13:50.800 | prove that some of these methods, very popular methods,

00:13:54.760 | cannot do better than random.

00:13:56.400 | So I'm going to talk a little bit about that.

00:14:00.800 | I'm missing one person.

00:14:02.040 | I'm missing Peng Wei in the author list.

00:14:03.640 | I just realized this is also work with Peng Wei.

00:14:07.480 | So let's first talk about our expectation.

00:14:10.240 | What is our expectation about this tool?

00:14:13.400 | Now, the original paper that developed this method, IG

00:14:18.160 | and Schaub, talks about how IG can

00:14:20.920 | be used for accounting the contributions of each feature.

00:14:25.560 | So what that means is that when the tool assigns

00:14:27.960 | zero attribution to a pixel, we're going to say, OK,

00:14:30.920 | well, pixel is unused by the function.

00:14:33.840 | And that means that f will be insensitive

00:14:36.560 | if I perturb this x.

00:14:40.480 | And in fact, this is how it's been used in practice.

00:14:43.840 | This is a paper published in Nature.

00:14:45.840 | They used Schaub to figure out the eligibility

00:14:49.720 | criteria in a medical trial.

00:14:53.720 | What we show in this work is that none

00:14:55.760 | of these inferences that seemed pretty natural were true.

00:15:01.240 | And in fact, just because popular attribution methods

00:15:03.920 | tell you anything about attribution is x,

00:15:07.240 | you cannot conclude anything about the actual model

00:15:10.520 | behavior.

00:15:12.640 | So how does that work?

00:15:15.640 | How many people here do theory proof?

00:15:17.960 | A few.

00:15:21.280 | Great.

00:15:21.960 | I'll tell you.

00:15:22.560 | I learned about theory proving from this project as well.

00:15:25.600 | So I'll tell you the way that we pursued this particular work

00:15:30.080 | is that first think about this problem.

00:15:32.000 | And we're going to formulate into some other problem

00:15:35.080 | that we know how to solve.

00:15:37.040 | So in this case, we formulate this as hypothesis testing.

00:15:41.120 | Because once you formulate into hypothesis testing, yes or no,

00:15:44.280 | there are lots of tools and statistics

00:15:45.900 | you can use to prove this.

00:15:48.320 | So what is hypothesis?

00:15:49.880 | The hypothesis is that I'm a user.

00:15:52.640 | I got an attribution value from one of these tools.

00:15:55.840 | And I have a mental model of, ah, this feature is important

00:15:59.920 | or maybe not important.

00:16:01.840 | Then the hypothesis is that whether that's true or not.

00:16:06.040 | And what we showed is that given whatever hypothesis you

00:16:09.600 | may have, you cannot do better than random guessing,

00:16:14.360 | invalidating or invalidating this hypothesis testing.

00:16:18.080 | And that means, yes, sometimes it's right.

00:16:20.680 | But you don't do hypothesis testing

00:16:22.640 | if you cannot validate yes or no.

00:16:24.760 | You just don't.

00:16:25.480 | Because what's the point of doing it

00:16:26.980 | if you just don't know if it's as good as random guessing?

00:16:29.920 | And the result is that, yes, for this graph,

00:16:36.640 | it's just a visualization of our result.

00:16:38.720 | If you plot true negative and true positive,

00:16:41.280 | and line is random guessing, because this

00:16:43.200 | is the worst method, that's the best method.

00:16:45.160 | All the equal distance is this line.

00:16:47.440 | Methods that we know, SHOP and IG,

00:16:50.600 | all falls under this line of random guessing.

00:16:55.280 | That's bad news.

00:16:57.080 | But maybe this still works in practice for some reason.

00:17:02.600 | Maybe there were some assumptions

00:17:03.960 | that we had that didn't quite meet in the practice.

00:17:07.520 | So does this phenomenon hold in practice?

00:17:11.560 | The answer is yes.

00:17:13.600 | We now have more image graphs and more bigger models.

00:17:16.200 | But here we test two concrete end tasks

00:17:19.720 | that people care about in interpretability,

00:17:21.960 | or use these methods to do, recourse or spurious

00:17:25.040 | correlation.

00:17:26.120 | So recourse, for those who are not familiar,

00:17:28.080 | is you're getting a loan.

00:17:29.880 | And you wonder whether, if I'm older,

00:17:32.520 | I would have a high chance of getting a loan.

00:17:34.960 | So I tweak this one feature and see if my value goes up or down.

00:17:39.520 | Very reasonable task if people do it all the time.

00:17:41.880 | Pretty significant implication socially.

00:17:45.840 | So for two of these concrete end tasks, both of them

00:17:50.520 | boil down to this hypothesis testing framework

00:17:53.080 | that I talked about.

00:17:54.760 | They're all around the random guessing line,

00:17:57.440 | or worse than random guessing.

00:18:01.240 | So you might say, oh, no.

00:18:02.960 | This is not good.

00:18:04.000 | A lot of people are using these tools.

00:18:05.600 | What do we do?

00:18:06.920 | We have a very simple idea about this.

00:18:10.640 | So people like developing complex tools.

00:18:15.600 | And I really hope you're not one of those people.

00:18:18.280 | Because a lot of times, simple methods work.

00:18:22.640 | Who comes razor?

00:18:23.840 | But also, simple methods are elegant.

00:18:25.560 | There is a reason, perhaps a lot of times, why they work.

00:18:29.040 | They're simple.

00:18:30.400 | You can understand them.

00:18:31.760 | They make sense.

00:18:32.880 | So let's try that idea here.

00:18:35.480 | So again, your goal is to estimate a function shape.

00:18:38.680 | What do you do?

00:18:39.680 | Well, the simplest thing you do is

00:18:41.840 | you have a point of interest.

00:18:43.920 | You sample around that point and evaluate the function

00:18:47.040 | around that point.

00:18:48.320 | If it goes up, maybe function's going up.

00:18:50.760 | If it goes down, maybe function's coming down.

00:18:54.480 | So that's the simplest way you can brute force it.

00:18:58.480 | But then the question is, how many samples do we need?

00:19:01.720 | So here, this is the equation that you're

00:19:05.240 | lifting this line upwards that way

00:19:07.800 | by adding that additional term.

00:19:10.680 | It's proportional to number of samples.

00:19:12.760 | The more samples you have, the better estimation you have.

00:19:14.640 | It makes sense.

00:19:15.680 | And differences in output, how much resolution do you care?

00:19:18.720 | Do you care 0.1 to 0.2?

00:19:22.800 | Or do you only care 0 slope to slope 1?

00:19:26.440 | That's resolution that you care about and number

00:19:29.520 | of features, of course.

00:19:31.160 | So if you worry about making some conclusion based

00:19:35.560 | on function shape, sample.

00:19:38.600 | Easy.

00:19:42.280 | So can we infer the model behavior

00:19:45.640 | using these popular methods?

00:19:47.640 | Answer is no.

00:19:49.600 | And this holds both theory and practice.

00:19:53.480 | We're currently working on even bigger models

00:19:55.480 | to show just again and again empirical evidence that, yes,

00:19:59.800 | it just really doesn't work.

00:20:01.440 | Please think twice and three times

00:20:03.920 | before using these methods.

00:20:06.120 | And also, model-dependent sample complexity.

00:20:09.280 | If your function is kind of crazy,

00:20:11.200 | of course, you're going to need more samples.

00:20:13.320 | So what is the definition?

00:20:14.400 | How do we characterize these functions?

00:20:17.840 | And finally, we haven't quite given up yet.

00:20:20.480 | Because these methods have a pretty good root

00:20:22.440 | in economics and shop-levels and all that.

00:20:25.840 | So maybe there are a lot narrower condition

00:20:30.080 | where these methods work.

00:20:31.880 | And we believe such condition does exist.

00:20:35.040 | We just have to figure out when.

00:20:37.120 | Once we figure out what that condition is,

00:20:39.960 | then in a given function, I can test it and say,

00:20:42.840 | yes, I can use shop here.

00:20:44.400 | Yes, I can use IG here.

00:20:45.840 | Or no, I can't.

00:20:47.520 | That would be still very useful for ongoing work.

00:20:51.800 | Before I go to the next one, any questions?

00:20:55.040 | Yes?

00:20:56.040 | To the findings you have about the JPEG models,

00:20:59.400 | does it only apply to computer vision models?

00:21:01.680 | Or does it apply to NLP-only models?

00:21:04.320 | Any model that has a function.

00:21:06.160 | [LAUGHTER]

00:21:09.080 | Yeah, very simple.

00:21:10.240 | Simple, actually.

00:21:10.920 | Simplish proof that can show simply any function, this holds.

00:21:16.920 | Any other questions?

00:21:17.960 | Wonderful.

00:21:21.760 | Yeah, Chris?

00:21:22.280 | This may disrelate to your last bullet.

00:21:24.560 | But it sort of seems like for the last couple of years,

00:21:28.800 | there have been at least dozens, maybe hundreds of people

00:21:32.440 | writing papers using Shepley values.

00:21:35.320 | I mean, is your guess that most of that work is invalid?

00:21:42.520 | Or that a lot of it might be OK because whatever conditions

00:21:47.600 | that it's all right might often be being there?

00:21:51.320 | So two answers to that question.

00:21:53.680 | My hypothesis testing results shows that it's random.

00:21:57.760 | So maybe in the optimistic case, 50% of those papers,

00:22:04.280 | you hit it.

00:22:06.480 | And on the other side, on the second note,

00:22:09.840 | even if maybe Shep wasn't perfect,

00:22:12.280 | maybe it was kind of wrong.

00:22:14.000 | But even if it helped human at the end task, whatever

00:22:17.440 | that might be, help doctors to be more efficient,

00:22:19.720 | identifying bugs and whatnot, and if they did the validation

00:22:22.920 | correctly with the right control testing setup,

00:22:26.640 | then I think it's good.

00:22:28.120 | You figured out somehow how to make

00:22:29.840 | this noisy tools together work with human in the loop, maybe.

00:22:33.320 | And that's also good.

00:22:34.600 | And I personally really like Shep's paper.

00:22:37.240 | And I'm a good friend with Scott.

00:22:39.240 | And I love all his work.

00:22:41.160 | It's just that I think we need to narrow down

00:22:43.120 | our expectations so that our expectations are

00:22:45.520 | better aligned.

00:22:46.320 | All right.

00:22:50.320 | I'm going to talk about another work that's

00:22:52.240 | kind of similar flavor.

00:22:54.160 | Now it's an NLP.

00:22:56.800 | So this is one of those papers, just like the many other papers

00:23:00.560 | that we ended up writing.

00:23:02.680 | One of those serendipity papers.

00:23:04.480 | So initially, Peter came up as an intern.

00:23:07.560 | And we thought, we're going to locate ethical knowledge

00:23:10.360 | in this large language models.

00:23:12.360 | And then maybe we're going to edit them to make them

00:23:14.760 | a little more ethical.

00:23:15.800 | So that was the goal.

00:23:17.000 | And then we thought, oh, the Rome paper from David Bauer.

00:23:19.520 | And I also love David's work.

00:23:21.560 | And let's use that.

00:23:22.840 | That's the start of this work.

00:23:24.960 | But then we start digging into and implementing the Rome.

00:23:28.560 | And things didn't quite line up.

00:23:30.960 | So we do sanity check, experiment after sanity check.

00:23:34.320 | And we ended up writing a completely different paper,

00:23:36.840 | which I'm about to talk to you about.

00:23:39.960 | So this paper, the Rome, for those who are not familiar,

00:23:45.200 | which I'm going into a little more detail in a bit,

00:23:47.600 | is about editing a model.

00:23:49.200 | So you first locate a knowledge in a model.

00:23:52.600 | Like the space needle is in Seattle.

00:23:54.640 | That's a fact, your knowledge.

00:23:56.440 | You locate them.

00:23:57.560 | You edit them.

00:23:58.960 | Because you can locate them, you can mess with it

00:24:02.040 | to edit that fact.

00:24:03.280 | That's the whole promise of it.

00:24:05.000 | In fact, that's a lot of times how localization or editing

00:24:08.040 | methods were motivated in the literature.

00:24:11.480 | But what we show is that this assumption is actually not

00:24:14.720 | true.

00:24:16.480 | And to be quite honest with you, I still

00:24:18.680 | don't quite get why this is not related.

00:24:22.800 | And I'll talk more about this, because this

00:24:24.600 | is a big question to us.

00:24:26.480 | This is pretty active work.

00:24:29.520 | So substantial fraction of factual knowledge

00:24:33.720 | is stored outside of layers that are identified

00:24:37.560 | as having the knowledge.

00:24:39.480 | And you will see this a little more detail in a bit.

00:24:44.600 | In fact, the correlation between where the location, where

00:24:48.880 | the facts are located, and how well you will edit

00:24:52.120 | if you edit that location is completely correlated,

00:24:55.920 | uncorrelated.

00:24:57.280 | So they have nothing to do with each other.

00:25:00.800 | So we thought, well, maybe it's the problem

00:25:04.360 | with the definition of editing.

00:25:06.160 | What we mean by editing can mean a lot of different things.

00:25:08.960 | So let's think about different ways to edit a thing.

00:25:13.040 | So we tried a bunch of things with little success.

00:25:16.280 | We couldn't find an editing definition that actually

00:25:19.480 | relates really well with localization methods,

00:25:22.240 | like in particular with ROM.

00:25:24.560 | So let's talk a little bit about ROM, how ROM works,

00:25:29.840 | super briefly.

00:25:30.760 | There's a lot of details missed out on the slide,

00:25:32.800 | but roughly you will get the idea.

00:25:34.920 | So ROM is Mangeto, 2022.

00:25:38.480 | They have what's called causal tracing algorithm.

00:25:41.920 | And the way it works is that you're

00:25:43.640 | going to run a model on this particular data

00:25:46.440 | set, counterfact data set, that has this tuple, subject,

00:25:51.600 | relation, and object.

00:25:52.800 | The space needle is located in Seattle.

00:25:56.280 | And so you're going to have a clean run of the space needle

00:25:59.720 | is in Seattle one time.

00:26:01.600 | You stole every single module, every single value,

00:26:04.200 | activations.

00:26:05.960 | And then in the second run, which they call corrupted run,

00:26:09.680 | you're going to add noise in the space needle is--

00:26:13.200 | or the space.

00:26:15.280 | Then you're going to intervene at every single one

00:26:20.080 | of those modules by copying this module to the corrupted run.

00:26:26.320 | So as if that particular model was never interrupted,

00:26:31.480 | noise was never added to that module.

00:26:34.360 | So it's a typical intervention case

00:26:37.160 | where you pretend everything else being equal.

00:26:40.720 | If I change just this one module,

00:26:43.320 | what is the probability of having the right answer?

00:26:46.400 | So in this case, probability of the right answer,

00:26:48.480 | Seattle, given that I know is the model

00:26:52.080 | and I intervened on it.

00:26:55.000 | So at the end of the day, you'll find a graph

00:26:57.440 | like that where each layer and each token has a score.

00:27:02.560 | How likely it is if I intervene on that token in that layer?

00:27:07.160 | How likely is it that I will recover the right answer?

00:27:10.960 | Because if I recover right answer, that's the module.

00:27:13.280 | That's the module that stored the knowledge.

00:27:16.240 | Really reasonable algorithm.

00:27:17.960 | I couldn't find technical flaw in this algorithm.

00:27:20.320 | I quite like it, actually.

00:27:21.640 | But when we start looking at this using the same model

00:27:28.320 | that they used, GPT-J, we realize

00:27:32.160 | that a lot of these facts--

00:27:34.840 | so Roam uses just layer 6 to edit,

00:27:38.080 | because that was supposedly the best layer across this data set

00:27:42.200 | to edit.

00:27:42.720 | Most of the factual knowledge is stored in layer 6.

00:27:45.360 | And they showed editing success and whatnot.

00:27:49.120 | But we realized the truth looks like the graph on the right.

00:27:52.560 | So the red line is the layer 6.

00:27:55.320 | Their extension paper called MEME

00:27:57.480 | edits multiple layers at the blue region.

00:28:01.360 | The black bars are histogram of where the knowledge was

00:28:04.840 | actually peaked if you test every single layer.

00:28:08.600 | And as you can see, not a lot of facts fall into that region.

00:28:12.000 | So in fact, every single fact has different region

00:28:14.920 | that where it peaked.

00:28:16.160 | So layer 6, for a lot of facts, weren't the best layer.

00:28:20.400 | But the editing really worked.

00:28:22.120 | It really works.

00:28:23.200 | And we were able to duplicate the results.

00:28:26.080 | So we thought, what do we do to find this ethical knowledge?

00:28:30.680 | How do we find the best layer to edit?

00:28:33.080 | So that's where we started.

00:28:34.720 | But then we thought, you know what?

00:28:36.360 | Take a step back.

00:28:37.640 | We're going to actually do a sanity check first

00:28:40.520 | to make sure that tracing effect--

00:28:43.200 | the tracing effect is the localization--

00:28:46.640 | implies better editing results.

00:28:49.560 | And that's when everything started to falling apart.

00:28:53.840 | So let's define some metrics first.

00:28:56.080 | The edit success-- this is the rewrite score,

00:28:59.360 | same score as Rome paper used.

00:29:01.600 | That's what we used.

00:29:02.960 | And the tracing effect-- this is localization--

00:29:06.160 | is probably-- you can read the slide.

00:29:10.040 | So when we plotted the relation between tracing effect

00:29:13.680 | and rewrite score, the editing method,

00:29:18.200 | red line implies the perfect correlation.

00:29:22.080 | And that was our assumption, that there

00:29:23.800 | will be perfectly correlated, which

00:29:25.480 | is why we do localization to begin with.

00:29:28.480 | The actual line was yellow.

00:29:30.840 | It's close to zero.

00:29:32.040 | It's actually negative in this particular data set.

00:29:36.440 | That is not even uncorrelated.

00:29:37.680 | It's anti-correlated.

00:29:39.920 | And we didn't stop there.

00:29:41.000 | We were so puzzled.

00:29:42.640 | We're going to do this for every single layer.

00:29:44.720 | And we're going to find R-squared value.

00:29:47.120 | So how much of the choice of layer

00:29:50.320 | versus the localization, the tracing effect,

00:29:53.000 | explains the variance of successful edit?

00:29:57.080 | If you're not familiar with R-squared,

00:29:58.960 | R-squared is like a--

00:29:59.960 | think about it as an importance of a factor.

00:30:03.240 | And it turns out that layer takes 94%.

00:30:06.960 | Tracing effect is 0.16.

00:30:10.680 | And so we were really puzzled.

00:30:11.960 | We were scratching our head.

00:30:13.160 | Why is this true?

00:30:15.640 | But it was true across layer.

00:30:17.520 | We tried all sorts of different things.

00:30:19.200 | We tried different model.

00:30:20.800 | We tried different data set.

00:30:22.080 | It was all roughly the case.

00:30:24.600 | So at this point, we contacted David.

00:30:28.120 | And we started talking about it.

00:30:29.520 | And we resolved them.

00:30:30.920 | They acknowledged that this is a phenomenon that exists.

00:30:34.600 | Yeah, John?

00:30:35.960 | So apart from the layer, the other way

00:30:38.760 | which localization can happen is are you

00:30:41.280 | looking at the correct token?

00:30:42.600 | Is that the other corresponding--

00:30:44.840 | Yeah.

00:30:45.600 | Yeah, in this graph, the token is in--

00:30:48.880 | So the added benefit of the rest of the localization

00:30:52.720 | could only help you look at which is the correct subred

00:30:54.920 | token, is that it?

00:30:55.720 | Yeah, yeah.

00:30:56.560 | So looking at any of the subred tokens

00:30:58.200 | is sort of fine is what I should think of?

00:30:59.880 | Yeah, yeah.

00:31:00.440 | Just layer is the most biggest thing.

00:31:03.080 | That's the only thing you should care.

00:31:04.680 | You care about editing?

00:31:05.680 | Layers.

00:31:06.480 | In fact, don't worry about localization at all.

00:31:08.480 | It's extra wasted carbon climate effect.

00:31:13.280 | So that was our conclusion.

00:31:16.240 | But then we thought, maybe the particular definition of edit

00:31:20.840 | that they used in the room was maybe different.

00:31:23.880 | Maybe there exists a definition of editing

00:31:26.920 | that correlates a lot better with localization.

00:31:30.480 | Because there must be.

00:31:31.320 | I'm still puzzled.

00:31:32.360 | Why is this not correlated?

00:31:34.560 | So we tried a bunch of different definitions of edits.

00:31:38.800 | You might inject an error.

00:31:41.040 | You might reverse the tracing.

00:31:46.280 | You might want to erase a fag.

00:31:47.720 | You might want to amplify the fag.

00:31:49.120 | All these things.

00:31:49.840 | Like maybe one of these will work.

00:31:52.480 | It didn't.

00:31:53.960 | So the graph that you're seeing down here

00:31:56.000 | is our square value for four different methods.

00:31:59.400 | And this wasn't just the case for ROM and MEM.

00:32:01.360 | It was also the case for fine tuning methods.

00:32:04.480 | That you want to look at the difference

00:32:06.720 | between blue and orange bar represents

00:32:10.280 | how much the tracing effect influenced

00:32:12.560 | our square value of the tracing effect.

00:32:14.400 | As you can see, it's ignorable.

00:32:16.160 | They're all the same.

00:32:17.920 | You might feel that effect forcing, the last one,

00:32:20.760 | has a little bit of hope.

00:32:22.240 | But still, compared to the impact of layer, choice of layer,

00:32:27.000 | it's ignorable.

00:32:28.680 | So at this point, we said, OK, well, we

00:32:32.560 | can't locate the ethical knowledge at this project.

00:32:35.920 | We're going to have to switch the direction.

00:32:37.800 | And we ended up doing a lot more in-depth analysis on this.

00:32:41.520 | So in summary, does localization help editing?

00:32:48.080 | No.

00:32:49.520 | The relationship is actually zero.

00:32:51.600 | For this particular editing method, from what I know,

00:32:55.120 | it's pretty state of the art.

00:32:56.880 | And the counterfact data, it's not true.

00:33:00.360 | Are there any other editing method

00:33:01.800 | that correlate better?

00:33:02.840 | No.

00:33:03.440 | But if somebody can answer this question for me,

00:33:05.680 | that will be very satisfying.

00:33:07.360 | I feel like there should still be something there

00:33:10.480 | that we're missing.

00:33:12.440 | But causal tracing, I think what it does

00:33:14.920 | is it reveals the factual information when

00:33:18.960 | the transformer is passing forward.

00:33:21.920 | I think it represents where is the fact when

00:33:24.880 | you're doing that.

00:33:26.360 | But what we found here is that it has nothing

00:33:28.480 | to do with editing success.

00:33:30.680 | Those two things are different.

00:33:32.000 | And we have to resolve that somehow.

00:33:35.120 | But a lot of insights that they found in their paper

00:33:37.880 | is still useful, like the early to mid-range NLP

00:33:40.520 | representation, loss token.

00:33:42.360 | They represent the factual, something we didn't know before.

00:33:46.160 | But it is important not to validate localization methods

00:33:50.440 | using the editing method, now we know,

00:33:52.960 | and maybe not to motivate editing methods using

00:33:56.880 | via localization.

00:33:58.600 | Those are the two things now we know that we shouldn't do,

00:34:01.880 | because we couldn't find a relationship.

00:34:05.200 | Any questions on this one before I move on to the next one?

00:34:07.920 | I'm not shocked by this.

00:34:17.480 | I am shocked by this.

00:34:18.680 | I'm still so puzzled.

00:34:21.600 | There should be something.

00:34:22.800 | I don't know.

00:34:26.840 | All right.

00:34:28.840 | So in summary of this first part,

00:34:32.200 | we talked about why the gap might exist,

00:34:35.280 | what machines know versus what we think machines know.

00:34:38.800 | There are three hypotheses.

00:34:40.040 | There are three ideas.

00:34:41.080 | Assumptions are wrong.

00:34:41.920 | Maybe our expectations are wrong.

00:34:43.520 | Maybe it's beyond us.

00:34:45.280 | There's a good quote that says, "Good artists steal.

00:34:48.480 | I think good researchers doubt."

00:34:50.520 | We have to be really suspicious of everything that we do.

00:34:54.160 | And that's maybe the biggest lesson

00:34:55.560 | that I've learned over many years,

00:34:57.880 | that once you like your results so much, that's a bad sign.

00:35:02.360 | Come back, go home, have a beer, go to sleep.

00:35:05.720 | And next day, you come back and put your paper on your desk

00:35:09.240 | and think, OK, now I'm going to review this paper.

00:35:12.480 | How do I criticize this?

00:35:13.680 | What do I not like about this paper?

00:35:16.320 | That's one way to look at it.

00:35:17.520 | Criticize your own research, and that will

00:35:19.800 | improve your thinking a lot.

00:35:23.160 | So let's bring our attention back to our hopes and dreams.

00:35:26.560 | It keeps coming back.

00:35:28.720 | So here, I came to realize maybe instead of just building

00:35:33.560 | tools to understand, perhaps we need to do some groundwork.

00:35:37.800 | What do I mean?

00:35:38.920 | Well, this alien that we've been dealing with,

00:35:41.680 | trying to generate explanations, seems to be a different kind.

00:35:45.960 | So maybe we should study them as if they're

00:35:48.920 | like newbies to the field.

00:35:50.560 | Study them as if they're like new species in the wild.

00:35:54.640 | So what do you do when you observe

00:35:56.160 | a new species in the wild?

00:35:57.800 | You have a couple of ways.

00:35:59.120 | But one of the ways is to do observational study.

00:36:02.120 | So you saw some species in the wild far away.

00:36:05.120 | First, you just kind of watch them.

00:36:07.280 | You watch them and see what are they like,

00:36:09.560 | what are their habitat, what are their values and whatnot.

00:36:14.440 | And second way, you can actually intervene and do a control

00:36:18.000 | study.

00:36:18.800 | So we did something like this with reinforcement learning

00:36:22.920 | setup.

00:36:25.560 | I'm going to talk about these two papers, first paper.

00:36:29.080 | Emergent behaviors in multi-agent systems

00:36:31.360 | has been so cool.

00:36:32.680 | Who saw this hide and seek video by OpenAI?

00:36:36.360 | Yeah, it's so cool.

00:36:37.120 | If you haven't seen it, just Google it and watch it.

00:36:39.240 | It's so fascinating.

00:36:40.440 | I'm only covering the tip of an iceberg in this.

00:36:42.920 | But at the end of this hide and seek episode, at some point,

00:36:47.400 | the agents discover a bug in this physical system

00:36:52.080 | and start anti-gravity flying in the air

00:36:55.680 | and shooting hiders everywhere.

00:36:57.960 | It's a super interesting video.

00:36:59.360 | You must watch.

00:37:01.320 | So lots of that.

00:37:02.360 | And also humanoid football and capture the flag from deep mind.

00:37:05.840 | Lots of interesting behaviors emerging that we observed.

00:37:08.560 | Here's my favorite one.

00:37:12.520 | But these labels-- so here, these

00:37:15.680 | are labels that are provided by OpenAI, running and chasing,

00:37:19.040 | fort building, and ramp use.

00:37:21.760 | And these ones were that a human or humans

00:37:25.360 | went painstakingly, one by one, watch all these videos

00:37:29.040 | and label them manually.

00:37:31.600 | So our question is, is there a better way

00:37:34.800 | to discover these emergent behaviors?

00:37:37.440 | Perhaps some nice visualization can

00:37:39.800 | help us explore this complex domain a little better.

00:37:44.960 | So that's our goal.

00:37:47.800 | So in this work, we're going to, again,

00:37:50.240 | treat the agents like an observational study,

00:37:52.800 | like a new species.

00:37:53.800 | And we're going to do observational study.

00:37:55.800 | And what that means is that we only

00:37:57.600 | get to observe state and action pair.

00:37:59.800 | So where they are, what are they doing,

00:38:02.600 | what are they doing?

00:38:04.080 | And we're going to discover agent behavior

00:38:07.200 | by basically clustering the data.

00:38:10.320 | That's all we're going to do.

00:38:12.600 | And how do we do it?

00:38:13.840 | Pretty simple.

00:38:15.600 | Generative model-- have you covered the Bayesian

00:38:18.120 | generative model, graphical model?

00:38:19.800 | No, gotcha.

00:38:20.360 | OK.

00:38:21.960 | So think about--

00:38:22.640 | [INAUDIBLE]

00:38:23.640 | Hi.

00:38:24.840 | That also what you teach?

00:38:27.200 | Yeah, so this is a graphical model.

00:38:29.400 | Think about this as a fake or hypothetical data generation

00:38:34.280 | process.

00:38:35.160 | So how does this work?

00:38:36.360 | Like, I'm generating the data.

00:38:37.800 | I created this system.

00:38:39.240 | I'm going to first generate a joint latent embedding space

00:38:43.240 | that represents numbers, that represents

00:38:45.360 | all the behaviors in the system.

00:38:47.440 | And then for each agent, I'm going

00:38:48.920 | to generate another embedding.

00:38:51.480 | And each embedding, when it's conditioned with state,

00:38:55.360 | it's going to generate policy.

00:38:57.160 | It's going to decide what it's going to do,

00:38:58.960 | what action is given the state and the embedding pair.

00:39:02.400 | And then what that whole thing generates

00:39:05.240 | is what you see, the state and action pair.

00:39:08.280 | So how does this work?

00:39:09.560 | And then given this, you build a model.

00:39:11.880 | And you do inference to learn all these parameters.

00:39:15.200 | Kind of same business as neural network,

00:39:16.880 | but it's just have a little more structure.

00:39:20.080 | So this is completely made up, right?

00:39:21.800 | This is like my idea of how these new species might work.

00:39:26.160 | And our goal is to--

00:39:27.240 | we're going to try this and see if anything useful comes up.

00:39:30.920 | And the way you do this is-- one of the ways you do this

00:39:33.320 | is you optimize for a variation of lower bound.

00:39:35.720 | You don't need to know that.

00:39:36.880 | It's very cool, actually.

00:39:38.920 | If one gets into this exponential family business,

00:39:42.440 | very cool.

00:39:44.160 | CS228.

00:39:46.320 | OK.

00:39:47.400 | So here's one of the results that we had.

00:39:49.600 | It's a domain called MuJoCo.

00:39:52.160 | Here, we're going to pretend that we have two agents, one

00:39:54.960 | controlling back leg and one controlling the front leg.

00:39:57.800 | And on the right, we're showing that joint embedding space

00:40:00.760 | z omega and z alpha.

00:40:03.040 | While video is running, I'm going

00:40:05.720 | to try to put the video back.

00:40:08.640 | So now I'm going to select-- this is a visualization

00:40:14.080 | that we built online.

00:40:16.400 | You can go check it out.

00:40:17.960 | You can select a little space in agent 1 space.

00:40:21.440 | And you see it maps to pretty tight space in agent 0.

00:40:25.120 | And it shows pretty decent running ability.

00:40:27.320 | That's cool.

00:40:28.760 | And now I'm going to select somewhere else in agent 1

00:40:32.480 | that maps to kind of dispersed area in agent 0.

00:40:35.560 | It looks like it's not doing as well.

00:40:38.800 | And this is just an insight that we gain for this data only.

00:40:42.480 | But I was quickly able to identify, ah,

00:40:45.800 | this tight mapping business kind of represents

00:40:49.840 | the good running behavior and bad running behaviors.

00:40:52.680 | That's something that you can do pretty efficiently.

00:40:55.200 | And now I'm going to show you something more interesting.

00:40:58.160 | So of course, we have to do this because we have the data.

00:41:00.960 | It's here.

00:41:01.440 | It's so cool.

00:41:02.880 | So we apply this framework in the OpenAI's hide and seek.

00:41:07.200 | This has four agents.

00:41:08.640 | It looks like a simple game, but it

00:41:10.120 | has pretty complex structure, 100 dimensional observations,

00:41:13.640 | five dimensional action space.

00:41:15.520 | So in this work, remember that we

00:41:18.200 | pretend that we don't know the labels given by OpenAI.

00:41:21.240 | We just shuffle them in the mix.

00:41:24.360 | But we can color them, our results,

00:41:26.480 | with respect to their labels.

00:41:28.280 | So again, this is the result of z omega and z alpha.

00:41:32.760 | The individual agents.

00:41:33.960 | But the coloring is something that we didn't know before.

00:41:36.340 | We just did it after the fact.

00:41:39.360 | You can see in the z omega, there's

00:41:41.120 | nice kind of pattern that we can roughly separate what

00:41:46.120 | makes sense to humans and what makes sense to us.

00:41:48.840 | But remember, the green and gray, kind of everywhere,

00:41:53.560 | they're mixed.

00:41:54.400 | So in this particular run of OpenAI's hide and seek,

00:41:58.440 | it seemed that those two representations

00:42:00.400 | were kind of entangled.

00:42:03.360 | The running and chasing, the blue dots,

00:42:05.400 | it seems to be pretty separate and distinguishable

00:42:08.280 | from all the other colors.

00:42:10.040 | And that kind of makes sense, because that's

00:42:11.920 | basis of playing this game.

00:42:13.680 | So if you don't have that representation,

00:42:15.440 | you have a big trouble.

00:42:17.760 | But in case of orange, which is fort building,

00:42:22.960 | it's a lot more distinguishable in hiders.

00:42:26.220 | And that makes sense, because hiders are

00:42:28.160 | the ones building the fort.

00:42:30.400 | And seekers don't build the fort,

00:42:31.720 | so oranges are a little more entangled than seekers.

00:42:34.400 | Perhaps if seekers had built more separate fort building

00:42:38.600 | representation, maybe they would have won this game.

00:42:40.720 | So this work, can we learn something interesting,

00:42:47.920 | emerging behaviors by just simply observing the system?

00:42:51.960 | The answer seems to be yes, at least for the domains

00:42:54.080 | that we tested.

00:42:55.200 | A lot more complex domains should be tested.

00:42:57.960 | But these are the ones we had.

00:43:01.280 | But remember that these methods don't give you

00:43:03.400 | names of these clusters.

00:43:04.720 | So you would have to go and investigate and click

00:43:07.840 | through and explore.

00:43:10.720 | And if the cluster represents super human concept,

00:43:14.720 | this is not going to help you.

00:43:16.080 | And I'll talk a little more about a work

00:43:17.800 | that we do try to help them.

00:43:19.640 | But this is not for you.

00:43:20.680 | This is not going to help you there.

00:43:23.440 | And also, if you have access to the model and the reward

00:43:27.120 | signal, you should use it.

00:43:29.160 | Why dump it?

00:43:31.160 | So next work, we do use it.

00:43:33.240 | I'm going to talk about this work with Nico and Natasha

00:43:37.000 | and Shai again.

00:43:39.560 | So here, this time, we're going to intervene.

00:43:42.200 | We're going to be a little intrusive,

00:43:43.780 | but hopefully we'll learn a little more.

00:43:46.440 | So problem is that we're going to build a new multi-agent

00:43:49.440 | system, going to build it from scratch, such that we

00:43:52.080 | can do control testing.

00:43:53.560 | But at the same time, we shouldn't

00:43:54.960 | sacrifice the performance.

00:43:56.520 | So we're going to try to match the performance

00:43:59.200 | of the overall system.

00:44:00.760 | We do succeed.

00:44:03.360 | I had this paper collaboration with folks

00:44:05.840 | at Stanford, actually, here in 2020,

00:44:08.520 | where we proposed this pretty simple idea, which

00:44:11.000 | is you have a neural network.

00:44:13.240 | Why don't we embed concepts in the middle of the bottleneck,

00:44:17.080 | where one neuron represents trees,

00:44:19.240 | the other represents stripes, and just

00:44:21.560 | train the model end-to-end?

00:44:23.720 | And why are we doing this?

00:44:25.440 | Well, because then at inference time,

00:44:27.680 | you can actually intervene.

00:44:29.880 | You can pretend, you know, predicting zebra,

00:44:32.160 | I don't think trees should matter.

00:44:33.840 | So I'm going to zero out this neuron

00:44:35.480 | and feed forward and see what happens.

00:44:37.480 | So it's particularly useful in the medical setting,

00:44:39.640 | where there are some features that doctors don't want.

00:44:41.840 | We can cancel out and test.

00:44:44.800 | So this is the work to extend this to RL setting.

00:44:48.720 | It's actually not as simple extension as we thought.

00:44:53.160 | It came out to be pretty complex.

00:44:54.920 | But essentially, we're doing that.

00:44:57.240 | And we're building each of the concept bottleneck

00:44:59.880 | for each agent.

00:45:02.120 | And at the end of the day, what you optimize

00:45:04.000 | is what you usually do, typical PPO.

00:45:06.360 | Just think about this as make the auto system work,

00:45:09.680 | plus minimizing the difference between the true concept

00:45:13.360 | and estimated concept.

00:45:14.840 | That's all you do.

00:45:17.640 | Why are we doing this?

00:45:18.600 | You can intervene.

00:45:19.600 | You can pretend now agent two, pretend

00:45:22.360 | that you can't see agent one.

00:45:24.480 | What happens now?

00:45:25.880 | That's what we're doing here.

00:45:29.320 | We're going to do this in two domains.

00:45:31.440 | First domain, how many people saw this cooking game before?

00:45:37.560 | Yeah, it's a pretty commonly used cooking domain

00:45:41.320 | in reinforcement learning, very simple.

00:45:43.880 | We have two agents, yellow and blue.

00:45:46.280 | And they're going to make soup.

00:45:48.080 | They can bring three tomatoes.

00:45:49.520 | They get a word.

00:45:50.800 | They wait for the tomato and bring the dishes,

00:45:53.440 | a dish to the cooking pot.

00:45:54.960 | They get a reward finally.

00:45:56.240 | Their goal is to deliver as many soups as possible,

00:45:59.040 | given some time.

00:46:01.360 | And here, concepts that we use are agent position,

00:46:04.520 | orientation, agent has tomato, has dish, et cetera, et cetera.

00:46:08.160 | Something that's immediately available to you already.

00:46:11.280 | And you can, of course, tweak the environment

00:46:13.600 | to make it more fun.

00:46:14.920 | So you can make it that they have to collaborate.

00:46:18.160 | You can build a wall between them

00:46:19.680 | so that they have to work together in order

00:46:21.440 | to serve any tomato soup.

00:46:23.360 | Or you can make them freely available.

00:46:25.240 | You can work independently or together,

00:46:27.680 | whatever your choice.

00:46:28.720 | First, just kind of sanity check was

00:46:34.160 | that you can detect this emerging behavior of coordination

00:46:40.080 | versus non-coordination.

00:46:41.320 | So when the impassable environment,

00:46:43.960 | when we made up that environment,

00:46:45.840 | and suppose the RL system that we trained worked,

00:46:48.960 | they were able to deliver some soups,

00:46:51.200 | then you see that when we intervene--

00:46:53.120 | this graph, let me explain.

00:46:54.440 | This is a reward of an agent one when there's no intervention.

00:46:59.480 | So this is perfectly good world.

00:47:01.840 | And when there was an intervention.

00:47:04.080 | This is average value of intervening on all concepts.

00:47:07.480 | But I'm also going to show you each concept soon.

00:47:10.800 | If you compare left and right, you

00:47:12.880 | can tell that in the right, when we intervene,

00:47:16.120 | reward deteriorated quite a lot for both of them.

00:47:19.800 | And that's one way to see, ah, they are coordinating.

00:47:22.880 | Because somehow intervening at this concept

00:47:26.560 | impacted a lot of their performance.

00:47:30.480 | But this is what was really interesting to me,

00:47:33.240 | and I'm curious.

00:47:34.280 | Anyone can guess.

00:47:35.600 | So this is the same graph as the one you saw before,

00:47:40.080 | but except I'm plotting for intervention for each concept.

00:47:44.000 | So I'm intervening team position, team orientation,

00:47:47.120 | team has tomato, et cetera, et cetera.

00:47:49.960 | It turns out that they are using--

00:47:52.640 | or rather, when we intervene on team orientation,

00:47:56.160 | the degradation of performance was the biggest,

00:47:58.760 | to the extent that we believe that orientation had

00:48:00.840 | to do with sub-coordination.

00:48:04.000 | Anyone can guess why this might be?

00:48:06.440 | It's not the position.

00:48:13.960 | It's orientation.

00:48:14.880 | Yes?

00:48:15.880 | Just a clarification question on the orientation.

00:48:17.880 | Is that like the direction that the team is projecting?

00:48:20.360 | Yes.

00:48:21.360 | So it seems like orientation would let you predict

00:48:25.800 | where the moving heads?

00:48:26.800 | Yes, yes, that's right.

00:48:27.920 | Yes.

00:48:28.400 | Where were you when I was pulling my hair over this

00:48:31.560 | question?

00:48:32.920 | Yes, that's exactly right.

00:48:34.040 | And initially, I was really puzzled.

00:48:36.120 | Like, why not position?

00:48:37.160 | Because I expected to be positioned.

00:48:38.840 | But exactly, that's exactly right.

00:48:40.280 | So the orientation of the team is

00:48:42.600 | But exactly, that's exactly right.

00:48:44.000 | So the orientation is the first signal

00:48:47.280 | that an agent can get about the next move of the other agent.

00:48:51.760 | Because they're facing the pot, they're going to the pot.

00:48:54.200 | They're facing the tomato, they're

00:48:55.600 | going to get the tomato.

00:48:57.240 | Really interesting intuition.

00:49:00.160 | Obvious to some, but I needed this graph to work that out.

00:49:05.600 | And of course, you can use this to identify lazy agents.

00:49:09.240 | If you look at the rightmost yellow agent, our friend,

00:49:14.160 | just chilling in the background.

00:49:17.200 | And he's lazy.

00:49:18.000 | And if you train an RL agent, there's

00:49:19.560 | always some agents just hanging out.

00:49:21.560 | They just not do anything.

00:49:23.560 | And you can easily identify this by using this graph.

00:49:27.160 | If I intervene, it just doesn't impact any of their rewards.

00:49:31.120 | That one's me.

00:49:32.080 | So the second domain, we're going

00:49:36.640 | to look at a little more complex domain.

00:49:39.080 | So this is studying inter-agent social dynamics.

00:49:42.920 | So in this domain, there is a little bit of tension.

00:49:45.800 | This is called a cleanup.

00:49:47.320 | We have four agents.

00:49:49.160 | They only get rewards if they eat apples.

00:49:51.560 | Just yellow things, or green things are apples.

00:49:54.960 | But if you don't clean the river,

00:49:57.000 | then apple stops you all.

00:49:58.640 | So somebody has to clean the river.

00:50:01.360 | And you can see, if you have four people trying

00:50:04.560 | to collect apples, you can just stay someone else's--

00:50:07.760 | wait until someone else to clean the river

00:50:09.960 | and then collect the apples.

00:50:11.120 | And in fact, that's sometimes what happens.

00:50:12.920 | And concepts here, again, are pretty common things--

00:50:20.440 | position, orientation, and pollution, positions, et

00:50:24.480 | cetera.

00:50:26.560 | So when we first plotted the same graph

00:50:30.920 | as the previous domain, it tells a story.

00:50:36.280 | So the story here is that when I intervene on agent 1,

00:50:41.920 | it seems to influence agent 2 quite a lot,

00:50:45.680 | if you look at these three different graphs,

00:50:49.920 | how reward was impacted when I intervene on agent 1.

00:50:53.720 | It's agent 3 and 4 are fine, but it

00:50:55.400 | seems that only agent 2 is influenced.

00:50:57.320 | Same with idle time, same with the inter-agent distance.

00:51:00.720 | So we were like, oh, maybe that's true.

00:51:03.360 | But we keep wondering.

00:51:04.600 | There's a lot going on in this domain.

00:51:06.840 | How do we know this is the case?

00:51:09.840 | So we decided to take another step.

00:51:13.280 | So we're going to do a little more work here, but not a lot.

00:51:18.480 | We're going to build a graph to discover

00:51:21.040 | inter-agent relationships.

00:51:22.960 | This is simplest, dumbest way to build a graph.

00:51:25.480 | But again, I like simple things.

00:51:27.320 | So how do you build a graph?

00:51:28.520 | Well, suppose that you're building

00:51:30.640 | a graph between movies.

00:51:31.680 | This is not what we do, but just to describe

00:51:34.880 | what we're trying to do.

00:51:36.280 | We have each row.

00:51:37.800 | We're going to build a matrix.

00:51:39.480 | Each row is a movie, and each column

00:51:42.640 | consists of features of these movies, so length,

00:51:45.720 | genre of the movie, and so on.

00:51:48.120 | And the simplest way to build a graph is to do a regression.

00:51:52.040 | So exclude i-th row, and then we're

00:51:56.440 | going to regress over everyone else.

00:51:58.600 | And that gives me beta, which is a kind of coefficient

00:52:02.680 | for each of these.

00:52:04.080 | And that beta represents the strength of the edges.

00:52:08.920 | So this movie is more related to this movie and not

00:52:11.040 | the other movie.

00:52:11.840 | And ta-da, you have a graph.

00:52:13.040 | It's the dumbest way.

00:52:14.160 | There's a lot of caveats to it.

00:52:15.400 | You shouldn't do this a lot of times,

00:52:16.960 | but this is the simplest way to do it.

00:52:20.520 | So we did the same thing here.

00:52:22.120 | Instead of movie, we're going to use intervention on concept

00:52:27.760 | C on agent N as our node.

00:52:31.400 | And to build this matrix, we're going

00:52:34.160 | to use intervention outcome, which

00:52:36.440 | wouldn't have been available without our framework

00:52:39.760 | for reward, resource collected, and many other things.

00:52:45.240 | And when you build this graph, at the end of the day,

00:52:47.440 | you get betas that represent relationship

00:52:50.000 | between these interventions.

00:52:54.240 | So I had a graph of that matrix.

00:52:57.120 | Apparently, I removed before I came over.

00:53:00.240 | But imagine there was a matrix that

00:53:03.320 | is nicely highlighted between agent 1 and 4 and that only,

00:53:07.800 | contradicting the original hypothesis that we had.

00:53:11.160 | And this is the video of it.

00:53:13.320 | So when we stared at that matrix,

00:53:15.640 | it turns out that there is no high edge, strong edges

00:53:20.160 | between agent 1 and 2.

00:53:22.160 | So we were like, that's weird.

00:53:23.560 | But there is strong edges between agent 1 and 4.

00:53:26.360 | So we dig deeper into it, watched a lot of sessions

00:53:30.280 | to validate what's happening.

00:53:32.000 | And it turns out that the story was a lot more complicated.

00:53:35.680 | The 1's orientation was important for 4.

00:53:39.200 | But when that fails, agent 1 and 2 gets cornered in.

00:53:43.320 | And you can see that in the graph.

00:53:44.800 | Agent 4 get agent 1 and 2, blue and yellow agent,

00:53:50.800 | gets in the corner together.

00:53:51.960 | They get stuck.

00:53:53.480 | And this is simply just accidental

00:53:55.840 | because of the way that we built this environment.

00:53:58.720 | It just happened.

00:54:00.880 | But the raw statistics wouldn't have told us this story,

00:54:04.640 | that this was completely accidental.

00:54:06.140 | In fact, there was no correlation, no coordination

00:54:08.680 | between agent 1 and 2.

00:54:10.200 | But only after the graph, we realized this was the case.

00:54:14.600 | Now, this might be a one-off case.

00:54:16.640 | But you know what?

00:54:17.440 | A lot of emerging behaviors that we want to detect, a lot of them

00:54:21.280 | will be one-off case.

00:54:22.640 | And we really want to get to the truth of that

00:54:25.000 | rather than having some surface-level statistics.

00:54:27.960 | So can we build multi-agent system

00:54:34.920 | that enables intervention and performs as well?

00:54:37.240 | The answer is yes.

00:54:38.160 | There's a graph that shows the red line and blue line roughly

00:54:41.400 | aligned.

00:54:41.880 | That's good news.

00:54:42.840 | We are performing as well.

00:54:45.480 | But remember these concepts.

00:54:47.080 | You need to label them.

00:54:48.360 | Or you should have some way of getting those concepts,

00:54:50.560 | positions, and orientation.

00:54:52.240 | That might be something that we would

00:54:53.880 | love to extend in the future.

00:54:56.600 | Before I go on, any questions?

00:54:58.320 | You shy?

00:55:05.360 | You shy?

00:55:06.360 | [LAUGHS]

00:55:07.840 | Cool.

00:55:11.800 | All right.

00:55:13.520 | So I did tell you that we're not going to know,

00:55:16.880 | does the solution to move 37.

00:55:18.720 | I still don't.

00:55:19.720 | I still don't.

00:55:20.960 | But I'll tell you a little bit of work

00:55:23.760 | that I'm currently doing I'm really excited about.

00:55:27.440 | That we started thinking, you know what?

00:55:29.840 | Will this understanding move 37 happen before

00:55:33.200 | within my lifetime?

00:55:34.560 | And I was like, oh, maybe not.

00:55:35.800 | But I kind of want it to happen.

00:55:37.680 | So we start-- this is all about research, right?

00:55:40.400 | You started carving out a space where

00:55:42.360 | things are a little bit solvable.

00:55:44.320 | And you try to attack that problem.

00:55:46.320 | So this is our attempt to do exactly that,

00:55:48.720 | to get a little closer to our ultimate goal,

00:55:51.760 | my ultimate goal, of understanding that move 37.

00:55:56.160 | So before that, how many people here know AlphaZero from TMI?

00:55:59.600 | Yes.

00:56:00.800 | AlphaZero is a self-trained chess playing machine

00:56:06.240 | that beats--

00:56:07.360 | that has higher ELO rating than any other humans

00:56:10.320 | and beats Stockfish, which is arguably no existing

00:56:13.000 | human can beat Stockfish.

00:56:15.040 | So in a previous paper, we try to discover human chess

00:56:19.640 | concepts in this network.

00:56:22.120 | So when does concept like material imbalance

00:56:26.320 | appear in its network, which layer,

00:56:29.320 | and when in the training time, and which we call

00:56:32.720 | what, when, and where plots.

00:56:35.140 | And we also compare the evolution

00:56:37.320 | of opening moves between humans and AlphaZero.

00:56:40.240 | These are the first couple moves that you

00:56:42.120 | make when you play chess.

00:56:43.800 | And as you can see, there's a pretty huge difference.

00:56:46.640 | Left is human, right is AlphaZero.

00:56:49.920 | It turns out that AlphaZero can master, or supposedly master,

00:56:54.480 | a lot of variety of different types of openings.

00:56:57.600 | Openings can be very aggressive.

00:56:59.320 | Openings can be very boring.

00:57:01.200 | Could be very long range, targeting

00:57:03.640 | for long range strategy or short range.

00:57:06.440 | Very different.

00:57:07.400 | So that begs the question, what does AlphaZero know

00:57:10.960 | that humans don't know?

00:57:12.400 | Don't you want to learn what that might be?

00:57:16.400 | So that's what we're doing right now.

00:57:18.000 | We're actually almost-- we're about to evaluate.

00:57:22.080 | So the goal of this work is teach the world chess champion

00:57:27.120 | a new chess, superhuman chess strategy.

00:57:31.080 | And we just got yes from Magnus Carlsen, who

00:57:34.080 | is the world chess champion.

00:57:36.520 | He just lost the match, I know.

00:57:38.040 | But he's still champion in my mind.

00:57:41.440 | He's still champion in two categories, actually.

00:57:44.400 | So the way that we are doing this

00:57:45.840 | is we're going to discover new chess strategy

00:57:49.320 | by explicitly forgetting existing chess strategy, which

00:57:54.280 | we have a lot of data for.

00:57:56.520 | And then we're going to learn a graph, this time

00:57:59.160 | a little more complicated graph, by using

00:58:03.720 | the existing relationships between existing concepts

00:58:07.400 | so that we can get a little bit more idea of what

00:58:09.960 | the new concept might look like.

00:58:12.160 | And Magnus Carlsen-- so my favorite part about this work--

00:58:15.440 | I talk about carving out.

00:58:17.160 | My favorite part about this work is that the evaluation

00:58:19.880 | is going to be pretty clear.

00:58:21.520 | So it's not just like Magnus coming in and say,

00:58:23.640 | oh, your work is kind of nice, and say nice things

00:58:25.960 | about our work.

00:58:26.720 | No, Magnus actually has to solve some puzzles.

00:58:29.880 | And we will be able to evaluate him, whether he did it or not.

00:58:33.400 | So it's like a kind of success and fail.

00:58:35.320 | But I'm extremely excited.

00:58:36.400 | This kind of work I can only do because of Lisa,

00:58:40.280 | who is a champion herself, but also a PhD student at Oxford.

00:58:45.400 | And she played against Magnus in the past,

00:58:47.560 | and many other chess players in the world.

00:58:50.000 | And she's going to be the ultimate pre-superhuman

00:58:53.600 | filtering to filter out these concepts that

00:58:56.520 | will eventually get to Magnus.

00:58:59.400 | So I'm super excited about this.

00:59:00.760 | I have no results, but it's coming up.

00:59:02.520 | I'm excited.

00:59:03.240 | Yes?

00:59:03.740 | [INAUDIBLE]

00:59:17.660 | Puzzles are actually pretty simple.

00:59:19.140 | So the way that we generate concepts

00:59:22.300 | are within the embedding space of alpha 0.

00:59:25.460 | And given that, because alpha 0 has really weird architecture,

00:59:29.660 | so every single latent layer in alpha 0

00:59:31.660 | has the exact same position as a chessboard.

00:59:33.900 | That's just the way that they decide to do it.

00:59:35.900 | So because of that, we can actually

00:59:37.420 | identify or generate the board positions that

00:59:40.780 | corresponds to that concept.

00:59:42.900 | And because we have MCTS, we can predict

00:59:46.540 | what move it's going to make given that board position.

00:59:49.900 | Because at inference time, it's actually

00:59:51.600 | deterministic, a whole alpha 0 thing.

00:59:53.980 | So we have a lot of board positions.

00:59:56.540 | And that's all you need for puzzles.

00:59:58.380 | You give a board position and then ask Magnus to make a move.

01:00:01.580 | We explain the concept and then give Magnus more board

01:00:04.660 | positions and see if we can apply that concept that he just

01:00:08.420 | learned.

01:00:09.420 | [INAUDIBLE]

01:00:21.740 | Yeah, so if I were to ask Stockfish to solve those puzzles,

01:00:27.340 | that would be a different question.

01:00:28.860 | Because we are interested in whether we can teach human,

01:00:31.780 | not Stockfish.

01:00:32.820 | Stockfish might be able to do it.

01:00:34.200 | That's actually an interesting thing that we could do,

01:00:36.980 | now I think about it.

01:00:37.900 | But our goal is to just teach one superhuman.

01:00:41.420 | Like if I have, for example, 10,000 superhuman concepts,

01:00:45.580 | and only three of them are digestible by Magnus,

01:00:49.180 | that's a win.

01:00:50.220 | That would be a big win for this type of research.

01:00:56.380 | Questions?

01:00:56.880 | Yeah, so wrap up.

01:01:04.540 | Small steps towards our hopes and dreams.

01:01:06.820 | We talked about the gap between what machines know

01:01:09.820 | versus what we think machines know.

01:01:12.180 | Three ideas why that might be true.

01:01:14.980 | The three different maybe angles we

01:01:16.780 | can try to attack and answer those questions

01:01:19.060 | and bridge that gap.

01:01:21.340 | We talked about studying aliens, these machines,

01:01:25.140 | in observation study or control study.

01:01:27.380 | There are many other ways to study a species.

01:01:30.460 | And I'm not an expert, but anthropology and other humanities

01:01:33.100 | studies would know a lot better, more about this.

01:01:36.580 | And maybe, just maybe, we can try to understand MOVE 37

01:01:41.740 | at some point, hopefully within my lifetime,

01:01:44.300 | through this chess project that I'm very excited about.

01:01:48.580 | Thank you.

01:01:49.560 | [APPLAUSE]

01:01:52.520 | Thank you very much.

01:01:58.480 | Questions?

01:01:58.980 | You talked about interpretability research

01:02:04.400 | across NLP, vision, and RL.

01:02:07.440 | Do you think there's much hope for taking

01:02:09.640 | certain interpretability techniques from one modality

01:02:12.000 | into other modalities?

01:02:13.320 | And if so, what's the pathway?

01:02:15.800 | Hmm.

01:02:17.300 | So it depends on your goal.

01:02:19.580 | I think-- like, think about fairness research,

01:02:22.420 | which builds on strong mathematical foundation.

01:02:26.220 | And that's applicable for any questions around fairness,

01:02:29.860 | or hopefully applicable.

01:02:31.340 | But then, once you--

01:02:33.260 | if your goal is to actually solve a fairness issue at hand

01:02:38.100 | for somebody, the real person in the world,

01:02:40.020 | that's a completely different question.

01:02:41.660 | You would have to customize it for a particular person.

01:02:44.060 | You would have to customize it for a particular application.

01:02:47.420 | So there are two venues.

01:02:48.420 | And I think similar is true interpretability,

01:02:50.820 | like the theory work that I talked about.

01:02:52.740 | SHAP and IG are used across domains, like vision, texts.

01:02:57.300 | So that theory paper would be applicable across the domain.

01:03:00.700 | Things like RL and the way that we

01:03:02.700 | build that generative model, you would

01:03:05.120 | need to test a little bit more to make sure

01:03:07.340 | that this works in NLP.

01:03:09.620 | I don't even know how to think about agents in NLP yet.

01:03:12.700 | So we will need a little bit of tweaking.

01:03:14.380 | But both directions are fruitful.

01:03:16.340 | I want to have a question.

01:03:22.660 | So I saw some recent work in which some amateur Go players

01:03:28.220 | found a very tricky strategy to trick up.

01:03:30.820 | I think it was AlphaGo.

01:03:32.580 | And that seemed like a concept that humans know

01:03:35.380 | that machines don't in that Venn diagram.

01:03:37.940 | I just want to know your thoughts about that.

01:03:40.060 | Yeah, actually, it's funny you mentioned that.

01:03:42.300 | Lisa can beat AlphaZero pretty easily.

01:03:46.700 | And it's a similar idea.

01:03:48.220 | Because you kind of know what are the most unseen

01:03:52.100 | autodistribution moves are.

01:03:53.660 | And she can break AlphaZero pretty easily.

01:03:56.580 | And Lisa guessed that if Isador had known something more

01:03:59.780 | about AI, then maybe he would have tried to confuse AlphaGo.

01:04:03.620 | But the truth is, it's a high stake game.

01:04:06.420 | Like Isador is the famous star worldwide.

01:04:10.180 | So he wouldn't want to make a move that

01:04:12.660 | would be seen as a complete mistake,

01:04:15.220 | like the one that Magnus made a couple of days

01:04:17.380 | ago that got on the newsfeed everywhere,

01:04:19.460 | that he made this century-wide mistake.

01:04:21.620 | And that probably hurts.

01:04:23.500 | Any other questions?

01:04:28.220 | [INAUDIBLE]

01:04:28.700 | [INAUDIBLE]

01:04:53.180 | These work that I've presented are pretty new.

01:04:56.500 | But there has been a bit of discussion in the robotics,

01:04:59.980 | applying potentially just to robotics.

01:05:02.140 | And of course, I can't talk about details.

01:05:03.980 | But things that reinforcement learning in the wild people

01:05:10.220 | worry about are some of the surprises.

01:05:13.380 | If you have a test for it, like if you have a unit test for it,

01:05:16.660 | you're never going to fail.

01:05:18.100 | Because you're going to test before you deploy.

01:05:20.820 | I think the biggest risk for any of these deployment systems

01:05:24.220 | is the surprises that you didn't expect.

01:05:27.700 | So my work around the visualization and others

01:05:30.740 | aim to help you with that.

01:05:33.540 | So we may not know names of these surprises.

01:05:36.780 | But here's a tool that helps you better discover

01:05:39.420 | those surprises before someone else does

01:05:41.620 | or someone else gets harmed.

01:05:42.860 | Thanks so much for the talk.

01:05:51.060 | This is kind of an open-ended question.

01:05:52.980 | But I was wondering, we're talking

01:05:54.500 | about a lot of ways in which we try to visualize or understand

01:05:59.260 | what's going on in the representation inside

01:06:01.060 | the machine.

01:06:02.300 | But I was wondering whether we could turn it around

01:06:04.820 | and try to teach machines to tell us what--

01:06:08.260 | using our language, what they're doing in their representations.

01:06:11.900 | Like, if we build representations of ours

01:06:13.620 | and then get the machine to do the translation for us

01:06:16.620 | instead of us going into the machine to see it.

01:06:19.140 | Yeah, great question.

01:06:20.300 | So it's a really interesting question.

01:06:21.940 | Because that's something that I kind of tried in my work,

01:06:26.900 | previous work, called Testing with Concept Activation

01:06:29.860 | Vectors.

01:06:30.420 | So that was to map human language into a machine's space

01:06:34.260 | so that they can only speak our language.

01:06:36.020 | Because I understand my language and just

01:06:37.860 | talk to me in my language.

01:06:39.740 | The challenge is that, how would you do that for something

01:06:42.620 | like alpha 0?

01:06:44.340 | Like, we don't have a vocabulary for it, like move 37.

01:06:48.340 | Then there is going to be a lot of missing valuable knowledge

01:06:51.940 | that we might not get from the machine.

01:06:55.140 | So I think the approach has to be both ways.

01:06:57.940 | We should leverage as much as we can.

01:06:59.460 | But acknowledging that, even that mapping,

01:07:02.740 | that trying to map our language to machines,

01:07:05.380 | is not going to be perfect.

01:07:07.700 | Because it's a kind of proxy for what we think a penguin is.

01:07:11.940 | There's a psychology research that says,

01:07:13.820 | everyone thinks very differently about what a penguin is.

01:07:16.420 | Like, if I take a picture of a penguin,

01:07:19.300 | everyone is thinking different penguin right now.

01:07:22.420 | Australia has the cutest penguin, the fairy penguin.

01:07:25.500 | I'm thinking that, right?

01:07:26.620 | But I don't know how many people are thinking that.

01:07:28.740 | So given that, we are so different,

01:07:31.100 | machine's going to think something else.

01:07:32.820 | So how do you bridge that gap?

01:07:34.260 | Extend that to 100 concepts and composing those concepts,

01:07:37.860 | it's going to go out of wild very soon.

01:07:40.180 | So there's pros and cons.

01:07:41.740 | I'm into both of them.

01:07:43.220 | I think some applications, exclusively just using

01:07:48.220 | human concepts are still very helpful.

01:07:50.300 | It gets you halfway.

01:07:53.180 | But my ambition is that we shouldn't stop there.

01:07:56.460 | We should benefit from them by having them teach us new things

01:08:00.900 | that we didn't know before.

01:08:02.060 | Yeah?

01:08:05.580 | So the second thing you talked about with Jerome

01:08:07.740 | was that where knowledge is located in the embedding space

01:08:11.460 | isn't super correlated with what you'd like to edit

01:08:13.860 | to change that knowledge.

01:08:15.180 | Do you think that has any implications for the later

01:08:17.940 | stuff you talked about, like the cost thing?

01:08:19.940 | But I don't know, like trying to locate,

01:08:23.780 | like to just get strategies in the embedding space

01:08:25.900 | might not be as helpful?

01:08:27.420 | Oh, what are the alternatives?

01:08:30.100 | I guess I don't know the alternatives,

01:08:31.740 | just because I feel like the Roam thing as well is not--

01:08:35.540 | That's possible.

01:08:36.260 | So it's like some transformed space of our embedding space

01:08:40.060 | in alpha 0, maybe it's a function applied

01:08:42.860 | to that embedding space.

01:08:44.340 | So thinking about that as a raw vector is a dead end.

01:08:49.180 | Could be.

01:08:50.180 | We'll see how this chess project goes.

01:08:52.420 | In a couple of months, I might rethink my strategy.

01:08:56.020 | But interesting thought.

01:08:57.820 | Yeah?

01:08:58.660 | So I'm a psychology major, and I do

01:09:00.460 | realize that a lot of this stuff that we're trying to do here,

01:09:03.100 | like reasoning parts of the game,

01:09:04.500 | is how we figure out how our brains work.

01:09:07.860 | So do you think that this--

01:09:10.180 | would there be stuff that moves that's

01:09:13.460 | applicable to neural networks?

01:09:15.180 | And on the contrary, do you think

01:09:16.820 | there must be this interpretability

01:09:18.260 | of a study of neural network to help us understand stuff

01:09:21.220 | about our own brain?

01:09:22.500 | Yeah.

01:09:23.260 | Talk to Geoff Hinton.

01:09:25.420 | He would really like this.

01:09:26.580 | So I believe-- I mean, you probably

01:09:28.260 | know about this history.

01:09:29.300 | I think that's how it all started, right?

01:09:31.580 | The whole neural network is to understand human brain.

01:09:37.140 | So that's the answer to your question.

01:09:39.540 | Interesting, however, in my view,

01:09:41.780 | there is some biases that we have in neuroscience

01:09:46.580 | because of the limitations of tools,

01:09:48.420 | like physical tools and availability of humans

01:09:50.700 | that you can poke in.

01:09:52.100 | I think that influences interpretability research.

01:09:54.660 | And I'll give you an example of what I mean.

01:09:56.540 | So in cat, the horizontal line and vertical line neuron

01:10:00.580 | in cat brain, so they put the prop in and figure out

01:10:03.460 | this one neuron detects vertical lines.

01:10:05.540 | And you can validate it.

01:10:07.180 | It's really cool if you look at the video.

01:10:08.900 | The video is still online.

01:10:10.260 | Yeah, what is it?

01:10:11.220 | [INAUDIBLE]

01:10:12.420 | Yes, yes, yes.

01:10:14.020 | So why did they do that?

01:10:15.780 | Well, because you had one cat, poor, poor cat.

01:10:20.100 | And you had-- we can only probe a few neurons at a time, right?

01:10:25.340 | So that implied a lot of-- a few interpretability research

01:10:28.700 | actually looked at--

01:10:29.900 | are very focused on neuron-wise representation.

01:10:32.900 | This one neuron must be very special.

01:10:35.260 | I actually think that's not true.

01:10:36.940 | That was limited by our ability, like physical ability

01:10:40.100 | to plop organisms.

01:10:41.140 | But in your network, you don't have to do that.

01:10:43.140 | You can apply functions to embeddings.

01:10:44.860 | You can change the whole embedding to something else,

01:10:47.020 | overwrite.

01:10:48.060 | So that kind of is actually a obstacle in our thinking

01:10:54.100 | rather than helping.

01:10:54.900 | OK, maybe we should call it there.

01:11:03.460 | So for Thursday, we're not having a lecture on Thursday.

01:11:08.740 | There'll be TAs and me here.

01:11:11.020 | So if you have any last-minute panics on your project

01:11:14.860 | or think we might have some great insight to help you,

01:11:18.700 | we probably won't actually.

01:11:19.860 | It'll be all right.

01:11:22.220 | Do come along, and you can chat to us as we find projects

01:11:25.940 | and we can give you help.

01:11:27.860 | That means that Dean actually got

01:11:29.740 | to give the final lecture of CS224 in today.

01:11:33.460 | So a round of applause for him.

01:11:35.500 | [APPLAUSE]

01:11:38.060 | [BLANK_AUDIO]