back to index

Stanford CS224N NLP with Deep Learning | 2023 | Lec. 19 - Model Interpretability & Editing, Been Kim


Whisper Transcript | Transcript Only Page

00:00:00.000 | [BLANK_AUDIO]
00:00:05.400 | Today I'm delighted to introduce as our final guest speaker, Bean Kim.
00:00:11.480 | Bean Kim is a staff research scientist at Google Brain.
00:00:15.760 | If you're really into Googleology, those funny words at the beginning,
00:00:19.480 | like staff, sort of says how senior you are.
00:00:22.480 | And that means that Bean's a good research scientist.
00:00:25.280 | [LAUGH] So I discovered at lunch today that Bean started out
00:00:30.960 | studying mechanical engineering at Seoul National University.
00:00:36.520 | But she moved on to, I don't know, it's better things or not.
00:00:40.360 | But she moved on to computer science and did her PhD at MIT.
00:00:46.000 | And there she started working on the interpretability and
00:00:49.120 | explainability of machine learning models.
00:00:52.480 | I think she'll be talking about some different parts of her work.
00:00:56.680 | But a theme that she's had in some of her recent work that I find especially
00:01:02.000 | appealing as an NLP person is the idea that we should be using
00:01:07.720 | higher level human interpretable languages for
00:01:11.280 | communication between people and machines.
00:01:15.200 | So welcome, Bean, looking forward to your talk, and go for it.
00:01:20.160 | >> Thank you.
00:01:21.160 | >> [APPLAUSE] >> Thank you.
00:01:25.240 | Thanks for having me.
00:01:26.760 | It's an honor to be here.
00:01:28.720 | It's the rainiest Stanford I've ever seen.
00:01:31.880 | Last night, I got here last night.
00:01:33.760 | But then I live in Seattle, so this is pretty common.
00:01:36.960 | So I still was able to see the blue sky today.
00:01:39.080 | I was like, this works, I really like it here.
00:01:42.080 | So today I'm going to share some of my dreams,
00:01:45.120 | chasing my dreams to communicate with machines.
00:01:49.040 | So if you're in this class, you probably agree, you don't have to,
00:01:53.240 | that large language models and generator models are pretty cool.
00:01:57.480 | They're impressive.
00:01:59.040 | But you may also agree that they're a little bit frightening.
00:02:01.920 | Not just because they're impressive, they're doing a really good job, but
00:02:07.120 | also we're not quite sure where we're going with this technology.
00:02:12.040 | In 10 years out, will we look back and say, that technology was net positive?
00:02:17.160 | Or we will say, that was catastrophic, we didn't know that that would happen.
00:02:21.280 | Ultimately, what I would like to do, or maybe hopefully what we all want to do,
00:02:28.600 | is to have this technology benefit us, humans.
00:02:32.560 | I know in 10 years time, or maybe, well, 20 years or earlier,
00:02:36.440 | he's gonna ask me, he's gonna be like, mom, did you work on this AI stuff?
00:02:41.320 | I watched some of your talks.
00:02:43.680 | And did you know that how this will profoundly change our lives?
00:02:48.680 | And what did you do about that?
00:02:50.560 | And I have to answer that question, and
00:02:53.480 | I really hope that I have some good things to say to him.
00:02:56.000 | So my initial thought, and still so, or current thought,
00:03:05.120 | is that if we want our ultimate goal to be benefit humanity,
00:03:09.080 | why not directly optimize for it, why wait?
00:03:11.760 | So how can we benefit?
00:03:15.480 | There's lots of different ways we can benefit.
00:03:18.240 | But one way we can benefit is to treat this like a colleague.
00:03:22.640 | You know, a colleague who are really good at something.
00:03:26.280 | This colleague is not perfect, but
00:03:28.240 | it's good at something enough that you want to learn something from them.
00:03:31.520 | One difference is though, in this case, is that this colleague is kind of weird.
00:03:37.560 | This colleague might have very different values,
00:03:40.720 | it might have very different experiences in the world.
00:03:44.480 | It may not care about surviving as much as we do.
00:03:48.080 | Maybe mortality isn't really a thing for this colleague.
00:03:52.400 | So you have to navigate that in our conversation.
00:03:55.760 | So what do you do when you first meet somebody?
00:03:59.120 | There's someone so different, what do you do?
00:04:01.040 | You try to have a conversation to figure out how do you do what you do?
00:04:07.320 | How are you solving decades old protein folding problem?
00:04:11.280 | How are you beating the world gold champion so easily, what it seems?
00:04:17.520 | Are you using the same language, the science language that we use,
00:04:21.800 | atoms, molecules?
00:04:23.480 | Or do you think about the world in a very different way?
00:04:27.320 | And more importantly, how can we work together?
00:04:29.800 | I have one alien that I really want to talk to, and it's AlphaGo.
00:04:37.000 | So AlphaGo beat world gold champion Isidore in 2016.
00:04:41.040 | Isidore is from South Korea, I'm from South Korea.
00:04:43.360 | I watched every single match.
00:04:44.800 | It was such a big deal in South Korea and worldwide, I hope.
00:04:48.440 | And in one of the matches, AlphaGo played this move called move 37.
00:04:53.960 | How many people watched AlphaGo matches?
00:04:57.680 | And how many people remember move 37?
00:05:00.920 | Yeah, a few people, right?
00:05:02.600 | And I remember the nine-dan commentator who's
00:05:05.360 | been talking a lot throughout the matches suddenly got really quiet.
00:05:10.000 | And he said, hmm, that's a very strange move.
00:05:14.640 | And I knew then that something really interesting
00:05:17.560 | has just happened in front of my eyes, that this
00:05:20.600 | is going to change something.
00:05:21.800 | This AlphaGo has made something that we're going to remember forever.
00:05:25.320 | And sure enough, this move turned around the game for AlphaGo
00:05:28.840 | and leading AlphaGo to win one of the matches.
00:05:33.160 | So goal players today continue to analyze this move
00:05:36.440 | and still discuss when people talk about,
00:05:38.720 | this is not the move a human would fan them.
00:05:42.120 | So the question is, how did AlphaGo know this is a good move?
00:05:45.680 | My dream is to learn something new by communicating with machines
00:05:54.720 | and having a conversation, and such that humanity
00:05:58.040 | will gain some new angle to our important problems
00:06:01.280 | like medicine and science and many others.
00:06:04.960 | And this is not just about discovering new things.
00:06:08.440 | If you think about reward hacking, you
00:06:11.280 | have to have a meaningful conversation with somebody
00:06:14.800 | to truly figure out what their true goal is.
00:06:18.200 | So in a way, solving this problem is a superset of solving AI safety, too.
00:06:23.240 | So how do we have this conversation?
00:06:29.080 | Conversation assumes that we share some common vocabulary between that
00:06:33.720 | exchange to exchange meaning and ultimately the knowledge.
00:06:36.880 | And naturally, a representation plays a key role in this conversation.
00:06:40.880 | On the left-- and we can visualize this on the left--
00:06:43.400 | we say, this is a representational space of what humans know.
00:06:47.440 | On the right, what machines know.
00:06:50.120 | Here in left circle, there will be something like, this dog is fluffy.
00:06:54.000 | And you know what that means, because we all share somewhat similar vocabulary.
00:06:59.360 | But on the right, we have something like move 37,
00:07:02.760 | where humans yet to have a representation for.
00:07:06.200 | So how do we have this conversation?
00:07:12.160 | Our representational space needs overlap.
00:07:14.800 | And the more overlap we have, the better conversation we're going to have.
00:07:19.160 | Humans are all good at learning new things.
00:07:21.520 | Like here, everyone is learning something new.
00:07:24.320 | So we can expand what we know by learning new concepts and vocabularies.
00:07:29.920 | And doing so, I believe, will help us to build
00:07:32.800 | machines that can better align with our values and our goals.
00:07:36.160 | So this is the talk that I gave.
00:07:40.880 | If you're curious about some of the work we're doing towards this direction,
00:07:43.680 | I highly recommend it.
00:07:44.600 | It's a YouTube video.
00:07:45.640 | I clear keynote, half an hour.
00:07:47.280 | You can fast.
00:07:48.560 | Do a fast feed.
00:07:50.280 | But today, I'm going to talk more about my hopes and dreams.
00:07:53.880 | And hopefully, at the end of the day, your hopes and dreams too.
00:07:59.160 | So first of all, I'm just going to set the expectation.
00:08:03.600 | So at the end of this talk, we still don't know how the move 37 is made.
00:08:09.160 | Sorry.
00:08:09.840 | That's going to take a while.
00:08:12.360 | In fact, the first part of this talk is going
00:08:16.160 | to be about how we move backwards in this progress,
00:08:21.640 | in terms of making this progress in our journey.
00:08:24.480 | And still a very, very small portion of our entire journey
00:08:27.960 | towards understanding move 37.
00:08:31.680 | And of course, this journey wouldn't be like a singular path.
00:08:35.160 | There will be lots of different branches coming in.
00:08:38.480 | Core ideas like transformer helped many domains across.
00:08:42.560 | There will be similar here.
00:08:43.920 | So I'm going to talk in the part two some
00:08:46.560 | of our work on understanding emerging behaviors in reinforcement learning.
00:08:51.560 | And all the techniques that I'm going to talk about
00:08:54.280 | is going to be in principle applicable to NLP.
00:08:56.920 | So coming back to our hopes and dreams, move 37.
00:09:05.120 | So let's first think about how we might realize this dream.
00:09:09.440 | And taking a step back, we have to ask, do we
00:09:12.400 | have tools to first estimate what even machines know?
00:09:17.480 | There has been many development in machine learning last decade
00:09:20.720 | now to develop tools to understand and estimate this purple circle.
00:09:26.800 | So is that accurate?
00:09:28.880 | Unfortunately, many recent research show
00:09:31.200 | that there is a huge gap between what machines actually know
00:09:35.880 | and what we think the machines know.
00:09:40.200 | And identifying and bridging this gap is important
00:09:43.400 | because these tools will form basis for understanding
00:09:46.640 | that move 37.
00:09:47.360 | So what are these tools?
00:09:51.800 | How many people familiar with saliency maps?
00:09:55.280 | A lot, but you don't have to.
00:09:56.480 | I'll explain what it is.
00:09:57.680 | So saliency map is one of the popular interpretability methods.
00:10:02.840 | For simplicity, let's say an image net, you have an image like this.
00:10:06.200 | You have a bird.
00:10:07.240 | The explanation is going to take a form of the same image,
00:10:10.920 | but where each pixel is associated with a number that
00:10:15.320 | is supposed to imply some importance of that pixel
00:10:20.760 | for prediction of this image.
00:10:23.520 | And one definition of that importance
00:10:26.160 | is that that number indicates how the function
00:10:29.000 | look like around this pixel.
00:10:31.240 | So for example, if I have a pixel ixj, maybe around xj,
00:10:35.640 | the function moves up like the yellow curve,
00:10:38.400 | or function is flat, or function is going down like the green curve.
00:10:43.880 | And so if it's flat like a blue curve or red curve,
00:10:47.960 | maybe that feature is irrelevant to predicting bird.
00:10:51.040 | Maybe it's going up.
00:10:52.080 | Then it's maybe more important because the value of x increases
00:10:55.200 | and the function value goes up.
00:10:56.480 | Function value here like a prediction value.
00:10:58.280 | So let's think about what are the few ways why this gap might exist.
00:11:06.560 | There are a few ways.
00:11:07.440 | Not exhaustive, they overlap a little bit,
00:11:09.440 | but helpful for us to think about.
00:11:11.320 | Maybe assumptions are wrong.
00:11:13.120 | So this alien, again, these machines that we train,
00:11:16.360 | works in a completely different, perhaps completely
00:11:18.520 | different representational space, very different experiences
00:11:21.640 | about the world.
00:11:23.120 | So assuming that it sees the world that we do just like we do,
00:11:26.840 | like having the gestalt phenomenon, there's few dots.
00:11:30.120 | Humans have tendency to connect them.
00:11:32.680 | Maybe machines have that too.
00:11:34.280 | Maybe not.
00:11:35.480 | So maybe our assumptions about these machines are wrong.
00:11:39.560 | Maybe our expectations are mismatched.
00:11:42.040 | We thought it was doing x, but it was actually doing y.
00:11:45.760 | Or maybe it's beyond us.
00:11:49.000 | Maybe it's showing something superhuman
00:11:51.080 | that humans just can't understand.
00:11:52.920 | I'm going to dig deeper into some of these, our work.
00:11:59.440 | This is more recent work.
00:12:01.480 | So again, coming back to the earlier story about salience
00:12:04.600 | map, we're going to play with some of these methods.
00:12:09.000 | Now, in 2018, we stumbled upon this phenomenon
00:12:14.560 | that was quite shocking, which was that we were actually
00:12:17.400 | trying to write some different people, again, paper,
00:12:19.600 | of course, the end here.
00:12:20.920 | But we were testing something, and we
00:12:23.000 | realized that trained network and untrained network
00:12:26.400 | has the same, very similar saliency map.
00:12:29.440 | In other words, random prediction
00:12:31.920 | and meaningful prediction were giving me
00:12:33.720 | the same explanation.
00:12:36.160 | So that was puzzling.
00:12:37.440 | We thought we had a bug, but it turned out we didn't.
00:12:40.800 | It actually is indistinguishable qualitatively and quantitatively.
00:12:45.920 | So that was shocking.
00:12:47.840 | But then we wondered, maybe it's a one-off case.
00:12:51.560 | Maybe it still works somehow in practice.
00:12:56.680 | So we tested that in a follow-up paper.
00:12:59.040 | OK, what if the model had an error, one of these errors?
00:13:02.760 | Maybe it has a labeling error.
00:13:04.440 | Maybe it has a spurious correlation.
00:13:06.600 | Maybe it had out-of-distribution at test time.
00:13:09.880 | If we intentionally insert these bugs,
00:13:12.440 | can explanation tell us that there's
00:13:14.800 | something wrong with the model?
00:13:17.360 | It turns out that that's also not quite true.
00:13:21.720 | You might think that, oh, maybe spurious correlation.
00:13:24.160 | Another follow-up work also showed
00:13:25.960 | that this is also not the case.
00:13:28.960 | So we were disappointed.
00:13:31.600 | But then still, we say, you know,
00:13:33.920 | maybe there's no theoretical proof of this.
00:13:38.240 | Maybe this is, again, a lab-setting test.
00:13:40.920 | We had grad students to test this system.
00:13:44.000 | Maybe there's still some hope.
00:13:48.400 | So this is more recent work where we theoretically
00:13:50.800 | prove that some of these methods, very popular methods,
00:13:54.760 | cannot do better than random.
00:13:56.400 | So I'm going to talk a little bit about that.
00:14:00.800 | I'm missing one person.
00:14:02.040 | I'm missing Peng Wei in the author list.
00:14:03.640 | I just realized this is also work with Peng Wei.
00:14:07.480 | So let's first talk about our expectation.
00:14:10.240 | What is our expectation about this tool?
00:14:13.400 | Now, the original paper that developed this method, IG
00:14:18.160 | and Schaub, talks about how IG can
00:14:20.920 | be used for accounting the contributions of each feature.
00:14:25.560 | So what that means is that when the tool assigns
00:14:27.960 | zero attribution to a pixel, we're going to say, OK,
00:14:30.920 | well, pixel is unused by the function.
00:14:33.840 | And that means that f will be insensitive
00:14:36.560 | if I perturb this x.
00:14:40.480 | And in fact, this is how it's been used in practice.
00:14:43.840 | This is a paper published in Nature.
00:14:45.840 | They used Schaub to figure out the eligibility
00:14:49.720 | criteria in a medical trial.
00:14:53.720 | What we show in this work is that none
00:14:55.760 | of these inferences that seemed pretty natural were true.
00:15:01.240 | And in fact, just because popular attribution methods
00:15:03.920 | tell you anything about attribution is x,
00:15:07.240 | you cannot conclude anything about the actual model
00:15:10.520 | behavior.
00:15:12.640 | So how does that work?
00:15:15.640 | How many people here do theory proof?
00:15:17.960 | A few.
00:15:21.280 | Great.
00:15:21.960 | I'll tell you.
00:15:22.560 | I learned about theory proving from this project as well.
00:15:25.600 | So I'll tell you the way that we pursued this particular work
00:15:30.080 | is that first think about this problem.
00:15:32.000 | And we're going to formulate into some other problem
00:15:35.080 | that we know how to solve.
00:15:37.040 | So in this case, we formulate this as hypothesis testing.
00:15:41.120 | Because once you formulate into hypothesis testing, yes or no,
00:15:44.280 | there are lots of tools and statistics
00:15:45.900 | you can use to prove this.
00:15:48.320 | So what is hypothesis?
00:15:49.880 | The hypothesis is that I'm a user.
00:15:52.640 | I got an attribution value from one of these tools.
00:15:55.840 | And I have a mental model of, ah, this feature is important
00:15:59.920 | or maybe not important.
00:16:01.840 | Then the hypothesis is that whether that's true or not.
00:16:06.040 | And what we showed is that given whatever hypothesis you
00:16:09.600 | may have, you cannot do better than random guessing,
00:16:14.360 | invalidating or invalidating this hypothesis testing.
00:16:18.080 | And that means, yes, sometimes it's right.
00:16:20.680 | But you don't do hypothesis testing
00:16:22.640 | if you cannot validate yes or no.
00:16:24.760 | You just don't.
00:16:25.480 | Because what's the point of doing it
00:16:26.980 | if you just don't know if it's as good as random guessing?
00:16:29.920 | And the result is that, yes, for this graph,
00:16:36.640 | it's just a visualization of our result.
00:16:38.720 | If you plot true negative and true positive,
00:16:41.280 | and line is random guessing, because this
00:16:43.200 | is the worst method, that's the best method.
00:16:45.160 | All the equal distance is this line.
00:16:47.440 | Methods that we know, SHOP and IG,
00:16:50.600 | all falls under this line of random guessing.
00:16:55.280 | That's bad news.
00:16:57.080 | But maybe this still works in practice for some reason.
00:17:02.600 | Maybe there were some assumptions
00:17:03.960 | that we had that didn't quite meet in the practice.
00:17:07.520 | So does this phenomenon hold in practice?
00:17:11.560 | The answer is yes.
00:17:13.600 | We now have more image graphs and more bigger models.
00:17:16.200 | But here we test two concrete end tasks
00:17:19.720 | that people care about in interpretability,
00:17:21.960 | or use these methods to do, recourse or spurious
00:17:25.040 | correlation.
00:17:26.120 | So recourse, for those who are not familiar,
00:17:28.080 | is you're getting a loan.
00:17:29.880 | And you wonder whether, if I'm older,
00:17:32.520 | I would have a high chance of getting a loan.
00:17:34.960 | So I tweak this one feature and see if my value goes up or down.
00:17:39.520 | Very reasonable task if people do it all the time.
00:17:41.880 | Pretty significant implication socially.
00:17:45.840 | So for two of these concrete end tasks, both of them
00:17:50.520 | boil down to this hypothesis testing framework
00:17:53.080 | that I talked about.
00:17:54.760 | They're all around the random guessing line,
00:17:57.440 | or worse than random guessing.
00:18:01.240 | So you might say, oh, no.
00:18:02.960 | This is not good.
00:18:04.000 | A lot of people are using these tools.
00:18:05.600 | What do we do?
00:18:06.920 | We have a very simple idea about this.
00:18:10.640 | So people like developing complex tools.
00:18:15.600 | And I really hope you're not one of those people.
00:18:18.280 | Because a lot of times, simple methods work.
00:18:22.640 | Who comes razor?
00:18:23.840 | But also, simple methods are elegant.
00:18:25.560 | There is a reason, perhaps a lot of times, why they work.
00:18:29.040 | They're simple.
00:18:30.400 | You can understand them.
00:18:31.760 | They make sense.
00:18:32.880 | So let's try that idea here.
00:18:35.480 | So again, your goal is to estimate a function shape.
00:18:38.680 | What do you do?
00:18:39.680 | Well, the simplest thing you do is
00:18:41.840 | you have a point of interest.
00:18:43.920 | You sample around that point and evaluate the function
00:18:47.040 | around that point.
00:18:48.320 | If it goes up, maybe function's going up.
00:18:50.760 | If it goes down, maybe function's coming down.
00:18:54.480 | So that's the simplest way you can brute force it.
00:18:58.480 | But then the question is, how many samples do we need?
00:19:01.720 | So here, this is the equation that you're
00:19:05.240 | lifting this line upwards that way
00:19:07.800 | by adding that additional term.
00:19:10.680 | It's proportional to number of samples.
00:19:12.760 | The more samples you have, the better estimation you have.
00:19:14.640 | It makes sense.
00:19:15.680 | And differences in output, how much resolution do you care?
00:19:18.720 | Do you care 0.1 to 0.2?
00:19:22.800 | Or do you only care 0 slope to slope 1?
00:19:26.440 | That's resolution that you care about and number
00:19:29.520 | of features, of course.
00:19:31.160 | So if you worry about making some conclusion based
00:19:35.560 | on function shape, sample.
00:19:38.600 | Easy.
00:19:42.280 | So can we infer the model behavior
00:19:45.640 | using these popular methods?
00:19:47.640 | Answer is no.
00:19:49.600 | And this holds both theory and practice.
00:19:53.480 | We're currently working on even bigger models
00:19:55.480 | to show just again and again empirical evidence that, yes,
00:19:59.800 | it just really doesn't work.
00:20:01.440 | Please think twice and three times
00:20:03.920 | before using these methods.
00:20:06.120 | And also, model-dependent sample complexity.
00:20:09.280 | If your function is kind of crazy,
00:20:11.200 | of course, you're going to need more samples.
00:20:13.320 | So what is the definition?
00:20:14.400 | How do we characterize these functions?
00:20:17.840 | And finally, we haven't quite given up yet.
00:20:20.480 | Because these methods have a pretty good root
00:20:22.440 | in economics and shop-levels and all that.
00:20:25.840 | So maybe there are a lot narrower condition
00:20:30.080 | where these methods work.
00:20:31.880 | And we believe such condition does exist.
00:20:35.040 | We just have to figure out when.
00:20:37.120 | Once we figure out what that condition is,
00:20:39.960 | then in a given function, I can test it and say,
00:20:42.840 | yes, I can use shop here.
00:20:44.400 | Yes, I can use IG here.
00:20:45.840 | Or no, I can't.
00:20:47.520 | That would be still very useful for ongoing work.
00:20:51.800 | Before I go to the next one, any questions?
00:20:56.040 | To the findings you have about the JPEG models,
00:20:59.400 | does it only apply to computer vision models?
00:21:01.680 | Or does it apply to NLP-only models?
00:21:04.320 | Any model that has a function.
00:21:06.160 | [LAUGHTER]
00:21:09.080 | Yeah, very simple.
00:21:10.240 | Simple, actually.
00:21:10.920 | Simplish proof that can show simply any function, this holds.
00:21:16.920 | Any other questions?
00:21:17.960 | Wonderful.
00:21:21.760 | Yeah, Chris?
00:21:22.280 | This may disrelate to your last bullet.
00:21:24.560 | But it sort of seems like for the last couple of years,
00:21:28.800 | there have been at least dozens, maybe hundreds of people
00:21:32.440 | writing papers using Shepley values.
00:21:35.320 | I mean, is your guess that most of that work is invalid?
00:21:42.520 | Or that a lot of it might be OK because whatever conditions
00:21:47.600 | that it's all right might often be being there?
00:21:51.320 | So two answers to that question.
00:21:53.680 | My hypothesis testing results shows that it's random.
00:21:57.760 | So maybe in the optimistic case, 50% of those papers,
00:22:04.280 | you hit it.
00:22:06.480 | And on the other side, on the second note,
00:22:09.840 | even if maybe Shep wasn't perfect,
00:22:12.280 | maybe it was kind of wrong.
00:22:14.000 | But even if it helped human at the end task, whatever
00:22:17.440 | that might be, help doctors to be more efficient,
00:22:19.720 | identifying bugs and whatnot, and if they did the validation
00:22:22.920 | correctly with the right control testing setup,
00:22:26.640 | then I think it's good.
00:22:28.120 | You figured out somehow how to make
00:22:29.840 | this noisy tools together work with human in the loop, maybe.
00:22:33.320 | And that's also good.
00:22:34.600 | And I personally really like Shep's paper.
00:22:37.240 | And I'm a good friend with Scott.
00:22:39.240 | And I love all his work.
00:22:41.160 | It's just that I think we need to narrow down
00:22:43.120 | our expectations so that our expectations are
00:22:45.520 | better aligned.
00:22:46.320 | All right.
00:22:50.320 | I'm going to talk about another work that's
00:22:52.240 | kind of similar flavor.
00:22:54.160 | Now it's an NLP.
00:22:56.800 | So this is one of those papers, just like the many other papers
00:23:00.560 | that we ended up writing.
00:23:02.680 | One of those serendipity papers.
00:23:04.480 | So initially, Peter came up as an intern.
00:23:07.560 | And we thought, we're going to locate ethical knowledge
00:23:10.360 | in this large language models.
00:23:12.360 | And then maybe we're going to edit them to make them
00:23:14.760 | a little more ethical.
00:23:15.800 | So that was the goal.
00:23:17.000 | And then we thought, oh, the Rome paper from David Bauer.
00:23:19.520 | And I also love David's work.
00:23:21.560 | And let's use that.
00:23:22.840 | That's the start of this work.
00:23:24.960 | But then we start digging into and implementing the Rome.
00:23:28.560 | And things didn't quite line up.
00:23:30.960 | So we do sanity check, experiment after sanity check.
00:23:34.320 | And we ended up writing a completely different paper,
00:23:36.840 | which I'm about to talk to you about.
00:23:39.960 | So this paper, the Rome, for those who are not familiar,
00:23:45.200 | which I'm going into a little more detail in a bit,
00:23:47.600 | is about editing a model.
00:23:49.200 | So you first locate a knowledge in a model.
00:23:52.600 | Like the space needle is in Seattle.
00:23:54.640 | That's a fact, your knowledge.
00:23:56.440 | You locate them.
00:23:57.560 | You edit them.
00:23:58.960 | Because you can locate them, you can mess with it
00:24:02.040 | to edit that fact.
00:24:03.280 | That's the whole promise of it.
00:24:05.000 | In fact, that's a lot of times how localization or editing
00:24:08.040 | methods were motivated in the literature.
00:24:11.480 | But what we show is that this assumption is actually not
00:24:14.720 | true.
00:24:16.480 | And to be quite honest with you, I still
00:24:18.680 | don't quite get why this is not related.
00:24:22.800 | And I'll talk more about this, because this
00:24:24.600 | is a big question to us.
00:24:26.480 | This is pretty active work.
00:24:29.520 | So substantial fraction of factual knowledge
00:24:33.720 | is stored outside of layers that are identified
00:24:37.560 | as having the knowledge.
00:24:39.480 | And you will see this a little more detail in a bit.
00:24:44.600 | In fact, the correlation between where the location, where
00:24:48.880 | the facts are located, and how well you will edit
00:24:52.120 | if you edit that location is completely correlated,
00:24:55.920 | uncorrelated.
00:24:57.280 | So they have nothing to do with each other.
00:25:00.800 | So we thought, well, maybe it's the problem
00:25:04.360 | with the definition of editing.
00:25:06.160 | What we mean by editing can mean a lot of different things.
00:25:08.960 | So let's think about different ways to edit a thing.
00:25:13.040 | So we tried a bunch of things with little success.
00:25:16.280 | We couldn't find an editing definition that actually
00:25:19.480 | relates really well with localization methods,
00:25:22.240 | like in particular with ROM.
00:25:24.560 | So let's talk a little bit about ROM, how ROM works,
00:25:29.840 | super briefly.
00:25:30.760 | There's a lot of details missed out on the slide,
00:25:32.800 | but roughly you will get the idea.
00:25:34.920 | So ROM is Mangeto, 2022.
00:25:38.480 | They have what's called causal tracing algorithm.
00:25:41.920 | And the way it works is that you're
00:25:43.640 | going to run a model on this particular data
00:25:46.440 | set, counterfact data set, that has this tuple, subject,
00:25:51.600 | relation, and object.
00:25:52.800 | The space needle is located in Seattle.
00:25:56.280 | And so you're going to have a clean run of the space needle
00:25:59.720 | is in Seattle one time.
00:26:01.600 | You stole every single module, every single value,
00:26:04.200 | activations.
00:26:05.960 | And then in the second run, which they call corrupted run,
00:26:09.680 | you're going to add noise in the space needle is--
00:26:13.200 | or the space.
00:26:15.280 | Then you're going to intervene at every single one
00:26:20.080 | of those modules by copying this module to the corrupted run.
00:26:26.320 | So as if that particular model was never interrupted,
00:26:31.480 | noise was never added to that module.
00:26:34.360 | So it's a typical intervention case
00:26:37.160 | where you pretend everything else being equal.
00:26:40.720 | If I change just this one module,
00:26:43.320 | what is the probability of having the right answer?
00:26:46.400 | So in this case, probability of the right answer,
00:26:48.480 | Seattle, given that I know is the model
00:26:52.080 | and I intervened on it.
00:26:55.000 | So at the end of the day, you'll find a graph
00:26:57.440 | like that where each layer and each token has a score.
00:27:02.560 | How likely it is if I intervene on that token in that layer?
00:27:07.160 | How likely is it that I will recover the right answer?
00:27:10.960 | Because if I recover right answer, that's the module.
00:27:13.280 | That's the module that stored the knowledge.
00:27:16.240 | Really reasonable algorithm.
00:27:17.960 | I couldn't find technical flaw in this algorithm.
00:27:20.320 | I quite like it, actually.
00:27:21.640 | But when we start looking at this using the same model
00:27:28.320 | that they used, GPT-J, we realize
00:27:32.160 | that a lot of these facts--
00:27:34.840 | so Roam uses just layer 6 to edit,
00:27:38.080 | because that was supposedly the best layer across this data set
00:27:42.200 | to edit.
00:27:42.720 | Most of the factual knowledge is stored in layer 6.
00:27:45.360 | And they showed editing success and whatnot.
00:27:49.120 | But we realized the truth looks like the graph on the right.
00:27:52.560 | So the red line is the layer 6.
00:27:55.320 | Their extension paper called MEME
00:27:57.480 | edits multiple layers at the blue region.
00:28:01.360 | The black bars are histogram of where the knowledge was
00:28:04.840 | actually peaked if you test every single layer.
00:28:08.600 | And as you can see, not a lot of facts fall into that region.
00:28:12.000 | So in fact, every single fact has different region
00:28:14.920 | that where it peaked.
00:28:16.160 | So layer 6, for a lot of facts, weren't the best layer.
00:28:20.400 | But the editing really worked.
00:28:22.120 | It really works.
00:28:23.200 | And we were able to duplicate the results.
00:28:26.080 | So we thought, what do we do to find this ethical knowledge?
00:28:30.680 | How do we find the best layer to edit?
00:28:33.080 | So that's where we started.
00:28:34.720 | But then we thought, you know what?
00:28:36.360 | Take a step back.
00:28:37.640 | We're going to actually do a sanity check first
00:28:40.520 | to make sure that tracing effect--
00:28:43.200 | the tracing effect is the localization--
00:28:46.640 | implies better editing results.
00:28:49.560 | And that's when everything started to falling apart.
00:28:53.840 | So let's define some metrics first.
00:28:56.080 | The edit success-- this is the rewrite score,
00:28:59.360 | same score as Rome paper used.
00:29:01.600 | That's what we used.
00:29:02.960 | And the tracing effect-- this is localization--
00:29:06.160 | is probably-- you can read the slide.
00:29:10.040 | So when we plotted the relation between tracing effect
00:29:13.680 | and rewrite score, the editing method,
00:29:18.200 | red line implies the perfect correlation.
00:29:22.080 | And that was our assumption, that there
00:29:23.800 | will be perfectly correlated, which
00:29:25.480 | is why we do localization to begin with.
00:29:28.480 | The actual line was yellow.
00:29:30.840 | It's close to zero.
00:29:32.040 | It's actually negative in this particular data set.
00:29:36.440 | That is not even uncorrelated.
00:29:37.680 | It's anti-correlated.
00:29:39.920 | And we didn't stop there.
00:29:41.000 | We were so puzzled.
00:29:42.640 | We're going to do this for every single layer.
00:29:44.720 | And we're going to find R-squared value.
00:29:47.120 | So how much of the choice of layer
00:29:50.320 | versus the localization, the tracing effect,
00:29:53.000 | explains the variance of successful edit?
00:29:57.080 | If you're not familiar with R-squared,
00:29:58.960 | R-squared is like a--
00:29:59.960 | think about it as an importance of a factor.
00:30:03.240 | And it turns out that layer takes 94%.
00:30:06.960 | Tracing effect is 0.16.
00:30:10.680 | And so we were really puzzled.
00:30:11.960 | We were scratching our head.
00:30:13.160 | Why is this true?
00:30:15.640 | But it was true across layer.
00:30:17.520 | We tried all sorts of different things.
00:30:19.200 | We tried different model.
00:30:20.800 | We tried different data set.
00:30:22.080 | It was all roughly the case.
00:30:24.600 | So at this point, we contacted David.
00:30:28.120 | And we started talking about it.
00:30:29.520 | And we resolved them.
00:30:30.920 | They acknowledged that this is a phenomenon that exists.
00:30:34.600 | Yeah, John?
00:30:35.960 | So apart from the layer, the other way
00:30:38.760 | which localization can happen is are you
00:30:41.280 | looking at the correct token?
00:30:42.600 | Is that the other corresponding--
00:30:44.840 | Yeah.
00:30:45.600 | Yeah, in this graph, the token is in--
00:30:48.880 | So the added benefit of the rest of the localization
00:30:52.720 | could only help you look at which is the correct subred
00:30:54.920 | token, is that it?
00:30:55.720 | Yeah, yeah.
00:30:56.560 | So looking at any of the subred tokens
00:30:58.200 | is sort of fine is what I should think of?
00:30:59.880 | Yeah, yeah.
00:31:00.440 | Just layer is the most biggest thing.
00:31:03.080 | That's the only thing you should care.
00:31:04.680 | You care about editing?
00:31:05.680 | Layers.
00:31:06.480 | In fact, don't worry about localization at all.
00:31:08.480 | It's extra wasted carbon climate effect.
00:31:13.280 | So that was our conclusion.
00:31:16.240 | But then we thought, maybe the particular definition of edit
00:31:20.840 | that they used in the room was maybe different.
00:31:23.880 | Maybe there exists a definition of editing
00:31:26.920 | that correlates a lot better with localization.
00:31:30.480 | Because there must be.
00:31:31.320 | I'm still puzzled.
00:31:32.360 | Why is this not correlated?
00:31:34.560 | So we tried a bunch of different definitions of edits.
00:31:38.800 | You might inject an error.
00:31:41.040 | You might reverse the tracing.
00:31:46.280 | You might want to erase a fag.
00:31:47.720 | You might want to amplify the fag.
00:31:49.120 | All these things.
00:31:49.840 | Like maybe one of these will work.
00:31:52.480 | It didn't.
00:31:53.960 | So the graph that you're seeing down here
00:31:56.000 | is our square value for four different methods.
00:31:59.400 | And this wasn't just the case for ROM and MEM.
00:32:01.360 | It was also the case for fine tuning methods.
00:32:04.480 | That you want to look at the difference
00:32:06.720 | between blue and orange bar represents
00:32:10.280 | how much the tracing effect influenced
00:32:12.560 | our square value of the tracing effect.
00:32:14.400 | As you can see, it's ignorable.
00:32:16.160 | They're all the same.
00:32:17.920 | You might feel that effect forcing, the last one,
00:32:20.760 | has a little bit of hope.
00:32:22.240 | But still, compared to the impact of layer, choice of layer,
00:32:27.000 | it's ignorable.
00:32:28.680 | So at this point, we said, OK, well, we
00:32:32.560 | can't locate the ethical knowledge at this project.
00:32:35.920 | We're going to have to switch the direction.
00:32:37.800 | And we ended up doing a lot more in-depth analysis on this.
00:32:41.520 | So in summary, does localization help editing?
00:32:49.520 | The relationship is actually zero.
00:32:51.600 | For this particular editing method, from what I know,
00:32:55.120 | it's pretty state of the art.
00:32:56.880 | And the counterfact data, it's not true.
00:33:00.360 | Are there any other editing method
00:33:01.800 | that correlate better?
00:33:03.440 | But if somebody can answer this question for me,
00:33:05.680 | that will be very satisfying.
00:33:07.360 | I feel like there should still be something there
00:33:10.480 | that we're missing.
00:33:12.440 | But causal tracing, I think what it does
00:33:14.920 | is it reveals the factual information when
00:33:18.960 | the transformer is passing forward.
00:33:21.920 | I think it represents where is the fact when
00:33:24.880 | you're doing that.
00:33:26.360 | But what we found here is that it has nothing
00:33:28.480 | to do with editing success.
00:33:30.680 | Those two things are different.
00:33:32.000 | And we have to resolve that somehow.
00:33:35.120 | But a lot of insights that they found in their paper
00:33:37.880 | is still useful, like the early to mid-range NLP
00:33:40.520 | representation, loss token.
00:33:42.360 | They represent the factual, something we didn't know before.
00:33:46.160 | But it is important not to validate localization methods
00:33:50.440 | using the editing method, now we know,
00:33:52.960 | and maybe not to motivate editing methods using
00:33:56.880 | via localization.
00:33:58.600 | Those are the two things now we know that we shouldn't do,
00:34:01.880 | because we couldn't find a relationship.
00:34:05.200 | Any questions on this one before I move on to the next one?
00:34:07.920 | I'm not shocked by this.
00:34:17.480 | I am shocked by this.
00:34:18.680 | I'm still so puzzled.
00:34:21.600 | There should be something.
00:34:22.800 | I don't know.
00:34:26.840 | All right.
00:34:28.840 | So in summary of this first part,
00:34:32.200 | we talked about why the gap might exist,
00:34:35.280 | what machines know versus what we think machines know.
00:34:38.800 | There are three hypotheses.
00:34:40.040 | There are three ideas.
00:34:41.080 | Assumptions are wrong.
00:34:41.920 | Maybe our expectations are wrong.
00:34:43.520 | Maybe it's beyond us.
00:34:45.280 | There's a good quote that says, "Good artists steal.
00:34:48.480 | I think good researchers doubt."
00:34:50.520 | We have to be really suspicious of everything that we do.
00:34:54.160 | And that's maybe the biggest lesson
00:34:55.560 | that I've learned over many years,
00:34:57.880 | that once you like your results so much, that's a bad sign.
00:35:02.360 | Come back, go home, have a beer, go to sleep.
00:35:05.720 | And next day, you come back and put your paper on your desk
00:35:09.240 | and think, OK, now I'm going to review this paper.
00:35:12.480 | How do I criticize this?
00:35:13.680 | What do I not like about this paper?
00:35:16.320 | That's one way to look at it.
00:35:17.520 | Criticize your own research, and that will
00:35:19.800 | improve your thinking a lot.
00:35:23.160 | So let's bring our attention back to our hopes and dreams.
00:35:26.560 | It keeps coming back.
00:35:28.720 | So here, I came to realize maybe instead of just building
00:35:33.560 | tools to understand, perhaps we need to do some groundwork.
00:35:37.800 | What do I mean?
00:35:38.920 | Well, this alien that we've been dealing with,
00:35:41.680 | trying to generate explanations, seems to be a different kind.
00:35:45.960 | So maybe we should study them as if they're
00:35:48.920 | like newbies to the field.
00:35:50.560 | Study them as if they're like new species in the wild.
00:35:54.640 | So what do you do when you observe
00:35:56.160 | a new species in the wild?
00:35:57.800 | You have a couple of ways.
00:35:59.120 | But one of the ways is to do observational study.
00:36:02.120 | So you saw some species in the wild far away.
00:36:05.120 | First, you just kind of watch them.
00:36:07.280 | You watch them and see what are they like,
00:36:09.560 | what are their habitat, what are their values and whatnot.
00:36:14.440 | And second way, you can actually intervene and do a control
00:36:18.000 | study.
00:36:18.800 | So we did something like this with reinforcement learning
00:36:22.920 | setup.
00:36:25.560 | I'm going to talk about these two papers, first paper.
00:36:29.080 | Emergent behaviors in multi-agent systems
00:36:31.360 | has been so cool.
00:36:32.680 | Who saw this hide and seek video by OpenAI?
00:36:36.360 | Yeah, it's so cool.
00:36:37.120 | If you haven't seen it, just Google it and watch it.
00:36:39.240 | It's so fascinating.
00:36:40.440 | I'm only covering the tip of an iceberg in this.
00:36:42.920 | But at the end of this hide and seek episode, at some point,
00:36:47.400 | the agents discover a bug in this physical system
00:36:52.080 | and start anti-gravity flying in the air
00:36:55.680 | and shooting hiders everywhere.
00:36:57.960 | It's a super interesting video.
00:36:59.360 | You must watch.
00:37:01.320 | So lots of that.
00:37:02.360 | And also humanoid football and capture the flag from deep mind.
00:37:05.840 | Lots of interesting behaviors emerging that we observed.
00:37:08.560 | Here's my favorite one.
00:37:12.520 | But these labels-- so here, these
00:37:15.680 | are labels that are provided by OpenAI, running and chasing,
00:37:19.040 | fort building, and ramp use.
00:37:21.760 | And these ones were that a human or humans
00:37:25.360 | went painstakingly, one by one, watch all these videos
00:37:29.040 | and label them manually.
00:37:31.600 | So our question is, is there a better way
00:37:34.800 | to discover these emergent behaviors?
00:37:37.440 | Perhaps some nice visualization can
00:37:39.800 | help us explore this complex domain a little better.
00:37:44.960 | So that's our goal.
00:37:47.800 | So in this work, we're going to, again,
00:37:50.240 | treat the agents like an observational study,
00:37:52.800 | like a new species.
00:37:53.800 | And we're going to do observational study.
00:37:55.800 | And what that means is that we only
00:37:57.600 | get to observe state and action pair.
00:37:59.800 | So where they are, what are they doing,
00:38:02.600 | what are they doing?
00:38:04.080 | And we're going to discover agent behavior
00:38:07.200 | by basically clustering the data.
00:38:10.320 | That's all we're going to do.
00:38:12.600 | And how do we do it?
00:38:13.840 | Pretty simple.
00:38:15.600 | Generative model-- have you covered the Bayesian
00:38:18.120 | generative model, graphical model?
00:38:19.800 | No, gotcha.
00:38:21.960 | So think about--
00:38:22.640 | [INAUDIBLE]
00:38:24.840 | That also what you teach?
00:38:27.200 | Yeah, so this is a graphical model.
00:38:29.400 | Think about this as a fake or hypothetical data generation
00:38:34.280 | process.
00:38:35.160 | So how does this work?
00:38:36.360 | Like, I'm generating the data.
00:38:37.800 | I created this system.
00:38:39.240 | I'm going to first generate a joint latent embedding space
00:38:43.240 | that represents numbers, that represents
00:38:45.360 | all the behaviors in the system.
00:38:47.440 | And then for each agent, I'm going
00:38:48.920 | to generate another embedding.
00:38:51.480 | And each embedding, when it's conditioned with state,
00:38:55.360 | it's going to generate policy.
00:38:57.160 | It's going to decide what it's going to do,
00:38:58.960 | what action is given the state and the embedding pair.
00:39:02.400 | And then what that whole thing generates
00:39:05.240 | is what you see, the state and action pair.
00:39:08.280 | So how does this work?
00:39:09.560 | And then given this, you build a model.
00:39:11.880 | And you do inference to learn all these parameters.
00:39:15.200 | Kind of same business as neural network,
00:39:16.880 | but it's just have a little more structure.
00:39:20.080 | So this is completely made up, right?
00:39:21.800 | This is like my idea of how these new species might work.
00:39:26.160 | And our goal is to--
00:39:27.240 | we're going to try this and see if anything useful comes up.
00:39:30.920 | And the way you do this is-- one of the ways you do this
00:39:33.320 | is you optimize for a variation of lower bound.
00:39:35.720 | You don't need to know that.
00:39:36.880 | It's very cool, actually.
00:39:38.920 | If one gets into this exponential family business,
00:39:42.440 | very cool.
00:39:44.160 | CS228.
00:39:47.400 | So here's one of the results that we had.
00:39:49.600 | It's a domain called MuJoCo.
00:39:52.160 | Here, we're going to pretend that we have two agents, one
00:39:54.960 | controlling back leg and one controlling the front leg.
00:39:57.800 | And on the right, we're showing that joint embedding space
00:40:00.760 | z omega and z alpha.
00:40:03.040 | While video is running, I'm going
00:40:05.720 | to try to put the video back.
00:40:08.640 | So now I'm going to select-- this is a visualization
00:40:14.080 | that we built online.
00:40:16.400 | You can go check it out.
00:40:17.960 | You can select a little space in agent 1 space.
00:40:21.440 | And you see it maps to pretty tight space in agent 0.
00:40:25.120 | And it shows pretty decent running ability.
00:40:27.320 | That's cool.
00:40:28.760 | And now I'm going to select somewhere else in agent 1
00:40:32.480 | that maps to kind of dispersed area in agent 0.
00:40:35.560 | It looks like it's not doing as well.
00:40:38.800 | And this is just an insight that we gain for this data only.
00:40:42.480 | But I was quickly able to identify, ah,
00:40:45.800 | this tight mapping business kind of represents
00:40:49.840 | the good running behavior and bad running behaviors.
00:40:52.680 | That's something that you can do pretty efficiently.
00:40:55.200 | And now I'm going to show you something more interesting.
00:40:58.160 | So of course, we have to do this because we have the data.
00:41:00.960 | It's here.
00:41:01.440 | It's so cool.
00:41:02.880 | So we apply this framework in the OpenAI's hide and seek.
00:41:07.200 | This has four agents.
00:41:08.640 | It looks like a simple game, but it
00:41:10.120 | has pretty complex structure, 100 dimensional observations,
00:41:13.640 | five dimensional action space.
00:41:15.520 | So in this work, remember that we
00:41:18.200 | pretend that we don't know the labels given by OpenAI.
00:41:21.240 | We just shuffle them in the mix.
00:41:24.360 | But we can color them, our results,
00:41:26.480 | with respect to their labels.
00:41:28.280 | So again, this is the result of z omega and z alpha.
00:41:32.760 | The individual agents.
00:41:33.960 | But the coloring is something that we didn't know before.
00:41:36.340 | We just did it after the fact.
00:41:39.360 | You can see in the z omega, there's
00:41:41.120 | nice kind of pattern that we can roughly separate what
00:41:46.120 | makes sense to humans and what makes sense to us.
00:41:48.840 | But remember, the green and gray, kind of everywhere,
00:41:53.560 | they're mixed.
00:41:54.400 | So in this particular run of OpenAI's hide and seek,
00:41:58.440 | it seemed that those two representations
00:42:00.400 | were kind of entangled.
00:42:03.360 | The running and chasing, the blue dots,
00:42:05.400 | it seems to be pretty separate and distinguishable
00:42:08.280 | from all the other colors.
00:42:10.040 | And that kind of makes sense, because that's
00:42:11.920 | basis of playing this game.
00:42:13.680 | So if you don't have that representation,
00:42:15.440 | you have a big trouble.
00:42:17.760 | But in case of orange, which is fort building,
00:42:22.960 | it's a lot more distinguishable in hiders.
00:42:26.220 | And that makes sense, because hiders are
00:42:28.160 | the ones building the fort.
00:42:30.400 | And seekers don't build the fort,
00:42:31.720 | so oranges are a little more entangled than seekers.
00:42:34.400 | Perhaps if seekers had built more separate fort building
00:42:38.600 | representation, maybe they would have won this game.
00:42:40.720 | So this work, can we learn something interesting,
00:42:47.920 | emerging behaviors by just simply observing the system?
00:42:51.960 | The answer seems to be yes, at least for the domains
00:42:54.080 | that we tested.
00:42:55.200 | A lot more complex domains should be tested.
00:42:57.960 | But these are the ones we had.
00:43:01.280 | But remember that these methods don't give you
00:43:03.400 | names of these clusters.
00:43:04.720 | So you would have to go and investigate and click
00:43:07.840 | through and explore.
00:43:10.720 | And if the cluster represents super human concept,
00:43:14.720 | this is not going to help you.
00:43:16.080 | And I'll talk a little more about a work
00:43:17.800 | that we do try to help them.
00:43:19.640 | But this is not for you.
00:43:20.680 | This is not going to help you there.
00:43:23.440 | And also, if you have access to the model and the reward
00:43:27.120 | signal, you should use it.
00:43:29.160 | Why dump it?
00:43:31.160 | So next work, we do use it.
00:43:33.240 | I'm going to talk about this work with Nico and Natasha
00:43:37.000 | and Shai again.
00:43:39.560 | So here, this time, we're going to intervene.
00:43:42.200 | We're going to be a little intrusive,
00:43:43.780 | but hopefully we'll learn a little more.
00:43:46.440 | So problem is that we're going to build a new multi-agent
00:43:49.440 | system, going to build it from scratch, such that we
00:43:52.080 | can do control testing.
00:43:53.560 | But at the same time, we shouldn't
00:43:54.960 | sacrifice the performance.
00:43:56.520 | So we're going to try to match the performance
00:43:59.200 | of the overall system.
00:44:00.760 | We do succeed.
00:44:03.360 | I had this paper collaboration with folks
00:44:05.840 | at Stanford, actually, here in 2020,
00:44:08.520 | where we proposed this pretty simple idea, which
00:44:11.000 | is you have a neural network.
00:44:13.240 | Why don't we embed concepts in the middle of the bottleneck,
00:44:17.080 | where one neuron represents trees,
00:44:19.240 | the other represents stripes, and just
00:44:21.560 | train the model end-to-end?
00:44:23.720 | And why are we doing this?
00:44:25.440 | Well, because then at inference time,
00:44:27.680 | you can actually intervene.
00:44:29.880 | You can pretend, you know, predicting zebra,
00:44:32.160 | I don't think trees should matter.
00:44:33.840 | So I'm going to zero out this neuron
00:44:35.480 | and feed forward and see what happens.
00:44:37.480 | So it's particularly useful in the medical setting,
00:44:39.640 | where there are some features that doctors don't want.
00:44:41.840 | We can cancel out and test.
00:44:44.800 | So this is the work to extend this to RL setting.
00:44:48.720 | It's actually not as simple extension as we thought.
00:44:53.160 | It came out to be pretty complex.
00:44:54.920 | But essentially, we're doing that.
00:44:57.240 | And we're building each of the concept bottleneck
00:44:59.880 | for each agent.
00:45:02.120 | And at the end of the day, what you optimize
00:45:04.000 | is what you usually do, typical PPO.
00:45:06.360 | Just think about this as make the auto system work,
00:45:09.680 | plus minimizing the difference between the true concept
00:45:13.360 | and estimated concept.
00:45:14.840 | That's all you do.
00:45:17.640 | Why are we doing this?
00:45:18.600 | You can intervene.
00:45:19.600 | You can pretend now agent two, pretend
00:45:22.360 | that you can't see agent one.
00:45:24.480 | What happens now?
00:45:25.880 | That's what we're doing here.
00:45:29.320 | We're going to do this in two domains.
00:45:31.440 | First domain, how many people saw this cooking game before?
00:45:37.560 | Yeah, it's a pretty commonly used cooking domain
00:45:41.320 | in reinforcement learning, very simple.
00:45:43.880 | We have two agents, yellow and blue.
00:45:46.280 | And they're going to make soup.
00:45:48.080 | They can bring three tomatoes.
00:45:49.520 | They get a word.
00:45:50.800 | They wait for the tomato and bring the dishes,
00:45:53.440 | a dish to the cooking pot.
00:45:54.960 | They get a reward finally.
00:45:56.240 | Their goal is to deliver as many soups as possible,
00:45:59.040 | given some time.
00:46:01.360 | And here, concepts that we use are agent position,
00:46:04.520 | orientation, agent has tomato, has dish, et cetera, et cetera.
00:46:08.160 | Something that's immediately available to you already.
00:46:11.280 | And you can, of course, tweak the environment
00:46:13.600 | to make it more fun.
00:46:14.920 | So you can make it that they have to collaborate.
00:46:18.160 | You can build a wall between them
00:46:19.680 | so that they have to work together in order
00:46:21.440 | to serve any tomato soup.
00:46:23.360 | Or you can make them freely available.
00:46:25.240 | You can work independently or together,
00:46:27.680 | whatever your choice.
00:46:28.720 | First, just kind of sanity check was
00:46:34.160 | that you can detect this emerging behavior of coordination
00:46:40.080 | versus non-coordination.
00:46:41.320 | So when the impassable environment,
00:46:43.960 | when we made up that environment,
00:46:45.840 | and suppose the RL system that we trained worked,
00:46:48.960 | they were able to deliver some soups,
00:46:51.200 | then you see that when we intervene--
00:46:53.120 | this graph, let me explain.
00:46:54.440 | This is a reward of an agent one when there's no intervention.
00:46:59.480 | So this is perfectly good world.
00:47:01.840 | And when there was an intervention.
00:47:04.080 | This is average value of intervening on all concepts.
00:47:07.480 | But I'm also going to show you each concept soon.
00:47:10.800 | If you compare left and right, you
00:47:12.880 | can tell that in the right, when we intervene,
00:47:16.120 | reward deteriorated quite a lot for both of them.
00:47:19.800 | And that's one way to see, ah, they are coordinating.
00:47:22.880 | Because somehow intervening at this concept
00:47:26.560 | impacted a lot of their performance.
00:47:30.480 | But this is what was really interesting to me,
00:47:33.240 | and I'm curious.
00:47:34.280 | Anyone can guess.
00:47:35.600 | So this is the same graph as the one you saw before,
00:47:40.080 | but except I'm plotting for intervention for each concept.
00:47:44.000 | So I'm intervening team position, team orientation,
00:47:47.120 | team has tomato, et cetera, et cetera.
00:47:49.960 | It turns out that they are using--
00:47:52.640 | or rather, when we intervene on team orientation,
00:47:56.160 | the degradation of performance was the biggest,
00:47:58.760 | to the extent that we believe that orientation had
00:48:00.840 | to do with sub-coordination.
00:48:04.000 | Anyone can guess why this might be?
00:48:06.440 | It's not the position.
00:48:13.960 | It's orientation.
00:48:15.880 | Just a clarification question on the orientation.
00:48:17.880 | Is that like the direction that the team is projecting?
00:48:21.360 | So it seems like orientation would let you predict
00:48:25.800 | where the moving heads?
00:48:26.800 | Yes, yes, that's right.
00:48:28.400 | Where were you when I was pulling my hair over this
00:48:31.560 | question?
00:48:32.920 | Yes, that's exactly right.
00:48:34.040 | And initially, I was really puzzled.
00:48:36.120 | Like, why not position?
00:48:37.160 | Because I expected to be positioned.
00:48:38.840 | But exactly, that's exactly right.
00:48:40.280 | So the orientation of the team is
00:48:42.600 | But exactly, that's exactly right.
00:48:44.000 | So the orientation is the first signal
00:48:47.280 | that an agent can get about the next move of the other agent.
00:48:51.760 | Because they're facing the pot, they're going to the pot.
00:48:54.200 | They're facing the tomato, they're
00:48:55.600 | going to get the tomato.
00:48:57.240 | Really interesting intuition.
00:49:00.160 | Obvious to some, but I needed this graph to work that out.
00:49:05.600 | And of course, you can use this to identify lazy agents.
00:49:09.240 | If you look at the rightmost yellow agent, our friend,
00:49:14.160 | just chilling in the background.
00:49:17.200 | And he's lazy.
00:49:18.000 | And if you train an RL agent, there's
00:49:19.560 | always some agents just hanging out.
00:49:21.560 | They just not do anything.
00:49:23.560 | And you can easily identify this by using this graph.
00:49:27.160 | If I intervene, it just doesn't impact any of their rewards.
00:49:31.120 | That one's me.
00:49:32.080 | So the second domain, we're going
00:49:36.640 | to look at a little more complex domain.
00:49:39.080 | So this is studying inter-agent social dynamics.
00:49:42.920 | So in this domain, there is a little bit of tension.
00:49:45.800 | This is called a cleanup.
00:49:47.320 | We have four agents.
00:49:49.160 | They only get rewards if they eat apples.
00:49:51.560 | Just yellow things, or green things are apples.
00:49:54.960 | But if you don't clean the river,
00:49:57.000 | then apple stops you all.
00:49:58.640 | So somebody has to clean the river.
00:50:01.360 | And you can see, if you have four people trying
00:50:04.560 | to collect apples, you can just stay someone else's--
00:50:07.760 | wait until someone else to clean the river
00:50:09.960 | and then collect the apples.
00:50:11.120 | And in fact, that's sometimes what happens.
00:50:12.920 | And concepts here, again, are pretty common things--
00:50:20.440 | position, orientation, and pollution, positions, et
00:50:24.480 | cetera.
00:50:26.560 | So when we first plotted the same graph
00:50:30.920 | as the previous domain, it tells a story.
00:50:36.280 | So the story here is that when I intervene on agent 1,
00:50:41.920 | it seems to influence agent 2 quite a lot,
00:50:45.680 | if you look at these three different graphs,
00:50:49.920 | how reward was impacted when I intervene on agent 1.
00:50:53.720 | It's agent 3 and 4 are fine, but it
00:50:55.400 | seems that only agent 2 is influenced.
00:50:57.320 | Same with idle time, same with the inter-agent distance.
00:51:00.720 | So we were like, oh, maybe that's true.
00:51:03.360 | But we keep wondering.
00:51:04.600 | There's a lot going on in this domain.
00:51:06.840 | How do we know this is the case?
00:51:09.840 | So we decided to take another step.
00:51:13.280 | So we're going to do a little more work here, but not a lot.
00:51:18.480 | We're going to build a graph to discover
00:51:21.040 | inter-agent relationships.
00:51:22.960 | This is simplest, dumbest way to build a graph.
00:51:25.480 | But again, I like simple things.
00:51:27.320 | So how do you build a graph?
00:51:28.520 | Well, suppose that you're building
00:51:30.640 | a graph between movies.
00:51:31.680 | This is not what we do, but just to describe
00:51:34.880 | what we're trying to do.
00:51:36.280 | We have each row.
00:51:37.800 | We're going to build a matrix.
00:51:39.480 | Each row is a movie, and each column
00:51:42.640 | consists of features of these movies, so length,
00:51:45.720 | genre of the movie, and so on.
00:51:48.120 | And the simplest way to build a graph is to do a regression.
00:51:52.040 | So exclude i-th row, and then we're
00:51:56.440 | going to regress over everyone else.
00:51:58.600 | And that gives me beta, which is a kind of coefficient
00:52:02.680 | for each of these.
00:52:04.080 | And that beta represents the strength of the edges.
00:52:08.920 | So this movie is more related to this movie and not
00:52:11.040 | the other movie.
00:52:11.840 | And ta-da, you have a graph.
00:52:13.040 | It's the dumbest way.
00:52:14.160 | There's a lot of caveats to it.
00:52:15.400 | You shouldn't do this a lot of times,
00:52:16.960 | but this is the simplest way to do it.
00:52:20.520 | So we did the same thing here.
00:52:22.120 | Instead of movie, we're going to use intervention on concept
00:52:27.760 | C on agent N as our node.
00:52:31.400 | And to build this matrix, we're going
00:52:34.160 | to use intervention outcome, which
00:52:36.440 | wouldn't have been available without our framework
00:52:39.760 | for reward, resource collected, and many other things.
00:52:45.240 | And when you build this graph, at the end of the day,
00:52:47.440 | you get betas that represent relationship
00:52:50.000 | between these interventions.
00:52:54.240 | So I had a graph of that matrix.
00:52:57.120 | Apparently, I removed before I came over.
00:53:00.240 | But imagine there was a matrix that
00:53:03.320 | is nicely highlighted between agent 1 and 4 and that only,
00:53:07.800 | contradicting the original hypothesis that we had.
00:53:11.160 | And this is the video of it.
00:53:13.320 | So when we stared at that matrix,
00:53:15.640 | it turns out that there is no high edge, strong edges
00:53:20.160 | between agent 1 and 2.
00:53:22.160 | So we were like, that's weird.
00:53:23.560 | But there is strong edges between agent 1 and 4.
00:53:26.360 | So we dig deeper into it, watched a lot of sessions
00:53:30.280 | to validate what's happening.
00:53:32.000 | And it turns out that the story was a lot more complicated.
00:53:35.680 | The 1's orientation was important for 4.
00:53:39.200 | But when that fails, agent 1 and 2 gets cornered in.
00:53:43.320 | And you can see that in the graph.
00:53:44.800 | Agent 4 get agent 1 and 2, blue and yellow agent,
00:53:50.800 | gets in the corner together.
00:53:51.960 | They get stuck.
00:53:53.480 | And this is simply just accidental
00:53:55.840 | because of the way that we built this environment.
00:53:58.720 | It just happened.
00:54:00.880 | But the raw statistics wouldn't have told us this story,
00:54:04.640 | that this was completely accidental.
00:54:06.140 | In fact, there was no correlation, no coordination
00:54:08.680 | between agent 1 and 2.
00:54:10.200 | But only after the graph, we realized this was the case.
00:54:14.600 | Now, this might be a one-off case.
00:54:16.640 | But you know what?
00:54:17.440 | A lot of emerging behaviors that we want to detect, a lot of them
00:54:21.280 | will be one-off case.
00:54:22.640 | And we really want to get to the truth of that
00:54:25.000 | rather than having some surface-level statistics.
00:54:27.960 | So can we build multi-agent system
00:54:34.920 | that enables intervention and performs as well?
00:54:37.240 | The answer is yes.
00:54:38.160 | There's a graph that shows the red line and blue line roughly
00:54:41.400 | aligned.
00:54:41.880 | That's good news.
00:54:42.840 | We are performing as well.
00:54:45.480 | But remember these concepts.
00:54:47.080 | You need to label them.
00:54:48.360 | Or you should have some way of getting those concepts,
00:54:50.560 | positions, and orientation.
00:54:52.240 | That might be something that we would
00:54:53.880 | love to extend in the future.
00:54:56.600 | Before I go on, any questions?
00:54:58.320 | You shy?
00:55:05.360 | You shy?
00:55:06.360 | [LAUGHS]
00:55:07.840 | Cool.
00:55:11.800 | All right.
00:55:13.520 | So I did tell you that we're not going to know,
00:55:16.880 | does the solution to move 37.
00:55:18.720 | I still don't.
00:55:19.720 | I still don't.
00:55:20.960 | But I'll tell you a little bit of work
00:55:23.760 | that I'm currently doing I'm really excited about.
00:55:27.440 | That we started thinking, you know what?
00:55:29.840 | Will this understanding move 37 happen before
00:55:33.200 | within my lifetime?
00:55:34.560 | And I was like, oh, maybe not.
00:55:35.800 | But I kind of want it to happen.
00:55:37.680 | So we start-- this is all about research, right?
00:55:40.400 | You started carving out a space where
00:55:42.360 | things are a little bit solvable.
00:55:44.320 | And you try to attack that problem.
00:55:46.320 | So this is our attempt to do exactly that,
00:55:48.720 | to get a little closer to our ultimate goal,
00:55:51.760 | my ultimate goal, of understanding that move 37.
00:55:56.160 | So before that, how many people here know AlphaZero from TMI?
00:56:00.800 | AlphaZero is a self-trained chess playing machine
00:56:06.240 | that beats--
00:56:07.360 | that has higher ELO rating than any other humans
00:56:10.320 | and beats Stockfish, which is arguably no existing
00:56:13.000 | human can beat Stockfish.
00:56:15.040 | So in a previous paper, we try to discover human chess
00:56:19.640 | concepts in this network.
00:56:22.120 | So when does concept like material imbalance
00:56:26.320 | appear in its network, which layer,
00:56:29.320 | and when in the training time, and which we call
00:56:32.720 | what, when, and where plots.
00:56:35.140 | And we also compare the evolution
00:56:37.320 | of opening moves between humans and AlphaZero.
00:56:40.240 | These are the first couple moves that you
00:56:42.120 | make when you play chess.
00:56:43.800 | And as you can see, there's a pretty huge difference.
00:56:46.640 | Left is human, right is AlphaZero.
00:56:49.920 | It turns out that AlphaZero can master, or supposedly master,
00:56:54.480 | a lot of variety of different types of openings.
00:56:57.600 | Openings can be very aggressive.
00:56:59.320 | Openings can be very boring.
00:57:01.200 | Could be very long range, targeting
00:57:03.640 | for long range strategy or short range.
00:57:06.440 | Very different.
00:57:07.400 | So that begs the question, what does AlphaZero know
00:57:10.960 | that humans don't know?
00:57:12.400 | Don't you want to learn what that might be?
00:57:16.400 | So that's what we're doing right now.
00:57:18.000 | We're actually almost-- we're about to evaluate.
00:57:22.080 | So the goal of this work is teach the world chess champion
00:57:27.120 | a new chess, superhuman chess strategy.
00:57:31.080 | And we just got yes from Magnus Carlsen, who
00:57:34.080 | is the world chess champion.
00:57:36.520 | He just lost the match, I know.
00:57:38.040 | But he's still champion in my mind.
00:57:41.440 | He's still champion in two categories, actually.
00:57:44.400 | So the way that we are doing this
00:57:45.840 | is we're going to discover new chess strategy
00:57:49.320 | by explicitly forgetting existing chess strategy, which
00:57:54.280 | we have a lot of data for.
00:57:56.520 | And then we're going to learn a graph, this time
00:57:59.160 | a little more complicated graph, by using
00:58:03.720 | the existing relationships between existing concepts
00:58:07.400 | so that we can get a little bit more idea of what
00:58:09.960 | the new concept might look like.
00:58:12.160 | And Magnus Carlsen-- so my favorite part about this work--
00:58:15.440 | I talk about carving out.
00:58:17.160 | My favorite part about this work is that the evaluation
00:58:19.880 | is going to be pretty clear.
00:58:21.520 | So it's not just like Magnus coming in and say,
00:58:23.640 | oh, your work is kind of nice, and say nice things
00:58:25.960 | about our work.
00:58:26.720 | No, Magnus actually has to solve some puzzles.
00:58:29.880 | And we will be able to evaluate him, whether he did it or not.
00:58:33.400 | So it's like a kind of success and fail.
00:58:35.320 | But I'm extremely excited.
00:58:36.400 | This kind of work I can only do because of Lisa,
00:58:40.280 | who is a champion herself, but also a PhD student at Oxford.
00:58:45.400 | And she played against Magnus in the past,
00:58:47.560 | and many other chess players in the world.
00:58:50.000 | And she's going to be the ultimate pre-superhuman
00:58:53.600 | filtering to filter out these concepts that
00:58:56.520 | will eventually get to Magnus.
00:58:59.400 | So I'm super excited about this.
00:59:00.760 | I have no results, but it's coming up.
00:59:02.520 | I'm excited.
00:59:03.740 | [INAUDIBLE]
00:59:17.660 | Puzzles are actually pretty simple.
00:59:19.140 | So the way that we generate concepts
00:59:22.300 | are within the embedding space of alpha 0.
00:59:25.460 | And given that, because alpha 0 has really weird architecture,
00:59:29.660 | so every single latent layer in alpha 0
00:59:31.660 | has the exact same position as a chessboard.
00:59:33.900 | That's just the way that they decide to do it.
00:59:35.900 | So because of that, we can actually
00:59:37.420 | identify or generate the board positions that
00:59:40.780 | corresponds to that concept.
00:59:42.900 | And because we have MCTS, we can predict
00:59:46.540 | what move it's going to make given that board position.
00:59:49.900 | Because at inference time, it's actually
00:59:51.600 | deterministic, a whole alpha 0 thing.
00:59:53.980 | So we have a lot of board positions.
00:59:56.540 | And that's all you need for puzzles.
00:59:58.380 | You give a board position and then ask Magnus to make a move.
01:00:01.580 | We explain the concept and then give Magnus more board
01:00:04.660 | positions and see if we can apply that concept that he just
01:00:08.420 | learned.
01:00:09.420 | [INAUDIBLE]
01:00:21.740 | Yeah, so if I were to ask Stockfish to solve those puzzles,
01:00:27.340 | that would be a different question.
01:00:28.860 | Because we are interested in whether we can teach human,
01:00:31.780 | not Stockfish.
01:00:32.820 | Stockfish might be able to do it.
01:00:34.200 | That's actually an interesting thing that we could do,
01:00:36.980 | now I think about it.
01:00:37.900 | But our goal is to just teach one superhuman.
01:00:41.420 | Like if I have, for example, 10,000 superhuman concepts,
01:00:45.580 | and only three of them are digestible by Magnus,
01:00:49.180 | that's a win.
01:00:50.220 | That would be a big win for this type of research.
01:00:56.380 | Questions?
01:00:56.880 | Yeah, so wrap up.
01:01:04.540 | Small steps towards our hopes and dreams.
01:01:06.820 | We talked about the gap between what machines know
01:01:09.820 | versus what we think machines know.
01:01:12.180 | Three ideas why that might be true.
01:01:14.980 | The three different maybe angles we
01:01:16.780 | can try to attack and answer those questions
01:01:19.060 | and bridge that gap.
01:01:21.340 | We talked about studying aliens, these machines,
01:01:25.140 | in observation study or control study.
01:01:27.380 | There are many other ways to study a species.
01:01:30.460 | And I'm not an expert, but anthropology and other humanities
01:01:33.100 | studies would know a lot better, more about this.
01:01:36.580 | And maybe, just maybe, we can try to understand MOVE 37
01:01:41.740 | at some point, hopefully within my lifetime,
01:01:44.300 | through this chess project that I'm very excited about.
01:01:48.580 | Thank you.
01:01:49.560 | [APPLAUSE]
01:01:52.520 | Thank you very much.
01:01:58.480 | Questions?
01:01:58.980 | You talked about interpretability research
01:02:04.400 | across NLP, vision, and RL.
01:02:07.440 | Do you think there's much hope for taking
01:02:09.640 | certain interpretability techniques from one modality
01:02:12.000 | into other modalities?
01:02:13.320 | And if so, what's the pathway?
01:02:17.300 | So it depends on your goal.
01:02:19.580 | I think-- like, think about fairness research,
01:02:22.420 | which builds on strong mathematical foundation.
01:02:26.220 | And that's applicable for any questions around fairness,
01:02:29.860 | or hopefully applicable.
01:02:31.340 | But then, once you--
01:02:33.260 | if your goal is to actually solve a fairness issue at hand
01:02:38.100 | for somebody, the real person in the world,
01:02:40.020 | that's a completely different question.
01:02:41.660 | You would have to customize it for a particular person.
01:02:44.060 | You would have to customize it for a particular application.
01:02:47.420 | So there are two venues.
01:02:48.420 | And I think similar is true interpretability,
01:02:50.820 | like the theory work that I talked about.
01:02:52.740 | SHAP and IG are used across domains, like vision, texts.
01:02:57.300 | So that theory paper would be applicable across the domain.
01:03:00.700 | Things like RL and the way that we
01:03:02.700 | build that generative model, you would
01:03:05.120 | need to test a little bit more to make sure
01:03:07.340 | that this works in NLP.
01:03:09.620 | I don't even know how to think about agents in NLP yet.
01:03:12.700 | So we will need a little bit of tweaking.
01:03:14.380 | But both directions are fruitful.
01:03:16.340 | I want to have a question.
01:03:22.660 | So I saw some recent work in which some amateur Go players
01:03:28.220 | found a very tricky strategy to trick up.
01:03:30.820 | I think it was AlphaGo.
01:03:32.580 | And that seemed like a concept that humans know
01:03:35.380 | that machines don't in that Venn diagram.
01:03:37.940 | I just want to know your thoughts about that.
01:03:40.060 | Yeah, actually, it's funny you mentioned that.
01:03:42.300 | Lisa can beat AlphaZero pretty easily.
01:03:46.700 | And it's a similar idea.
01:03:48.220 | Because you kind of know what are the most unseen
01:03:52.100 | autodistribution moves are.
01:03:53.660 | And she can break AlphaZero pretty easily.
01:03:56.580 | And Lisa guessed that if Isador had known something more
01:03:59.780 | about AI, then maybe he would have tried to confuse AlphaGo.
01:04:03.620 | But the truth is, it's a high stake game.
01:04:06.420 | Like Isador is the famous star worldwide.
01:04:10.180 | So he wouldn't want to make a move that
01:04:12.660 | would be seen as a complete mistake,
01:04:15.220 | like the one that Magnus made a couple of days
01:04:17.380 | ago that got on the newsfeed everywhere,
01:04:19.460 | that he made this century-wide mistake.
01:04:21.620 | And that probably hurts.
01:04:23.500 | Any other questions?
01:04:28.220 | [INAUDIBLE]
01:04:28.700 | [INAUDIBLE]
01:04:53.180 | These work that I've presented are pretty new.
01:04:56.500 | But there has been a bit of discussion in the robotics,
01:04:59.980 | applying potentially just to robotics.
01:05:02.140 | And of course, I can't talk about details.
01:05:03.980 | But things that reinforcement learning in the wild people
01:05:10.220 | worry about are some of the surprises.
01:05:13.380 | If you have a test for it, like if you have a unit test for it,
01:05:16.660 | you're never going to fail.
01:05:18.100 | Because you're going to test before you deploy.
01:05:20.820 | I think the biggest risk for any of these deployment systems
01:05:24.220 | is the surprises that you didn't expect.
01:05:27.700 | So my work around the visualization and others
01:05:30.740 | aim to help you with that.
01:05:33.540 | So we may not know names of these surprises.
01:05:36.780 | But here's a tool that helps you better discover
01:05:39.420 | those surprises before someone else does
01:05:41.620 | or someone else gets harmed.
01:05:42.860 | Thanks so much for the talk.
01:05:51.060 | This is kind of an open-ended question.
01:05:52.980 | But I was wondering, we're talking
01:05:54.500 | about a lot of ways in which we try to visualize or understand
01:05:59.260 | what's going on in the representation inside
01:06:01.060 | the machine.
01:06:02.300 | But I was wondering whether we could turn it around
01:06:04.820 | and try to teach machines to tell us what--
01:06:08.260 | using our language, what they're doing in their representations.
01:06:11.900 | Like, if we build representations of ours
01:06:13.620 | and then get the machine to do the translation for us
01:06:16.620 | instead of us going into the machine to see it.
01:06:19.140 | Yeah, great question.
01:06:20.300 | So it's a really interesting question.
01:06:21.940 | Because that's something that I kind of tried in my work,
01:06:26.900 | previous work, called Testing with Concept Activation
01:06:29.860 | Vectors.
01:06:30.420 | So that was to map human language into a machine's space
01:06:34.260 | so that they can only speak our language.
01:06:36.020 | Because I understand my language and just
01:06:37.860 | talk to me in my language.
01:06:39.740 | The challenge is that, how would you do that for something
01:06:42.620 | like alpha 0?
01:06:44.340 | Like, we don't have a vocabulary for it, like move 37.
01:06:48.340 | Then there is going to be a lot of missing valuable knowledge
01:06:51.940 | that we might not get from the machine.
01:06:55.140 | So I think the approach has to be both ways.
01:06:57.940 | We should leverage as much as we can.
01:06:59.460 | But acknowledging that, even that mapping,
01:07:02.740 | that trying to map our language to machines,
01:07:05.380 | is not going to be perfect.
01:07:07.700 | Because it's a kind of proxy for what we think a penguin is.
01:07:11.940 | There's a psychology research that says,
01:07:13.820 | everyone thinks very differently about what a penguin is.
01:07:16.420 | Like, if I take a picture of a penguin,
01:07:19.300 | everyone is thinking different penguin right now.
01:07:22.420 | Australia has the cutest penguin, the fairy penguin.
01:07:25.500 | I'm thinking that, right?
01:07:26.620 | But I don't know how many people are thinking that.
01:07:28.740 | So given that, we are so different,
01:07:31.100 | machine's going to think something else.
01:07:32.820 | So how do you bridge that gap?
01:07:34.260 | Extend that to 100 concepts and composing those concepts,
01:07:37.860 | it's going to go out of wild very soon.
01:07:40.180 | So there's pros and cons.
01:07:41.740 | I'm into both of them.
01:07:43.220 | I think some applications, exclusively just using
01:07:48.220 | human concepts are still very helpful.
01:07:50.300 | It gets you halfway.
01:07:53.180 | But my ambition is that we shouldn't stop there.
01:07:56.460 | We should benefit from them by having them teach us new things
01:08:00.900 | that we didn't know before.
01:08:02.060 | Yeah?
01:08:05.580 | So the second thing you talked about with Jerome
01:08:07.740 | was that where knowledge is located in the embedding space
01:08:11.460 | isn't super correlated with what you'd like to edit
01:08:13.860 | to change that knowledge.
01:08:15.180 | Do you think that has any implications for the later
01:08:17.940 | stuff you talked about, like the cost thing?
01:08:19.940 | But I don't know, like trying to locate,
01:08:23.780 | like to just get strategies in the embedding space
01:08:25.900 | might not be as helpful?
01:08:27.420 | Oh, what are the alternatives?
01:08:30.100 | I guess I don't know the alternatives,
01:08:31.740 | just because I feel like the Roam thing as well is not--
01:08:35.540 | That's possible.
01:08:36.260 | So it's like some transformed space of our embedding space
01:08:40.060 | in alpha 0, maybe it's a function applied
01:08:42.860 | to that embedding space.
01:08:44.340 | So thinking about that as a raw vector is a dead end.
01:08:49.180 | Could be.
01:08:50.180 | We'll see how this chess project goes.
01:08:52.420 | In a couple of months, I might rethink my strategy.
01:08:56.020 | But interesting thought.
01:08:57.820 | Yeah?
01:08:58.660 | So I'm a psychology major, and I do
01:09:00.460 | realize that a lot of this stuff that we're trying to do here,
01:09:03.100 | like reasoning parts of the game,
01:09:04.500 | is how we figure out how our brains work.
01:09:07.860 | So do you think that this--
01:09:10.180 | would there be stuff that moves that's
01:09:13.460 | applicable to neural networks?
01:09:15.180 | And on the contrary, do you think
01:09:16.820 | there must be this interpretability
01:09:18.260 | of a study of neural network to help us understand stuff
01:09:21.220 | about our own brain?
01:09:22.500 | Yeah.
01:09:23.260 | Talk to Geoff Hinton.
01:09:25.420 | He would really like this.
01:09:26.580 | So I believe-- I mean, you probably
01:09:28.260 | know about this history.
01:09:29.300 | I think that's how it all started, right?
01:09:31.580 | The whole neural network is to understand human brain.
01:09:37.140 | So that's the answer to your question.
01:09:39.540 | Interesting, however, in my view,
01:09:41.780 | there is some biases that we have in neuroscience
01:09:46.580 | because of the limitations of tools,
01:09:48.420 | like physical tools and availability of humans
01:09:50.700 | that you can poke in.
01:09:52.100 | I think that influences interpretability research.
01:09:54.660 | And I'll give you an example of what I mean.
01:09:56.540 | So in cat, the horizontal line and vertical line neuron
01:10:00.580 | in cat brain, so they put the prop in and figure out
01:10:03.460 | this one neuron detects vertical lines.
01:10:05.540 | And you can validate it.
01:10:07.180 | It's really cool if you look at the video.
01:10:08.900 | The video is still online.
01:10:10.260 | Yeah, what is it?
01:10:11.220 | [INAUDIBLE]
01:10:12.420 | Yes, yes, yes.
01:10:14.020 | So why did they do that?
01:10:15.780 | Well, because you had one cat, poor, poor cat.
01:10:20.100 | And you had-- we can only probe a few neurons at a time, right?
01:10:25.340 | So that implied a lot of-- a few interpretability research
01:10:28.700 | actually looked at--
01:10:29.900 | are very focused on neuron-wise representation.
01:10:32.900 | This one neuron must be very special.
01:10:35.260 | I actually think that's not true.
01:10:36.940 | That was limited by our ability, like physical ability
01:10:40.100 | to plop organisms.
01:10:41.140 | But in your network, you don't have to do that.
01:10:43.140 | You can apply functions to embeddings.
01:10:44.860 | You can change the whole embedding to something else,
01:10:47.020 | overwrite.
01:10:48.060 | So that kind of is actually a obstacle in our thinking
01:10:54.100 | rather than helping.
01:10:54.900 | OK, maybe we should call it there.
01:11:03.460 | So for Thursday, we're not having a lecture on Thursday.
01:11:08.740 | There'll be TAs and me here.
01:11:11.020 | So if you have any last-minute panics on your project
01:11:14.860 | or think we might have some great insight to help you,
01:11:18.700 | we probably won't actually.
01:11:19.860 | It'll be all right.
01:11:22.220 | Do come along, and you can chat to us as we find projects
01:11:25.940 | and we can give you help.
01:11:27.860 | That means that Dean actually got
01:11:29.740 | to give the final lecture of CS224 in today.
01:11:33.460 | So a round of applause for him.
01:11:35.500 | [APPLAUSE]
01:11:38.060 | [BLANK_AUDIO]