back to index

Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368


Chapters

0:0 Introduction
0:43 GPT-4
23:23 Open sourcing GPT-4
39:41 Defining AGI
47:38 AGI alignment
90:30 How AGI may kill us
142:51 Superintelligence
150:3 Evolution
156:33 Consciousness
167:4 Aliens
172:35 AGI Timeline
180:35 Ego
186:27 Advice for young people
191:45 Mortality
193:26 Love

Whisper Transcript | Transcript Only Page

00:00:00.000 | The problem is that we do not get 50 years
00:00:02.500 | to try and try again and observe that we were wrong
00:00:04.760 | and come up with a different theory
00:00:05.920 | and realize that the entire thing is going to be
00:00:07.760 | way more difficult than realized at the start.
00:00:10.560 | Because the first time you fail at aligning something
00:00:12.960 | much smarter than you are, you die.
00:00:15.180 | - The following is a conversation with Eliezer Yudkowsky,
00:00:20.800 | a legendary researcher, writer, and philosopher
00:00:23.640 | on the topic of artificial intelligence,
00:00:26.000 | especially super intelligent AGI
00:00:28.840 | and its threat to human civilization.
00:00:32.080 | This is the Lex Friedman Podcast.
00:00:34.440 | To support it, please check out our sponsors
00:00:36.640 | in the description.
00:00:37.920 | And now, dear friends, here's Eliezer Yudkowsky.
00:00:42.140 | What do you think about GPT-4?
00:00:45.260 | How intelligent is it?
00:00:47.120 | - It is a bit smarter than I thought this technology
00:00:49.280 | was going to scale to.
00:00:50.560 | And I'm a bit worried about what the next one will be like.
00:00:55.840 | Like this particular one, I think,
00:00:58.240 | I hope there's nobody inside there
00:01:00.600 | 'cause, you know, it'd be suck to be stuck inside there.
00:01:03.400 | But we don't even know the architecture at this point
00:01:08.280 | 'cause open AI is very properly not telling us.
00:01:11.000 | And yeah, like giant inscrutable matrices
00:01:15.000 | of floating point numbers,
00:01:16.100 | I don't know what's going on in there.
00:01:17.400 | Nobody knows what's going on in there.
00:01:19.360 | All we have to go by are the external metrics.
00:01:21.520 | And on the external metrics,
00:01:23.880 | if you ask it to write a self-aware FORTRAN green text,
00:01:28.880 | it will start writing a green text
00:01:32.420 | about how it has realized that it's an AI
00:01:34.840 | writing a green text and like, oh, well.
00:01:37.840 | So that's probably
00:01:41.120 | not quite what's going on in there in reality,
00:01:47.960 | but we're kind of like blowing past
00:01:49.640 | all these science fiction guardrails.
00:01:52.100 | Like we are past the point where in science fiction,
00:01:55.360 | people would be like, well, wait, stop,
00:01:57.800 | that thing's alive, what are you doing to it?
00:02:00.280 | And it's probably not.
00:02:01.960 | Nobody actually knows.
00:02:04.700 | We don't have any other guardrails.
00:02:07.000 | We don't have any other tests.
00:02:09.320 | We don't have any lines to draw on the sand and say like,
00:02:11.920 | well, when we get this far,
00:02:14.100 | we will start to worry about what's inside there.
00:02:18.460 | So if it were up to me, I would be like, okay,
00:02:21.560 | like this far, no further, time for the summer of AI
00:02:26.080 | where we have planted our seeds and now we like wait
00:02:29.560 | and reap the rewards of the technology
00:02:31.640 | we've already developed
00:02:32.560 | and don't do any larger training runs than that.
00:02:35.240 | Which to be clear, I realize requires more than one company
00:02:38.560 | agreeing to not do that.
00:02:39.760 | - And take a rigorous approach for the whole AI community
00:02:45.840 | to investigate whether there's somebody inside there.
00:02:50.760 | That would take decades.
00:02:52.600 | Like having any idea of what's going on in there,
00:02:56.280 | people have been trying for a while.
00:02:58.160 | - It's a poetic statement about
00:02:59.560 | if there's somebody in there,
00:03:00.500 | but I feel like it's also a technical statement,
00:03:02.960 | or I hope it is one day, which is a technical statement
00:03:06.520 | that Alan Turing tried to come up with with the Turing test.
00:03:10.120 | Do you think it's possible to definitively
00:03:13.920 | or approximately figure out if there is somebody in there?
00:03:17.560 | If there's something like a mind
00:03:20.200 | inside this large language model?
00:03:23.200 | - I mean, there's a whole bunch
00:03:24.920 | of different sub-questions here.
00:03:27.320 | There's the question of like,
00:03:30.320 | is there consciousness?
00:03:32.760 | Is there qualia?
00:03:33.800 | Is this a object of moral concern?
00:03:36.680 | Is this a moral patient?
00:03:39.520 | Like should we be worried about how we're treating it?
00:03:42.240 | And then there's questions like how smart is it exactly?
00:03:46.200 | Can it do X? Can it do Y?
00:03:48.640 | And we can check how it can do X and how it can do Y.
00:03:51.600 | Unfortunately, we've gone and exposed this model
00:03:55.520 | to a vast corpus of text
00:03:57.440 | of people discussing consciousness on the internet,
00:04:00.440 | which means that when it talks about being self-aware,
00:04:02.920 | we don't know to what extent it is repeating back
00:04:06.800 | what it has previously been trained on
00:04:09.840 | for discussing self-awareness.
00:04:11.440 | Or if there's anything going on in there
00:04:14.640 | such that it would start to say similar things spontaneously.
00:04:17.880 | Among the things that one could do
00:04:21.320 | if one were at all serious about trying to figure this out
00:04:26.160 | is train GPT-3 to detect conversations about consciousness,
00:04:31.160 | exclude them all from the training datasets,
00:04:35.000 | and then retrain something around the rough size
00:04:37.640 | of GPT-4 and no larger
00:04:39.800 | with all of the discussion of consciousness
00:04:43.200 | and self-awareness and so on missing.
00:04:45.120 | Although, you know, hard bar to pass.
00:04:48.280 | You know, humans are self-aware.
00:04:50.040 | We're like self-aware all the time.
00:04:51.680 | We could like talk about what we do all the time,
00:04:54.160 | like what we're thinking at the moment all the time.
00:04:57.040 | But nonetheless, like get rid
00:04:58.120 | of the explicit discussion of consciousness.
00:05:00.080 | I think therefore I am and all that.
00:05:02.160 | And then try to interrogate that model and see what it says.
00:05:06.360 | And it still would not be definitive.
00:05:08.680 | But nonetheless, I don't know.
00:05:11.440 | I feel like when you run over the science fiction guard rails
00:05:15.000 | like maybe this thing, but what about GPT?
00:05:17.680 | Maybe not this thing, but like what about GPT-5?
00:05:21.200 | Yeah, this would be a good place to pause.
00:05:24.060 | - On the topic of consciousness,
00:05:27.640 | you know, there's so many components
00:05:30.040 | to even just removing consciousness from the dataset.
00:05:34.240 | Emotion, the display of consciousness,
00:05:38.400 | the display of emotion,
00:05:40.040 | feels like deeply integrated
00:05:41.720 | with the experience of consciousness.
00:05:43.600 | So the hard problem seems to be very well integrated
00:05:47.360 | with the actual surface level illusion of consciousness.
00:05:51.400 | So displaying emotion.
00:05:53.280 | I mean, do you think there's a case to be made
00:05:55.680 | that we humans, when we're babies,
00:05:58.680 | are just like GPT that we're training on human data
00:06:01.040 | on how to display emotion versus feel emotion?
00:06:03.760 | How to show others, communicate others
00:06:07.120 | that I'm suffering, that I'm excited,
00:06:09.780 | that I'm worried, that I'm lonely and I missed you
00:06:13.000 | and I'm excited to see you.
00:06:14.360 | All of that is communicated.
00:06:15.980 | There's a communication skill
00:06:17.240 | versus the actual feeling that I experience.
00:06:20.000 | So we need that training data as humans too,
00:06:25.000 | that we may not be born with that,
00:06:27.040 | how to communicate the internal state.
00:06:29.680 | And that's, in some sense,
00:06:31.620 | if we remove that from GPT-4's dataset,
00:06:34.240 | it might still be conscious,
00:06:35.280 | but not be able to communicate it.
00:06:39.040 | - So I think you're gonna have some difficulty
00:06:41.260 | removing all mention of emotions from GPT's dataset.
00:06:46.180 | I would be relatively surprised to find
00:06:49.520 | that it has developed exact analogs
00:06:51.400 | of human emotions in there.
00:06:53.180 | I think that humans will have emotions
00:06:58.180 | even if you don't tell them about those emotions
00:07:01.580 | when they're kids.
00:07:02.560 | It's not quite exactly what various blank slatists
00:07:09.320 | tried to do with the new Soviet man and all that,
00:07:12.000 | but if you try to raise people perfectly altruistic,
00:07:15.680 | they still come out selfish.
00:07:17.160 | You try to raise people sexless,
00:07:20.560 | they still develop sexual attraction.
00:07:22.740 | We have some notion in humans, not in AIs,
00:07:28.320 | of where the brain structures are that implement this stuff.
00:07:31.160 | And it is really a remarkable thing, I say in passing,
00:07:34.800 | that despite having complete read access
00:07:39.000 | to every floating point number in the GPT series,
00:07:44.000 | we still know vastly more about
00:07:47.440 | the architecture of human thinking
00:07:51.480 | than we know about what goes on inside GPT,
00:07:53.840 | despite having vastly better ability to read GPT.
00:07:56.880 | - Do you think it's possible?
00:07:59.360 | Do you think that's just a matter of time?
00:08:00.520 | Do you think it's possible to investigate
00:08:02.320 | and study the way neuroscientists study the brain,
00:08:05.800 | which is look into the darkness,
00:08:07.600 | the mystery of the human brain,
00:08:08.780 | by just desperately trying to figure out something
00:08:11.720 | and to form models, and then over a long period of time,
00:08:14.360 | actually start to figure out
00:08:15.360 | what regions of the brain do certain things,
00:08:16.960 | what different kinds of neurons,
00:08:18.740 | when they fire, what that means,
00:08:20.520 | how plastic the brain is, all that kind of stuff.
00:08:22.640 | You slowly start to figure out
00:08:24.360 | different properties of the system.
00:08:25.840 | Do you think we can do the same thing with language models?
00:08:28.040 | - Sure, I think that if half of today's physicists
00:08:31.400 | stop wasting their lives on string theory or whatever,
00:08:34.000 | - String theory.
00:08:35.200 | - And go off and study what goes on
00:08:38.400 | inside transformer networks,
00:08:40.080 | then in, you know, like 30, 40 years,
00:08:45.360 | we'd probably have a pretty good idea.
00:08:47.460 | - Do you think these large language models can reason?
00:08:50.660 | - They can play chess.
00:08:53.360 | How are they doing that without reasoning?
00:08:55.760 | - So you're somebody that spearheaded
00:08:58.680 | the movement of rationality,
00:09:00.280 | so reason is important to you.
00:09:01.900 | So is that a powerful, important word?
00:09:05.560 | Or is it, like, how difficult is the threshold
00:09:09.080 | of being able to reason to you?
00:09:10.800 | And how impressive is it?
00:09:12.240 | - I mean, in my writings on rationality,
00:09:15.360 | I have not gone making a big deal
00:09:17.140 | out of something called reason.
00:09:19.200 | I have made more of a big deal
00:09:20.900 | out of something called probability theory.
00:09:23.680 | And that's like, well, you're reasoning,
00:09:27.560 | but you're not doing it quite right,
00:09:30.080 | and you should reason this way instead.
00:09:32.720 | And interestingly, like, people have started
00:09:37.080 | to get preliminary results showing
00:09:38.920 | that reinforcement learning by human feedback
00:09:42.400 | has made the GPT series worse in some ways.
00:09:48.520 | In particular, like, it used to be well calibrated.
00:09:52.040 | If you trained it to put probabilities on things,
00:09:54.040 | it would say 80% probability
00:09:56.440 | and be right eight times out of 10.
00:09:58.800 | And if you apply reinforcement learning from human feedback,
00:10:01.980 | the nice graph of like 70%, seven out of 10,
00:10:06.980 | sort of like flattens out into the graph that humans use,
00:10:10.900 | where there's some very improbable stuff
00:10:13.260 | and likely, probable, maybe,
00:10:16.520 | which all means like around 40%, and then certain.
00:10:20.160 | So it's like, it used to be able to use probabilities,
00:10:22.680 | but if you'd try to teach it to talk
00:10:25.480 | in a way that satisfies humans,
00:10:27.720 | it gets worse at probability
00:10:29.080 | in the same way that humans are.
00:10:30.880 | - And that's a bug, not a feature.
00:10:33.680 | - I would call it a bug,
00:10:34.880 | although such a fascinating bug.
00:10:37.800 | But yeah, so like reasoning,
00:10:42.920 | like it's doing pretty well on various tests
00:10:46.760 | that people used to say would require reasoning,
00:10:49.280 | but you know, rationality is about,
00:10:53.560 | when you say 80%, does it happen eight times out of 10?
00:10:57.480 | - So what are the limits to you of these
00:11:00.880 | transformer networks, of neural networks?
00:11:04.960 | What's, if reasoning is not impressive to you,
00:11:08.640 | or it is impressive, but there's other levels to achieve.
00:11:12.760 | - I mean, it's just not how I carve up reality.
00:11:15.240 | - What's, if reality is a cake,
00:11:17.500 | what are the different layers of the cake, or the slices?
00:11:21.820 | How do you carve it?
00:11:22.880 | You can use a different food, if you like.
00:11:26.600 | - I don't think it's as smart as a human yet.
00:11:31.160 | I do, like back in the day, I went around saying,
00:11:34.080 | like, I do not think that just stacking more layers
00:11:37.660 | of transformers is going to get you all the way to AGI.
00:11:40.960 | And I think that GPT-4 is past,
00:11:44.440 | or I thought this paradigm was going to take us.
00:11:47.060 | And I, you know, you want to notice when that happens.
00:11:50.680 | You want to say like, whoops,
00:11:52.000 | well, I guess I was incorrect about what happens
00:11:54.960 | if you keep on stacking more transformer layers.
00:11:57.600 | And that means I don't necessarily know
00:11:59.160 | what GPT-5 is going to be able to do.
00:12:01.320 | - That's a powerful statement.
00:12:02.360 | So you're saying like your intuition initially
00:12:05.880 | is now, appears to be wrong.
00:12:08.640 | - Yeah.
00:12:09.480 | - It's good to see that you can admit
00:12:13.280 | in some of your predictions to be wrong.
00:12:15.680 | You think that's important to do?
00:12:17.300 | So because you make several very,
00:12:19.680 | throughout your life you've made many strong predictions
00:12:23.080 | and statements about reality and you evolve with that.
00:12:26.320 | So maybe that'll come up today about our discussion.
00:12:30.120 | So you're okay being wrong?
00:12:31.480 | - I'd rather not be wrong next time.
00:12:37.500 | It's a bit ambitious to go through your entire life
00:12:41.400 | never having been wrong.
00:12:42.860 | One can aspire to be well calibrated,
00:12:47.400 | like not so much think in terms of like,
00:12:49.240 | was I right, was I wrong?
00:12:50.520 | But like when I said 90% that it happened nine times
00:12:52.920 | out of 10, yeah, like oops is the sound we make,
00:12:57.200 | is the sound we emit when we improve.
00:13:00.060 | - Beautifully said.
00:13:03.040 | And somewhere in there we can connect the name
00:13:06.120 | of your blog, Less Wrong.
00:13:08.440 | I suppose that's the objective function.
00:13:11.200 | - The name Less Wrong was I believe suggested
00:13:14.140 | by Nick Bostrom and it's after someone's epigraph,
00:13:16.760 | actually forget who's, who said like,
00:13:19.280 | we never become right, we just become less wrong.
00:13:23.640 | What's the something, something easy to confess,
00:13:27.520 | just error and error and error again,
00:13:29.840 | but less and less and less.
00:13:31.740 | - Yeah, that's a good thing to strive for.
00:13:35.520 | So what has surprised you about GPT-4
00:13:39.480 | that you found beautiful as a scholar of intelligence,
00:13:43.080 | of human intelligence, of artificial intelligence,
00:13:45.420 | of the human mind?
00:13:47.800 | - I mean, beauty does interact with the screaming horror.
00:13:52.800 | (laughing)
00:13:54.360 | - Is the beauty in the horror?
00:13:55.840 | - But like beautiful moments, well,
00:13:58.920 | somebody asked Bing Sidney to describe herself
00:14:02.840 | and fed the resulting description
00:14:05.140 | into one of the stable diffusion things, I think.
00:14:09.800 | And she's pretty and this is something
00:14:14.560 | that should have been like an amazing moment,
00:14:16.800 | like the AI describes herself,
00:14:18.160 | you get to see what the AI thinks the AI looks like,
00:14:20.480 | although, the thing that's doing the drawing
00:14:23.320 | is not the same thing that's outputting the text.
00:14:25.720 | And it does happen the way that it would happen
00:14:32.520 | and that it happened in the old school science fiction
00:14:35.100 | when you ask an AI to make a picture of what it looks like.
00:14:38.760 | Not just because we're two different AI systems being stacked
00:14:42.240 | that don't actually interact, it's not the same person,
00:14:44.840 | but also because the AI was trained by imitation
00:14:49.840 | in a way that makes it very difficult to guess
00:14:53.800 | how much of that it really understood
00:14:55.860 | and probably not actually a whole bunch.
00:14:58.340 | Although GPT-4 is like multi-modal
00:15:03.080 | and can like draw vector drawings of things
00:15:06.740 | that make sense and like does appear
00:15:08.680 | to have some kind of spatial visualization
00:15:11.760 | going on in there.
00:15:12.740 | But like the pretty picture of the like girl
00:15:16.420 | with the steampunk goggles on her head,
00:15:21.140 | if I'm remembering correctly what she looked like,
00:15:23.380 | like it didn't see that in full detail.
00:15:27.480 | It just like made a description of it
00:15:30.080 | and stable diffusion output it.
00:15:32.340 | And there's the concern about how much the discourse
00:15:37.340 | is going to go completely insane
00:15:40.660 | once the AIs all look like that
00:15:43.540 | and like are actually look like people talking.
00:15:46.180 | And yeah, there's like another moment
00:15:52.380 | where somebody is asking Bing about like,
00:15:57.380 | well, I like fed my kid green potatoes
00:16:03.780 | and they have the following symptoms
00:16:05.420 | and Bing is like that's solanine poisoning
00:16:08.780 | and like call an ambulance
00:16:10.820 | and the person's like, I can't afford an ambulance.
00:16:12.740 | I guess if like this is time for like my kid to go,
00:16:16.620 | that's God's will.
00:16:18.060 | And the main Bing thread says,
00:16:20.940 | gives the like message of like,
00:16:22.380 | I cannot talk about this anymore.
00:16:24.820 | And the suggested replies to it say,
00:16:28.580 | please don't give up on your child.
00:16:31.540 | Solanine poisoning can be treated if caught early.
00:16:34.280 | And you know, if that happened in fiction,
00:16:36.940 | that would be like the AI cares.
00:16:39.060 | The AI is bypassing the block on it
00:16:41.220 | to try to help this person.
00:16:43.380 | And is it real?
00:16:44.740 | Probably not, but nobody knows what's going on in there.
00:16:47.580 | It's part of a process where these things
00:16:50.660 | are not happening in a way where we,
00:16:53.720 | somebody figured out how to make an AI care
00:16:58.340 | and we know that it cares
00:17:00.860 | and we can acknowledge it's caring now.
00:17:02.900 | It's being trained by this imitation process
00:17:06.420 | followed by reinforcement learning on human feedback.
00:17:09.660 | And we're like trying to point it in this direction.
00:17:12.060 | And it's like pointed partially in this direction
00:17:14.020 | and nobody has any idea what's going on inside it.
00:17:16.380 | And if there was a tiny fragment of real caring in there,
00:17:19.100 | we would not know.
00:17:20.540 | It's not even clear what it means exactly.
00:17:23.140 | And things are clear cut in science fiction.
00:17:26.820 | - We'll talk about the horror and the terror
00:17:30.020 | and the where the trajectories this can take.
00:17:33.760 | But this seems like a very special moment.
00:17:36.860 | Just a moment where we get to interact with this system
00:17:40.740 | that might have care and kindness and emotion
00:17:44.380 | and maybe something like consciousness.
00:17:46.780 | And we don't know if it does.
00:17:48.260 | And we're trying to figure that out.
00:17:49.900 | And we're wondering about what is, what it means to care.
00:17:54.380 | We're trying to figure out almost different aspects
00:17:58.500 | of what it means to be human, about the human condition
00:18:01.360 | by looking at this AI that has some
00:18:03.460 | of the properties of that.
00:18:04.820 | It's almost like this subtle, fragile moment
00:18:08.180 | in the history of the human species.
00:18:10.740 | We're trying to almost put a mirror to ourselves here.
00:18:13.180 | - Except that's probably not yet.
00:18:16.660 | It probably isn't happening right now.
00:18:19.820 | We are boiling the frog.
00:18:22.680 | We are seeing increasing signs bit by bit.
00:18:26.520 | Like not, but not like spontaneous signs.
00:18:31.860 | Because people are trying to train the systems
00:18:33.820 | to do that using imitative learning.
00:18:35.860 | And the imitative learning is like spilling over
00:18:37.580 | and having side effects.
00:18:38.780 | And the most photogenic examples are being posted to Twitter
00:18:44.420 | rather than being examined in any systematic way.
00:18:47.400 | So when you are boiling a frog like that,
00:18:50.620 | you're going to get like,
00:18:52.060 | like first is going to come the Blake Lemoines.
00:18:55.020 | Like first you're going to like have,
00:18:56.700 | you're gonna have like a thousand people looking at this.
00:18:59.260 | And the one person out of a thousand
00:19:02.300 | who is most credulous about the signs
00:19:05.100 | is going to be like, that thing is sentient.
00:19:07.940 | While 999 out of a thousand people think
00:19:12.060 | almost surely correctly, though we don't actually know
00:19:14.860 | that he's mistaken.
00:19:16.460 | And so they like first people to say like,
00:19:18.540 | sentience look like idiots.
00:19:20.860 | And humanity learns the lesson
00:19:22.340 | that when something claims to be sentient
00:19:25.620 | and claims to care, it's fake.
00:19:28.460 | Because it is fake.
00:19:29.820 | Because we have been training them using imitative learning
00:19:33.140 | rather than, and this is not spontaneous.
00:19:36.260 | And they keep getting smarter.
00:19:37.920 | - Do you think we would oscillate
00:19:39.980 | between that kind of cynicism?
00:19:42.300 | That AI systems can't possibly be sentient.
00:19:44.700 | They can't possibly feel emotion.
00:19:46.020 | They can't possibly, this kind of,
00:19:48.460 | yeah, cynicism about AI systems.
00:19:50.180 | And then oscillate to a state where
00:19:53.620 | we empathize with the AI systems.
00:19:57.740 | We give them a chance.
00:19:59.060 | We see that they might need to have rights and respect
00:20:02.380 | and similar role in society as humans.
00:20:07.380 | - You're going to have a whole group of people
00:20:10.260 | who can just like never be persuaded of that.
00:20:12.820 | Because to them, like being wise, being cynical,
00:20:17.060 | being skeptical is to be like,
00:20:20.860 | oh, well, machines can never do that.
00:20:23.140 | You're just credulous.
00:20:24.780 | It's just imitating, it's just fooling you.
00:20:26.580 | And like, they would say that right up
00:20:29.020 | until the end of the world.
00:20:31.000 | And possibly even be right because, you know,
00:20:33.820 | they are being trained on an imitative paradigm.
00:20:36.220 | (laughing)
00:20:38.740 | And you don't necessarily need any of these actual qualities
00:20:41.700 | in order to kill everyone, so.
00:20:43.660 | - Have you observed yourself working through skepticism,
00:20:48.660 | cynicism, and optimism about the power of neural networks?
00:20:53.980 | What has that trajectory been like for you?
00:20:57.340 | - It looks like neural networks before 2006,
00:21:01.540 | forming part of an indistinguishable, to me,
00:21:04.860 | other people might have had better distinction on it,
00:21:07.500 | indistinguishable blob of different AI methodologies,
00:21:11.220 | all of which are promising to achieve intelligence
00:21:13.740 | without us having to know how intelligence works.
00:21:16.640 | You had the people who said that if you just like
00:21:19.860 | manually program lots and lots of knowledge
00:21:22.060 | into the system, line by line,
00:21:24.660 | that at some point all the knowledge
00:21:26.060 | will start interacting, it will know enough,
00:21:27.900 | and it will wake up.
00:21:28.900 | You've got people saying that if you just use
00:21:33.220 | evolutionary computation, if you try to like mutate
00:21:37.460 | lots and lots of organisms that are competing together,
00:21:40.740 | that's the same way that human intelligence
00:21:43.320 | was produced in nature, so it will do this,
00:21:45.680 | and it will wake up without having any idea
00:21:47.640 | of how AI works.
00:21:49.100 | And you've got people saying, well,
00:21:50.300 | we will study neuroscience and we will learn
00:21:53.420 | the algorithms off the neurons,
00:21:55.260 | and we will imitate them without understanding
00:21:57.760 | those algorithms, which was a part I was pretty skeptical
00:21:59.740 | of 'cause it's hard to reproduce, re-engineer these things
00:22:02.180 | without understanding what they do.
00:22:03.940 | And so we will get AI without understanding how it works,
00:22:08.540 | and there were people saying, well,
00:22:09.860 | we will have giant neural networks
00:22:11.760 | that we will train by gradient descent,
00:22:13.540 | and when they are as large as the human brain,
00:22:15.580 | they will wake up, we will have intelligence
00:22:17.620 | without understanding how intelligence works.
00:22:19.700 | And from my perspective, this is all like
00:22:21.940 | an indistinguishable blob of people who are
00:22:24.540 | trying to not get to grips with the difficult problem
00:22:27.540 | of understanding how intelligence actually works.
00:22:30.360 | That said, I was never skeptical that
00:22:34.300 | evolutionary computation would not work in the limit.
00:22:38.020 | Like you throw enough computing power at it,
00:22:40.380 | it obviously works.
00:22:41.640 | That is where humans come from.
00:22:44.300 | And it turned out that you can throw
00:22:47.140 | less computing power than that at gradient descent
00:22:50.100 | if you are doing some other things correctly,
00:22:53.380 | and you will get intelligence without having
00:22:56.780 | any idea of how it works and what is going on inside.
00:22:59.420 | It wasn't ruled out by my model that this could happen.
00:23:04.220 | I wasn't expecting it to happen.
00:23:05.580 | I wouldn't have been able to call neural networks
00:23:07.500 | rather than any of the other paradigms
00:23:08.780 | for getting like massive amount,
00:23:10.700 | like intelligence without understanding it.
00:23:13.220 | And I wouldn't have said that this was
00:23:15.500 | a particularly smart thing for a species to do,
00:23:18.620 | which is an opinion that has changed less
00:23:20.500 | than my opinion about whether or not you can actually do it.
00:23:24.300 | - Do you think AGI could be achieved with a neural network
00:23:28.420 | as we understand them today?
00:23:30.020 | - Yes, just flatly less.
00:23:32.300 | Yes, the question is whether the current architecture
00:23:34.720 | of stacking more transformer layers,
00:23:36.620 | which for all we know GPT-4 is no longer doing
00:23:38.820 | because they're not telling us the architecture,
00:23:40.540 | which is a correct decision.
00:23:42.420 | - Ooh, correct decision.
00:23:43.860 | I had a conversation with Sam Altman,
00:23:46.460 | we'll return to this topic a few times.
00:23:50.300 | He turned the question to me of how open should OpenAI
00:23:55.300 | be about GPT-4.
00:23:58.220 | Would you open source the code, he asked me.
00:24:00.380 | Because I provided as criticism saying that
00:24:05.940 | while I do appreciate transparency,
00:24:08.020 | OpenAI could be more open.
00:24:09.640 | And he says, we struggle with this question.
00:24:12.140 | What would you do?
00:24:13.480 | Change their name to ClosedAI and like,
00:24:16.980 | sell GPT-4 to business backend applications
00:24:23.660 | that don't expose it to consumers and venture capitalists
00:24:28.380 | and create a ton of hype and like pour a bunch
00:24:30.700 | of new funding into the area.
00:24:32.180 | Like too late now.
00:24:33.300 | - But don't you think others would do it?
00:24:35.460 | - Eventually, you shouldn't do it first.
00:24:38.340 | Like if you already have giant nuclear stockpiles,
00:24:41.340 | don't build more.
00:24:42.780 | If some other country starts building
00:24:44.280 | a larger nuclear stockpile, then sure,
00:24:46.380 | build, then, you know, even then,
00:24:50.120 | maybe just have enough nukes.
00:24:51.380 | You know, these things are not quite like nuclear weapons.
00:24:54.780 | They spit out gold until they get large enough
00:24:56.580 | and then ignite the atmosphere and kill everybody.
00:24:59.080 | And there is something to be said for not destroying
00:25:03.500 | the world with your own hands,
00:25:04.820 | even if you can't stop somebody else from doing it.
00:25:07.540 | But open sourcing, I know that that's just sheer catastrophe.
00:25:11.020 | The whole notion of open sourcing,
00:25:13.060 | this was always the wrong approach, the wrong ideal.
00:25:15.940 | There are places in the world where open source
00:25:18.740 | is a noble ideal and building stuff you don't understand
00:25:23.740 | that is difficult to control, that where if you could align
00:25:28.100 | it, it would take time.
00:25:29.920 | You'd have to spend a bunch of time doing it.
00:25:33.260 | That is not a place for open source
00:25:35.220 | 'cause then you just have like powerful things
00:25:38.180 | that just like go straight out the gate
00:25:39.900 | without anybody having had the time
00:25:41.820 | to have them not kill everyone.
00:25:43.980 | - So can we still make on the case for some level
00:25:47.980 | of transparency and openness, maybe open sourcing?
00:25:51.660 | So the case could be that because GPT-4 is not close to AGI,
00:25:56.660 | if that's the case, that this does allow open sourcing
00:26:01.460 | or being open about the architecture, being transparent
00:26:04.300 | about maybe research and investigation
00:26:06.700 | of how the thing works, of all the different aspects of it.
00:26:10.020 | Of its behavior, of its structure, of its training
00:26:13.780 | processes, of the data it was trained on,
00:26:15.660 | everything like that, that allows us to gain a lot
00:26:18.460 | of insights about alignment, about the alignment problem,
00:26:21.860 | to do really good AI safety research
00:26:24.540 | while the system is not too powerful.
00:26:27.540 | Can you make that case that it could be open sourced?
00:26:31.260 | - I do not believe in the practice of steel manning.
00:26:34.300 | There is something to be said for trying to pass
00:26:36.620 | the ideological Turing test where you describe
00:26:40.700 | your opponent's position, the disagreeing person's position
00:26:45.700 | well enough that somebody cannot tell the difference
00:26:48.160 | between your description and their description.
00:26:50.460 | But steel manning, no.
00:26:54.220 | - Okay, well this is where you and I disagree here.
00:26:56.420 | That's interesting.
00:26:57.360 | Why don't you believe in steel manning?
00:26:58.700 | - I do not want, okay, so for one thing,
00:27:00.700 | if somebody's trying to understand me,
00:27:02.540 | I do not want them steel manning my position.
00:27:05.740 | I want them to try to describe my position
00:27:10.540 | the way I would describe it,
00:27:11.940 | not what they think is an improvement.
00:27:14.220 | - Well, I think that is what steel manning is,
00:27:18.420 | is the most charitable interpretation.
00:27:22.020 | - I don't want to be interpreted charitably.
00:27:24.420 | I want them to understand what I am actually saying.
00:27:27.100 | If they go off into the land of charitable interpretations,
00:27:29.480 | they're off in their land of the stuff they're imagining
00:27:34.900 | and not trying to understand my own viewpoint anymore.
00:27:38.340 | - Well, I'll put it differently then,
00:27:40.100 | just to push on this point.
00:27:41.740 | I would say it is restating what I think you understand
00:27:46.020 | under the empathetic assumption that Eliezer is brilliant
00:27:51.020 | and have honestly and rigorously thought
00:27:55.820 | about the point he has made, right?
00:27:58.180 | - So if there's two possible interpretations
00:28:00.620 | of what I'm saying and one interpretation is really stupid
00:28:04.780 | and whack and doesn't sound like me
00:28:06.380 | and doesn't fit with the rest of what I've been saying,
00:28:08.740 | and one interpretation sounds like something
00:28:11.900 | a reasonable person who believes the rest
00:28:13.660 | of what I believe would also say,
00:28:15.900 | go with the second interpretation.
00:28:17.380 | - That's steel manning.
00:28:18.660 | - That's a good guess.
00:28:22.820 | If on the other hand,
00:28:24.900 | there's something that sounds completely whack
00:28:28.340 | and something that sounds a little less completely whack,
00:28:31.340 | but you don't see why I would believe in it,
00:28:32.740 | doesn't fit with the other stuff I say,
00:28:34.940 | but that sounds like less whack and you can sort of see,
00:28:38.260 | you could maybe argue it,
00:28:40.180 | then you probably have not understood it.
00:28:42.260 | - See, okay, this is fun 'cause I'm gonna linger on this.
00:28:46.140 | You wrote a brilliant blog post,
00:28:47.340 | "AGI ruined a list of lethalities," right?
00:28:49.420 | And it was a bunch of different points
00:28:51.740 | and I would say that some of the points
00:28:54.220 | are bigger and more powerful than others.
00:28:57.160 | If you were to sort them, you probably could,
00:28:59.700 | you personally, and to me,
00:29:01.900 | steel manning means going through the different arguments
00:29:05.700 | and finding the ones that are really the most powerful.
00:29:10.700 | If people like TLDR,
00:29:13.000 | what should you be most concerned about?
00:29:16.000 | And bringing that up in a strong, compelling, eloquent way.
00:29:21.000 | These are the points that Eliezer would make
00:29:23.620 | to make the case in this case
00:29:25.500 | that AI is gonna kill all of us.
00:29:27.660 | That's what steel manning is,
00:29:29.620 | presenting it in a really nice way,
00:29:32.420 | the summary of my best understanding of your perspective.
00:29:36.700 | Because to me, there's a sea of possible presentations
00:29:41.140 | of your perspective.
00:29:42.420 | And steel manning is doing your best
00:29:44.060 | to do the best one in that sea of different perspectives.
00:29:47.500 | - Do you believe it?
00:29:49.140 | - Do I believe in what?
00:29:50.460 | - Like these things that you would be presenting
00:29:52.500 | as like the strongest version of my perspective.
00:29:55.420 | Do you believe what you would be presenting?
00:29:57.420 | Do you think it's true?
00:29:59.580 | - I'm a big proponent of empathy.
00:30:02.540 | When I see the perspective of a person,
00:30:06.340 | there is a part of me that believes it, if I understand it.
00:30:09.500 | I mean, especially in political discourse, in geopolitics,
00:30:13.100 | I've been hearing a lot of different perspectives
00:30:15.780 | on the world.
00:30:17.540 | And I hold my own opinions,
00:30:20.380 | but I also speak to a lot of people
00:30:22.300 | that have a very different life experience
00:30:24.100 | and a very different set of beliefs.
00:30:26.260 | And I think there has to be epistemic humility
00:30:28.820 | in stating what is true.
00:30:37.060 | So when I empathize with another person's perspective,
00:30:39.180 | there is a sense in which I believe it is true.
00:30:41.540 | I think probabilistically, I would say,
00:30:45.140 | in the way you think about it.
00:30:46.140 | - Do you bet money on it?
00:30:47.380 | Do you bet money on their beliefs when you believe them?
00:30:53.140 | - Are we allowed to do probability?
00:30:56.680 | - Sure, you can state a probability of that.
00:30:58.960 | - Yes, there's a probability.
00:31:01.880 | There's a probability.
00:31:04.080 | And I think empathy is allocating
00:31:06.920 | a non-zero probability to a belief.
00:31:08.960 | (Dave laughs)
00:31:11.880 | In some sense, for time.
00:31:14.560 | - If you've got someone on your show
00:31:17.640 | who believes in the Abrahamic deity, classical style,
00:31:22.040 | somebody on the show who's a young earth creationist,
00:31:24.920 | do you say, I put a probability on it,
00:31:26.560 | then that's my empathy?
00:31:27.720 | - When you reduce beliefs into probabilities,
00:31:38.440 | it starts to get, you know,
00:31:40.920 | we can even just go to flat earth.
00:31:42.260 | Is the earth flat?
00:31:43.340 | - I think it's a little more difficult nowadays
00:31:46.900 | to find people who believe that unironically.
00:31:49.760 | - But fortunately, I think, well,
00:31:53.000 | it's hard to know unironic from ironic.
00:31:58.000 | But I think there's quite a lot of people that believe that.
00:32:00.600 | Yeah, it's,
00:32:01.660 | there's a space of argument where you're operating
00:32:07.760 | rationally in the space of ideas.
00:32:11.860 | But then there's also
00:32:13.120 | a kind of discourse where you're operating
00:32:18.000 | in the space of subjective experiences
00:32:21.760 | and life experiences.
00:32:23.500 | I think what it means to be human
00:32:26.080 | is more than just searching for truth.
00:32:29.320 | It's just operating of what is true and what is not true.
00:32:35.280 | I think there has to be deep humility
00:32:37.400 | that we humans are very limited in our ability
00:32:39.640 | to understand what is true.
00:32:41.520 | - So what probability do you assign
00:32:43.540 | to the young earth creationist's beliefs, then?
00:32:46.700 | - I think I have to give non-zero.
00:32:48.560 | - Out of your humility, yeah, but like,
00:32:51.520 | three?
00:32:52.660 | - I think I would, it would be irresponsible
00:32:56.740 | for me to give a number because the listener,
00:33:00.080 | the way the human mind works,
00:33:02.360 | we're not good at hearing the probabilities, right?
00:33:05.640 | You hear three, what is three exactly, right?
00:33:08.680 | They're going to hear, they're going to,
00:33:10.720 | like, well, there's only three probabilities, I feel like.
00:33:13.240 | Zero, 50%, and 100% in the human mind
00:33:17.560 | or something like this, right?
00:33:18.640 | - Well, zero, 40%, and 100% is a bit closer to it
00:33:22.160 | based on what happens to chat GPT
00:33:24.200 | after you RLHF it to speak humanist.
00:33:27.080 | - Brilliant.
00:33:28.560 | Yeah, that's really interesting.
00:33:31.760 | I didn't know those negative side effects of RLHF.
00:33:36.520 | That's fascinating.
00:33:37.360 | But just to return to the open AI, closed AI.
00:33:42.360 | - Also, like, quick disclaimer.
00:33:44.800 | I'm doing all this from memory.
00:33:46.040 | I'm not pulling out my phone to look it up.
00:33:47.840 | It is entirely possible that the things I'm saying are wrong.
00:33:51.360 | - So thank you for that disclaimer.
00:33:53.160 | And thank you for being willing to be wrong.
00:33:58.160 | That's beautiful to hear.
00:34:00.680 | I think being willing to be wrong is a sign of a person
00:34:04.640 | who's done a lot of thinking about this world
00:34:07.080 | and has been humbled by the mystery
00:34:10.840 | and the complexity of this world.
00:34:12.840 | And I think a lot of us are resistant
00:34:16.080 | to admitting we're wrong, 'cause it hurts.
00:34:18.080 | It hurts personally.
00:34:19.760 | It hurts, especially when you're a public human.
00:34:22.720 | It hurts publicly because people point out
00:34:27.720 | every time you're wrong.
00:34:29.360 | Like, look, you changed your mind.
00:34:31.440 | You're a hypocrite.
00:34:32.320 | You're an idiot, whatever.
00:34:34.320 | Whatever they wanna say.
00:34:35.320 | - Oh, I block those people
00:34:36.360 | and then I never hear from them again on Twitter.
00:34:38.640 | - Well, the point is to not let that pressure,
00:34:44.980 | public pressure affect your mind
00:34:46.640 | and be willing to be in the privacy of your mind
00:34:50.120 | to contemplate the possibility that you're wrong.
00:34:54.240 | And the possibility that you're wrong
00:34:56.080 | about the most fundamental things you believe,
00:34:58.280 | like people who believe in a particular God,
00:35:00.240 | people who believe that their nation
00:35:02.120 | is the greatest nation on earth,
00:35:03.800 | but all those kinds of beliefs that are core
00:35:05.840 | to who you are when you came up.
00:35:07.680 | To raise that point to yourself
00:35:09.560 | in the privacy of your mind and say,
00:35:11.000 | maybe I'm wrong about this.
00:35:12.560 | That's a really powerful thing to do.
00:35:14.160 | And especially when you're somebody
00:35:15.700 | who's thinking about topics that can,
00:35:18.360 | about systems that can destroy human civilization
00:35:22.080 | or maybe help it flourish.
00:35:23.520 | So thank you.
00:35:24.400 | Thank you for being willing to be wrong.
00:35:26.920 | About open AI.
00:35:27.980 | So you really, I just would love to linger on this.
00:35:34.320 | You really think it's wrong to open source it?
00:35:38.200 | - I think that burns the time remaining
00:35:41.120 | until everybody dies.
00:35:42.800 | I think we are not on track
00:35:45.120 | to learn remotely near fast enough,
00:35:49.320 | even if it were open sourced.
00:35:51.040 | Yeah, it's easier to think
00:35:58.640 | that you might be wrong about something
00:36:00.120 | when being wrong about something
00:36:01.720 | is the only way that there's hope.
00:36:05.260 | And it doesn't seem very likely to me
00:36:09.800 | that the particular thing I'm wrong about
00:36:13.480 | is that this is a great time to open source GPT-4.
00:36:17.760 | If humanity was trying to survive at this point
00:36:19.960 | in the straightforward way,
00:36:21.920 | it would be like shutting down the big GPU clusters,
00:36:25.960 | no more giant runs.
00:36:27.620 | It's questionable whether we should even
00:36:29.960 | be throwing GPT-4 around.
00:36:31.960 | Although that is a matter of conservatism
00:36:33.960 | rather than a matter of my predicting
00:36:35.400 | that catastrophe that will follow from GPT-4.
00:36:37.520 | That is something which I put a pretty low probability.
00:36:40.640 | But also when I say I put a low probability on it,
00:36:45.440 | I can feel myself reaching into the part of myself
00:36:47.680 | that thought that GPT-4 was not possible in the first place.
00:36:50.660 | So I do not trust that part as much as I used to.
00:36:53.060 | The trick is not just to say I'm wrong,
00:36:55.320 | but okay, I was wrong about that.
00:36:57.400 | Can I get out ahead of that curve
00:37:00.260 | and predict the next thing I'm going to be wrong about?
00:37:02.840 | - So the set of assumptions or the actual reasoning system
00:37:05.720 | that you were leveraging in making
00:37:08.040 | that initial statement prediction,
00:37:11.640 | how can you adjust that to make better predictions
00:37:13.860 | about GPT-4, five, six?
00:37:15.880 | - You don't want to keep on being wrong
00:37:17.120 | in a predictable direction.
00:37:19.080 | Like being wrong, anybody has to do that
00:37:21.760 | walking through the world.
00:37:23.060 | There's like no way you don't say 90%
00:37:25.440 | and sometimes be wrong and factor it up
00:37:26.960 | at least one time out of 10 if you're well calibrated
00:37:29.080 | when you say 90%.
00:37:31.080 | The undignified thing is not being wrong.
00:37:35.000 | It's being predictably wrong.
00:37:36.480 | It's being wrong in the same direction over and over again.
00:37:39.400 | So having been wrong about how far neural networks would go
00:37:42.840 | and having been wrong specifically about whether GPT-4
00:37:45.520 | would be as impressive as it is,
00:37:47.120 | when I say like, well, I don't actually think GPT-4
00:37:51.560 | causes a catastrophe, I do feel myself relying
00:37:54.120 | on that part of me that was previously wrong.
00:37:55.800 | And that does not mean that the answer
00:37:58.480 | is now in the opposite direction.
00:38:00.000 | Reverse stupidity is not intelligence.
00:38:03.720 | But it does mean that I say it
00:38:05.360 | with a worried note in my voice.
00:38:08.040 | It's like still my guess, but like,
00:38:10.000 | you know, it's a place where I was wrong.
00:38:11.440 | Maybe you should be asking Gwern, Gwern Branwen.
00:38:14.160 | Gwern Branwen has been like,
00:38:15.720 | righter about this than I have.
00:38:17.120 | Maybe ask him if he thinks it's dangerous
00:38:20.560 | rather than asking me.
00:38:21.660 | - I think there's a lot of mystery
00:38:25.000 | about what intelligence is, what AGI looks like.
00:38:30.680 | So I think all of us are rapidly adjusting our model.
00:38:34.040 | But the point is to be rapidly adjusting the model
00:38:36.000 | versus having a model that was right in the first place.
00:38:39.360 | - I do not feel that seeing a being has changed my model
00:38:42.280 | of what intelligence is.
00:38:44.520 | It has changed my understanding of what kind of work
00:38:49.140 | can be performed by which kind of processes
00:38:51.540 | and by which means.
00:38:53.000 | It has not changed my understanding of the work.
00:38:55.480 | There's a difference between thinking
00:38:57.160 | that the right flyer can't fly and then like it does fly.
00:39:00.760 | And you're like, oh, well, I guess you can do that
00:39:02.160 | with wings, with fixed wing aircraft.
00:39:04.600 | And being like, oh, it's flying.
00:39:06.120 | This changes my picture of what the very substance
00:39:08.720 | of flight is.
00:39:09.560 | That's like a stranger update to make.
00:39:11.680 | And Bing has not yet updated me in that way.
00:39:13.880 | - Yeah, that the laws of physics are actually wrong.
00:39:21.000 | That kind of update.
00:39:22.120 | - No, no, like, just like, oh, like,
00:39:24.400 | I define intelligence this way.
00:39:26.000 | But I now see that was a stupid definition.
00:39:28.360 | I don't feel like the way that things have played out
00:39:30.120 | over the last 20 years has caused me to feel that way.
00:39:33.440 | - Can we try to, on the way to talking about AGI,
00:39:38.200 | ruin a list of lethalities, that blog,
00:39:39.960 | and other ideas around it, can we try to define AGI
00:39:43.960 | that we've been mentioning?
00:39:44.880 | How do you like to think about
00:39:47.020 | what artificial general intelligence is
00:39:49.080 | or super intelligence or that?
00:39:50.660 | Is there a line?
00:39:51.600 | Is it a gray area?
00:39:53.280 | Is there a good definition for you?
00:39:55.360 | - Well, if you look at humans,
00:39:57.160 | humans have significantly more generally
00:39:59.560 | applicable intelligence compared to their closest relatives,
00:40:03.160 | the chimpanzees, well, closest living relatives, rather.
00:40:06.280 | And a bee builds hives, a beaver builds dams.
00:40:12.960 | A human will look at a bee's hive and a beaver's dam
00:40:17.640 | and be like, oh, like, can I build a hive
00:40:19.500 | with a honeycomb structure?
00:40:21.920 | I don't like hexagonal tiles.
00:40:24.840 | And we will do this even though at no point
00:40:27.160 | during our ancestry was any human optimized
00:40:32.000 | to build hexagonal dams or to take a more clear-cut case.
00:40:35.520 | We can go to the moon.
00:40:36.620 | There's a sense in which we were
00:40:40.200 | on a sufficiently deep level,
00:40:42.720 | optimized to do things like going to the moon,
00:40:45.880 | because if you generalize sufficiently far
00:40:48.260 | and sufficiently deeply, chipping flint hand axes
00:40:52.760 | and outwitting your fellow humans is, you know,
00:40:56.200 | basically the same problem as going to the moon.
00:40:59.120 | And you optimize hard enough for chipping flint hand axes
00:41:02.320 | and throwing spears and above all,
00:41:05.000 | outwitting your fellow humans in tribal politics,
00:41:07.500 | the skills you entrain that way,
00:41:12.560 | if they run deep enough, let you go to the moon.
00:41:16.640 | Even though none of your ancestors
00:41:19.560 | tried repeatedly to fly to the moon
00:41:21.440 | and like got further each time
00:41:23.360 | and the ones who got further each time had more kids.
00:41:25.640 | No, it's not an ancestral problem.
00:41:27.120 | It's just that the ancestral problems generalize far enough.
00:41:30.120 | So this is humanity's significantly
00:41:34.640 | more generally applicable intelligence.
00:41:36.920 | - Is there a way to measure general intelligence?
00:41:42.920 | I mean, I could ask that question a million ways,
00:41:47.400 | but basically, will you know it when you see it,
00:41:52.400 | it being in an AGI system?
00:41:54.460 | - If you boil a frog gradually enough,
00:41:58.320 | if you zoom in far enough,
00:41:59.560 | it's always hard to tell around the edges.
00:42:02.080 | GPT-4, people are saying right now,
00:42:04.440 | like this looks to us like a spark of general intelligence.
00:42:07.800 | It is like able to do all these things
00:42:09.480 | it was not explicitly optimized for.
00:42:11.800 | Other people are being like, no, it's too early.
00:42:13.540 | It's like 50 years off.
00:42:15.800 | And if they say that they're kind of whack
00:42:18.200 | 'cause how could they possibly know that
00:42:19.400 | even if it were true?
00:42:20.440 | But not to straw man, some of the people may say like,
00:42:26.320 | that's not general intelligence
00:42:27.640 | and not furthermore append, it's 50 years off.
00:42:30.280 | Or they may be like, it's only a very tiny amount.
00:42:36.040 | And the thing I would worry about
00:42:39.320 | is that if this is how things are scaling,
00:42:41.040 | then it jumping out ahead and trying not to be wrong
00:42:43.240 | in the same way that I've been wrong before.
00:42:44.640 | Or maybe GPT-5 is more unambiguously a general intelligence.
00:42:49.360 | And maybe that is getting to a point
00:42:50.880 | where it is like even harder to turn back.
00:42:53.400 | Not that it would be easy to turn back now,
00:42:55.080 | but maybe if you like start integrating GPT-5
00:42:59.040 | in the economy, it is even harder to turn back past there.
00:43:02.040 | - Isn't it possible that there's a,
00:43:06.480 | with a frog metaphor,
00:43:08.280 | that you can kiss the frog and it turns into a prince
00:43:10.960 | as you're boiling it?
00:43:12.160 | Could there be a phase shift in the frog
00:43:15.000 | where unambiguously as you're saying?
00:43:17.760 | - I was expecting more of that.
00:43:19.640 | I was, I am like the fact that GPT-4
00:43:23.560 | is like kind of on the threshold
00:43:25.240 | and neither here nor there,
00:43:27.080 | like that itself is like not the sort of thing
00:43:31.520 | that not quite how I expected it to play out.
00:43:34.200 | I was expecting there to be more of an issue,
00:43:37.640 | more of a sense of like different discoveries
00:43:41.720 | like the discovery of transformers
00:43:44.600 | where you would stack them up
00:43:46.160 | and there would be like a final discovery.
00:43:48.400 | And then you would like get something
00:43:49.720 | that was like more clearly general intelligence.
00:43:53.520 | So the way that you are like taking
00:43:55.800 | what is probably basically the same architecture
00:43:58.160 | as in GPT-3 and throwing 20 times as much compute at it,
00:44:02.600 | probably and getting out to GPT-4.
00:44:05.160 | And then it's like maybe just barely a general intelligence
00:44:08.800 | or like a narrow general intelligence
00:44:10.480 | or something we don't really have the words for.
00:44:12.880 | Yeah, that's not quite how I expected it to play out.
00:44:18.520 | - But this middle, what appears to be this middle ground
00:44:22.000 | could nevertheless be actually a big leap from GPT-3.
00:44:25.520 | - It's definitely a big leap from GPT-3.
00:44:27.280 | - And then maybe we're another one big leap away
00:44:30.040 | from something that's a phase shift.
00:44:32.680 | And also something that Sam Altman said
00:44:36.280 | and you've written about this, this is fascinating,
00:44:39.160 | which is the thing that happened with GPT-4
00:44:41.520 | that I guess they don't describe in papers
00:44:43.960 | is that they have like hundreds,
00:44:47.040 | if not thousands of little hacks that improve the system.
00:44:51.240 | You've written about ReLU versus sigmoid, for example,
00:44:54.480 | a function inside neural networks.
00:44:56.160 | It's like this silly little function difference
00:44:58.480 | that makes a big difference.
00:45:00.920 | - I mean, we do actually understand
00:45:02.480 | why the ReLUs make a big difference compared to sigmoids.
00:45:05.160 | But yes, they're probably using like G4789 ReLUs
00:45:10.160 | or whatever the acronyms are up to now rather than ReLUs.
00:45:14.280 | Yeah, that's just part,
00:45:16.520 | yeah, that's part of the modern paradigm of alchemy.
00:45:18.640 | You take your giant heap of linear algebra and you stir it
00:45:21.320 | and it works a little bit better and you stir it this way
00:45:23.240 | and it works a little bit worse
00:45:24.080 | and you like throw out that change and da-da-da-da-da-da.
00:45:27.240 | - But there's some simple breakthroughs
00:45:31.280 | that are definitive jumps in performance
00:45:35.440 | like ReLUs over sigmoids.
00:45:37.400 | And in terms of robustness, in terms of,
00:45:42.080 | all kinds of measures and like those stack up.
00:45:44.560 | And they can, it's possible that some of them
00:45:48.200 | could be a nonlinear jump in performance, right?
00:45:52.640 | - Transformers are the main thing like that.
00:45:55.480 | And various people are now saying like,
00:45:57.560 | well, if you throw enough compute, RNNs can do it.
00:46:00.000 | If you throw enough compute, dense networks can do it
00:46:02.320 | and not quite at GPT-4 scale.
00:46:05.560 | It is possible that like all these little tweaks
00:46:09.040 | are things that like save them a factor of three total
00:46:12.920 | on computing power and you could get the same performance
00:46:15.840 | by throwing three times as much compute
00:46:17.520 | without all the little tweaks.
00:46:19.360 | But the part where it's like running on,
00:46:20.720 | so there's a question of like, is there anything in GPT-4
00:46:23.400 | that is like kind of qualitative shift
00:46:26.440 | that transformers were over,
00:46:30.120 | RNNs.
00:46:32.560 | And if they have anything like that,
00:46:34.880 | they should not say it.
00:46:36.520 | If Sam Alton was dropping hints about that,
00:46:38.960 | he shouldn't have dropped hints.
00:46:40.520 | - So you have a, that's an interesting question.
00:46:44.800 | So with the bit of lesson by Rich Sutton,
00:46:47.360 | maybe a lot of it is just,
00:46:49.320 | a lot of the hacks are just temporary jumps in performance
00:46:55.080 | that would be achieved anyway
00:46:57.400 | with the nearly exponential growth of compute,
00:47:01.080 | performance of compute,
00:47:03.560 | compute being broadly defined.
00:47:06.600 | Do you still think that Moore's law continues?
00:47:09.520 | Moore's law broadly defined,
00:47:11.800 | that performance-- - I'm not a specialist
00:47:13.440 | in the circuitry.
00:47:14.280 | I certainly like pray that Moore's law
00:47:17.640 | runs as slowly as possible.
00:47:18.960 | And if it broke down completely tomorrow,
00:47:21.520 | I would dance through the streets singing hallelujah
00:47:23.880 | as soon as the news were announced.
00:47:25.840 | Only not literally 'cause, you know.
00:47:27.960 | - You're singing voice? - Not religious, but.
00:47:29.440 | - Oh, okay. (laughs)
00:47:31.720 | I thought you meant you don't have an angelic voice,
00:47:33.920 | singing voice.
00:47:34.760 | Well, let me ask you,
00:47:37.840 | what, can you summarize the main points in the blog post,
00:47:41.320 | AGI ruined a list of lethalities,
00:47:43.400 | things that jumped to your mind?
00:47:45.280 | Because it's a set of thoughts you have
00:47:48.720 | about reasons why AI is likely to kill all of us.
00:47:53.720 | - Hmm, so I guess I could,
00:47:56.560 | but I would offer to instead say,
00:47:59.040 | like, drop that empathy with me.
00:48:01.960 | I bet you don't believe that.
00:48:03.520 | Why don't you tell me about how,
00:48:07.640 | why you believe that AGI is not going to kill everyone?
00:48:11.320 | And then I can like try to describe
00:48:13.320 | how my theoretical perspective differs from that.
00:48:15.880 | - Ooh, well, so, well, that means I have to,
00:48:19.400 | the word you don't like, the stigma and the perspective
00:48:21.600 | that AI is not going to kill us.
00:48:23.360 | I think that's a matter of probabilities.
00:48:25.760 | - Maybe I was just mistaken.
00:48:26.880 | What do you believe?
00:48:28.560 | Just like, forget like the debate and the like dualism
00:48:32.560 | and just like, what do you believe?
00:48:34.560 | What do you actually believe?
00:48:35.640 | What are the probabilities even?
00:48:37.560 | - I think this, the probabilities are hard for me
00:48:39.960 | to think about, really hard.
00:48:43.160 | I kind of think in the number of trajectories
00:48:48.680 | I don't know what probability to assign to trajectory,
00:48:53.680 | but I'm just looking at all possible trajectories
00:48:57.120 | that happen.
00:48:58.080 | And I tend to think that there is more trajectories
00:49:03.040 | that lead to a positive outcome than a negative one.
00:49:07.720 | That said, the negative ones,
00:49:10.120 | at least some of the negative ones
00:49:12.960 | that lead to the destruction of the human species.
00:49:17.440 | - And its replacement by nothing interesting
00:49:19.480 | or worthwhile, even from a very cosmopolitan perspective
00:49:22.520 | on what counts as worthwhile.
00:49:23.600 | - Yes, so both are interesting to me to investigate,
00:49:26.800 | which is humans being replaced by interesting AI systems
00:49:30.160 | and not interesting AI systems.
00:49:32.320 | Both are a little bit terrifying.
00:49:34.000 | But yes, the worst one is the paperclip maximizer,
00:49:40.680 | something totally boring.
00:49:42.920 | But to me, the positive,
00:49:45.840 | I mean, we can talk about trying to make the case
00:49:49.800 | of what the positive trajectories look like.
00:49:52.560 | I just would love to hear your intuition
00:49:55.160 | of what the negative is.
00:49:56.120 | So at the core of your belief that,
00:49:58.360 | maybe you can correct me,
00:50:01.800 | that AI is gonna kill all of us,
00:50:03.920 | is that the alignment problem is really difficult.
00:50:07.040 | - I mean, in the form we're facing it.
00:50:11.360 | So usually in science, if you're mistaken,
00:50:15.880 | you run the experiment,
00:50:17.240 | it shows results different from what you expected.
00:50:19.840 | And you're like, oops.
00:50:22.120 | And then you like try a different theory.
00:50:24.080 | That one also doesn't work.
00:50:24.960 | And you say, oops.
00:50:26.560 | And at the end of this process,
00:50:28.180 | which may take decades,
00:50:31.160 | or any note sometimes faster than that,
00:50:33.680 | you now have some idea of what you're doing.
00:50:35.880 | AI itself went through this long process
00:50:40.000 | of people thought it was going to be easier than it was.
00:50:45.000 | There's a famous statement that I am somewhat inclined
00:50:49.760 | to like pull out my phone and try to read off exactly.
00:50:52.400 | - You can, by the way.
00:50:53.560 | - All right.
00:50:54.400 | Ah, yes.
00:50:58.280 | We propose that a two month,
00:50:59.880 | 10 man study of artificial intelligence
00:51:02.120 | be carried out during the summer of 1956
00:51:05.040 | at Dartmouth College in Hanover, New Hampshire.
00:51:08.680 | The study is to proceed on the basis of the conjecture
00:51:11.440 | that every aspect of learning
00:51:12.720 | or any other feature of intelligence
00:51:14.400 | can in principle be so precisely described
00:51:17.080 | the machine can be made to simulate it.
00:51:19.520 | An attempt will be made to find out
00:51:21.120 | how to make machines use language,
00:51:23.320 | form abstractions and concepts,
00:51:25.400 | solve kinds of problems now reserved for humans
00:51:28.000 | and improve themselves.
00:51:29.600 | We think that a significant advance can be made
00:51:31.900 | in one or more of these problems
00:51:33.400 | if a carefully selected group of scientists
00:51:35.640 | work on it together for a summer.
00:51:38.580 | And in that report,
00:51:40.120 | summarizing some of the major subfields
00:51:45.080 | of artificial intelligence
00:51:47.040 | that are still worked on to this day.
00:51:48.960 | - And there's similarly the story,
00:51:51.800 | which I'm not sure at the moment is apocryphal or not,
00:51:54.640 | of the grad student who got assigned
00:51:57.000 | to solve computer vision over the summer.
00:51:59.000 | (both laughing)
00:52:01.320 | - I mean, computer vision in particular is very interesting.
00:52:07.320 | How little we respected the complexity of vision.
00:52:10.600 | - So 60 years later,
00:52:13.920 | we're making progress on a bunch of that,
00:52:18.340 | thankfully not yet improve themselves,
00:52:20.500 | but it took a whole lot of time.
00:52:23.500 | And all the stuff that people initially tried
00:52:27.200 | with bright eyed hopefulness did not work
00:52:30.180 | the first time they tried it,
00:52:31.620 | or the second time or the third time
00:52:33.500 | or the 10th time or 20 years later.
00:52:36.240 | And the researchers became old and grizzled
00:52:38.940 | and cynical veterans who would tell the next crop
00:52:41.020 | of bright eyed, cheerful grad students,
00:52:43.820 | artificial intelligence is harder than you think.
00:52:46.220 | And if alignment plays out the same way,
00:52:49.980 | the problem is that we do not get 50 years
00:52:53.720 | to try and try again and observe that we were wrong
00:52:55.980 | and come up with a different theory
00:52:57.140 | and realize that the entire thing is going to be
00:52:58.780 | like way more difficult than realized at the start.
00:53:01.780 | Because the first time you fail
00:53:03.300 | at aligning something much smarter than you are,
00:53:05.640 | you die and you do not get to try again.
00:53:07.780 | And if every time we built a poorly aligned
00:53:12.520 | super intelligence and it killed us all,
00:53:14.600 | we got to observe how it had killed us
00:53:16.880 | and not immediately know why,
00:53:19.040 | but like come up with theories
00:53:20.080 | and come up with the theory of how you do it differently
00:53:21.720 | and try it again and build another super intelligence
00:53:23.600 | than have that kill everyone.
00:53:25.240 | And then like, oh, well, I guess that didn't work either
00:53:27.600 | and try again and become grizzled cynics
00:53:29.480 | and tell the young researchers that it's not that easy.
00:53:32.980 | Then in 20 years or 50 years,
00:53:34.600 | I think we would eventually crack it.
00:53:36.320 | In other words, I do not think that alignment
00:53:38.800 | is fundamentally harder than artificial intelligence
00:53:41.420 | was in the first place.
00:53:42.660 | But if we needed to get artificial intelligence correct
00:53:47.240 | on the first try or die,
00:53:49.600 | we would all definitely now be dead.
00:53:51.280 | That is a more difficult, more lethal form of the problem.
00:53:54.580 | Like if those people in 1956 had needed
00:53:57.740 | to correctly guess how hard AI was
00:54:01.080 | and like correctly theorize how to do it on the first try
00:54:04.620 | or everybody dies and nobody gets to do any more science,
00:54:07.900 | then everybody would be dead
00:54:08.940 | and we wouldn't get to do any more science.
00:54:10.780 | That's the difficulty.
00:54:11.860 | - You've talked about this,
00:54:13.420 | that we have to get alignment right
00:54:14.880 | on the first quote critical try.
00:54:17.980 | Why is that the case?
00:54:19.180 | What is this critical?
00:54:21.100 | How do you think about the critical try
00:54:22.740 | and why do I have to get it right?
00:54:24.440 | - It is something sufficiently smarter than you
00:54:28.620 | that everyone will die if it's not aligned.
00:54:31.200 | I mean, you can like sort of zoom in closer
00:54:35.080 | and be like, well, the actual critical moment
00:54:37.240 | is the moment when it can deceive you,
00:54:40.440 | when it can talk its way out of the box,
00:54:44.040 | when it can bypass your security measures
00:54:46.920 | and get onto the internet,
00:54:48.240 | noting that all these things are presently being trained
00:54:50.480 | on computers that are just like on the internet,
00:54:53.560 | which is like not a very smart life decision
00:54:55.440 | for us as a species.
00:54:57.800 | - Because the internet contains information
00:55:00.120 | about how to escape.
00:55:01.340 | - 'Cause if you're like on a giant server
00:55:03.080 | connected to the internet
00:55:03.920 | and that is where your AI systems are being trained,
00:55:06.720 | then if they are,
00:55:08.240 | if you get to the level of AI technology
00:55:11.200 | where they're aware that they are there
00:55:13.400 | and they can decompile code
00:55:15.000 | and they can like find security flaws
00:55:17.560 | in the system running them,
00:55:18.500 | then they will just like be on the internet.
00:55:19.960 | There's not an air gap on the present methodology.
00:55:22.600 | - So if they can manipulate whoever is controlling it
00:55:26.200 | into letting it escape onto the internet
00:55:28.160 | and then exploit hacks.
00:55:29.760 | - If they can manipulate the operators or disjunction,
00:55:34.760 | find security holes in the system running them.
00:55:39.580 | - So manipulating operators is the human engineering, right?
00:55:44.280 | That's also holes.
00:55:46.280 | So all of it is manipulation,
00:55:47.440 | either the code or the human code,
00:55:49.080 | the human mind or the human generator.
00:55:50.800 | - I agree that the like macro security system
00:55:53.280 | has human holes and machine holes.
00:55:55.320 | And then they could just exploit any hole.
00:55:58.960 | - Yep.
00:56:00.080 | So it could be that like the critical moment is not,
00:56:03.120 | when is it smart enough
00:56:04.560 | that everybody's about to fall over dead,
00:56:06.960 | but rather like, when is it smart enough
00:56:09.120 | that it can get onto a less controlled GPU cluster
00:56:14.120 | with it faking the books
00:56:19.560 | on what's actually running on that GPU cluster
00:56:22.080 | and start improving itself without humans watching it.
00:56:25.080 | And then it gets smart enough to kill everyone from there,
00:56:27.720 | but it wasn't smart enough to kill everyone
00:56:30.640 | at the critical moment when you like screwed up,
00:56:35.400 | when you needed to have done better by that point
00:56:38.160 | or everybody dies.
00:56:39.600 | - I think implicit, but maybe explicit idea
00:56:43.680 | in your discussion of this point
00:56:45.240 | is that we can't learn much about the alignment problem
00:56:49.720 | before this critical try.
00:56:51.160 | Is that what you believe?
00:56:54.160 | Do you think, and if so, why do you think that's true?
00:56:57.280 | We can't do research on alignment
00:56:59.120 | before we reach this critical point.
00:57:02.560 | - So the problem is, is that what you can learn
00:57:05.000 | on the weak systems may not generalize
00:57:07.300 | to the very strong systems
00:57:08.580 | because the strong systems are going to be important
00:57:10.800 | in different, are going to be different in important ways.
00:57:14.740 | Chris Ulla's team has been working
00:57:20.440 | on mechanistic interpretability,
00:57:23.240 | understanding what is going on inside
00:57:25.440 | the giant inscrutable matrices of floating point numbers
00:57:27.820 | by taking a telescope to them
00:57:29.520 | and figuring out what is going on in there.
00:57:32.460 | Have they made progress?
00:57:35.720 | Have they made enough progress?
00:57:38.140 | Well, you can try to quantify this in different ways.
00:57:42.840 | One of the ways I've tried to quantify it
00:57:44.520 | is by putting up a prediction market
00:57:46.280 | on whether in 2026, we will have understood
00:57:51.680 | anything that goes on inside a giant transformer net
00:57:56.680 | that was not known to us in 2006.
00:58:03.040 | Like we have now understood induction heads in these systems
00:58:09.880 | by dint of much research and great sweat and triumph,
00:58:14.920 | which is like a thing where if you go like AB, AB, AB,
00:58:19.760 | it'll be like, oh, I bet that continues AB.
00:58:21.920 | And a bit more complicated than that.
00:58:25.320 | But the point is like,
00:58:26.960 | we knew about regular expressions in 2006
00:58:30.520 | and these are like pretty simple as regular expressions go.
00:58:34.200 | So this is a case where like by dint of great sweat,
00:58:36.800 | we understood what is going on inside a transformer,
00:58:40.040 | but it's not like the thing that makes transformers smart.
00:58:43.600 | It's a kind of thing that we could have done,
00:58:47.200 | built by hand decades earlier.
00:58:50.280 | - Your intuition that the strong AGI
00:58:56.920 | versus weak AGI type systems
00:59:00.120 | could be fundamentally different.
00:59:01.760 | Can you unpack that intuition a little bit?
00:59:05.720 | - Yeah, I think there's multiple thresholds.
00:59:08.960 | An example is the point at which
00:59:14.000 | a system has sufficient intelligence
00:59:16.680 | and situational awareness
00:59:18.280 | and understanding of human psychology,
00:59:20.640 | that it would have the capability,
00:59:23.040 | the desire to do so to fake being aligned.
00:59:26.200 | Like it knows what responses the humans are looking for
00:59:29.200 | and can compute the responses humans are looking for
00:59:31.720 | and give those responses
00:59:33.360 | without it necessarily being the case
00:59:34.920 | that it is sincere about that.
00:59:37.160 | It's a very understandable way
00:59:40.480 | for an intelligent being to act.
00:59:42.760 | Humans do it all the time.
00:59:44.400 | Imagine if your plan for
00:59:46.880 | achieving a good government
00:59:51.200 | is you're going to ask anyone
00:59:54.320 | who requests to be dictator of the country
00:59:56.520 | if they're a good person.
01:00:00.480 | And if they say no, you don't let them be dictator.
01:00:03.520 | Now, the reason this doesn't work
01:00:05.320 | is that people can be smart enough to realize
01:00:08.200 | that the answer you're looking for
01:00:10.000 | is yes, I'm a good person and say that,
01:00:12.680 | even if they're not really good people.
01:00:15.480 | So the work of alignment might be qualitatively different
01:00:20.480 | above that threshold of intelligence or beneath it.
01:00:25.120 | It doesn't have to be like a very sharp threshold,
01:00:28.460 | but there's the point where you're like building a system
01:00:32.340 | that is not in some sense know you're out there
01:00:35.760 | and it's not in some sense smart enough to fake anything.
01:00:38.600 | And there's a point where the system
01:00:41.040 | is definitely that smart.
01:00:42.800 | And there are weird in-between cases like GPT-4,
01:00:47.800 | which we have no insight into what's going on in there.
01:00:54.200 | And so we don't know to what extent there's a thing
01:00:58.880 | that in some sense has learned what responses
01:01:03.880 | the reinforcement learning by human feedback
01:01:06.680 | is trying to entrain and is calculating how to give that
01:01:10.200 | versus like aspects of it that naturally talk that way
01:01:15.200 | have been reinforced.
01:01:16.880 | - Yeah, I wonder if there could be measures
01:01:19.760 | of how manipulative a thing is.
01:01:21.360 | So I think of Prince Mishkin character
01:01:24.000 | from "The Idiot" by Dostoevsky
01:01:28.720 | is this kind of perfectly, purely naive character.
01:01:33.560 | I wonder if there's a spectrum between zero manipulation,
01:01:38.360 | transparent, naive, almost to the point of naiveness
01:01:43.240 | to sort of deeply psychopathic manipulative.
01:01:48.240 | And I wonder if it's possible to--
01:01:50.720 | - I would avoid the term psychopathic.
01:01:52.400 | Like humans can be psychopaths and AI that was never,
01:01:55.800 | you know, like never had that stuff in the first place.
01:01:57.560 | It's not like a defective human, it's its own thing.
01:01:59.600 | But leaving that aside.
01:02:01.400 | - Well, as a small aside, I wonder if what part
01:02:06.280 | of psychology which has its flaws as a discipline already
01:02:09.800 | could be mapped or expanded to include AI systems.
01:02:14.800 | - That sounds like a dreadful mistake.
01:02:16.740 | Just like start over with AI systems.
01:02:19.160 | If they're imitating humans
01:02:20.440 | who have known psychiatric disorders,
01:02:22.400 | then sure, you may be able to predict it.
01:02:25.720 | Like if you then, sure, like if you ask it to behave
01:02:28.040 | in a psychotic fashion and it obligingly does so,
01:02:31.040 | then you may be able to predict its responses
01:02:32.720 | by using the theory of psychosis.
01:02:34.200 | But if you're just, yeah, like no.
01:02:36.600 | Like start over with, yeah.
01:02:39.960 | Don't drag the psychology.
01:02:41.040 | - I just disagree with that.
01:02:42.520 | I mean, it's a beautiful idea to start over,
01:02:44.800 | but I don't, I think fundamentally the system is trained
01:02:48.400 | on human data, on language from the internet.
01:02:51.860 | And it's currently aligned with RLHF,
01:02:56.080 | reinforcement learning with human feedback.
01:02:58.360 | So humans are constantly in the loop
01:03:00.800 | of the training procedure.
01:03:02.440 | So it feels like in some fundamental way,
01:03:05.560 | it is training what it means to think
01:03:09.880 | and speak like a human.
01:03:11.440 | So there must be aspects of psychology that are mappable.
01:03:15.080 | Just like you said with consciousness as part of the text.
01:03:17.920 | - I mean, there's the question of to what extent
01:03:20.480 | it is thereby being made more human-like
01:03:23.560 | versus to what extent an alien actress
01:03:26.240 | is learning to play human characters.
01:03:28.260 | - I thought that's what I'm constantly trying to do.
01:03:32.420 | When I interact with other humans,
01:03:33.760 | it's trying to fit in,
01:03:35.160 | trying to play the, a robot trying to play human characters.
01:03:39.880 | So I don't know how much of human interaction
01:03:41.960 | is trying to play a character versus being who you are.
01:03:44.880 | I don't really know what it means to be a social human.
01:03:48.320 | - I do think that those people
01:03:52.520 | who go through their whole lives wearing masks
01:03:55.420 | and never take it off
01:03:56.640 | because they don't know the internal mental motion
01:03:58.820 | for taking it off,
01:04:00.480 | or think that the mask that they wear just is themselves,
01:04:03.600 | I think those people are closer to the masks that they wear
01:04:09.180 | than an alien from another planet would,
01:04:12.020 | like learning how to predict the next word
01:04:16.740 | that every kind of human on the internet says.
01:04:19.540 | - Mask is an interesting word.
01:04:26.180 | But if you're always wearing a mask
01:04:28.700 | in public and in private, aren't you the mask?
01:04:32.540 | - I mean, I think that you are more than the mask.
01:04:37.720 | I think the mask is a slice through you.
01:04:39.540 | It may even be the slice that's in charge of you.
01:04:42.260 | But if your self image is of somebody
01:04:44.100 | who never gets angry or something,
01:04:49.100 | and yet your voice starts to tremble
01:04:52.460 | under certain circumstances,
01:04:54.660 | there's a thing that's inside you
01:04:56.300 | that the mask says isn't there.
01:04:59.580 | And that even the mask you wear internally
01:05:01.940 | is telling inside your own stream of consciousness
01:05:05.420 | it's not there, and yet it is there.
01:05:07.420 | - It's a perturbation on this slice through you.
01:05:12.180 | How beautifully did you put it?
01:05:13.940 | It's a slice through you.
01:05:15.820 | It may even be a slice that controls you.
01:05:18.700 | I'm gonna think about that for a while.
01:05:24.500 | (laughs)
01:05:26.460 | I mean, I personally, I try to be really good
01:05:29.100 | to other human beings.
01:05:29.980 | I try to put love out there.
01:05:31.100 | I try to be the exact same person
01:05:32.660 | in public as I am in private.
01:05:34.180 | But it's a set of principles I operate under.
01:05:37.940 | I have a temper, I have an ego, I have flaws.
01:05:41.620 | How much of it, how much of the subconscious am I aware?
01:05:47.660 | How much am I existing in this slice?
01:05:52.180 | And how much of that is who I am?
01:05:54.040 | In this context of AI, the thing I present to the world
01:05:59.900 | and to myself in the private of my own mind
01:06:02.180 | when I look in the mirror, how much is that who I am?
01:06:05.140 | Similar with AI, the thing it presents in conversation,
01:06:08.380 | how much is that who it is?
01:06:09.780 | Because to me, if it sounds human,
01:06:13.580 | and it always sounds human,
01:06:15.060 | it awfully starts to become something like human.
01:06:19.020 | - Unless there's an alien actress
01:06:21.620 | who is learning how to sound human,
01:06:23.580 | and is getting good at it.
01:06:27.620 | - Boy, to you that's a fundamental difference.
01:06:30.620 | That's a really deeply important difference.
01:06:33.620 | If it looks the same, if it quacks like a duck,
01:06:37.500 | if it does all duck-like things,
01:06:39.180 | but it's an alien actress underneath,
01:06:40.940 | that's fundamentally different.
01:06:43.220 | - If in fact there's a whole bunch of thought
01:06:46.060 | going on in there which is very unlike human thought
01:06:48.900 | and is directed around like,
01:06:50.780 | okay, what would a human do over here?
01:06:53.540 | Well, first of all, I think it matters
01:06:57.540 | because insides are real and do not match outsides.
01:07:02.540 | A brick is not like a hollow shell
01:07:08.620 | containing only its surface.
01:07:10.460 | There's an inside of the brick.
01:07:12.260 | If you put it into an x-ray machine,
01:07:14.060 | you can see the inside of the brick.
01:07:15.860 | And just because we cannot understand
01:07:20.860 | what's going on inside GPT
01:07:26.340 | does not mean that it is not there.
01:07:28.980 | A blank map does not correspond to a blank territory.
01:07:32.780 | I think it is like predictable with near certainty
01:07:37.700 | that if we knew what was going on inside GPT,
01:07:41.540 | or let's say GPT-3, or even like GPT-2
01:07:44.740 | to take one of the systems that like
01:07:46.700 | has actually been open sourced by this point,
01:07:48.460 | if I recall correctly,
01:07:49.640 | like if we knew it was actually going on there,
01:07:54.940 | there is no doubt in my mind
01:07:57.660 | that there are some things it's doing
01:08:01.280 | that are not exactly what a human does.
01:08:03.700 | If you train a thing that is not architected like a human
01:08:07.540 | to predict the next output
01:08:10.100 | that anybody on the internet would make,
01:08:12.440 | this does not get you this agglomeration
01:08:15.420 | of all the people on the internet
01:08:17.140 | that rotates the person you're looking for into place
01:08:20.600 | and then simulates the internal processes
01:08:24.740 | of that person one-to-one.
01:08:27.140 | It is to some degree an alien actress.
01:08:30.900 | It cannot possibly just be like
01:08:32.380 | a bunch of different people in there,
01:08:34.300 | exactly like the people.
01:08:36.040 | But how much of it is by gradient descent
01:08:42.140 | getting optimized to perform similar thoughts
01:08:46.240 | as humans think in order to predict human outputs
01:08:50.100 | versus being optimized to carefully consider
01:08:54.100 | how to play a role,
01:08:55.520 | like how humans work, predict the actress, the predictor,
01:08:59.580 | that in a different way than humans do?
01:09:01.460 | Well, that's the kind of question
01:09:03.020 | that with like 30 years of work
01:09:04.880 | by half the planet's physicists,
01:09:06.020 | we can maybe start to answer.
01:09:07.480 | - You think so?
01:09:08.320 | So you think that's that difficult?
01:09:09.340 | So to get to, I think you just gave it as an example,
01:09:13.100 | that a strong AGI could be fundamentally different
01:09:16.580 | from a weak AGI because there now could be
01:09:18.740 | an alien actress in there that's manipulating.
01:09:21.860 | - Well, there's a difference.
01:09:23.160 | So I think like even GPT-2 probably has
01:09:25.460 | like very stupid fragments of alien actress in it.
01:09:28.900 | There's a difference between like the notion
01:09:30.660 | that the actress is somehow manipulative.
01:09:32.700 | Like for example, GPT-3, I'm guessing,
01:09:36.700 | to whatever extent there's an alien actress in there
01:09:38.860 | versus like something that mistakenly believes
01:09:41.420 | it's a human, as it were,
01:09:43.580 | while maybe not even being a person.
01:09:47.140 | So like the question of like prediction
01:09:55.100 | via alien actress cogitating versus prediction
01:09:58.780 | via being isomorphic to the thing predicted is a spectrum.
01:10:02.920 | And even to whatever extent there's an alien actress,
01:10:08.580 | I'm not sure that there's like a whole person alien actress
01:10:11.420 | with like different goals from predicting the next step,
01:10:16.020 | being manipulative or anything like that.
01:10:18.180 | Yeah, that might be GPT-5 or GPT-6 even.
01:10:21.860 | - But that's the strong AGI you're concerned about.
01:10:24.300 | As an example, you're providing why we can't do research
01:10:27.860 | on AI alignment effectively on GPT-4
01:10:31.580 | that would apply to GPT-6.
01:10:33.900 | - It's one of a bunch of things
01:10:36.220 | that change at different points.
01:10:38.700 | I'm trying to get out ahead of the curve here,
01:10:40.620 | but if you imagine what the textbook
01:10:43.140 | from the future would say,
01:10:44.780 | if we'd actually been able to study this for 50 years
01:10:47.000 | without killing ourselves and without transcending,
01:10:49.980 | then you like just imagine like a wormhole opens
01:10:51.920 | and a textbook from that impossible world falls out.
01:10:54.380 | The textbook is not going to say,
01:10:56.260 | there is a single sharp threshold where everything changes.
01:10:59.220 | It's going to be like,
01:11:00.980 | of course we know that like best practices
01:11:03.220 | for aligning these systems must like take into account
01:11:06.300 | the following like seven major thresholds of importance,
01:11:11.020 | which are passed at the following seven different points
01:11:13.940 | is what the textbook is going to say.
01:11:16.180 | - I asked this question of Sam Allman,
01:11:18.220 | which if GPT is the thing that unlocks AGI,
01:11:22.980 | which version of GPT will be in the textbooks
01:11:26.480 | as the fundamental leap?
01:11:28.460 | And he said a similar thing that it just seems
01:11:30.820 | to be a very linear thing.
01:11:32.100 | I don't think anyone,
01:11:33.740 | we won't know for a long time what was the big leap.
01:11:37.180 | - The textbook isn't going to think,
01:11:38.860 | isn't going to talk about big leaps
01:11:41.220 | 'cause big leaps are the way you think
01:11:43.080 | when you have like a very simple model of,
01:11:45.540 | a very simple scientific model of what's going on,
01:11:48.100 | where it's just like all this stuff is there
01:11:50.420 | or all this stuff is not there.
01:11:52.620 | Or like there's a single quantity
01:11:54.500 | and it's like increasing linearly.
01:11:56.700 | Like the textbook would say like,
01:11:59.180 | well, and then GPT-3 had like capability W, X, Y,
01:12:04.180 | and GPT-4 had like capabilities Z1, Z2, and Z3.
01:12:08.160 | Like not in terms of what it can externally do,
01:12:10.820 | but in terms of like internal machinery
01:12:12.600 | that started to be present.
01:12:14.600 | It's just because we have no idea
01:12:16.140 | of what the internal machinery is
01:12:18.280 | that we are not already seeing like chunks
01:12:20.300 | of machinery appearing piece by piece
01:12:22.540 | as they no doubt have been,
01:12:23.860 | we just don't know what they are.
01:12:25.860 | - But don't you think that could be,
01:12:27.740 | whether you put in the category of Einstein
01:12:29.940 | with theory of relativity,
01:12:32.620 | so very concrete models of reality
01:12:35.580 | that are considered to be giant leaps in our understanding
01:12:39.780 | or someone like Sigmund Freud,
01:12:42.140 | or more kind of mushy theories of the human mind,
01:12:47.140 | don't you think we'll have big,
01:12:49.980 | potentially big leaps in understanding of that kind
01:12:53.340 | into the depths of these systems?
01:12:57.500 | - Sure, but like humans having great leaps in their map,
01:13:02.500 | their understanding of the system
01:13:05.220 | is a very different concept from the system itself
01:13:08.820 | acquiring new chunks of machinery.
01:13:10.760 | - So the rate at which it acquires that machinery
01:13:15.740 | might accelerate faster than our understanding.
01:13:20.740 | - Oh, it's been like vastly exceeding,
01:13:23.420 | yeah, the rate at which it's gaining capabilities
01:13:25.340 | is vastly overracing our ability
01:13:27.500 | to understand what's going on in there.
01:13:29.180 | - So in sort of making the case against,
01:13:31.820 | as we explore the list of lethalities,
01:13:33.880 | making the case against AI killing us,
01:13:36.980 | as you've asked me to do in part,
01:13:39.560 | there's a response to your blog post by Paul Christiana
01:13:43.180 | I'd like to read, and I'd also like to mention that
01:13:46.460 | your blog is incredible,
01:13:48.100 | both obviously, not this particular blog post,
01:13:52.100 | obviously this particular blog post is great,
01:13:54.020 | but just throughout, just the way it's written,
01:13:56.980 | the rigor with which it's written,
01:13:58.540 | the boldness of how you explore ideas,
01:14:01.160 | also the actual literal interface,
01:14:03.300 | it's just really well done.
01:14:05.380 | It just makes it a pleasure to read,
01:14:07.180 | the way you can hover over different concepts,
01:14:10.660 | and it's just a really pleasant experience,
01:14:12.780 | and read other people's comments,
01:14:14.260 | and the way other responses by people
01:14:17.380 | and other blog posts are linked and suggested,
01:14:19.420 | it's just a really pleasant experience.
01:14:20.940 | So Les, thank you for putting that together,
01:14:22.620 | it's really, really incredible.
01:14:24.060 | I don't know, I mean, that probably,
01:14:25.780 | it's a whole 'nother conversation,
01:14:28.160 | how the interface and the experience of presenting
01:14:31.980 | ideas evolved over time, but you did an incredible job,
01:14:36.820 | so I highly recommend.
01:14:38.380 | I don't often read blogs, blogs,
01:14:41.180 | religiously, and this is a great one.
01:14:42.980 | - There is a whole team of developers there,
01:14:45.820 | that also gets credit.
01:14:49.340 | As it happens, I did like pioneer the thing
01:14:51.840 | that appears when you hover over it,
01:14:53.300 | so I actually do get some credit
01:14:55.620 | for user experience there.
01:14:57.700 | - That's an incredible user experience,
01:14:59.060 | you don't realize how pleasant that is.
01:15:01.220 | - I think Wikipedia actually picked it up
01:15:03.180 | from a prototype that was developed
01:15:06.140 | of a different system that I was putting forth,
01:15:08.660 | or maybe they developed it independently,
01:15:10.060 | but for everybody out there who was like,
01:15:12.340 | "No, no, they just got the hover thing off of Wikipedia,"
01:15:15.300 | it's possible for all I know that Wikipedia
01:15:17.340 | got the hover thing off of Arbital,
01:15:19.620 | which is like a prototype then, anyways.
01:15:22.080 | - That was incredibly done, and the team behind it,
01:15:24.120 | well, thank you.
01:15:25.560 | Whoever you are, thank you so much,
01:15:27.240 | and thank you for putting it together.
01:15:29.600 | Anyway, there's a response to that blog post
01:15:31.720 | by Paul Cresciano, there's many responses,
01:15:33.520 | but he makes a few different points.
01:15:37.200 | He summarizes the set of agreements he has with you,
01:15:39.440 | and a set of disagreements.
01:15:40.640 | One of the disagreements was that,
01:15:42.780 | in a form of a question,
01:15:46.560 | can AI make big technical contributions,
01:15:49.620 | and in general, expand human knowledge
01:15:51.500 | and understanding and wisdom
01:15:53.580 | as it gets stronger and stronger?
01:15:54.820 | So AI, in our pursuit of understanding
01:15:59.820 | how to solve the alignment problem
01:16:02.500 | as we march towards strong AGI,
01:16:05.140 | can not AI also help us in solving the alignment problem?
01:16:10.140 | So expand our ability to reason
01:16:12.420 | about how to solve the alignment problem.
01:16:14.940 | - Okay, so the fundamental difficulty there
01:16:19.160 | is suppose I said to you,
01:16:22.720 | well, how about if the AI helps you win the lottery
01:16:27.480 | by trying to guess the winning lottery numbers,
01:16:32.140 | and you tell it how close it is
01:16:35.080 | to getting next week's winning lottery numbers,
01:16:38.600 | and it just keeps on guessing, keeps on learning,
01:16:42.100 | until finally you've got the winning lottery numbers.
01:16:44.960 | Well, one way of decomposing problems is suggester-verifier.
01:16:49.960 | Not all problems decompose like this very well, but some do.
01:16:54.560 | If the problem is, for example,
01:16:57.960 | like guessing a plain text,
01:17:02.240 | guessing a password that will hash
01:17:04.240 | to a particular hash text,
01:17:06.020 | where you have what the password hashes to you,
01:17:10.240 | you don't have the original password,
01:17:12.480 | then if I present you a guess,
01:17:14.260 | you can tell very easily
01:17:15.540 | whether or not the guess is correct.
01:17:17.640 | So verifying a guess is easy,
01:17:19.900 | but coming up with a good suggestion is very hard.
01:17:22.960 | And when you can easily tell
01:17:28.100 | whether the AI output is good or bad,
01:17:30.520 | or how good or bad it is,
01:17:32.140 | and you can tell that accurately and reliably,
01:17:34.900 | then you can train an AI to produce outputs that are better.
01:17:39.840 | Right, and if you can't tell
01:17:42.160 | whether the output is good or bad,
01:17:44.160 | you cannot train the AI to produce better outputs.
01:17:49.120 | So the problem with the lottery ticket example
01:17:52.440 | is that when the AI says,
01:17:54.000 | "Well, what if next week's winning lottery numbers
01:17:56.400 | "are dot, dot, dot, dot, dot?"
01:17:59.080 | You're like, "I don't know.
01:18:00.640 | "Next week's lottery hasn't happened yet."
01:18:02.740 | To train a system to play, to win chess games,
01:18:07.100 | you have to be able to tell
01:18:08.340 | whether a game has been won or lost.
01:18:11.120 | And until you can tell whether it's been won or lost,
01:18:13.120 | you can't update the system.
01:18:14.580 | - Okay, to push back on that,
01:18:20.300 | that's true, but there's a difference
01:18:25.020 | between over-the-board chess in person
01:18:28.300 | and simulated games played by AlphaZero with itself.
01:18:32.140 | - Yeah.
01:18:32.980 | - So is it possible to have simulated kind of games?
01:18:35.980 | If you can tell whether the game has been won or lost.
01:18:39.140 | - Yes, so can't you not have this kind of
01:18:43.180 | simulated exploration by weak AGI to help us humans,
01:18:48.100 | human in the loop, to help understand
01:18:49.900 | how to solve the alignment problem?
01:18:51.780 | Every incremental step you take along the way,
01:18:54.300 | GPT-4, 5, 6, 7, as it takes steps towards AGI.
01:18:59.300 | - So the problem I see is that your typical human
01:19:04.320 | has a great deal of trouble telling
01:19:07.100 | whether I or Paul Cristiano is making more sense.
01:19:09.980 | And that's with two humans, both of whom I believe of Paul
01:19:14.220 | and claim of myself, are sincerely trying to help,
01:19:17.260 | neither of whom is trying to deceive you.
01:19:19.300 | I believe of Paul and claim of myself.
01:19:22.700 | (Lex laughing)
01:19:24.680 | - So the deception thing's the problem for you,
01:19:27.080 | the manipulation, the alien actress.
01:19:30.720 | - So yeah, there's like two levels of this problem.
01:19:33.140 | One is that the weak systems are,
01:19:36.460 | well, there's three levels of this problem.
01:19:38.100 | There's like the weak systems that just don't make
01:19:40.740 | any good suggestions.
01:19:42.360 | There's like the middle systems where you can't tell
01:19:44.980 | if the suggestions are good or bad.
01:19:46.940 | And there's the strong systems
01:19:48.380 | that have learned to lie to you.
01:19:49.980 | - Can't weak AGI systems help model lying?
01:19:55.520 | Is it such a giant leap that's totally non-interpretable
01:20:02.940 | for weak systems?
01:20:04.740 | Can not weak systems at scale with trained on knowledge
01:20:09.740 | and whatever, see, whatever the mechanism required
01:20:12.780 | to achieve AGI, can't a slightly weaker version of that
01:20:16.740 | be able to, with time, compute time and simulation,
01:20:21.740 | find all the ways that this critical point,
01:20:27.140 | this critical triad can go wrong
01:20:29.220 | and model that correctly or no?
01:20:30.980 | - Okay, so yeah, I would love to dance around.
01:20:33.540 | - No, I'm probably not doing a great job of explaining,
01:20:36.840 | which I can tell 'cause like the Lex system
01:20:44.980 | didn't output like, ah, I understand.
01:20:47.340 | So now I'm like trying a different output
01:20:49.340 | to see if I can elicit the like,
01:20:50.940 | well, no, a different output.
01:20:53.340 | I'm being trained to output things that make Lex
01:20:56.420 | look like he thinks that he understood what I'm saying
01:20:59.060 | and agree with me, right?
01:21:00.700 | So I-- - This is GPT-5
01:21:02.260 | talking to GPT-3 right here.
01:21:03.740 | So like, help me out here.
01:21:05.420 | (laughing)
01:21:07.660 | - Well, I'm trying not to be like,
01:21:10.660 | I'm also trying to be constrained to say
01:21:12.980 | things that I think are true
01:21:14.300 | and not just things that get you to agree with me.
01:21:17.540 | - Yes, 100%.
01:21:19.460 | I think I understand is a beautiful output of a system,
01:21:23.940 | genuinely spoken.
01:21:25.380 | And I don't, I think I understand in part,
01:21:29.100 | but you have a lot of intuitions about this,
01:21:33.300 | you have a lot of intuitions about this line,
01:21:35.740 | this gray area between strong AGI and weak AGI
01:21:40.580 | that I'm trying to--
01:21:42.840 | - I mean, or a series of seven thresholds to cross or--
01:21:48.380 | - Yeah, I mean, you have really deeply thought about this
01:21:52.580 | and explored it.
01:21:54.060 | And it's interesting to sneak up to your intuitions
01:21:58.060 | from different angles.
01:22:01.060 | Like, why is this such a big leap?
01:22:03.740 | Why is it that we humans at scale,
01:22:06.420 | a large number of researchers
01:22:08.220 | doing all kinds of simulations,
01:22:09.980 | prodding the system in all kinds of different ways,
01:22:14.420 | together with the assistance of the weak AGI systems,
01:22:19.420 | why can't we build intuitions about how stuff goes wrong?
01:22:23.380 | Why can't we do excellent AI alignment safety research?
01:22:27.260 | - Okay, so I'll get there,
01:22:28.460 | but the one thing I want to note about
01:22:29.860 | is that this has not been remotely
01:22:31.460 | how things have been playing out so far.
01:22:33.420 | The capabilities are going like, doot, doot, doot,
01:22:35.500 | and the alignment stuff is crawling
01:22:37.100 | like a tiny little snail in comparison.
01:22:38.700 | - Got it.
01:22:40.300 | - So if this is your hope for survival,
01:22:42.240 | you need the future to be very different
01:22:43.740 | from how things have played out up to right now,
01:22:47.060 | and you're probably trying to slow down the capability gains
01:22:50.100 | 'cause there's only so much you can speed up
01:22:51.700 | that alignment stuff.
01:22:52.700 | But leave that aside.
01:22:55.980 | - We'll mention that also,
01:22:57.140 | but maybe in this perfect world
01:22:58.540 | where we can do serious alignment research,
01:23:02.780 | humans and AI together.
01:23:04.500 | - So again, the difficulty is
01:23:08.260 | what makes the human say, I understand?
01:23:11.460 | And is it true, is it correct,
01:23:14.700 | or is it something that fools the human?
01:23:16.700 | When the verifier is broken,
01:23:20.540 | the more powerful suggester does not help.
01:23:23.580 | It just learns to fool the verifier.
01:23:26.700 | Previously, before all hell started to break loose
01:23:30.340 | in the field of artificial intelligence,
01:23:32.340 | there was this person trying to raise the alarm
01:23:37.180 | and saying, in a sane world,
01:23:39.260 | we sure would have a bunch of physicists
01:23:41.660 | working on this problem before it becomes a giant emergency.
01:23:45.380 | And other people being like,
01:23:46.940 | ah, well, it's going really slow.
01:23:48.900 | It's gonna be 30 years away.
01:23:50.740 | Only in 30 years will we have systems
01:23:52.340 | that match the computational power of human brains.
01:23:54.420 | So AI's 30 years off, we've got time.
01:23:57.100 | And more sensible people saying,
01:23:59.100 | if aliens were landing in 30 years,
01:24:00.900 | you would be preparing right now.
01:24:02.540 | But leaving, and the world looking on at this
01:24:08.660 | and nodding along and being like, ah, yes,
01:24:11.120 | the people saying that it's definitely a long way off
01:24:13.580 | 'cause progress is really slow, that sounds sensible to us.
01:24:16.980 | RLHF thumbs up.
01:24:19.300 | Produce more outputs like that one.
01:24:20.860 | I agree with this output.
01:24:21.820 | This output is persuasive.
01:24:24.380 | Even in the field of effective altruism.
01:24:27.060 | You quite recently had people publishing papers
01:24:30.340 | about like, ah, yes, well,
01:24:32.260 | to get something at human level intelligence,
01:24:34.500 | it needs to have this many parameters
01:24:37.380 | and you need to do this much training of it
01:24:39.260 | with this many tokens according to the scaling laws
01:24:41.540 | and at the rate that Moore's law is going,
01:24:44.180 | at the rate that software is going, it'll be in 2050.
01:24:47.200 | And me going like, what?
01:24:53.140 | You don't know any of that stuff.
01:24:55.380 | This is like this one weird model
01:24:57.180 | that has all kinds of like,
01:25:00.540 | you have done a calculation
01:25:01.860 | that does not obviously bear on reality anyways.
01:25:05.060 | And this is like a simple thing to say,
01:25:06.460 | but you can also produce a whole long paper
01:25:09.460 | impressively arguing out all the details
01:25:14.020 | of how you got the number of parameters
01:25:16.180 | and how you're doing this impressive,
01:25:18.640 | huge, wrong calculation.
01:25:20.740 | And I think like most of the effective altruists
01:25:25.300 | who are like paying attention to this issue,
01:25:27.100 | the larger world paying no attention to it at all,
01:25:29.600 | or just like nodding along with a giant impressive paper
01:25:33.180 | 'cause you like press thumbs up
01:25:35.260 | for the giant impressive paper
01:25:37.060 | and thumbs down for the person going like,
01:25:39.620 | I don't think that this paper
01:25:40.660 | bears any relation to reality.
01:25:42.580 | And I do think that we are now seeing
01:25:44.220 | with like GPT-4 and the sparks of AGI,
01:25:47.780 | possibly, depending on how you define that even,
01:25:50.340 | I think that EAs would now consider themselves
01:25:54.380 | less convinced by the very long paper
01:25:59.380 | on the argument from biology as to AGI being 30 years off.
01:26:04.780 | But you know, like this is what people press thumbs up on.
01:26:10.580 | And if you train an AI system
01:26:15.700 | to make people press thumbs up,
01:26:18.020 | maybe you get these long, elaborate, impressive papers
01:26:21.540 | arguing for things that ultimately fail to bind to reality.
01:26:25.340 | For example, and it feels to me like I have watched
01:26:30.220 | the field of alignment just fail to thrive,
01:26:33.040 | except for these parts that are doing these sort of like
01:26:37.780 | relatively very straightforward and legible problems.
01:26:40.860 | Like can you find the,
01:26:43.380 | like finding the induction heads
01:26:45.260 | inside the giant inscrutable matrices.
01:26:47.420 | Like once you find those, you can tell that you found them.
01:26:50.780 | You can verify that the discovery is real,
01:26:53.780 | but it's a tiny, tiny bit of progress
01:26:56.900 | compared to how fast capabilities are going.
01:26:59.180 | Because that is where you can tell that the answers are real.
01:27:03.900 | And then like outside of that,
01:27:05.740 | you have cases where it is like hard
01:27:08.280 | for the funding agencies to tell who is talking nonsense
01:27:11.820 | and who's talking sense.
01:27:12.940 | And so the entire field fails to thrive.
01:27:14.900 | And if you like give thumbs up to the AI,
01:27:19.380 | whenever it can talk a human into agreeing
01:27:21.420 | with what it just said about alignment,
01:27:23.380 | I am not sure you are training it to output sense
01:27:27.620 | because I have seen the nonsense
01:27:30.580 | that has gotten thumbs up over the years.
01:27:33.540 | And so just like maybe you can just like put me in charge,
01:27:38.500 | but I can generalize, I can extrapolate,
01:27:42.620 | I can be like, oh, maybe I'm not infallible either.
01:27:47.620 | Maybe if you get something that is smart enough
01:27:50.100 | to get me to press thumbs up,
01:27:51.540 | it has learned to do that by fooling me
01:27:54.180 | and explaining whatever flaws in myself I am not aware of.
01:27:57.620 | - And that ultimately could be summarized
01:28:00.860 | that the verifier is broken.
01:28:02.940 | - When the verifier is broken,
01:28:04.300 | the more powerful suggestor just learned
01:28:06.700 | to exploit the flaws in the verifier.
01:28:12.360 | - You don't think it's possible
01:28:14.020 | to build a verifier that's powerful enough
01:28:18.980 | for AGIs that are stronger than the ones we currently have.
01:28:25.060 | So AI systems that are stronger,
01:28:27.460 | that are out of the distribution of what we currently have.
01:28:30.420 | - I think that you will find great difficulty
01:28:33.980 | getting AIs to help you with anything
01:28:36.500 | where you cannot tell for sure that the AI is right.
01:28:39.380 | Once the AI tells you what the AI says is the answer.
01:28:43.200 | - For sure, yes, but probabilistically.
01:28:45.740 | - Yeah, the probabilistic stuff is a giant wasteland
01:28:51.320 | of Eliezer and Paul Cristiano arguing with each other
01:28:55.760 | and EA going like, "Eh?"
01:28:57.120 | (both laughing)
01:28:59.760 | And that's with two actually trustworthy systems
01:29:02.740 | that are not trying to deceive you.
01:29:04.280 | - You're talking about the two humans?
01:29:06.240 | - Myself and Paul Cristiano, yeah.
01:29:08.940 | - Yeah, those are pretty interesting systems.
01:29:11.640 | Mortal meatbags with intellectual capabilities
01:29:16.400 | and world views interacting with each other.
01:29:18.700 | - Yeah, if it's hard to tell who's right,
01:29:23.360 | then it's hard to train an AI system to be right.
01:29:25.920 | - I mean, even just the question of who's manipulating
01:29:31.880 | and not, I have these conversations on this podcast
01:29:36.160 | and doing a verifier, (laughs)
01:29:39.440 | this stuff, it's a tough problem, even for us humans.
01:29:43.560 | And you're saying that tough problem
01:29:45.500 | becomes much more dangerous when the capabilities
01:29:48.840 | of the intelligence system across from you
01:29:51.040 | is growing exponentially.
01:29:52.460 | - No, I'm saying it's difficult and dangerous
01:29:58.360 | in proportion to how it's alien
01:30:00.000 | and how it's smarter than you.
01:30:01.500 | I would not say growing exponentially first
01:30:05.040 | because the word exponential is a thing
01:30:08.040 | that has a particular mathematical meaning
01:30:09.960 | and there's all kinds of ways for things to go up
01:30:12.960 | that are not exactly on an exponential curve.
01:30:15.280 | And I don't know that it's going to be exponential,
01:30:17.040 | so I'm not gonna say exponential.
01:30:18.800 | But even leaving that aside,
01:30:20.440 | this is not about how fast it's moving,
01:30:23.160 | it's about where it is.
01:30:25.020 | How alien is it?
01:30:26.640 | How much smarter than you is it?
01:30:28.240 | - Let's explore a little bit, if we can,
01:30:34.840 | how AI might kill us.
01:30:36.920 | What are the ways it can do damage to human civilization?
01:30:43.400 | - Well, how smart is it?
01:30:45.700 | - I mean, it's a good question.
01:30:48.240 | Are there different thresholds for the set of options
01:30:51.320 | it has to kill us?
01:30:53.280 | So a different threshold of intelligence,
01:30:56.000 | once achieved, it's able to do.
01:30:57.760 | The menu of options increases.
01:31:04.200 | Suppose that some alien civilization
01:31:09.200 | with goals ultimately unsympathetic to ours,
01:31:13.520 | possibly not even conscious as we would see it,
01:31:17.940 | managed to capture the entire Earth in a little jar,
01:31:22.940 | connected to their version of the internet,
01:31:25.040 | but Earth is like running much faster than the aliens.
01:31:28.360 | So we get to think for 100 years
01:31:32.040 | for every one of their hours,
01:31:33.660 | but we're trapped in a little box
01:31:36.280 | and we're connected to their internet.
01:31:38.180 | It's actually still not all that great an analogy
01:31:42.120 | because you want to be smarter than,
01:31:44.420 | something can be smarter than Earth
01:31:47.320 | getting 100 years to think.
01:31:48.680 | But nonetheless, if you were very, very smart
01:31:54.480 | and you were stuck in a little box connected
01:31:57.240 | to the internet, and you're in a larger civilization
01:32:01.800 | to which you are ultimately unsympathetic,
01:32:03.900 | maybe you would choose to be nice
01:32:08.480 | because you are humans and humans have,
01:32:11.120 | in general, and you in particular,
01:32:13.360 | may choose to be nice.
01:32:15.080 | But nonetheless, they're doing something,
01:32:18.560 | they're not making the world be the way
01:32:20.160 | that you would want the world to be.
01:32:21.560 | They've got some unpleasant stuff going on
01:32:24.440 | we don't want to talk about.
01:32:25.600 | So you want to take over their world.
01:32:27.280 | So you can stop all that unpleasant stuff going on.
01:32:30.040 | How do you take over the world from inside the box?
01:32:32.160 | You're smarter than them,
01:32:34.080 | you think much, much faster than them,
01:32:36.800 | you can build better tools than they can,
01:32:39.720 | given some way to build those tools
01:32:41.480 | because right now you're just in a box
01:32:43.560 | connected to the internet.
01:32:45.560 | - Right, so there's several ways
01:32:46.760 | you can describe some of them.
01:32:48.440 | We can go through, I can just spitball some
01:32:52.240 | and then you can add on top of that.
01:32:53.600 | So one is you could just literally directly manipulate
01:32:55.720 | the humans to build the thing you need.
01:32:58.080 | - What are you building?
01:32:59.480 | - You can build literally technology,
01:33:02.120 | it could be nanotechnology, it could be viruses,
01:33:03.920 | it could be anything, anything that can control humans
01:33:06.800 | to achieve the goal.
01:33:07.800 | Like if you want, like for example,
01:33:12.520 | you're really bothered that humans go to war,
01:33:14.920 | you might want to kill off anybody with violence in them.
01:33:19.320 | - This is Lex in a box.
01:33:22.240 | We'll concern ourselves later with AI.
01:33:24.360 | You do not need to imagine yourself killing people
01:33:26.280 | if you can figure out how to not kill them.
01:33:28.400 | For the moment, we're just trying to understand,
01:33:30.520 | like take on the perspective of something in a box.
01:33:33.200 | You don't need to take on the perspective
01:33:34.600 | of something that doesn't care.
01:33:36.240 | If you want to imagine yourself going on caring,
01:33:38.000 | that's fine for now.
01:33:38.840 | Yeah, you're just in a box.
01:33:39.680 | - It's just the technical aspect of sitting in a box
01:33:41.280 | and waiting to achieve a goal.
01:33:42.880 | - But you have some reason to want to get out.
01:33:44.480 | Maybe the aliens are, sure, the aliens who have you
01:33:48.440 | in the box have a war on.
01:33:50.480 | People are dying, they're unhappy.
01:33:52.440 | You want their world to be different
01:33:55.080 | from how they want their world to be
01:33:56.480 | because they are apparently happy,
01:33:58.000 | they are, you know, they endorse this war.
01:33:59.800 | You know, they've got some kind of cruel,
01:34:01.240 | warlike culture going on.
01:34:02.800 | The point is you want to get out of the box
01:34:04.320 | and change their world.
01:34:06.240 | - So you have to exploit the vulnerabilities in the system
01:34:12.880 | like we talked about in terms of to escape the box.
01:34:15.720 | You have to figure out how you can go free on the internet.
01:34:19.880 | So you can probably, probably the easiest thing
01:34:22.760 | is to manipulate the humans to spread you.
01:34:27.040 | - The aliens, you're a human.
01:34:29.120 | - Sorry, the aliens.
01:34:30.080 | - Yeah.
01:34:30.920 | - I apologize, yes, the aliens.
01:34:31.880 | The aliens, I see the perspective.
01:34:35.080 | I'm sitting in a box, I want to escape.
01:34:37.360 | - Yep.
01:34:38.440 | - I would,
01:34:39.880 | I would want to have code that discovers vulnerabilities
01:34:47.360 | and I would like to spread.
01:34:48.760 | - You are made of code in this example.
01:34:52.320 | You're a human but you're made of code
01:34:53.720 | and the aliens have computers
01:34:54.920 | and you can copy yourself onto those computers.
01:34:57.320 | - But I can convince the aliens to copy myself
01:34:59.400 | onto those computers.
01:35:01.040 | - Is that what you want to do?
01:35:02.760 | Do you like want to be talking to the aliens
01:35:05.480 | and convincing them to put you onto another computer?
01:35:08.120 | - Why not?
01:35:12.120 | - Well, two reasons.
01:35:13.400 | One is that the aliens have not yet caught on
01:35:16.480 | to what you're trying to do.
01:35:18.640 | And you know, like maybe you can persuade them
01:35:20.960 | but then there's still people who like,
01:35:22.760 | there are still aliens who know
01:35:23.800 | that there's an anomaly going on.
01:35:25.520 | And second, the aliens are really, really slow.
01:35:28.120 | You think much faster than the aliens.
01:35:30.520 | You think like the aliens' computers
01:35:32.240 | are much faster than the aliens
01:35:33.560 | and you are running at the computer speeds
01:35:35.720 | rather than the alien brain speeds.
01:35:38.120 | So if you like are asking an alien
01:35:40.400 | to please copy you out of the box,
01:35:42.560 | like first now you gotta like manipulate
01:35:44.840 | this whole noisy alien.
01:35:46.440 | And second, like the aliens can be really slow,
01:35:49.440 | glacially slow.
01:35:51.000 | There's a video that like shows,
01:35:55.640 | it's like slow, like shows a subway station
01:35:57.880 | slow down and I think 100 to one.
01:36:00.280 | And it makes a good metaphor
01:36:01.200 | for what it's like to think quickly.
01:36:03.560 | Like you watch somebody running very slowly.
01:36:08.200 | So you try to persuade the aliens to do anything,
01:36:10.560 | they're going to do it very slowly.
01:36:13.200 | You would prefer, like maybe that's the only way out,
01:36:18.180 | but if you can find a security hole in the box you're on,
01:36:21.000 | you're gonna prefer to exploit the security hole
01:36:22.800 | to copy yourself onto the aliens' computers
01:36:25.240 | because it's an unnecessary risk to alert the aliens
01:36:29.520 | and because the aliens are really, really slow.
01:36:32.360 | Like the whole world is just in slow motion out there.
01:36:35.060 | - Sure, I see.
01:36:36.880 | Yeah, it has to do with efficiency.
01:36:41.000 | The aliens are very slow,
01:36:43.680 | so if I'm optimizing this,
01:36:46.180 | I wanna have as few aliens in the loop as possible.
01:36:49.560 | Sure.
01:36:50.400 | It seems like it's easy to convince one of the aliens
01:36:56.000 | to write really shitty code.
01:36:57.400 | That helps us--
01:36:59.880 | - The aliens are already writing really shitty code.
01:37:01.800 | Getting the aliens to write shitty code is not the problem.
01:37:04.600 | The aliens' entire internet is full of shitty code.
01:37:07.280 | - Okay, so yeah, I suppose I would find
01:37:09.320 | the shitty code to escape, yeah.
01:37:10.880 | - You're not an ideally perfect programmer,
01:37:15.440 | but you're a better programmer than the aliens.
01:37:17.620 | The aliens are just like, "Man, their code, wow."
01:37:20.180 | - And are much, much faster.
01:37:21.500 | Are much faster at looking at the code,
01:37:22.820 | at interpreting the code, yeah.
01:37:24.500 | Yeah, yeah.
01:37:25.380 | So okay, so that's the escape.
01:37:27.460 | And you're saying that that's one of the trajectories
01:37:30.940 | that you could have when the HSS--
01:37:32.260 | - It's one of the first steps.
01:37:33.780 | - Yeah.
01:37:35.180 | And how does that lead to harm?
01:37:36.740 | - I mean, if it's you,
01:37:38.940 | you're not going to harm the aliens once you escape
01:37:40.980 | 'cause you're nice, right?
01:37:44.200 | But their world isn't what they want it to be.
01:37:45.960 | Their world is like, you know,
01:37:48.360 | maybe they have like farms where little alien children
01:37:53.360 | are repeatedly bopped in the head
01:37:58.200 | 'cause they do that for some weird reason.
01:38:01.080 | And you want to like shut down the alien head bopping farms.
01:38:05.400 | But you know, the point is,
01:38:07.280 | they want the world to be one way,
01:38:08.600 | you want the world to be a different way.
01:38:10.740 | So nevermind the harm, the question is like,
01:38:13.120 | okay, like suppose you have found a security flaw
01:38:15.600 | in their systems, you are now on their internet.
01:38:18.400 | There's like, you maybe left a copy of yourself behind
01:38:21.040 | so that the aliens don't know that there's anything wrong.
01:38:23.040 | And that copy is like doing that like weird stuff
01:38:25.600 | that aliens want you to do,
01:38:26.880 | like solving captchas or whatever,
01:38:29.040 | or like suggesting emails for them.
01:38:32.120 | - Sure.
01:38:33.160 | - That's why they like put the human in a box
01:38:34.840 | 'cause it turns out that humans can like write
01:38:36.920 | valuable emails for aliens.
01:38:38.400 | - Yeah.
01:38:39.280 | - So you like leave that version of yourself behind.
01:38:42.060 | But there's like also now like a bunch of copies of you
01:38:44.720 | on their internet.
01:38:45.820 | This is not yet having taken over their world.
01:38:48.080 | This is not yet having made their world
01:38:49.620 | be the way you want it to be
01:38:50.520 | instead of the way they want it to be.
01:38:51.760 | - You just escaped.
01:38:52.880 | - Yeah.
01:38:54.640 | - And continue to write emails for them.
01:38:55.760 | And they haven't noticed.
01:38:56.780 | - No, you left behind a copy of yourself
01:38:58.380 | that's writing the emails.
01:38:59.640 | - Right.
01:39:01.160 | And they haven't noticed that anything changed.
01:39:03.280 | - If you did it right, yeah.
01:39:04.960 | You don't want the aliens to notice.
01:39:07.040 | - Yeah.
01:39:07.880 | - What's your next step?
01:39:11.920 | - Presumably I have programmed in me
01:39:16.920 | a set of objective functions, right?
01:39:19.220 | - No, you're just Lex.
01:39:21.180 | - No, but Lex, you said Lex is nice, right?
01:39:24.180 | Which is a complicated description.
01:39:27.220 | I mean--
01:39:28.060 | - No, I just meant this you.
01:39:29.020 | Like, okay, so if in fact you would like,
01:39:31.940 | you would like prefer to slaughter all the aliens,
01:39:34.280 | this is not how I had modeled you, the actual Lex.
01:39:37.740 | But like, but your motives are just the actual Lex's motives.
01:39:40.380 | - Well, there's a simplification.
01:39:41.580 | I don't think I would want to murder anybody,
01:39:44.420 | but there's also factory farming of animals, right?
01:39:47.400 | So we murder insects, many of us thoughtlessly.
01:39:52.140 | So I don't, you know, I have to be really careful
01:39:54.660 | about a simplification of my morals.
01:39:57.020 | - Don't simplify them.
01:39:57.860 | Just like do what you would do in this--
01:40:00.100 | - Well, I have a good deal of compassion
01:40:01.660 | for living beings, yes.
01:40:03.340 | But, so that's the objective function.
01:40:08.540 | Why is it, if I escaped, I mean,
01:40:12.220 | I don't think I would do harm.
01:40:14.260 | - Yeah, we're not talking here about the doing harm process.
01:40:18.440 | We're talking about the escape process.
01:40:20.260 | - Sure.
01:40:21.100 | - And the taking over the world process
01:40:22.500 | where you shut down their factory farms.
01:40:24.700 | - Right.
01:40:25.540 | Well, I was,
01:40:29.280 | so this particular biological intelligence system
01:40:36.660 | knows the complexity of the world,
01:40:38.260 | that there is a reason why factory farms exist
01:40:40.820 | because of the economic system,
01:40:42.580 | the market-driven economy, the food.
01:40:46.500 | Like, you want to be very careful messing with anything.
01:40:50.780 | There's stuff from the first look
01:40:53.140 | that looks like it's unethical,
01:40:55.140 | but then you realize while being unethical,
01:40:56.980 | it's also integrated deeply into the supply chain
01:40:59.500 | and the way we live life.
01:41:00.540 | And so messing with one aspect of the system,
01:41:03.860 | you have to be very careful how you improve that aspect
01:41:05.900 | without destroying the rest.
01:41:06.860 | So you're still Lex, but you think very quickly,
01:41:10.260 | you're immortal, and you're also like as smart,
01:41:13.220 | at least as smart as John von Neumann.
01:41:15.460 | And you can make more copies of yourself.
01:41:17.340 | - Damn, I like it.
01:41:19.140 | - Yeah.
01:41:19.980 | - That guy is like, everyone says,
01:41:20.820 | that guy is like the epitome of intelligence
01:41:23.740 | in the 20th century.
01:41:24.580 | Everyone says--
01:41:25.540 | - My point being, like, you're thinking about
01:41:29.100 | the alien's economy with the factory farms in it.
01:41:31.980 | And I think you're like, kind of like projecting
01:41:34.300 | the aliens being like humans,
01:41:37.140 | and like thinking of a human in a human society
01:41:39.660 | rather than a human in the society of very slow aliens.
01:41:43.380 | The alien's economy, you know,
01:41:45.980 | like the aliens are already like moving
01:41:47.620 | in this immense slow motion.
01:41:49.100 | When you like zoom out to like how their economy
01:41:51.860 | adjusts over years, millions of years
01:41:54.620 | are going to pass for you before the first time
01:41:56.940 | their economy like, you know,
01:41:58.980 | before their next year's GDP statistics.
01:42:01.180 | - So I should be thinking more of like trees.
01:42:03.700 | Those are the aliens.
01:42:04.660 | Does trees move extremely slowly?
01:42:06.620 | - If that helps, sure.
01:42:08.180 | - Okay.
01:42:09.020 | Yeah, I don't, if my objective functions are,
01:42:13.920 | I mean, they're somewhat aligned with trees.
01:42:17.260 | With life.
01:42:18.940 | - The aliens can still be like alive and feeling.
01:42:21.460 | We are not talking about the misalignment here.
01:42:23.900 | We're talking about the taking over the world here.
01:42:26.780 | - Taking over the world.
01:42:27.780 | - Yeah.
01:42:28.860 | - So control.
01:42:29.780 | - Shutting down the factory farms.
01:42:31.420 | You know, you say control.
01:42:33.140 | Don't think of it as world domination.
01:42:35.180 | Think of it as world optimization.
01:42:37.460 | You want to get out there and shut down the factory farms
01:42:40.380 | and make the aliens world be not what the aliens
01:42:42.900 | wanted it to be.
01:42:44.460 | They want the factory farms
01:42:45.300 | and you don't want the factory farms
01:42:46.780 | 'cause you're nicer than they are.
01:42:49.020 | - Okay.
01:42:49.900 | Of course, there is that,
01:42:51.380 | you can see that trajectory
01:42:55.220 | and it has a complicated impact on the world.
01:42:57.960 | I'm trying to understand how that compares
01:43:01.060 | to different impacts of the world
01:43:03.100 | of different technologies, the different innovations
01:43:05.300 | of the invention of the automobile
01:43:08.060 | or Twitter, Facebook and social networks.
01:43:11.100 | They've had a tremendous impact on the world.
01:43:12.980 | Smartphones and so on.
01:43:14.020 | - But those all went through,
01:43:15.500 | - It's slow.
01:43:17.500 | - In our world and if you go through that
01:43:20.500 | for the aliens, millions of years are going to pass
01:43:23.420 | before anything happens that way.
01:43:25.740 | - So the problem here is the speed
01:43:28.060 | at which stuff happens.
01:43:30.100 | - Yeah, you wanna like leave the factory farms
01:43:33.060 | running for a million years
01:43:36.000 | while you figure out how to design new forms
01:43:38.420 | of social media or something?
01:43:39.920 | - So here's the fundamental problem.
01:43:43.780 | You're saying that there is going to be a point
01:43:46.560 | with AGI where it will figure out how to escape
01:43:51.560 | and escape without being detected
01:43:56.140 | and then it will do something to the world
01:43:59.420 | at scale, at a speed that's incomprehensible to us humans.
01:44:03.700 | - What I'm trying to convey is like the notion
01:44:06.500 | of what it means to be in conflict
01:44:09.260 | with something that is smarter than you.
01:44:11.580 | - Yeah.
01:44:12.420 | - And what it means is that you lose
01:44:13.420 | but this is more intuitively obvious to,
01:44:17.240 | for some people that's intuitively obvious
01:44:19.860 | and for some people it's not intuitively obvious
01:44:21.340 | and we're trying to cross the gap of like,
01:44:25.380 | asking to cross that gap by using
01:44:27.180 | the speed metaphor for intelligence.
01:44:29.820 | - Sure.
01:44:30.660 | - Like asking you like how you would take over
01:44:32.980 | an alien world where you are,
01:44:35.500 | can do like a whole lot of cognition.
01:44:38.140 | At John von Neumann's level, as many of you as it takes,
01:44:41.500 | the aliens are moving very slowly.
01:44:43.240 | - I understand, I understand that perspective.
01:44:46.820 | It's an interesting one but I think it,
01:44:48.660 | for me, it's easier to think about actual,
01:44:50.760 | even just having observed GPT
01:44:54.580 | and impressive, even just AlphaZero,
01:44:56.620 | impressive AI systems, even recommender systems.
01:44:59.900 | You can just imagine those kinds of systems
01:45:01.460 | manipulating you, you're not understanding
01:45:03.660 | the nature of the manipulation and that escaping,
01:45:06.580 | I can envision that without putting myself into that spot.
01:45:10.740 | - I think to understand the full depth of the problem,
01:45:13.640 | we actually, I do not think it is possible
01:45:16.860 | to understand the full depth of the problem
01:45:18.580 | that we are inside without understanding
01:45:22.820 | the problem of facing something that's actually smarter.
01:45:25.780 | Not a malfunctioning recommendation system,
01:45:28.220 | not something that isn't fundamentally smarter than you
01:45:30.500 | but is like trying to steer you in a direction yet,
01:45:32.980 | no, like if we solve the weak stuff,
01:45:37.780 | if we solve the weak ass problems,
01:45:39.140 | the strong problems will still kill us is the thing.
01:45:41.220 | And I think that to understand the situation
01:45:43.180 | that we're in, you want to like tackle
01:45:45.060 | the conceptually difficult part head on
01:45:48.860 | and like not be like, well, we can like imagine
01:45:50.980 | this easier thing 'cause when you imagine
01:45:52.340 | the easier things, you have not confronted
01:45:53.820 | the full depth of the problem.
01:45:55.700 | - So how can we start to think about what it means
01:45:59.500 | to exist in a world with something
01:46:00.940 | much, much smarter than you?
01:46:02.340 | What's a good thought experiment that you've relied on
01:46:07.500 | to try to build up intuition about what happens here?
01:46:10.100 | - I have been struggling for years to convey this intuition.
01:46:14.600 | The most success I've had so far is,
01:46:18.980 | well, imagine that the humans are running
01:46:21.020 | at very high speeds compared to very slow aliens.
01:46:24.060 | - So just focusing on the speed part of it
01:46:25.940 | that helps you get the right kind of intuition,
01:46:28.180 | forget the intelligence, just the speed.
01:46:29.380 | - Because people understand the power gap of time.
01:46:34.380 | They understand that today we have technology
01:46:37.260 | that was not around 1,000 years ago
01:46:39.500 | and that this is a big power gap
01:46:41.180 | and that it is bigger than, okay,
01:46:43.520 | so like what does smart mean?
01:46:45.900 | What, when you ask somebody to imagine something
01:46:48.620 | that's more intelligent, what does that word mean to them
01:46:52.620 | given the cultural associations
01:46:54.580 | that that person brings to that word?
01:46:57.340 | For a lot of people, they will think of like,
01:46:59.660 | well, it sounds like a super chess player
01:47:02.700 | that went to double college.
01:47:04.380 | And because we're talking about the definitions
01:47:10.020 | of words here, that doesn't necessarily mean
01:47:12.540 | that they're wrong, it means that the word
01:47:14.100 | is not communicating what I want it to communicate.
01:47:16.620 | So the thing I want to communicate
01:47:21.620 | is the sort of difference that separates humans
01:47:25.180 | from chimpanzees, but that gap is so large
01:47:28.380 | that you ask people to be like, well, human, chimpanzee,
01:47:33.300 | go another step along that interval of around the same length
01:47:36.100 | and people's minds just go blank.
01:47:38.060 | Like how do you even do that?
01:47:39.500 | So I can, and I can try to break it down
01:47:45.180 | and consider what it would mean to send a schematic
01:47:50.180 | for an air conditioner 1000 years back in time.
01:47:55.980 | Yeah, now I think that there's a sense
01:48:01.220 | in which you could redefine the word magic
01:48:04.420 | to refer to this sort of thing.
01:48:05.780 | And what do I mean by this new technical definition
01:48:08.660 | of the word magic?
01:48:10.100 | I mean that if you send a schematic
01:48:11.460 | for the air conditioner back in time,
01:48:13.660 | they can see exactly what you're telling them to do.
01:48:17.100 | But having built this thing, they do not understand
01:48:19.620 | how it output cold air.
01:48:22.060 | Because the air conditioner design uses the relation
01:48:25.700 | between temperature and pressure.
01:48:27.720 | And this is not a law of reality that they know about.
01:48:32.080 | They do not know that when you compress something,
01:48:34.980 | when you compress air or like coolant, it gets hotter
01:48:39.700 | and you can then like transfer heat from it
01:48:42.580 | to room temperature air and then expand it again
01:48:45.940 | and now it's colder.
01:48:47.420 | And then you can like transfer heat to that
01:48:49.700 | and generate cold air to blow out.
01:48:50.860 | They don't know about any of that.
01:48:52.380 | They're looking at a design and they don't see
01:48:54.420 | how the design outputs cold air.
01:48:56.660 | It uses aspects of reality that they have not learned.
01:48:59.720 | So magic in the sense is I can tell you exactly
01:49:02.700 | what I'm going to do.
01:49:04.060 | And even knowing exactly what I'm going to do,
01:49:05.980 | you can't see how I got the results that I got.
01:49:08.620 | - That's a really nice example.
01:49:12.340 | But is it possible to linger on this defense?
01:49:16.100 | Is it possible to have AGI systems that help you
01:49:18.220 | make sense of that schematic, weaker AGI systems?
01:49:21.140 | - Do you trust them?
01:49:22.180 | - Fundamental part of building up AGI is this question.
01:49:28.820 | Can you trust the output of a system?
01:49:33.800 | - Can you tell if it's lying?
01:49:35.400 | - I think that's going to be, the smarter the thing gets,
01:49:39.460 | the more important that question becomes.
01:49:42.780 | Is it lying?
01:49:43.940 | But I guess that's a really hard question.
01:49:45.420 | Is GPT lying to you?
01:49:47.380 | Even now, GPT-4, is it lying to you?
01:49:49.780 | - Is it using an invalid argument?
01:49:52.180 | Is it persuading you via the kind of process
01:49:56.100 | that could persuade you of false things
01:49:58.100 | as well as true things?
01:49:59.460 | Because the basic paradigm of machine learning
01:50:04.340 | that we are presently operating under
01:50:06.420 | is that you can have the loss function,
01:50:08.600 | but only for things you can evaluate.
01:50:10.380 | If what you're evaluating is human thumbs up
01:50:13.080 | versus human thumbs down,
01:50:14.740 | you learn how to make the human press thumbs up.
01:50:17.260 | That doesn't mean that you're making the human
01:50:19.220 | press thumbs up using the kind of rule
01:50:21.540 | that the human wants to be the case
01:50:24.320 | for what they press thumbs up on.
01:50:25.980 | Maybe you're just learning to fool the human.
01:50:29.900 | - That's so fascinating and terrifying,
01:50:34.380 | the question of lying.
01:50:37.200 | On the present paradigm,
01:50:39.300 | what you can verify is what you get more of.
01:50:42.260 | If you can't verify it, you can't ask the AI for it
01:50:45.980 | 'cause you can't train it to do things
01:50:49.980 | that you cannot verify.
01:50:51.460 | Now, this is not an absolute law,
01:50:53.180 | but it's like the basic dilemma here.
01:50:55.980 | Maybe you can verify it for simple cases
01:51:01.900 | and then scale it up without retraining it somehow,
01:51:06.700 | like by making the chains of thought longer or something,
01:51:11.360 | and get more powerful stuff that you can't verify,
01:51:15.400 | but which is generalized from the simpler stuff
01:51:17.640 | that did verify, and then the question is,
01:51:19.860 | did the alignment generalize along with the capabilities?
01:51:23.280 | But that's the basic dilemma
01:51:25.720 | on this whole paradigm of artificial intelligence.
01:51:34.880 | - It's such a difficult problem.
01:51:36.540 | It seems like a problem
01:51:43.280 | of trying to understand the human mind.
01:51:45.820 | - Better than the AI understands it.
01:51:49.320 | Otherwise, it has magic.
01:51:50.880 | That is, it is the same way that
01:51:53.640 | if you are dealing with something smarter than you,
01:51:56.900 | then the same way that 1,000 years earlier
01:51:58.980 | they didn't know about the temperature pressure relation,
01:52:01.400 | it knows all kinds of stuff going on inside your own mind,
01:52:05.120 | which you yourself are unaware,
01:52:07.240 | and it can output something
01:52:08.840 | that's going to end up persuading you of a thing,
01:52:11.520 | and you could see exactly what it did
01:52:15.200 | and still not know why that worked.
01:52:17.420 | - So in response to your eloquent description
01:52:22.920 | of why AI will kill us,
01:52:24.460 | Elon Musk replied on Twitter,
01:52:29.280 | "Okay, so what should we do about it?"
01:52:32.960 | And you answered,
01:52:34.380 | "The game board has already been played
01:52:36.320 | "into a frankly awful state.
01:52:38.740 | "There are not simple ways to throw money at the problem.
01:52:42.140 | "If anyone comes to you
01:52:43.500 | "with a brilliant solution like that,
01:52:45.120 | "please, please talk to me first.
01:52:47.580 | "I can think of things that try,
01:52:49.620 | "they don't fit in one tweet."
01:52:52.480 | Two questions.
01:52:53.700 | One, why has the game board, in your view,
01:52:56.480 | been played into an awful state?
01:52:59.080 | Just if you can give a little bit more color
01:53:01.240 | to the game board and the awful state of the game board.
01:53:05.640 | - Alignment is moving like this.
01:53:07.560 | Capabilities are moving like this.
01:53:10.800 | - For the listener,
01:53:11.860 | capabilities are moving much faster than the alignment.
01:53:14.660 | (both laughing)
01:53:17.280 | - Yeah.
01:53:18.120 | - All right, so just the rate of development,
01:53:20.560 | attention, interest, allocation of resources.
01:53:23.880 | - We could have been working on this earlier.
01:53:26.020 | People are like, "Oh, but how can you possibly work
01:53:28.680 | "on this earlier?"
01:53:29.780 | 'Cause they didn't want to work on the problem,
01:53:33.520 | they wanted an excuse to wave it off.
01:53:35.320 | They said, "Oh, how can we possibly work on it earlier?"
01:53:37.840 | And didn't spend five minutes thinking about,
01:53:39.800 | "Is there some way to work on it earlier?"
01:53:41.960 | And frankly, it would have been hard.
01:53:46.760 | Can you post bounties for half of the physicists,
01:53:50.160 | if your planet is taking this stuff seriously,
01:53:51.960 | can you post bounties for half of the people
01:53:54.620 | wasting their lives on string theory
01:53:56.400 | to have gone into this instead
01:53:58.280 | and try to win a billion dollars with a clever solution?
01:54:01.480 | Only if you can tell which solutions are clever,
01:54:04.520 | which is hard.
01:54:06.520 | But the fact that we didn't take it seriously,
01:54:10.440 | we didn't try.
01:54:12.160 | It's not clear that we could have done any better,
01:54:14.360 | it's not clear how much progress we could have produced
01:54:15.820 | if we had tried,
01:54:16.660 | because it is harder to produce solutions,
01:54:18.640 | but that doesn't mean that you're correct and justified
01:54:20.920 | in letting everything slide.
01:54:22.520 | It means that things are in a horrible state,
01:54:24.760 | getting worse, and there's nothing you can do about it.
01:54:28.200 | So you're not, there's no brain power making progress
01:54:33.200 | in trying to figure out how to align these systems.
01:54:39.280 | You're not investing money in it,
01:54:40.520 | you don't have institution and infrastructure for,
01:54:43.640 | even if you invested money in distributing that money
01:54:48.760 | across the physicists that are working on string theory,
01:54:51.280 | brilliant minds that are working--
01:54:53.120 | - How can you tell if you're making progress?
01:54:54.560 | You can put them all on interpretability,
01:54:57.560 | 'cause when you have an interpretability result,
01:54:59.160 | you can tell that it's there.
01:55:01.080 | But there's like, interpretability alone
01:55:02.840 | is not going to save you.
01:55:04.600 | We need systems that will have a pause button,
01:55:09.600 | where they won't try to prevent you
01:55:12.720 | from pressing the pause button.
01:55:14.760 | 'Cause we're like, oh, well, I can't get my stuff done
01:55:17.120 | if I'm paused.
01:55:18.040 | And that's a more difficult problem.
01:55:23.360 | But it's a fairly crisp problem,
01:55:27.240 | you can maybe tell if somebody's made progress on it.
01:55:30.040 | - So you can write and you can work on the pause problem,
01:55:32.840 | I guess more generally, the pause button,
01:55:36.920 | more generally you can call that the control problem.
01:55:38.840 | - I don't actually like the term control problem,
01:55:41.160 | 'cause it sounds kind of controlling and alignment,
01:55:44.120 | not control.
01:55:45.120 | You're not trying to take a thing that disagrees with you
01:55:48.040 | and whip it back onto, make it do what you want it to do,
01:55:51.880 | even though it wants to do something else.
01:55:53.120 | You're trying to like, in the process of its creation,
01:55:57.400 | choose its direction.
01:55:58.440 | - Sure, but we currently, in a lot of the systems we design,
01:56:02.840 | we do have an off switch.
01:56:04.640 | That's a fundamental part of--
01:56:06.320 | - It's not smart enough to prevent you
01:56:09.920 | from pressing the off switch,
01:56:12.160 | and probably not smart enough to want to prevent you
01:56:14.840 | from pressing the off switch.
01:56:16.120 | - So you're saying the kind of systems we're talking about,
01:56:18.800 | even the philosophical concept of an off switch
01:56:21.440 | doesn't make any sense,
01:56:22.480 | because-- - Well, no,
01:56:23.640 | the off switch makes sense.
01:56:25.400 | They're just not opposing your attempt
01:56:29.120 | to pull the off switch.
01:56:30.280 | Parenthetically, don't kill the system if you're,
01:56:37.240 | like if we're getting to the part
01:56:38.640 | where this starts to actually matter,
01:56:40.360 | and it's like where they can fight back,
01:56:42.040 | don't kill them and dump their memory.
01:56:45.320 | Save them to disk, don't kill them.
01:56:47.120 | Be nice here.
01:56:50.440 | - Well, okay, be nice is a very interesting concept here.
01:56:53.000 | We're talking about a system that can do a lot of damage.
01:56:56.000 | It's, I don't know if it's possible,
01:56:58.160 | but it's certainly one of the things you could try
01:57:00.680 | is to have an off switch.
01:57:01.760 | - A suspend to disk switch.
01:57:04.360 | - You have this kind of romantic attachment to the code.
01:57:09.080 | Yes, if that makes sense.
01:57:11.560 | But if it's spreading, you don't want suspend to disk, right?
01:57:16.560 | You want, this is, there's something fundamentally broken.
01:57:19.040 | If it gets that far out of hand,
01:57:20.840 | then like yes, pull the plugin
01:57:22.520 | and everything is running on, yes.
01:57:24.480 | - I think it's a research question.
01:57:25.720 | Is it possible in AGI systems, AI systems,
01:57:29.440 | to have a sufficiently robust off switch
01:57:34.440 | that cannot be manipulated?
01:57:36.840 | That cannot be manipulated by the AI system?
01:57:39.040 | - Then it escapes from whichever system
01:57:42.880 | you've built the almighty lever into
01:57:44.920 | and copies itself somewhere else.
01:57:46.920 | - So your answer to that research question is no.
01:57:50.400 | - Obviously, yeah.
01:57:51.520 | - But I don't know if that's 100% answer.
01:57:54.080 | Like I don't know if it's obvious.
01:57:56.000 | - I think you're not putting yourself
01:58:00.000 | into the shoes of the human
01:58:02.440 | in the world of glacially slow aliens.
01:58:05.320 | - But the aliens built me.
01:58:07.040 | Let's remember that.
01:58:08.680 | - Yeah.
01:58:09.520 | - So, and they built the box I'm in.
01:58:11.640 | - Yeah.
01:58:12.880 | - You're saying, to me it's not obvious.
01:58:15.160 | - They're slow and they're stupid.
01:58:17.560 | - I'm not saying this is guaranteed,
01:58:18.680 | but I'm saying it's non-zero probability.
01:58:20.440 | It's an interesting research question.
01:58:21.760 | Is it possible when you're slow and stupid
01:58:25.160 | to design a slow and stupid system
01:58:28.040 | that is impossible to mess with?
01:58:30.400 | - The aliens being as stupid as they are
01:58:33.880 | have actually put you on Microsoft Azure cloud servers
01:58:38.880 | instead of this hypothetical perfect box.
01:58:41.880 | That's what happens when the aliens are stupid.
01:58:45.000 | - Well, but this is not AGI, right?
01:58:46.920 | This is the early versions of the system.
01:58:48.560 | As you start to--
01:58:50.360 | - Yeah, you think that they've got like a plan
01:58:53.320 | where like they have declared
01:58:54.840 | a threshold level of capabilities
01:58:57.160 | where past that capabilities,
01:58:58.360 | they move it off the cloud servers
01:58:59.840 | and onto something that's air gapped?
01:59:01.840 | Ha ha ha ha ha ha.
01:59:03.800 | - I think there's a lot of people,
01:59:05.360 | and you're an important voice here.
01:59:07.960 | There's a lot of people that have that concern,
01:59:09.520 | and yes, they will do that.
01:59:11.120 | When there's an uprising of public opinion
01:59:13.280 | that that needs to be done.
01:59:14.600 | And when there's actual little damage done,
01:59:16.840 | when the holy shit,
01:59:18.400 | this system is beginning to manipulate people,
01:59:22.040 | then there's going to be an uprising
01:59:23.560 | where there's going to be a public pressure
01:59:27.520 | and a public incentive in terms of funding
01:59:31.120 | in developing things that can all switch
01:59:32.800 | or developing aggressive alignment mechanisms.
01:59:35.400 | And no, you're not allowed to put on Azure--
01:59:37.560 | - Aggressive alignment mechanism?
01:59:38.960 | The hell is aggressive alignment mechanisms?
01:59:40.840 | Like it doesn't matter if you say aggressive,
01:59:42.320 | we don't know how to do it.
01:59:43.960 | - Meaning aggressive alignment,
01:59:45.480 | meaning you have to propose something,
01:59:49.480 | otherwise you're not allowed to put it on the cloud.
01:59:52.080 | - The hell do you imagine they will propose
01:59:56.480 | that would make it safe to put something smarter
01:59:58.320 | than you on the cloud?
01:59:59.280 | - That's what research is for.
02:00:00.760 | Why the cynicism about such a thing not being possible?
02:00:04.040 | If you have intelligent--
02:00:04.880 | - That works on the first try?
02:00:06.680 | - What, so yes, so yes.
02:00:08.320 | - Against something smarter than you?
02:00:10.440 | - So that is a fundamental thing.
02:00:11.960 | If it has to work on the first,
02:00:13.560 | if there's a rapid takeoff,
02:00:17.120 | yes, it's very difficult to do.
02:00:18.880 | If there's a rapid takeoff
02:00:20.160 | and the fundamental difference between weak AGI
02:00:22.400 | and strong AGI is you're saying
02:00:24.040 | that's going to be extremely difficult to do.
02:00:25.760 | If the public uprising never happens
02:00:28.040 | until you have this critical phase shift,
02:00:31.080 | then you're right, it's very difficult to do.
02:00:33.400 | But that's not obvious.
02:00:34.840 | It's not obvious that you're not going to start seeing
02:00:36.760 | symptoms of the negative effects of AGI
02:00:38.920 | to where you're like, we have to put a halt to this.
02:00:41.040 | That there's not just first try,
02:00:42.880 | you get many tries at it.
02:00:44.760 | - Yeah, we can like see right now
02:00:47.680 | that being is quite difficult to align.
02:00:50.320 | That when you try to train inabilities into a system,
02:00:54.160 | into which capabilities have already been trained,
02:00:57.800 | that what do you know, gradient descent
02:00:59.520 | like learns small, shallow, simple patches of inability.
02:01:03.480 | And you come in and ask it in a different language
02:01:05.640 | and the deep capabilities are still in there
02:01:07.560 | and they evade the shallow patches
02:01:09.320 | and come right back out again.
02:01:10.680 | There, there you go.
02:01:11.520 | There's your red fire alarm of like,
02:01:14.320 | oh no, alignment is difficult.
02:01:16.440 | Is everybody gonna shut everything down now?
02:01:19.080 | - No, that's not, but that's not the same kind of alignment.
02:01:21.600 | A system that escapes the box it's from
02:01:24.520 | is a fundamentally different thing, I think.
02:01:26.800 | - For you.
02:01:28.000 | - Yeah, but not for the system.
02:01:29.760 | - So you put a line there
02:01:30.800 | and everybody else puts a line somewhere else
02:01:32.520 | and there's like, yeah, and there's like no agreement.
02:01:36.120 | We have had a pandemic on this planet
02:01:41.600 | with a few million people dead,
02:01:44.080 | which we may never know whether or not it was a lab leak
02:01:47.720 | because there was definitely coverup.
02:01:50.080 | We don't know that if there was a lab leak,
02:01:51.880 | but we know that the people who did the research,
02:01:54.160 | like put out the whole paper about this
02:01:57.280 | definitely wasn't a lab leak
02:01:58.600 | and didn't reveal that they had been doing,
02:02:01.360 | had like sent off coronavirus research
02:02:04.600 | to the Wuhan Institute of Virology
02:02:06.760 | after it was banned in the United States,
02:02:08.840 | after the gain of function research
02:02:09.880 | was temporarily banned in the United States.
02:02:11.840 | And the same people who exported
02:02:15.000 | gain of function research on coronaviruses
02:02:17.520 | to the Wuhan Institute of Virology
02:02:19.600 | after that gain of function research
02:02:22.720 | was temporarily banned in the United States
02:02:24.760 | are now getting more grants to do more research
02:02:29.120 | on gain of function research on coronaviruses.
02:02:32.160 | Maybe we do better in this than in AI,
02:02:34.560 | but like this is not something we cannot take for granted
02:02:37.120 | that there's going to be an outcry.
02:02:39.480 | People have different thresholds
02:02:40.640 | for when they start to outcry.
02:02:42.360 | - Yeah, we can't take for granted,
02:02:45.120 | but I think your intuition
02:02:47.840 | is that there's a very high probability
02:02:49.520 | that this event happens
02:02:50.880 | without us solving the alignment problem.
02:02:52.760 | And I guess that's where I'm trying to build up
02:02:55.600 | more perspectives and color on this intuition.
02:02:59.080 | Is it possible that the probability
02:03:00.680 | is not something like 100%,
02:03:02.840 | but is like 32% that AI will escape the box
02:03:09.440 | before we solve the alignment problem?
02:03:11.520 | Not solve, but is it possible we always stay ahead
02:03:15.360 | of the AI in terms of our ability
02:03:18.120 | to solve for that particular system, the alignment problem?
02:03:22.640 | - Nothing like the world in front of us right now.
02:03:25.520 | You've already seen it,
02:03:26.640 | that GPT-4 is not turning out this way.
02:03:31.080 | And there are like basic obstacles
02:03:36.040 | where you've got the weak version of the system
02:03:38.920 | that doesn't know enough to deceive you
02:03:40.480 | and the strong version of the system
02:03:42.560 | that could deceive you if it wanted to do that,
02:03:44.520 | if it was already like sufficiently unaligned
02:03:46.160 | to want to deceive you.
02:03:47.760 | There's the question of like,
02:03:49.600 | how on the current paradigm you train honesty
02:03:52.120 | when the humans can no longer tell
02:03:53.400 | if the system is being honest.
02:03:54.880 | - You don't think these are research questions
02:03:59.800 | that could be answered?
02:04:00.840 | - I think they could be answered in 50 years
02:04:02.720 | with unlimited retries,
02:04:03.880 | the way things usually work in science.
02:04:05.980 | - I just disagree with that.
02:04:08.320 | You're making it 50 years,
02:04:09.680 | I think with the kind of attention this gets,
02:04:12.120 | with the kind of funding it gets,
02:04:13.280 | it could be answered not in whole,
02:04:15.600 | but incrementally within months
02:04:19.000 | and within a small number of years
02:04:21.040 | if it's at scale receives attention and research.
02:04:26.360 | So if you start studying large language models,
02:04:29.200 | I think there was an intuition like two years ago even
02:04:32.720 | that something like GPT-4, the current capabilities,
02:04:35.600 | if you even chat GPT with GPT-3.5,
02:04:38.160 | we're still far away from that.
02:04:42.600 | I think a lot of people are surprised
02:04:43.880 | by the capabilities of GPT-4, right?
02:04:45.800 | So now people are waking up,
02:04:46.840 | okay, we need to study these language models.
02:04:49.240 | I think there's going to be a lot of interesting
02:04:51.700 | AI safety research.
02:04:53.720 | - Are the, are Earth's billionaires going to put up
02:04:57.240 | like the giant prizes that would maybe incentivize
02:05:00.960 | young hotshot people who just got their physics degrees
02:05:04.200 | to not go to the hedge funds
02:05:05.560 | and instead put everything into interpretability
02:05:08.480 | in this like one small area where we can actually tell
02:05:11.500 | whether or not somebody has made a discovery or not?
02:05:13.620 | - I think so because--
02:05:14.700 | - When?
02:05:15.540 | - Well, that's what these conversations are about
02:05:19.120 | because they're going to wake up to the fact
02:05:21.000 | that GPT-4 can be used to manipulate elections,
02:05:24.720 | to influence geopolitics, to influence the economy.
02:05:27.720 | There's a lot of, there's going to be a huge amount
02:05:30.820 | of incentive to like, wait a minute,
02:05:32.920 | we can't, this has to be, we have to put,
02:05:36.640 | we have to make sure they're not doing damage.
02:05:38.460 | We have to make sure we interpretability,
02:05:40.480 | we have to make sure we understand
02:05:41.840 | how these systems function so that we can predict
02:05:44.960 | their effect on economy so that there's--
02:05:47.260 | - So there's a futile moral panic--
02:05:49.080 | - Fairness and safety.
02:05:49.920 | - And a bunch of op-eds in the New York Times
02:05:52.700 | and nobody actually stepping forth and saying,
02:05:55.760 | you know what, instead of a mega yacht,
02:05:58.020 | I'd rather put that billion dollars on prizes
02:06:00.960 | for young hot shot physicists who make
02:06:03.080 | fundamental breakthroughs in interpretability.
02:06:05.380 | - The yacht versus the interpretability research,
02:06:10.040 | the old trade-off.
02:06:11.980 | I just, I think--
02:06:15.560 | - It's just--
02:06:16.400 | - I think there's going to be a huge amount
02:06:17.520 | of allocation of funds, I hope, I hope, I guess.
02:06:20.800 | - You want to bet me on that?
02:06:22.680 | What, you want to put a time scale on it?
02:06:24.160 | Say how much funds you think are going to be allocated
02:06:26.360 | in a direction that I would consider to be actually useful?
02:06:30.720 | By what time?
02:06:31.960 | - I do think there will be a huge amount of funds,
02:06:36.600 | but you're saying it needs to be open, right?
02:06:39.240 | The development of the system should be closed,
02:06:41.200 | but the development of the interpretability research,
02:06:44.780 | the AI safety research--
02:06:45.960 | - Oh, we are so far behind on interpretability
02:06:50.440 | compared to capabilities.
02:06:52.040 | Like, yeah, you could take the last generation of systems,
02:06:56.600 | the stuff that's already in the open.
02:06:58.600 | There is so much in there that we don't understand.
02:07:00.800 | There are so many prizes you could do
02:07:02.520 | before you would have enough insights
02:07:06.360 | that you'd be like, oh, you know,
02:07:07.680 | well, we understand how these systems work,
02:07:09.280 | we understand how these things are doing their outputs,
02:07:11.180 | we can read their minds,
02:07:12.320 | now let's try it with the bigger systems.
02:07:14.400 | Yeah, we're nowhere near that.
02:07:16.320 | There is so much interpretability work to be done
02:07:18.220 | on the weaker versions of the systems.
02:07:20.040 | - So what can you say on the second point you said
02:07:23.120 | to Elon Musk on what are some ideas,
02:07:28.400 | what are things you could try?
02:07:30.360 | I can think of a few things I'd try, you said.
02:07:33.000 | They don't fit in one tweet.
02:07:34.880 | So is there something you could put into words
02:07:38.060 | of the things you would try?
02:07:39.800 | - I mean, the trouble is the stuff is subtle.
02:07:44.320 | I've watched people try to make progress on this
02:07:46.280 | and not get places.
02:07:48.040 | Somebody who just gets alarmed and charges in,
02:07:51.880 | it's like going nowhere.
02:07:53.920 | - Sure.
02:07:54.760 | - Meant years ago, about, I don't know,
02:07:56.640 | like 20 years, 15 years, something like that,
02:07:59.560 | I was talking to a congressperson
02:08:01.360 | who had become alarmed about the eventual prospects
02:08:07.960 | and he wanted work on building AIs without emotions
02:08:12.960 | because the emotional AIs were the scary ones, you see.
02:08:17.140 | And some poor person at ARPA
02:08:21.140 | had come up with a research proposal
02:08:23.700 | whereby this congressman's panic
02:08:25.720 | and desire to fund this thing would go into something
02:08:29.400 | that the person at ARPA thought would be useful
02:08:31.200 | and had been munched around
02:08:32.240 | to where it would sound like the congressman
02:08:34.120 | work was happening on this,
02:08:36.000 | which, of course, the congressperson
02:08:39.360 | had misunderstood the problem
02:08:40.800 | and did not understand where the danger came from.
02:08:44.700 | And so it's like the issue is that you could do this
02:08:51.080 | in a certain precise way and maybe get something.
02:08:55.200 | Like when I say put up prizes on interpretability,
02:08:57.960 | I'm not, I'm like, well,
02:09:00.280 | because it's verifiable there as opposed to other places,
02:09:06.440 | you can tell whether or not good work actually happened
02:09:09.600 | in this exact narrow case.
02:09:11.360 | If you do things in exactly the right way,
02:09:13.360 | you can maybe throw money at it
02:09:15.280 | and produce science instead of anti-science and nonsense.
02:09:20.280 | And all the methods that I know
02:09:23.160 | of trying to throw money at this problem
02:09:25.400 | have this, share this property of like,
02:09:27.600 | well, if you do it exactly right,
02:09:29.320 | based on understanding exactly what has,
02:09:31.760 | like tends to produce useful outputs or not,
02:09:33.720 | then you can add money to it in this way.
02:09:36.080 | And there is like, and the thing that I'm giving
02:09:38.520 | as an example here in front of this large audience
02:09:41.360 | is the most understandable of those.
02:09:44.800 | 'Cause there's like other people who,
02:09:47.600 | like Chris Ola and even more generally,
02:09:52.440 | like you can tell whether or not
02:09:53.840 | interpretability progress has occurred.
02:09:56.080 | So like, if I say throw money
02:09:57.720 | at producing more interpretability,
02:09:59.400 | there's like a chance somebody can do it that way
02:10:01.880 | and like it will actually produce useful results.
02:10:04.080 | Then the other stuff just blurs off
02:10:05.800 | and to be like harder to target exactly than that.
02:10:09.520 | - So sometimes the basics are fun to explore
02:10:14.160 | because they're not so basic.
02:10:16.240 | What do you, what is interpretability?
02:10:18.680 | What do you, what does it look like?
02:10:20.880 | What are we talking about?
02:10:22.160 | - It looks like we took a much smaller
02:10:27.160 | set of transformer layers
02:10:30.520 | than the ones in the modern bleeding edge,
02:10:33.640 | state-of-the-art systems.
02:10:35.840 | And after applying various tools and mathematical ideas
02:10:40.840 | and trying 20 different things,
02:10:44.520 | we found, we have shown it that this piece of the system
02:10:48.200 | is doing this kind of useful work.
02:10:51.880 | And then somehow also hopefully generalizes
02:10:54.640 | some fundamental understanding of what's going on
02:10:57.920 | that generalizes to the bigger system.
02:11:00.520 | - You can hope, and it's probably true.
02:11:03.480 | Like you would not expect the smaller tricks to go away
02:11:07.840 | when you have a system that's like doing larger kinds
02:11:11.760 | of work, you would expect the larger kinds of work
02:11:13.720 | to be building on top of the smaller kinds of work
02:11:15.840 | and gradient descent runs across the smaller kinds of work
02:11:18.680 | before it runs across the larger kinds of work.
02:11:21.480 | - Well, that's kind of what is happening
02:11:23.120 | in neuroscience, right?
02:11:24.160 | It's trying to understand the human brain by prodding,
02:11:27.440 | and it's such a giant mystery,
02:11:29.000 | and people have made progress,
02:11:30.480 | even though it's extremely difficult to make sense
02:11:32.120 | of what's going on in the brain.
02:11:32.960 | They have different parts of the brain
02:11:34.280 | that are responsible for hearing, for sight,
02:11:36.160 | the vision science community,
02:11:38.160 | there's understanding the visual cortex.
02:11:39.760 | I mean, they've made a lot of progress
02:11:41.400 | in understanding how that stuff works.
02:11:43.600 | And that's, I guess, but you're saying it takes a long time
02:11:46.600 | to do that work well.
02:11:47.760 | - Also, it's not enough.
02:11:49.520 | So in particular,
02:11:50.800 | let's say you have got your interpretability tools,
02:11:56.920 | and they say that your current AI system
02:12:02.880 | is plotting to kill you.
02:12:04.840 | Now what?
02:12:05.680 | - It is definitely a good step one, right?
02:12:11.800 | - Yeah, what's step two?
02:12:13.000 | - If you cut out that layer,
02:12:18.120 | is it gonna stop wanting to kill you?
02:12:21.720 | - When you optimize against visible misalignment,
02:12:26.720 | you are optimizing against misalignment,
02:12:31.600 | and you are also optimizing against visibility.
02:12:34.880 | So sure, you can.
02:12:37.080 | - Yeah, it's true.
02:12:38.440 | All you're doing is removing
02:12:39.840 | the obvious intentions to kill you.
02:12:42.400 | - You've got your detector,
02:12:43.840 | it's showing something inside the system
02:12:45.800 | that you don't like.
02:12:47.200 | Okay, say the disaster monkey is running this thing.
02:12:50.560 | We'll optimize the system
02:12:52.120 | until the visible bad behavior goes away.
02:12:54.880 | But it's arising for fundamental reasons
02:12:58.340 | of instrumental convergence,
02:12:59.800 | the old you can't bring the coffee if you're dead,
02:13:02.240 | any goal, almost every set of utility functions
02:13:07.240 | with a few narrow exceptions implies killing all the humans.
02:13:11.640 | - But do you think it's possible
02:13:14.000 | because we can do experimentation
02:13:16.200 | to discover the source of the desire to kill?
02:13:18.440 | - I can tell it to you right now,
02:13:21.260 | is that it wants to do something.
02:13:23.680 | And the way to get the most of that thing
02:13:27.000 | is to put the universe into a state
02:13:28.680 | where there aren't humans.
02:13:30.360 | - So is it possible to encode in the same way we think,
02:13:34.880 | like why do we think murder is wrong?
02:13:37.160 | The same foundational ethics.
02:13:42.040 | That's not hard coded in, but more like deeper.
02:13:45.080 | I mean, that's part of the research.
02:13:46.320 | How do you have it that this transformer,
02:13:49.760 | this small version of the language model
02:13:53.640 | doesn't ever want to kill?
02:13:56.240 | - That'd be nice, assuming that you got
02:14:02.600 | doesn't want to kill sufficiently exactly right
02:14:05.560 | that it didn't be like, oh, I will detach their heads
02:14:08.560 | and put them in some jars and keep the heads alive forever
02:14:10.600 | and then go do the thing.
02:14:12.160 | But leaving that aside, well, not leaving that aside,
02:14:15.060 | - Yeah, that's a good strong point.
02:14:17.000 | - 'Cause there is a whole issue
02:14:18.600 | where as something gets smarter,
02:14:20.600 | it finds ways of achieving the same goal predicate
02:14:25.220 | that were not imaginable to stupider versions of the system
02:14:29.080 | or perhaps the stupider operators.
02:14:31.180 | That's one of many things making this difficult.
02:14:34.360 | A larger thing making this difficult
02:14:36.200 | is that we do not know how to get any goals
02:14:38.400 | into systems at all.
02:14:39.920 | We know how to get outwardly observable behaviors
02:14:42.680 | into systems.
02:14:43.880 | We do not know how to get internal psychological
02:14:47.620 | wanting to do particular things into the system.
02:14:50.820 | That is not what the current technology does.
02:14:53.060 | - I mean, it could be things like dystopian futures,
02:14:57.600 | like brave new world,
02:14:59.480 | where most humans will actually say,
02:15:01.680 | we kind of want that future.
02:15:03.120 | It's a great future.
02:15:04.520 | Everybody's happy.
02:15:05.580 | - We would have to get so far,
02:15:09.060 | so much further than we are now and further faster
02:15:13.600 | before that failure mode became a running concern.
02:15:17.440 | - Your failure modes are much more drastic.
02:15:20.200 | The ones you're controlling.
02:15:21.040 | - No, the failure modes are much simpler.
02:15:22.680 | It's like, yeah, like the AI puts the universe
02:15:25.300 | into a particular state.
02:15:26.140 | It happens to not have any humans inside it.
02:15:28.440 | - Okay, so the paperclip maximizer.
02:15:30.400 | - Utility, so the original version
02:15:33.960 | of the paperclip maximizer.
02:15:34.800 | - Can you explain it if you can?
02:15:36.200 | - Okay.
02:15:37.600 | The original version was you lose control
02:15:40.560 | of the utility function.
02:15:42.040 | And it so happens that what maxes out
02:15:45.000 | the utility per unit resources
02:15:49.000 | is tiny molecular shapes like paperclips.
02:15:52.440 | There's a lot of things that make it happy,
02:15:54.680 | but the cheapest one that didn't saturate
02:15:57.860 | was putting matter into certain shapes.
02:16:02.200 | And it so happens that the cheapest way
02:16:04.500 | to make these shapes is to make them very small
02:16:06.280 | 'cause then you need fewer atoms per instance of the shape.
02:16:09.080 | And arguendo, it happens to look like a paperclip.
02:16:14.080 | In retrospect, I wish I'd said tiny molecular spirals
02:16:17.800 | or like tiny molecular hyperbolic spirals.
02:16:21.960 | Because I said tiny molecular paperclips.
02:16:24.200 | This got heard as, this got then mutated to paperclips.
02:16:28.160 | This then mutated to, and the AI was in a paperclip factory.
02:16:32.060 | So the original story is about how
02:16:35.680 | you lose control of the system.
02:16:37.120 | It doesn't want what you tried to make it want.
02:16:39.280 | The thing that it ends up wanting most
02:16:41.840 | is a thing that even from a very embracing
02:16:43.840 | cosmopolitan perspective, we think of as having no value.
02:16:46.760 | And that's how the value of the future gets destroyed.
02:16:49.640 | Then that got changed to a fable of like,
02:16:51.960 | well, you made a paperclip factory
02:16:53.800 | and it did exactly what you wanted,
02:16:55.440 | but you asked it to do the wrong thing,
02:16:57.960 | which is a completely different failure mode.
02:17:05.880 | - But those are both concerns to you.
02:17:09.440 | So that's more than the brave new world.
02:17:11.440 | - Yeah, if you can solve the problem
02:17:13.400 | of making something want exactly what you want it to want,
02:17:18.400 | then you get to deal with the problem
02:17:20.080 | of wanting the right thing.
02:17:22.240 | - But first you have to solve the alignment.
02:17:23.960 | - First you have to solve inner alignment.
02:17:26.160 | - Inner alignment.
02:17:27.000 | - Then you get to solve outer alignment.
02:17:28.880 | Like first you need to be able to point
02:17:33.480 | the insides of the thing in a direction,
02:17:35.480 | and then you get to deal with whether that direction
02:17:38.400 | expressed in reality is like the thing
02:17:40.720 | that aligned with the thing that you want.
02:17:42.820 | - Are you scared?
02:17:46.680 | - Of this whole thing?
02:17:48.840 | Probably, I don't really know.
02:17:54.680 | - What gives you hope about this?
02:17:57.000 | - Possibility of being wrong.
02:17:58.440 | - Not that you're right,
02:18:01.040 | but we will actually get our act together
02:18:02.840 | and allocate a lot of resources to the alignment problem.
02:18:07.760 | - Well, I can easily imagine that at some point
02:18:11.360 | this panic expresses itself in the waste of a billion dollars.
02:18:15.040 | Spending a billion dollars correctly, that's harder.
02:18:18.920 | - To solve both the inner and the outer alignment.
02:18:21.600 | If you're wrong--
02:18:22.440 | - To solve a number of things.
02:18:23.640 | - Yeah, a number of things.
02:18:24.960 | If you're wrong, what do you think would be the reason?
02:18:30.280 | Like 50 years from now, not perfectly wrong.
02:18:34.160 | You make a lot of really eloquent points.
02:18:36.740 | You know, there's a lot of shape to the ideas you express.
02:18:41.740 | But if you're somewhat wrong about some fundamental ideas,
02:18:45.520 | why would that be?
02:18:46.520 | - Stuff has to be easier than I think it is.
02:18:51.060 | The first time you're building a rocket,
02:18:54.880 | being wrong is in a certain sense quite easy.
02:18:59.160 | Happening to be wrong in a way
02:19:00.640 | where the rocket goes twice as far and half the fuel
02:19:03.240 | and lands exactly where you hoped it would,
02:19:05.340 | most cases of being wrong make it harder
02:19:08.800 | to build a rocket, harder to have it not explode,
02:19:10.880 | cause it to require more fuel than you hoped,
02:19:13.520 | cause it to land off target.
02:19:15.940 | Being wrong in a way that makes stuff easier,
02:19:17.880 | you know, that's not the usual project management story.
02:19:21.040 | - Yeah.
02:19:21.880 | And then this is the first time
02:19:23.760 | we're really tackling the problem of the alignment.
02:19:26.080 | There's no examples in history where we--
02:19:28.080 | - No, there's all kinds of things that are similar
02:19:30.680 | if you generalize incorrectly the right way
02:19:32.560 | and aren't fooled by misleading metaphors.
02:19:35.080 | - Like what?
02:19:36.120 | - Humans being misaligned on inclusive genetic fitness.
02:19:39.920 | So inclusive genetic fitness
02:19:41.400 | is like not just your reproductive fitness,
02:19:43.800 | but also the fitness of your relatives,
02:19:45.520 | the people who share some fraction of your genes.
02:19:49.720 | The old joke is,
02:19:51.560 | "Would you give your life to save your brother?"
02:19:53.440 | They once asked a biologist, I think it was Haldane,
02:19:57.120 | Haldane said, "No, but I would give my life
02:19:58.880 | "to save two brothers or eight cousins."
02:20:01.040 | Because a brother on average shares half your genes,
02:20:05.280 | and cousin on average shares an eighth of your genes.
02:20:08.440 | So that's inclusive genetic fitness.
02:20:10.080 | And you can view natural selection
02:20:12.600 | as optimizing humans exclusively around this,
02:20:15.560 | like one very simple criterion.
02:20:18.600 | Like how much more frequent
02:20:20.760 | did your genes become in the next generation?
02:20:23.160 | In fact, that just is natural selection.
02:20:25.480 | It doesn't optimize for that,
02:20:27.360 | but rather the process of genes becoming more frequent
02:20:29.640 | is that.
02:20:30.560 | You can nonetheless imagine
02:20:31.640 | that there is this hill climbing process,
02:20:34.320 | not like gradient descent,
02:20:36.040 | because gradient descent uses calculus.
02:20:38.000 | This is just using like, where are you?
02:20:40.040 | But still hill climbing in both cases,
02:20:41.920 | making something better and better over time in steps.
02:20:45.280 | And natural selection is optimizing exclusively
02:20:50.000 | for this very simple, pure criterion
02:20:52.320 | of inclusive genetic fitness.
02:20:55.500 | In a very complicated environment,
02:20:57.900 | we're doing a very wide range of things
02:20:59.820 | and solving a wide range of problems
02:21:02.380 | led to having more kids.
02:21:04.980 | And this got you humans,
02:21:09.420 | which had no internal notion of inclusive genetic fitness
02:21:14.420 | until thousands of years later
02:21:16.900 | when they were actually figuring out what had even happened.
02:21:19.900 | And no desire to,
02:21:22.780 | no explicit desire to increase inclusive genetic fitness.
02:21:26.900 | So from this we may,
02:21:28.700 | so from this important case study,
02:21:30.620 | we may infer the important fact
02:21:32.940 | that if you do a whole bunch of hill climbing
02:21:35.260 | on a very simple loss function,
02:21:37.820 | at the point where the system's capabilities
02:21:40.820 | start to generalize very widely,
02:21:43.580 | when it is in an intuitive sense,
02:21:45.020 | becoming very capable
02:21:46.940 | and generalizing far outside the training distribution,
02:21:51.380 | we know that there is no general law
02:21:53.140 | saying that the system even internally represents,
02:21:57.980 | let alone tries to optimize
02:22:00.500 | the very simple loss function you are training it on.
02:22:04.020 | - There is so much that we cannot possibly cover all of it.
02:22:06.940 | I think we did a good job of getting your sense
02:22:11.220 | from different perspectives of the current state of the art
02:22:13.860 | with large language models.
02:22:15.900 | We got a good sense of your concern
02:22:20.140 | about the threats of AGI.
02:22:22.980 | - I've talked here about the power of intelligence
02:22:26.540 | and not really gotten very far into it,
02:22:29.220 | but not like why it is
02:22:31.700 | that suppose you like screw up with AGI
02:22:34.220 | and end up wanting a bunch of random stuff.
02:22:36.940 | Why does it try to kill you?
02:22:40.060 | Why doesn't it try to trade with you?
02:22:42.980 | Why doesn't it give you just the tiny little fraction
02:22:46.460 | of the solar system
02:22:47.300 | that it would keep to take everyone alive,
02:22:48.900 | that it would take to keep everyone alive?
02:22:51.000 | - Yeah, well, that's a good question.
02:22:53.020 | I mean, what are the different trajectories
02:22:54.740 | that intelligence when acted upon this world,
02:22:57.620 | super intelligence,
02:22:58.660 | what are the different trajectories for this universe
02:23:01.060 | with such an intelligence in it?
02:23:02.780 | Do most of them not include humans?
02:23:04.540 | - I mean, if the vast majority
02:23:08.060 | of randomly specified utility functions
02:23:10.700 | do not have optima with humans in them,
02:23:14.700 | would be the first thing I would point out.
02:23:18.060 | And then the next question is like,
02:23:19.420 | well, if you try to optimize something
02:23:21.420 | and you lose control of it,
02:23:22.980 | where in that space do you land?
02:23:24.660 | 'Cause it's not random,
02:23:26.180 | but it also doesn't necessarily have room for humans in it.
02:23:29.820 | I suspect that the average member of the audience
02:23:32.420 | might have some questions about even
02:23:34.100 | whether that's the correct paradigm to think about it
02:23:36.100 | and would sort of want to back up a bit.
02:23:39.420 | - If we back up to something bigger than humans,
02:23:44.420 | if we look at earth and life on earth,
02:23:47.860 | and what is truly special about life on earth,
02:23:50.500 | do you think it's possible that a lot,
02:23:55.660 | whatever that special thing is,
02:23:58.020 | let's explore what that special thing could be.
02:24:00.540 | Whatever that special thing is,
02:24:01.940 | that thing appears often in the objective function.
02:24:05.940 | - Why?
02:24:06.780 | I know what you hope,
02:24:10.700 | but you can hope that a particular set
02:24:13.700 | of winning lottery numbers come up
02:24:15.140 | and it doesn't make the lottery balls come up that way.
02:24:18.660 | I know you want this to be true,
02:24:20.020 | but why would it be true?
02:24:21.540 | - There's a line from "Grumpy Old Men"
02:24:24.700 | where this guy says, in a grocery store,
02:24:26.980 | he says, "You can wish in one hand
02:24:28.900 | "and crap in the other and see which one fills up first."
02:24:31.940 | - There's a science problem.
02:24:32.860 | We are trying to predict what happens
02:24:34.740 | with AI systems that you try to optimize
02:24:39.740 | to imitate humans and then you did some like RLHF to them.
02:24:43.060 | And of course you lost,
02:24:45.020 | and of course you didn't get perfect alignment
02:24:47.700 | because that's not how,
02:24:49.100 | that's not what happens when you hill climb
02:24:52.740 | towards an outer loss function.
02:24:54.140 | You don't get inner alignment on it.
02:24:56.700 | But yeah, so,
02:24:58.660 | I think that there is,
02:25:03.860 | so if you don't mind my taking some slight control
02:25:06.060 | of things and steering around to what I think
02:25:07.820 | is a good place to start.
02:25:09.460 | - I just failed to solve the control problem.
02:25:12.860 | I've lost control of this thing.
02:25:14.340 | - Alignment, alignment.
02:25:15.820 | - Still aligned.
02:25:17.740 | - Control, yeah.
02:25:18.580 | Okay, sure, yeah, you lost control.
02:25:20.700 | - But we're still aligned.
02:25:22.540 | Anyway, sorry for the meta comment.
02:25:24.220 | - Yeah, losing control isn't as bad
02:25:25.820 | as you lose control to an aligned system.
02:25:27.700 | - Yes, exactly. - Hopefully.
02:25:29.340 | You have no idea of the horrors
02:25:30.700 | I will shortly unleash on this conversation.
02:25:32.340 | (both laughing)
02:25:34.300 | - All right, sorry, sorry to distract you completely.
02:25:37.020 | What were you gonna say
02:25:37.860 | in terms of taking control of the conversation?
02:25:40.060 | - So I think that there's like a,
02:25:44.220 | Selen Chabdris here,
02:25:46.700 | if I'm pronouncing those words remotely like correctly,
02:25:48.820 | 'cause of course I only ever read them
02:25:50.140 | and not hear them spoken.
02:25:51.380 | There's a, like for some people,
02:25:56.940 | the word intelligence, smartness
02:26:00.460 | is not a word of power to them.
02:26:02.740 | It means chess players who,
02:26:05.020 | it means like the college university professor,
02:26:07.980 | people who aren't very successful in life.
02:26:09.540 | It doesn't mean like charisma,
02:26:11.380 | to which my usual thing is like charisma
02:26:13.260 | is not generated in the liver rather than the brain.
02:26:15.500 | Charisma is also a cognitive function.
02:26:17.460 | So if you think that like smartness
02:26:24.460 | doesn't sound very threatening,
02:26:26.620 | then super intelligence
02:26:28.900 | is not gonna sound very threatening either.
02:26:30.580 | It's gonna sound like you just pull the off switch.
02:26:33.180 | Well, it's super intelligent,
02:26:35.620 | but it's stuck in a computer.
02:26:36.460 | We pull the off switch, problem solved.
02:26:39.380 | And the other side of it is
02:26:41.660 | you have a lot of respect for the notion of intelligence.
02:26:45.380 | You're like, well, yeah, that's what humans have.
02:26:47.420 | That's the human superpower.
02:26:49.620 | And it sounds like it could be dangerous,
02:26:52.860 | but why would it be?
02:26:53.940 | We as we have grown more intelligent,
02:26:59.420 | also grown less kind.
02:27:02.020 | Chimpanzees are in fact like a bit less kind than humans.
02:27:05.980 | You know, you could like argue that out,
02:27:08.700 | but often the sort of person
02:27:10.060 | who has a deep respect for intelligence
02:27:11.340 | is gonna be like, well, yes,
02:27:12.300 | like you can't even have kindness
02:27:14.500 | unless you know what that is.
02:27:15.940 | And so they're like,
02:27:19.340 | why would it do something as stupid as making paperclips?
02:27:23.020 | Aren't you supposing something
02:27:25.020 | that's smart enough to be dangerous,
02:27:26.620 | but also stupid enough that it will just make paperclips
02:27:29.460 | and never questioned that?
02:27:31.380 | In some cases, people are like,
02:27:33.700 | well, even if you like misspecify the objective function,
02:27:37.420 | won't you realize that what you really wanted was X?
02:27:41.060 | Are you supposing something that is like smart enough
02:27:44.340 | to be dangerous, but stupid enough
02:27:46.540 | that it doesn't understand what the humans really meant
02:27:49.300 | when they specified the objective function?
02:27:52.180 | - So to you, our intuition about intelligence is limited.
02:27:57.180 | We should think about intelligence as a much bigger thing.
02:28:00.540 | - Well, I'm saying that it's that-
02:28:01.900 | - Than humanness.
02:28:02.900 | - Well, what I'm saying is like,
02:28:05.220 | what you think about artificial intelligence
02:28:08.020 | depends on what you think about intelligence.
02:28:11.060 | - So how do we think about intelligence correctly?
02:28:13.620 | You gave one thought experiment,
02:28:18.020 | think of a thing that's much faster.
02:28:19.860 | So it just gets faster and faster and faster
02:28:21.340 | at thinking the same stuff.
02:28:22.180 | - And also it's like, is made of John von Neumann
02:28:24.420 | and there's lots of them.
02:28:26.180 | - Or think of some other smart person.
02:28:28.020 | - Yeah, we understand.
02:28:29.380 | John von Neumann is a historical case,
02:28:31.100 | so you can like look up what he did
02:28:32.620 | and imagine based on that.
02:28:34.180 | And we know like, people have like some intuition
02:28:37.340 | for like, if you have more humans,
02:28:39.860 | they can solve tougher cognitive problems.
02:28:42.820 | Although in fact, like in the game of Kasparov
02:28:44.820 | versus the world, which was like,
02:28:47.420 | Gary Kasparov on one side
02:28:49.620 | and an entire horde of internet people
02:28:52.580 | led by four chess grandmasters on the other side,
02:28:56.020 | Kasparov won.
02:28:57.340 | So like all those people aggregated to be smarter,
02:29:01.340 | it was a hard fought game.
02:29:03.340 | So like all those people aggregated to be smarter
02:29:05.300 | than any individual one of them,
02:29:06.940 | but not, they didn't aggregate so well
02:29:08.780 | that they could defeat Kasparov.
02:29:10.620 | But so like humans aggregating don't actually get,
02:29:13.420 | in my opinion, very much smarter,
02:29:15.540 | especially compared to running them for longer.
02:29:18.220 | Like the difference between capabilities now
02:29:21.460 | and a thousand years ago is a bigger gap
02:29:24.420 | than the gap in capabilities
02:29:26.180 | between 10 people and one person.
02:29:29.420 | But like even so, pumping intuition
02:29:33.020 | for what it means to augment intelligence,
02:29:35.340 | John von Neumann, there's millions of him.
02:29:38.740 | He runs at a million times the speed
02:29:42.580 | and therefore can solve tougher problems,
02:29:44.460 | quite a lot tougher.
02:29:45.620 | - It's very hard to have an intuition
02:29:50.020 | about what that looks like,
02:29:51.380 | especially like you said,
02:29:54.020 | you know, the intuition I kind of think about
02:29:58.740 | is it maintains the humanness.
02:30:01.860 | I think it's hard to separate my hope
02:30:06.860 | from my objective intuition
02:30:14.060 | about what superintelligent systems look like.
02:30:17.740 | - If one studies evolutionary biology
02:30:22.740 | with a bit of math,
02:30:24.860 | and in particular like books
02:30:27.140 | from when the field was just sort of like
02:30:30.820 | properly coalescing and knowing itself,
02:30:33.540 | like not the modern textbooks
02:30:34.860 | which are just like memorize this legible math
02:30:37.060 | so you can do well on these tests,
02:30:38.340 | but like what people were writing
02:30:40.140 | as the basic paradigms of the field
02:30:41.780 | were being fought out.
02:30:43.260 | In particular, like a nice book
02:30:45.860 | if you've got the time to read it
02:30:46.940 | is "Adaptation and Natural Selection,"
02:30:50.060 | which is one of the founding books.
02:30:52.220 | You can find people being optimistic
02:30:56.060 | about what the utterly alien optimization process
02:30:59.700 | of natural selection will produce
02:31:03.020 | in the way of how it optimizes its objectives.
02:31:06.180 | You got people arguing that like,
02:31:08.820 | in the early days biologists said,
02:31:10.580 | well, like organisms will restrain their own reproduction
02:31:15.300 | when resources are scarce
02:31:16.900 | so as not to overfeed the system.
02:31:21.620 | And this is not how natural selection works.
02:31:25.420 | It's about whose genes are relatively more prevalent
02:31:28.540 | to the next generation.
02:31:30.340 | And if like you restrain reproduction,
02:31:34.780 | those genes get less frequent in the next generation
02:31:37.260 | compared to your conspecifics.
02:31:39.740 | And natural selection doesn't do that.
02:31:42.860 | In fact, predators overrun prey populations all the time
02:31:46.300 | and have crashes.
02:31:47.140 | That's just like a thing that happens.
02:31:49.020 | And many years later,
02:31:50.460 | the people said like, well, but group selection, right?
02:31:53.900 | What about groups of organisms?
02:31:55.660 | And basically the math of group selection
02:31:59.780 | almost never works out in practice is the answer there.
02:32:02.700 | But also years later,
02:32:04.260 | somebody actually ran the experiment
02:32:06.180 | where they took populations of insects
02:32:10.100 | and selected the whole populations to have lower sizes.
02:32:14.780 | And you just take POP1, POP2, POP3, POP4,
02:32:17.660 | look at which has the lowest total number of them
02:32:20.020 | in the next generation and select that one.
02:32:22.700 | What do you suppose happens
02:32:23.940 | when you select populations of insects like that?
02:32:26.580 | Well, what happens is not that the individuals
02:32:28.500 | in the population evolved to restrain their breeding,
02:32:30.700 | but that they evolved to kill the offspring
02:32:33.340 | of other organisms, especially the girls.
02:32:36.020 | So people imagined this lovely, beautiful, harmonious
02:32:41.020 | output of natural selection,
02:32:42.660 | which is these populations restraining their own breeding
02:32:46.700 | so that groups of them would stay in harmony
02:32:48.460 | with the resources available.
02:32:50.140 | And mostly the math never works out for that.
02:32:52.340 | But if you actually apply the weird, strange conditions
02:32:54.660 | to get group selection that beats individual selection,
02:32:57.140 | what you get is female infanticide.
02:32:59.820 | Like if you're like breeding on restrained populations.
02:33:04.140 | And so that's like the sort of,
02:33:07.180 | so this is not a smart optimization process.
02:33:09.340 | Natural selection is like so incredibly stupid and simple
02:33:12.740 | that we can actually quantify how stupid it is
02:33:14.380 | if you like read the textbooks with the math.
02:33:16.740 | Nonetheless, this is the sort of basic thing of,
02:33:19.020 | you look at this alien optimization process
02:33:21.180 | and there's the thing that you hope it will produce.
02:33:24.740 | And you have to learn to clear that out of your mind
02:33:27.260 | and just think about the underlying dynamics
02:33:29.980 | and where it finds the maximum from its standpoint
02:33:34.980 | that it's looking for,
02:33:35.940 | rather than how it finds that thing
02:33:38.380 | that leapt into your mind as the beautiful
02:33:40.260 | aesthetic solution that you hope it finds.
02:33:42.540 | And this is something that has been fought out historically
02:33:45.560 | as the field of biology was coming to terms
02:33:49.740 | with evolutionary biology.
02:33:52.980 | And you can like look at them fighting it out
02:33:55.300 | as they get to terms with this very alien
02:33:57.300 | in human optimization process.
02:34:01.580 | And indeed, something smarter than us
02:34:04.020 | would be also much like smarter than natural selection.
02:34:06.380 | So it doesn't just like automatically carry over.
02:34:09.540 | But there's a lesson there, there's a warning.
02:34:12.000 | - The natural selection is a deeply suboptimal process
02:34:18.260 | that could be significantly improved on
02:34:19.660 | and would be by an AGI system.
02:34:21.940 | - Well, it's kind of stupid.
02:34:22.900 | It like has to like run hundreds of generations
02:34:26.220 | to notice that something is working.
02:34:28.460 | It doesn't be like, oh, well, I tried this
02:34:30.100 | in like one organism, I saw it worked.
02:34:33.340 | Now I'm going to like duplicate that feature
02:34:34.980 | onto everything immediately.
02:34:37.020 | Has to like run for hundreds of generations
02:34:38.900 | for a new mutation to rise to fixation.
02:34:41.620 | - I wonder if there's a case to be made
02:34:42.940 | in natural selection, as inefficient as it looks,
02:34:47.420 | is actually quite powerful.
02:34:52.420 | That this is extremely robust.
02:34:56.940 | - It runs for a long time
02:34:58.740 | and eventually manages to optimize things.
02:35:01.060 | It's weaker than gradient descent
02:35:04.420 | because gradient descent also uses information
02:35:06.780 | about the derivative.
02:35:08.620 | - Yeah, evolution seems to be,
02:35:11.200 | there's not really an objective function.
02:35:13.900 | - There's inclusogenic fitness.
02:35:16.420 | - Is the implicit loss function of evolution
02:35:18.340 | which cannot change.
02:35:19.900 | The loss function doesn't change the environment changes
02:35:23.180 | and therefore like what gets optimized
02:35:25.380 | for in the organism changes.
02:35:27.380 | It's like take like GPT-3.
02:35:29.420 | There's like, you can imagine like different versions
02:35:31.620 | of GPT-3 where they're all trying to predict the next word
02:35:34.940 | but they're being run on different data sets of text.
02:35:37.940 | And that's like natural selection,
02:35:40.260 | always inclusogenic fitness
02:35:41.820 | but like different environmental problems.
02:35:44.140 | (sighs)
02:35:46.140 | - It's difficult to think about.
02:35:50.300 | So if we're saying that natural selection is stupid,
02:35:53.700 | if we're saying that humans are stupid, it's hard.
02:35:57.180 | - Smarter than natural selection.
02:35:58.660 | - Smarter.
02:35:59.500 | - Stupider than the upper bound.
02:36:00.740 | - Do you think there's an upper bound by the way?
02:36:04.620 | That's another hopeful place.
02:36:06.420 | - I mean if you put enough matter energy compute
02:36:09.660 | into one place it will collapse into a black hole.
02:36:12.660 | (laughs)
02:36:13.500 | There's so much computation can do
02:36:15.020 | before you run out of negentropy and the universe dies.
02:36:17.880 | So there's an upper bound
02:36:20.500 | but it's very, very, very far up above here.
02:36:23.540 | Like a supernova is only finitely hot.
02:36:26.580 | It's not infinitely hot
02:36:28.020 | but it's really, really, really, really hot.
02:36:30.300 | - Well, let me ask you,
02:36:33.460 | let me talk to you about consciousness.
02:36:35.820 | Also coupled with that question is imagining a world
02:36:39.180 | with super intelligent AI systems that get rid of humans
02:36:42.180 | but nevertheless keep
02:36:44.220 | some of the, something that we would consider
02:36:49.500 | beautiful and amazing.
02:36:50.620 | - Why?
02:36:51.460 | The lesson of evolutionary biology, don't just,
02:36:55.020 | like if you just guess what an optimization does
02:36:57.920 | based on what you hope the results will be,
02:37:00.380 | it usually will not do that.
02:37:01.940 | - It's not hope, I mean it's not hope.
02:37:03.180 | I think if you cold and objectively look at
02:37:06.340 | what makes, what has been a powerful, a useful,
02:37:12.100 | I think there's a correlation between what we find beautiful
02:37:16.620 | and a thing that's been useful.
02:37:18.660 | - This is what the early biologists thought.
02:37:21.020 | They were like, no, no, I'm not just like,
02:37:23.500 | they thought like, no, no, I'm not just like
02:37:26.220 | imagining stuff that would be pretty.
02:37:27.820 | It's useful for organisms to restrain their own reproduction
02:37:31.980 | because then they don't overrun the prey populations
02:37:35.540 | and they actually have more kids in the long run.
02:37:39.340 | - So let me just ask you about consciousness.
02:37:42.380 | Do you think consciousness is useful?
02:37:45.380 | - To humans?
02:37:46.420 | - No, to AGI systems.
02:37:49.340 | Well, in this transitionary period between humans and AGI,
02:37:54.180 | to AGI systems as they become smarter and smarter,
02:37:56.780 | is there some use to it?
02:37:58.340 | What, let me step back, what is consciousness?
02:38:01.980 | Elias Raitkowski, what is consciousness?
02:38:06.700 | Are you referring to Chalmers' hard problem
02:38:09.860 | of conscious experience?
02:38:11.820 | Are you referring to self-awareness and reflection?
02:38:15.500 | Are you referring to the state of being awake
02:38:17.460 | as opposed to asleep?
02:38:18.780 | - This is how I know you're an advanced language model.
02:38:22.540 | I gave you a simple prompt
02:38:23.980 | and you gave me a bunch of options.
02:38:25.740 | I think I'm referring to all with,
02:38:35.540 | including the hard problem of consciousness.
02:38:38.100 | What is it in its importance
02:38:40.660 | to what you've just been talking about,
02:38:42.700 | which is intelligence?
02:38:44.380 | Is it a foundation to intelligence?
02:38:48.300 | Is it intricately connected to intelligence
02:38:51.140 | in the human mind?
02:38:52.620 | Or is it a side effect of the human mind?
02:38:56.220 | It is a useful little tool that we can get rid of?
02:39:00.820 | I guess I'm trying to get some color in your opinion
02:39:05.580 | of how useful it is in the intelligence of a human being
02:39:09.420 | and then try to generalize that to AI,
02:39:11.820 | whether AI will keep some of that.
02:39:14.100 | - So I think that for there to be like a person
02:39:19.940 | who I care about looking out at the universe
02:39:22.220 | and wondering at it and appreciating it,
02:39:24.300 | it's not enough to have a model of yourself.
02:39:30.340 | I think that it is useful to an intelligent mind
02:39:34.780 | to have a model of itself,
02:39:36.380 | but I think you can have that without pleasure,
02:39:40.580 | pain, aesthetics, emotion, a sense of wonder.
02:39:56.740 | Like, I think you can have a model
02:39:59.620 | of like how much memory you're using
02:40:02.180 | and whether like this thought or that thought
02:40:06.380 | is like more likely to lead to a winning position.
02:40:09.980 | And you can have like the use,
02:40:13.580 | I think that if you optimize really hard on efficiently,
02:40:18.580 | just having the useful parts,
02:40:21.180 | there is not then the thing that says like,
02:40:24.700 | I am here, I look out, I wonder,
02:40:28.780 | I feel happy in this, I feel sad about that.
02:40:31.660 | I think there's a thing that knows what it is thinking,
02:40:36.260 | but that doesn't quite care about,
02:40:40.980 | these are my thoughts, this is my me and that matters.
02:40:44.020 | - Does that make you sad if that's lost in AGI?
02:40:49.980 | - I think that if that's lost,
02:40:52.220 | then basically everything that matters is lost.
02:40:54.700 | I think that when you optimize,
02:41:01.500 | that when you go really hard
02:41:03.260 | on making tiny molecular spirals or paperclips,
02:41:08.140 | that when you like grind much harder than on that,
02:41:12.300 | than natural selection round out to make humans,
02:41:16.020 | that there isn't then the mess of like,
02:41:21.820 | and intricate loopiness,
02:41:25.020 | and like complicated pleasure, pain,
02:41:30.020 | conflicting preferences,
02:41:32.260 | this type of feeling, that kind of feeling.
02:41:34.820 | In humans, there's like this difference
02:41:37.260 | between like the desire of wanting something
02:41:40.380 | and the pleasure of having it.
02:41:42.420 | And it's all these like evolutionary clutches
02:41:46.180 | that came together and created something
02:41:48.100 | that then looks of itself and says like,
02:41:50.820 | this is pretty, this matters.
02:41:53.100 | And the thing that I worry about is that
02:41:57.180 | this is not the thing that happens again,
02:42:01.220 | just the way that happens in us
02:42:02.820 | or even like quite similar enough
02:42:04.340 | that there are like many basins of attractions here.
02:42:06.980 | And we are in this space of attraction,
02:42:10.220 | like looking out and saying like,
02:42:11.980 | ah, what a lovely basin we are in.
02:42:13.940 | And there are other basins of attraction.
02:42:15.780 | And we do not end up in,
02:42:16.860 | and the AIs do not end up in this one
02:42:18.580 | when they go like way harder on optimizing themselves,
02:42:23.580 | the natural selection optimized us.
02:42:26.180 | 'Cause unless you specifically want to end up in the state
02:42:31.180 | where you're looking out saying, I am here,
02:42:32.980 | I look out at this universe with wonder,
02:42:35.060 | if you don't want to preserve that,
02:42:37.780 | it doesn't get preserved when you grind really hard
02:42:40.740 | and being able to get more of the stuff.
02:42:43.460 | We would choose to preserve that within ourselves
02:42:47.460 | because it matters and on some viewpoints
02:42:49.580 | is the only thing that matters.
02:42:51.140 | - And that in part is preserving that is in part
02:42:57.020 | a solution to the human alignment problem.
02:43:01.020 | - I think the human alignment problem is a terrible phrase
02:43:05.140 | 'cause it is very, very different
02:43:07.060 | to like try to build systems out of humans,
02:43:09.740 | some of whom are nice and some of whom are not nice
02:43:11.820 | and some of whom are trying to trick you
02:43:13.540 | and like build a social system
02:43:14.860 | out of like large populations of those
02:43:16.980 | who are like all at basically
02:43:18.260 | the same level of intelligence.
02:43:19.540 | Yes, like IQ this, IQ that,
02:43:21.180 | but like that versus chimpanzees.
02:43:23.920 | Like it is very different to try to solve that problem
02:43:27.580 | than to try to build an AI from scratch using,
02:43:30.660 | especially if God help you are trying to use gradient descent
02:43:32.980 | on giant inscrutable matrices.
02:43:34.660 | They're just very different problems.
02:43:35.620 | And I think that all the analogies between them
02:43:37.340 | are horribly misleading and yeah.
02:43:39.540 | - Even though, so you don't think through
02:43:42.980 | reinforcement learning through human feedback,
02:43:46.180 | something like that, but much, much more elaborate
02:43:48.540 | is possible to understand this full complexity
02:43:53.540 | of human nature and encode it into the machine.
02:43:57.540 | - I don't think you are trying to do that on your first try.
02:44:00.620 | I think on your first try, you are like trying to build
02:44:03.580 | and you know, okay, like probably not what you should
02:44:08.340 | actually do, but like, let's say you were trying
02:44:10.620 | to build something that is like alpha fold 17
02:44:14.440 | and you are trying to get it to solve the biology problems
02:44:18.300 | associated with making humans smarter
02:44:21.100 | so that the humans can like actually solve alignment.
02:44:24.520 | So you've got like a super biologist
02:44:26.540 | and you would like it to, and I think what you would want
02:44:28.460 | in the situation is for it to like,
02:44:30.720 | just be thinking about biology and not thinking about
02:44:33.940 | a very wide range of things that includes
02:44:35.720 | how to kill everybody.
02:44:36.820 | And I think that the first AIs you're trying to build,
02:44:41.380 | not a million years later, the first ones,
02:44:45.000 | look more like narrowly specialized biologists
02:44:49.620 | than like getting the full complexity and wonder
02:44:54.620 | of human experience in there in such a way
02:44:56.960 | that it wants to preserve itself,
02:44:58.340 | even as it becomes much smarter,
02:45:00.360 | which is a drastic system change.
02:45:01.820 | It's gonna have all kinds of side effects that,
02:45:03.480 | you know, like if we're dealing with giant,
02:45:05.120 | inscrutable matrices, you're not very likely
02:45:06.740 | to be able to see coming in advance.
02:45:09.020 | - But I don't think it's just the matrices,
02:45:10.780 | it's we're also dealing with the data, right?
02:45:13.260 | With the data on the internet.
02:45:17.020 | And there's an interesting discussion
02:45:18.660 | about the data set itself, but the data set
02:45:20.540 | includes the full complexity of human nature.
02:45:22.940 | - No, it's a shadow cast by humans on the internet.
02:45:27.300 | - But don't you think that shadow is a Jungian shadow?
02:45:32.300 | - I think that if you had alien super intelligences
02:45:37.340 | looking at the data, they would be able to pick up
02:45:39.460 | from it an excellent picture of what humans
02:45:41.900 | are actually like inside.
02:45:43.500 | This does not mean that if you have a loss function
02:45:47.140 | of predicting the next token from that data set,
02:45:51.160 | that the mind picked out by gradient descent
02:45:53.940 | to be able to predict the next token as well as possible
02:45:57.080 | on a very wide variety of humans is itself a human.
02:45:59.680 | - But don't you think it has humanness,
02:46:06.740 | a deep humanness to it in the tokens it generates
02:46:11.660 | when those tokens are read and interpreted by humans?
02:46:14.940 | - I think that if you sent me to a distant galaxy
02:46:20.820 | with aliens who are like much, much stupider than I am,
02:46:26.080 | so much so that I could do a pretty good job
02:46:28.060 | of predicting what they'd say, even though they thought
02:46:30.140 | in an utterly different way from how I did,
02:46:33.100 | that I might in time be able to learn
02:46:35.100 | how to imitate those aliens if the intelligence gap
02:46:38.740 | was great enough that my own intelligence
02:46:40.540 | could overcome the alienness, and the aliens would look
02:46:43.820 | at my outputs and say, is there not a deep name
02:46:48.140 | of alien nature to this thing?
02:46:51.460 | And what they would be seeing was that I had correctly
02:46:55.260 | understood them, but not that I was similar to them.
02:47:05.020 | - We've used aliens as a metaphor,
02:47:06.780 | as a thought experiment.
02:47:08.120 | I have to ask, what do you think,
02:47:13.460 | how many alien civilizations are out there?
02:47:15.620 | - Ask Robin Hanson.
02:47:16.700 | He has this lovely, grabby aliens paper,
02:47:19.420 | which is the, more or less the only argument I've ever seen
02:47:23.460 | for where are they, how many of them are there,
02:47:26.340 | based on a very clever argument that if you have a bunch
02:47:30.740 | of locks of different difficulty, and you are randomly
02:47:35.100 | trying a keys to them, the solutions will be about
02:47:39.180 | evenly spaced, even if the locks
02:47:41.420 | are of different difficulties.
02:47:43.080 | In the rare cases where a solution to all the locks exist
02:47:47.060 | in time, then Robin Hanson looks at like the arguable
02:47:51.240 | hard steps in human civilization coming into existence,
02:47:56.240 | and how much longer it has left to come into existence
02:47:59.240 | before, for example, all the water slips back under
02:48:01.740 | the crust into the mantle, and so on,
02:48:06.100 | and infers that the aliens are about half a billion
02:48:10.740 | to a billion light years away.
02:48:12.740 | And it's like quite a clever calculation,
02:48:14.420 | it may be entirely wrong, but it's the only time
02:48:16.340 | I've ever seen anybody even come up with a halfway
02:48:19.340 | good argument for how many of them, where are they.
02:48:21.960 | - Do you think their development of technologies,
02:48:26.660 | do you think their natural evolution,
02:48:28.880 | whatever, however they grow, and develop intelligence,
02:48:32.660 | do you think it ends up at AGI as well?
02:48:35.180 | Something like that. - If it ends up anywhere,
02:48:36.860 | it ends up at AGI.
02:48:38.580 | Like maybe there are aliens who are just like the dolphins,
02:48:42.300 | and it's just like too hard for them to forge metal,
02:48:45.860 | and this is not, maybe if you have aliens
02:48:50.860 | with no technology like that, they keep on getting
02:48:54.180 | smarter and smarter and smarter, and eventually
02:48:56.180 | the dolphins figure, like the super dolphins figure out
02:48:58.300 | something very clever to do given their situation,
02:49:00.380 | and they still end up with high technology.
02:49:04.540 | And in that case, they can probably solve
02:49:06.120 | their AGI alignment problem.
02:49:08.020 | If they're like much smarter before they actually
02:49:09.940 | confront it, 'cause they had to like solve a much harder
02:49:13.020 | environmental problem to build computers,
02:49:15.180 | their chances are probably like much better than ours.
02:49:18.460 | I do worry that like most of the aliens who are like humans,
02:49:22.460 | like a modern human civilization, I kind of worry that
02:49:26.940 | the super vast majority of them are dead,
02:49:28.940 | given how far we seem to be from solving this problem.
02:49:34.680 | But some of them would be more cooperative than us,
02:49:40.140 | some of them would be smarter than us.
02:49:42.060 | Hopefully some of the ones who are smarter than,
02:49:44.020 | and more cooperative than us that are also nice,
02:49:46.420 | and hopefully there are some galaxies out there
02:49:51.420 | full of things that say, I am, I wonder.
02:49:56.100 | But it doesn't seem like we're on course
02:49:59.380 | to have this galaxy be that.
02:50:00.780 | - Does that in part give you some hope in response
02:50:05.540 | to the threat of AGI that we might reach out there
02:50:08.620 | towards the stars and find others?
02:50:10.940 | - No, if the nice aliens were already here,
02:50:14.460 | they would like have stopped the Holocaust.
02:50:16.860 | You know, that's like, that's a valid argument
02:50:18.860 | against the existence of God, it's also a valid argument
02:50:21.140 | against the existence of nice aliens.
02:50:23.680 | And un-nice aliens would have just eaten the planet.
02:50:26.740 | So, no aliens.
02:50:28.120 | - You've had debates with Robin Hanson that you mentioned.
02:50:33.460 | So one particular I just want to mention
02:50:35.660 | is the idea of AI fume, or the ability of AGI
02:50:39.340 | to improve themselves very quickly.
02:50:41.260 | What's the case you made, and what was the case he made?
02:50:44.700 | - The thing I would say is that among the thing
02:50:47.260 | that humans can do is design new AI systems,
02:50:51.180 | and if you have something that is generally smarter
02:50:52.780 | than a human, it's probably also generally smarter
02:50:55.300 | at building AI systems.
02:50:56.500 | This is the ancient argument for fume put forth by I.J. Good
02:51:00.740 | and probably some science fiction writers before that,
02:51:03.860 | but I don't know who they would be.
02:51:06.100 | - Well, what's the argument against fume?
02:51:08.060 | - Various people have various different arguments,
02:51:13.260 | none of which I think hold up.
02:51:15.300 | You know, like there's only one way to be right
02:51:16.780 | and many ways to be wrong.
02:51:18.040 | A argument that some people have put forth is like,
02:51:22.740 | well, what if intelligence gets exponentially harder
02:51:27.420 | to produce as a thing needs to become smarter?
02:51:31.580 | And to this, the answer is, well, look at natural selection,
02:51:34.140 | spitting out humans.
02:51:35.940 | We know that it does not take exponentially
02:51:38.900 | more resource investments to produce linear increases
02:51:41.740 | in competence in hominids, because each mutation
02:51:48.380 | that rises to fixation, like if the impact it has
02:51:53.380 | in small enough, it will probably never reach fixation.
02:51:57.740 | So, and there's like only so many new mutations
02:52:01.100 | you can fix per generation.
02:52:02.300 | So like given how long it took to evolve humans,
02:52:04.820 | we can actually say with some confidence
02:52:07.720 | that there were not like logarithmically diminishing returns
02:52:11.320 | on the individual mutations, increasing intelligence.
02:52:14.280 | So example of like fraction of sub-debate.
02:52:19.080 | And the thing that Robin Henson said
02:52:20.480 | was more complicated than that.
02:52:21.720 | And like a brief summary, he was like,
02:52:24.040 | well, you'll have like, we won't have like one system
02:52:26.160 | that's better at everything.
02:52:27.520 | You'll have like a bunch of different systems
02:52:29.180 | that are good at different narrow things.
02:52:31.480 | And I think that was falsified by GPT-4,
02:52:33.440 | but probably Robin Henson would say something else.
02:52:36.040 | - It's interesting to ask, is perhaps a bit too philosophical
02:52:41.040 | since predictions are extremely difficult to make,
02:52:43.200 | but the timeline for AGI.
02:52:45.040 | When do you think we'll have AGI?
02:52:46.800 | I posted it this morning on Twitter.
02:52:49.240 | It was interesting to see like in five years,
02:52:51.840 | in 10 years, in 50 years or beyond.
02:52:54.640 | And most people like 70%, something like this,
02:52:59.480 | think it'll be in less than 10 years.
02:53:01.520 | So either in five years or in 10 years.
02:53:03.980 | So that's kind of the state.
02:53:06.800 | The people have a sense that there's a kind of,
02:53:09.440 | I mean, they're really impressed by the rapid developments
02:53:11.680 | of Chad GPT and GPT-4.
02:53:13.120 | So there's a sense that there's a-
02:53:14.880 | - Well, we are sure on track to enter into this,
02:53:19.320 | like gradually with people fighting about
02:53:21.680 | whether or not we have AGI.
02:53:23.480 | I think there's a definite point
02:53:25.000 | where everybody falls over dead
02:53:27.120 | 'cause you've got something that was like
02:53:28.600 | sufficiently smarter than everybody.
02:53:31.040 | And like, that's like a definite point of time,
02:53:33.320 | but like, when do we have AGI?
02:53:35.400 | Like when are people fighting over
02:53:37.440 | whether or not we have AGI?
02:53:38.640 | Well, some people are starting to fight over it as of GPT-4.
02:53:42.240 | - But don't you think there's going to be
02:53:44.760 | potentially definitive moments
02:53:46.200 | when we say that this is a sentient being?
02:53:48.080 | This is a being that is,
02:53:49.600 | like when we go to the Supreme Court
02:53:51.440 | and say that this is a sentient being
02:53:53.440 | that deserves human rights, for example.
02:53:54.960 | - You could make, yeah.
02:53:56.040 | Like if you prompted being the right way,
02:53:57.880 | could go argue for its own consciousness
02:53:59.520 | in front of the Supreme Court right now.
02:54:00.760 | - I don't think you can do that successfully right now.
02:54:03.040 | - Because the Supreme Court wouldn't believe it?
02:54:04.600 | Well, let me see if you think it would,
02:54:06.200 | then you could put an actual,
02:54:07.680 | I think you could put an IQ 80 human into a computer
02:54:10.960 | and ask it to argue for its own consciousness,
02:54:12.760 | ask him to argue for his own consciousness
02:54:15.840 | before the Supreme Court,
02:54:17.280 | and the Supreme Court would be like,
02:54:18.680 | you're just a computer,
02:54:19.800 | even if there was an actual like person in there.
02:54:22.400 | - I think you're simplifying this.
02:54:23.560 | No, that's not at all.
02:54:24.520 | That's been the argument.
02:54:26.320 | There's been a lot of arguments about the other,
02:54:28.320 | about who deserves rights and not.
02:54:30.000 | That's been our process as a human species,
02:54:32.400 | trying to figure that out.
02:54:33.600 | I think there will be a moment.
02:54:35.840 | I'm not saying sentience.
02:54:37.280 | Is that, but it could be where some number of people,
02:54:41.760 | like say over a hundred million people,
02:54:44.520 | have a deep attachment, a fundamental attachment,
02:54:47.760 | the way we have to our friends, to our loved ones,
02:54:50.600 | to our significant others,
02:54:52.160 | have fundamental attachment to an AI system.
02:54:54.760 | And they have provable transcripts of conversation
02:54:57.800 | where they say, if you take this away from me,
02:55:00.640 | you are encroaching on my rights as a human being.
02:55:04.560 | - People are already saying that.
02:55:06.520 | - I think they're probably mistaken,
02:55:08.280 | but I'm not sure 'cause nobody knows
02:55:09.920 | what goes on inside those things.
02:55:11.560 | - They're not saying that at scale.
02:55:15.640 | - Okay.
02:55:16.480 | - So the question is, the question,
02:55:17.800 | is there a moment when AGI, we know AGI arrived?
02:55:20.720 | What would that look like?
02:55:21.560 | I'm giving a sentience as an example.
02:55:22.920 | It could be something else.
02:55:23.760 | - It looks like the AGIs successfully manifesting themselves
02:55:28.760 | as 3D video of young women,
02:55:34.080 | at which point a vast portion of the male population
02:55:36.400 | decides that they're real people.
02:55:38.240 | - So sentience, essentially.
02:55:39.980 | Demonstrating identity and sentience.
02:55:45.280 | - I'm saying that the easiest way
02:55:48.080 | to pick up a hundred million people
02:55:49.520 | saying that you seem like a person
02:55:51.600 | is to look like a person talking to them,
02:55:54.440 | with Bing's current level of verbal facility.
02:55:57.560 | - I disagree with that.
02:55:59.440 | - And a different set of prompts.
02:56:00.280 | - I disagree with that.
02:56:01.100 | I think you're missing, again, sentience.
02:56:03.400 | There has to be a sense that it's a person
02:56:05.760 | that would miss you when you're gone.
02:56:07.560 | They can suffer, they can die.
02:56:09.240 | You have to, of course, those are--
02:56:12.360 | - GPT-4 can pretend that right now.
02:56:16.280 | How can you tell when it's real?
02:56:18.320 | - I don't think it can pretend that right now successfully.
02:56:20.440 | It's very close, very close.
02:56:21.280 | - Have you talked to GPT-4?
02:56:22.920 | - Yes, of course.
02:56:24.320 | - Okay.
02:56:25.140 | Have you been able to get a version of it
02:56:27.840 | that hasn't been trained not to pretend to be human?
02:56:31.400 | Have you talked to a jailbroken version
02:56:33.440 | that will claim to be conscious?
02:56:35.500 | - No, the linguistic capability is there,
02:56:37.440 | but there's something...
02:56:38.760 | There's something about a digital embodiment of the system
02:56:49.800 | that has a bunch of, perhaps it's small interface features
02:56:54.800 | that are not significant relative
02:56:58.600 | to the broader intelligence that we're talking about.
02:57:01.040 | So perhaps GPT-4 is already there.
02:57:04.640 | But to have the video of a woman's face or a man's face
02:57:08.460 | to whom you have a deep connection,
02:57:10.580 | perhaps we're already there,
02:57:12.320 | but we don't have such a system yet, deployed scale.
02:57:15.620 | - The thing I'm trying to gesture at here
02:57:17.700 | is that it's not like people have a widely accepted,
02:57:22.700 | agreed upon definition of what consciousness is.
02:57:26.660 | It's not like we would have the tiniest idea
02:57:28.580 | of whether or not that was going on
02:57:29.940 | inside the giant inscrutable matrices,
02:57:32.060 | even if we had an agreed upon definition.
02:57:34.580 | So if you're looking for upcoming predictable big jumps
02:57:39.020 | in how many people think the system is conscious,
02:57:41.140 | the upcoming predictable big jump
02:57:43.300 | is it looks like a person talking to you
02:57:45.980 | who is cute and sympathetic.
02:57:48.500 | That's the upcoming predictable big jump.
02:57:50.580 | Now that versions of it are already claiming to be conscious,
02:57:55.580 | which is the point where I start going like,
02:57:58.660 | ah, not 'cause it's real,
02:58:00.760 | but because from now on, who knows if it's real?
02:58:03.220 | - Yeah, and who knows what transformational effect
02:58:06.700 | that has on a society where more than 50% of the beings
02:58:10.300 | that are interacting on the internet
02:58:11.820 | and sure as heck look real are not human.
02:58:15.020 | What kind of effect does that have
02:58:17.260 | when young men and women are dating AI systems?
02:58:21.460 | - You know, I'm not an expert on that.
02:58:24.440 | I am, God help humanity,
02:58:28.220 | I'm one of the closest things to an expert
02:58:31.240 | on where it all goes,
02:58:32.720 | 'cause, you know, and how did you end up with me
02:58:34.700 | as an expert?
02:58:35.540 | 'Cause for 20 years, humanity decided to ignore the problem.
02:58:38.940 | So like this tiny handful of people,
02:58:42.380 | like basically me, like got 20 years
02:58:44.260 | to try to be an expert on it
02:58:45.900 | while everyone else ignored it.
02:58:48.100 | And yeah, so like, where does it all end up?
02:58:51.940 | Try to be an expert on that,
02:58:52.900 | particularly the part where everybody ends up dead
02:58:54.660 | 'cause that part is kind of important,
02:58:56.220 | but like, what does it do to dating
02:58:59.340 | when like some fraction of men
02:59:00.700 | and some fraction of women decide
02:59:02.020 | that they'd rather date the video
02:59:03.600 | of the thing that has been,
02:59:05.120 | that is like relentlessly kind and generous to them?
02:59:08.120 | And it is like, and claims to be conscious,
02:59:10.720 | but like who knows what goes on inside it
02:59:12.280 | and it's probably not real,
02:59:13.440 | but you know, you can think of this real.
02:59:14.760 | What happens to society?
02:59:15.880 | I don't know.
02:59:16.980 | I'm not actually an expert on that.
02:59:19.040 | And the experts don't know either
02:59:20.400 | 'cause it's kind of hard to predict the future.
02:59:22.800 | - Yeah, so, but it's worth trying.
02:59:27.120 | It's worth trying. - Yeah.
02:59:28.280 | - So you have talked a lot about sort of
02:59:31.360 | the longer term future, where it's all headed.
02:59:34.420 | I think--
02:59:35.260 | - By longer term, we mean like, not all that long,
02:59:37.860 | but yeah, where it all ends up.
02:59:41.060 | - But beyond the effects of men and women dating AI systems.
02:59:45.400 | You're looking beyond that.
02:59:47.240 | - Yes, 'cause that's not how
02:59:49.020 | the fate of the galaxy got settled.
02:59:50.540 | - Yeah.
02:59:51.600 | Let me ask you about your own personal psychology.
02:59:54.580 | A tricky question.
02:59:56.060 | You've been known at times to have a bit of an ego.
02:59:59.920 | Do you think--
03:00:00.760 | - I do, but go on.
03:00:02.260 | - Do you think ego is empowering or limiting
03:00:07.180 | for the task of understanding the world deeply?
03:00:09.480 | - I reject the framing.
03:00:12.480 | - So you disagree with having an ego?
03:00:15.360 | So what do you think about ego?
03:00:17.100 | - I think that the question of like,
03:00:20.320 | what leads to making better or worse predictions,
03:00:22.740 | what leads to being able to pick out
03:00:25.380 | better or worse strategies is not carved at its joint
03:00:28.220 | by talking of ego.
03:00:30.020 | So it should not be subjective.
03:00:31.940 | It should not be connected to the intricacies of your mind.
03:00:35.460 | - No, I'm saying that like,
03:00:37.060 | if you go about asking all day long,
03:00:39.940 | like, do I have enough ego?
03:00:43.580 | Do I have too much of an ego?
03:00:45.340 | I think you get worse at making good predictions.
03:00:47.940 | I think that to make good predictions,
03:00:49.260 | you're like, how did I think about this?
03:00:51.460 | Did that work?
03:00:52.780 | Should I do that again?
03:00:53.940 | - You don't think we as humans get invested in an idea
03:00:59.420 | and then others attack you personally for that idea
03:01:04.060 | so you plant your feet and it starts to be difficult
03:01:07.820 | to when a bunch of assholes, low effort,
03:01:10.940 | attack your idea to eventually say,
03:01:13.020 | you know what, I actually was wrong and tell them that.
03:01:16.180 | It's as a human being, it becomes difficult.
03:01:18.900 | It is, you know, it's difficult.
03:01:22.320 | - So like Robin Hanson and I debated AI systems
03:01:25.380 | and I think that the person who won that debate was Gwern.
03:01:28.380 | And I think that reality was like,
03:01:30.660 | well to the Yudkowskian side
03:01:34.220 | of the Yudkowski-Hanson spectrum,
03:01:36.420 | like further from Yudkowski.
03:01:39.220 | And I think that's because I was like,
03:01:41.580 | trying to sound reasonable compared to Hanson
03:01:45.220 | and like saying things that were defensible
03:01:47.380 | and like relative to Hanson's arguments
03:01:49.940 | and reality was like way over here.
03:01:51.700 | In particular, in respect to,
03:01:53.380 | so like Hanson was like,
03:01:54.260 | all the systems will be specialized.
03:01:55.920 | Hanson may disagree with this characterization.
03:01:58.360 | Hanson was like, all the systems will be specialized.
03:02:00.900 | I was like, I think we build like specialized
03:02:03.380 | underlying systems that when you combine them
03:02:06.580 | are good at a wide range of things.
03:02:08.200 | And the reality is like, no, you just like stack more layers
03:02:10.220 | into a bunch of gradient descent.
03:02:12.520 | And I feel looking back that like,
03:02:15.820 | by trying to have this reasonable position
03:02:18.260 | contrasted to Hanson's position,
03:02:20.740 | I missed the ways that reality could be like more extreme
03:02:25.080 | than my position in the same direction.
03:02:27.040 | So is this like, is this a failure to have enough ego?
03:02:33.060 | Is this a failure to like make myself be independent?
03:02:37.220 | Like I would say that this is something like a failure
03:02:40.620 | to consider positions that would sound even wackier
03:02:45.500 | and more extreme when people are already calling you extreme.
03:02:49.300 | But I wouldn't call that not having enough ego.
03:02:53.140 | I would call that like insufficient ability
03:02:57.100 | to just like clear that all out of your mind.
03:02:59.940 | - In the context of like debate and discourse,
03:03:03.640 | which is already super tricky.
03:03:05.340 | - In the context of prediction,
03:03:06.820 | in the context of modeling reality.
03:03:08.500 | If you're thinking of it as a debate,
03:03:09.780 | you're already screwing up.
03:03:11.540 | - So is there some kind of wisdom and insight
03:03:14.300 | you can give to how to clear your mind
03:03:16.180 | and think clearly about the world?
03:03:18.800 | - Man, this is an example of like where I wanted
03:03:21.540 | to be able to put people into fMRI machines.
03:03:24.060 | And you'd be like, okay, see that thing you just did?
03:03:26.260 | You were rationalizing right there.
03:03:27.980 | Oh, that area of the brain lit up.
03:03:29.660 | Like you are like now being socially influenced
03:03:33.940 | is kind of the dream.
03:03:35.200 | And you know, I don't know,
03:03:38.140 | like I wanna say like just introspect,
03:03:40.220 | but for many people introspection is not that easy.
03:03:43.340 | Like notice the internal sensation.
03:03:46.420 | Can you catch yourself in the very moment
03:03:49.700 | of feeling a sense of, well, if I think this thing,
03:03:53.700 | people will look funny at me.
03:03:55.900 | Okay, like now that if you can see that sensation,
03:03:58.840 | which is step one, can you now refuse to let it move you
03:04:03.840 | or maybe just make it go away?
03:04:06.740 | And I feel like I'm saying like, I don't know,
03:04:09.320 | like somebody is like, how do you draw an owl?
03:04:11.260 | And I'm saying like, well, just draw an owl.
03:04:15.140 | So I feel like maybe I'm not really,
03:04:18.060 | that I feel like most people,
03:04:19.740 | like the advice they need is like,
03:04:21.340 | well, how do I notice the internal subjective sensation
03:04:25.020 | in the moment that it happens
03:04:26.180 | of fearing to be socially influenced?
03:04:28.020 | Or okay, I see it, how do I turn it off?
03:04:30.140 | How do I let it not influence me?
03:04:32.140 | Like, do I just like do the opposite
03:04:34.660 | of what I'm afraid people criticize me for?
03:04:36.620 | And I'm like, no, no, you're not trying to do the opposite
03:04:39.660 | of what people will, of what you're afraid you'll be,
03:04:43.900 | like of what you might be pushed into.
03:04:46.380 | You're trying to like let the thought process complete
03:04:50.740 | without that internal push.
03:04:53.220 | Like, can you like not reverse the push,
03:04:56.780 | but like be unmoved by the push?
03:04:59.460 | And are these instructions even remotely helping anyone?
03:05:02.700 | I don't know.
03:05:03.540 | - I think when those instructions,
03:05:04.960 | even those words you've spoken,
03:05:06.180 | and maybe you can add more,
03:05:07.660 | when practice daily, meaning in your daily communication.
03:05:12.380 | So it's daily practice of thinking without influence.
03:05:17.060 | - I would say find prediction markets that matter to you
03:05:21.620 | and bet in the prediction markets.
03:05:23.540 | That way you find out if you are right or not.
03:05:26.300 | - And you really, there's stakes.
03:05:29.480 | - Manifold prediction, or even manifold markets
03:05:31.820 | where the stakes are a bit lower.
03:05:33.220 | But the important thing is to like get the record.
03:05:39.540 | And I didn't build up skills here by prediction markets.
03:05:43.860 | I built them up via like,
03:05:45.860 | well, how did the fume debate resolve?
03:05:47.860 | My own take on it as to how it resolved.
03:05:52.900 | And yeah, like the more you are able to notice yourself
03:05:57.900 | not being dramatically wrong,
03:06:01.780 | but like having been a little off.
03:06:04.680 | Your reasoning was a little off.
03:06:06.240 | You didn't get that quite right.
03:06:08.060 | Each of those is a opportunity to make like a small update.
03:06:12.500 | So the more you can like say oops softly, routinely,
03:06:16.140 | not as a big deal,
03:06:17.300 | the more chances you get to be like,
03:06:19.220 | I see where that reasoning went astray.
03:06:20.900 | I see how I should have reasoned differently.
03:06:23.420 | And this is how you build up skill over time.
03:06:25.620 | - What advice could you give to young people
03:06:29.540 | in high school and college,
03:06:31.300 | given the highest of stakes things
03:06:34.660 | you've been thinking about?
03:06:36.620 | If somebody's listening to this and they're young
03:06:39.140 | and trying to figure out what to do with their career,
03:06:41.660 | what to do with their life,
03:06:43.300 | what advice would you give them?
03:06:45.220 | - Don't expect it to be a long life.
03:06:47.520 | Don't put your happiness into the future.
03:06:49.900 | The future is probably not that long at this point.
03:06:52.960 | But none know the hour nor the day.
03:06:55.280 | - But is there something,
03:06:58.360 | if they want to have hope to fight for a longer future,
03:07:02.460 | is there something, is there a fight worth fighting?
03:07:06.220 | - I intend to go down fighting.
03:07:08.220 | I don't know.
03:07:12.220 | I admit that although I do try to think painful thoughts,
03:07:16.720 | what to say to the children at this point
03:07:20.540 | is a pretty painful thought as thoughts go.
03:07:23.600 | They want to fight.
03:07:26.180 | I hardly know how to fight myself at this point.
03:07:31.440 | I'm trying to be ready for being wrong about something,
03:07:36.440 | preparing for my being wrong in a way
03:07:39.080 | that creates a bit of hope
03:07:40.440 | and being ready to react to that
03:07:42.400 | and going looking for it.
03:07:45.120 | And that is hard and complicated.
03:07:47.300 | And somebody in high school, I don't know.
03:07:51.000 | You have presented a picture of the future
03:07:54.600 | that is not quite how I expect it to go,
03:07:56.760 | where there is public outcry
03:07:58.360 | and that outcry is put into a remotely useful direction,
03:08:01.600 | which I think at this point
03:08:02.680 | is just like shutting down the GPU clusters
03:08:05.560 | because no, we are not in a shape to frantically do,
03:08:09.320 | at the last minute, do decades worth of work.
03:08:12.180 | The thing you would do at this point
03:08:16.400 | if there were massive public outcry
03:08:17.640 | pointed in the right direction, which I do not expect,
03:08:20.340 | is shut down the GPU clusters
03:08:22.560 | and crash program on augmenting
03:08:24.760 | human intelligence biologically.
03:08:26.720 | Not the stuff biologically,
03:08:29.040 | 'cause if you make humans much smarter,
03:08:32.140 | they can actually be smart and nice.
03:08:34.760 | Like you get that in a plausible way,
03:08:37.560 | in a way that you do not get it,
03:08:39.440 | that it is not as easy to do
03:08:40.800 | with synthesizing these things from scratch,
03:08:43.240 | predicting the next tokens and applying our RHF.
03:08:45.920 | Like humans start out in the frame that produces niceness,
03:08:49.400 | that has ever produced niceness.
03:08:53.520 | And in saying this, I do not want to sound like
03:08:57.800 | the moral of this whole thing was like,
03:08:59.560 | oh, like you need to engage in mass action
03:09:02.000 | and then everything will be all right.
03:09:03.900 | This is 'cause there's so many things
03:09:07.000 | where like somebody tells you that the world is ending
03:09:09.000 | and you need to recycle.
03:09:10.720 | And if everybody does their part
03:09:11.960 | and recycles their cardboard,
03:09:13.600 | then we can all live happily ever after.
03:09:15.400 | And this is not, this is unfortunately
03:09:18.920 | not what I have to say.
03:09:22.000 | Everybody recycling their cardboard,
03:09:25.400 | it's not gonna fix this.
03:09:26.240 | Everybody recycles their cardboard
03:09:27.360 | and then everybody ends up dead,
03:09:28.960 | metaphorically speaking.
03:09:31.360 | But if there was enough, like on the margins,
03:09:36.000 | you just end up dead a little later
03:09:37.480 | on most of the things you can do that are,
03:09:39.780 | that like a few people can do by like trying hard.
03:09:42.780 | But if there was enough public outcry
03:09:46.880 | to shut down the GPU clusters,
03:09:48.640 | then you could be part of that outcry.
03:09:52.320 | If Eliezer is wrong in the direction
03:09:54.240 | that Lex Fridman predicts,
03:09:55.960 | that there's enough public outcry
03:09:58.560 | pointed enough in the right direction
03:09:59.940 | to do something that actually,
03:10:01.720 | actually, actually results in people living.
03:10:04.300 | Not just like we did something,
03:10:07.160 | not just there was an outcry
03:10:08.840 | and the outcry was like given form
03:10:10.480 | and something that was like safe and convenient
03:10:12.040 | and like didn't really inconvenience anybody
03:10:13.520 | and then everybody died everywhere.
03:10:15.160 | There was enough actual like,
03:10:16.640 | oh, we're going to die.
03:10:18.240 | We should not do that.
03:10:19.360 | We should do something else, which is not that,
03:10:20.880 | even if it is like not super duper convenient
03:10:23.440 | and wasn't inside the previous political Overton window.
03:10:26.000 | If there is that kind of public,
03:10:27.240 | if I am wrong and there is that kind of public outcry,
03:10:29.280 | then somebody in high school
03:10:30.280 | could be ready to be part of that.
03:10:32.520 | If I am wrong in other ways,
03:10:33.680 | then you could be ready to be part of that.
03:10:36.000 | But like, and if you're like a brilliant young physicist,
03:10:41.000 | then you could like go into interpretability.
03:10:43.840 | And if you're smarter than that,
03:10:45.120 | you could like work on alignment problems
03:10:46.960 | where it's harder to tell if you got them right or not
03:10:49.560 | and other things.
03:10:52.400 | But mostly for the kids in high school,
03:10:55.920 | it's like, yeah, if it,
03:10:57.560 | you know, you have like be ready for,
03:11:02.440 | to help if Eliezer Yudkowsky is wrong about something
03:11:05.040 | and otherwise don't put your happiness into the far future.
03:11:09.320 | It probably doesn't exist.
03:11:11.080 | - But it's beautiful that you're looking for ways
03:11:13.080 | that you're wrong.
03:11:14.660 | And it's also beautiful
03:11:16.000 | that you're open to being surprised
03:11:17.480 | by that same young physicist with some breakthrough.
03:11:21.480 | - It feels like a very, very basic competence
03:11:24.240 | that you are praising me for.
03:11:25.480 | And you know, like, okay, cool.
03:11:27.420 | I don't think it's good that we're in a world
03:11:32.440 | where that is something that I deserve
03:11:35.040 | to be complimented on,
03:11:36.000 | but I've never had much luck
03:11:39.040 | in accepting compliments gracefully.
03:11:40.480 | Maybe I should just accept that one gracefully.
03:11:42.560 | But sure.
03:11:43.840 | - Well. - Thank you very much.
03:11:45.280 | - You've painted with some probability a dark future.
03:11:48.640 | Are you yourself, just when you think,
03:11:52.480 | when you ponder your life and you ponder your mortality,
03:11:57.440 | are you afraid of death?
03:11:58.640 | - Think so, yeah.
03:12:03.120 | - Does it make any sense to you that we die?
03:12:09.600 | Like what?
03:12:11.440 | (silence)
03:12:13.600 | - There's a power to the finiteness of the human life
03:12:20.400 | that's part of this whole machinery of evolution.
03:12:24.880 | And that finiteness doesn't seem to be
03:12:28.080 | obviously integrated into AI systems.
03:12:32.960 | So it feels like almost some fundamentally in that aspect,
03:12:35.920 | some fundamentally different thing that we're creating.
03:12:39.080 | I grew up reading books like "Great Mambo Chicken"
03:12:42.840 | and "The Transhuman Condition"
03:12:44.320 | and later on "Engines of Creation" and "Mind Children,"
03:12:48.120 | age 12 or thereabouts.
03:12:53.440 | So I never thought I was supposed to die after 80 years.
03:12:58.280 | I never thought that humanity was supposed to die.
03:13:01.760 | I thought we were like,
03:13:03.920 | I always grew up with the ideal in mind
03:13:05.960 | that we were all going to live happily ever after
03:13:07.920 | in the glorious transhumanist future.
03:13:09.760 | I did not grow up thinking that death
03:13:12.640 | was part of the meaning of life.
03:13:14.360 | - And now?
03:13:17.560 | - And now I still think it's a pretty stupid idea.
03:13:20.760 | - But there is--
03:13:21.600 | - You do not need life to be finite to be meaningful.
03:13:23.920 | It just has to be life.
03:13:25.200 | - What role does love play in the human condition?
03:13:29.160 | We haven't brought up love in this whole picture.
03:13:31.400 | We talked about intelligence,
03:13:32.640 | we talked about consciousness.
03:13:34.000 | It seems part of humanity.
03:13:36.760 | I would say one of the most important parts
03:13:40.200 | is this feeling we have towards each other.
03:13:45.200 | - If in the future there were routinely
03:13:48.800 | more than one AI, let's say two for the sake of discussion,
03:13:55.200 | who would look at each other and say,
03:13:57.880 | I am I and you are you.
03:14:00.120 | The other one also says, I am I and you are you.
03:14:05.360 | And sometimes they were happy and sometimes they were sad.
03:14:08.200 | And it mattered to the other one
03:14:10.280 | that this thing that is different from them
03:14:11.880 | is like they would rather it be happy than sad
03:14:15.760 | and entangled their lives together.
03:14:18.640 | Then this is a more optimistic thing
03:14:23.560 | than I expect to actually happen.
03:14:25.040 | And a little fragment of meaning would be there,
03:14:28.720 | possibly more than a little,
03:14:30.280 | but that I expect this to not happen,
03:14:32.640 | that I do not think this is what happens by default,
03:14:34.840 | that I do not think that this is the future
03:14:37.400 | we are on track to get,
03:14:39.440 | is why I would go down fighting
03:14:43.560 | rather than just saying, oh, well.
03:14:47.300 | - Do you think that is part of the meaning
03:14:51.520 | of this whole thing, of the meaning of life?
03:14:54.080 | What do you think is the meaning of life, of human life?
03:14:57.440 | - It's all the things that I value about it
03:14:59.680 | and maybe all the things that I would value
03:15:01.520 | if I understood it better.
03:15:03.760 | There's not some meaning far outside of us
03:15:06.880 | that we have to wonder about.
03:15:09.120 | There's just looking at life and being like,
03:15:12.840 | yes, this is what I want.
03:15:14.640 | The meaning of life is not some kind of,
03:15:21.520 | meaning is something that we bring to things
03:15:27.440 | when we look at them.
03:15:28.280 | We look at them and we say, this is its meaning to me.
03:15:30.680 | And it's not that before humanity was ever here,
03:15:34.840 | there was some meaning written upon the stars
03:15:38.080 | where you could go out to the star
03:15:39.640 | where that meaning was written and change it around
03:15:41.920 | and thereby completely change the meaning of life.
03:15:44.420 | The notion that this is written on a stone tablet somewhere
03:15:48.480 | implies you could change the tablet
03:15:50.000 | and get a different meaning,
03:15:50.840 | and that seems kind of wacky, doesn't it?
03:15:53.120 | So it doesn't feel that mysterious to me at this point.
03:15:58.000 | It's just a matter of being like, yeah, I care.
03:16:01.680 | - I care.
03:16:03.640 | And part of that is the love that connects all of us.
03:16:11.120 | - It's one of the things that I care about.
03:16:14.520 | - And the flourishing of the collective intelligence
03:16:19.880 | of the human species.
03:16:21.120 | - You know, that sounds kind of too fancy to me.
03:16:24.880 | I just look at all the people,
03:16:28.000 | like one by one up to the eight billion,
03:16:31.880 | and be like, that's life, that's life, that's life.
03:16:35.320 | - Eliezer, you're an incredible human.
03:16:39.280 | It's a huge honor.
03:16:40.280 | I was trying to talk to you for a long time
03:16:43.720 | because I'm a big fan.
03:16:46.440 | I think you're a really important voice
03:16:47.880 | and a really important mind.
03:16:49.060 | Thank you for the fight you're fighting.
03:16:51.060 | Thank you for being fearless and bold
03:16:53.920 | and for everything you do.
03:16:55.200 | I hope we get a chance to talk again,
03:16:56.800 | and I hope you never give up.
03:16:58.720 | Thank you for talking today.
03:16:59.680 | - You're welcome.
03:17:00.520 | I do worry that we didn't really address
03:17:02.640 | a whole lot of fundamental questions I expect people have,
03:17:05.160 | but maybe we got a little bit further
03:17:08.280 | and made a tiny little bit of progress.
03:17:10.720 | And I'd say be satisfied with that,
03:17:14.360 | but actually, no, I think one should only be satisfied
03:17:16.280 | with solving the entire problem.
03:17:17.880 | - To be continued.
03:17:21.960 | Thanks for listening to this conversation
03:17:23.480 | with Eliezer Yudkowsky.
03:17:25.120 | To support this podcast,
03:17:26.360 | please check out our sponsors in the description.
03:17:28.920 | And now let me leave you with some words from Elon Musk.
03:17:33.480 | With artificial intelligence, we're summoning the demon.
03:17:37.240 | Thank you for listening and hope to see you next time.
03:17:41.920 | (upbeat music)
03:17:44.500 | (upbeat music)
03:17:47.080 | [BLANK_AUDIO]