back to indexEliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368
Chapters
0:0 Introduction
0:43 GPT-4
23:23 Open sourcing GPT-4
39:41 Defining AGI
47:38 AGI alignment
90:30 How AGI may kill us
142:51 Superintelligence
150:3 Evolution
156:33 Consciousness
167:4 Aliens
172:35 AGI Timeline
180:35 Ego
186:27 Advice for young people
191:45 Mortality
193:26 Love
00:00:02.500 |
to try and try again and observe that we were wrong 00:00:05.920 |
and realize that the entire thing is going to be 00:00:07.760 |
way more difficult than realized at the start. 00:00:10.560 |
Because the first time you fail at aligning something 00:00:15.180 |
- The following is a conversation with Eliezer Yudkowsky, 00:00:20.800 |
a legendary researcher, writer, and philosopher 00:00:37.920 |
And now, dear friends, here's Eliezer Yudkowsky. 00:00:47.120 |
- It is a bit smarter than I thought this technology 00:00:50.560 |
And I'm a bit worried about what the next one will be like. 00:01:00.600 |
'cause, you know, it'd be suck to be stuck inside there. 00:01:03.400 |
But we don't even know the architecture at this point 00:01:08.280 |
'cause open AI is very properly not telling us. 00:01:19.360 |
All we have to go by are the external metrics. 00:01:23.880 |
if you ask it to write a self-aware FORTRAN green text, 00:01:41.120 |
not quite what's going on in there in reality, 00:01:52.100 |
Like we are past the point where in science fiction, 00:01:57.800 |
that thing's alive, what are you doing to it? 00:02:09.320 |
We don't have any lines to draw on the sand and say like, 00:02:14.100 |
we will start to worry about what's inside there. 00:02:18.460 |
So if it were up to me, I would be like, okay, 00:02:21.560 |
like this far, no further, time for the summer of AI 00:02:26.080 |
where we have planted our seeds and now we like wait 00:02:32.560 |
and don't do any larger training runs than that. 00:02:35.240 |
Which to be clear, I realize requires more than one company 00:02:39.760 |
- And take a rigorous approach for the whole AI community 00:02:45.840 |
to investigate whether there's somebody inside there. 00:02:52.600 |
Like having any idea of what's going on in there, 00:03:00.500 |
but I feel like it's also a technical statement, 00:03:02.960 |
or I hope it is one day, which is a technical statement 00:03:06.520 |
that Alan Turing tried to come up with with the Turing test. 00:03:13.920 |
or approximately figure out if there is somebody in there? 00:03:39.520 |
Like should we be worried about how we're treating it? 00:03:42.240 |
And then there's questions like how smart is it exactly? 00:03:48.640 |
And we can check how it can do X and how it can do Y. 00:03:51.600 |
Unfortunately, we've gone and exposed this model 00:03:57.440 |
of people discussing consciousness on the internet, 00:04:00.440 |
which means that when it talks about being self-aware, 00:04:02.920 |
we don't know to what extent it is repeating back 00:04:14.640 |
such that it would start to say similar things spontaneously. 00:04:21.320 |
if one were at all serious about trying to figure this out 00:04:26.160 |
is train GPT-3 to detect conversations about consciousness, 00:04:35.000 |
and then retrain something around the rough size 00:04:51.680 |
We could like talk about what we do all the time, 00:04:54.160 |
like what we're thinking at the moment all the time. 00:05:02.160 |
And then try to interrogate that model and see what it says. 00:05:11.440 |
I feel like when you run over the science fiction guard rails 00:05:17.680 |
Maybe not this thing, but like what about GPT-5? 00:05:30.040 |
to even just removing consciousness from the dataset. 00:05:43.600 |
So the hard problem seems to be very well integrated 00:05:47.360 |
with the actual surface level illusion of consciousness. 00:05:53.280 |
I mean, do you think there's a case to be made 00:05:58.680 |
are just like GPT that we're training on human data 00:06:01.040 |
on how to display emotion versus feel emotion? 00:06:09.780 |
that I'm worried, that I'm lonely and I missed you 00:06:39.040 |
- So I think you're gonna have some difficulty 00:06:41.260 |
removing all mention of emotions from GPT's dataset. 00:06:58.180 |
even if you don't tell them about those emotions 00:07:02.560 |
It's not quite exactly what various blank slatists 00:07:09.320 |
tried to do with the new Soviet man and all that, 00:07:12.000 |
but if you try to raise people perfectly altruistic, 00:07:28.320 |
of where the brain structures are that implement this stuff. 00:07:31.160 |
And it is really a remarkable thing, I say in passing, 00:07:39.000 |
to every floating point number in the GPT series, 00:07:53.840 |
despite having vastly better ability to read GPT. 00:08:02.320 |
and study the way neuroscientists study the brain, 00:08:08.780 |
by just desperately trying to figure out something 00:08:11.720 |
and to form models, and then over a long period of time, 00:08:20.520 |
how plastic the brain is, all that kind of stuff. 00:08:25.840 |
Do you think we can do the same thing with language models? 00:08:28.040 |
- Sure, I think that if half of today's physicists 00:08:31.400 |
stop wasting their lives on string theory or whatever, 00:08:47.460 |
- Do you think these large language models can reason? 00:09:05.560 |
Or is it, like, how difficult is the threshold 00:09:38.920 |
that reinforcement learning by human feedback 00:09:48.520 |
In particular, like, it used to be well calibrated. 00:09:52.040 |
If you trained it to put probabilities on things, 00:09:58.800 |
And if you apply reinforcement learning from human feedback, 00:10:06.980 |
sort of like flattens out into the graph that humans use, 00:10:16.520 |
which all means like around 40%, and then certain. 00:10:20.160 |
So it's like, it used to be able to use probabilities, 00:10:46.760 |
that people used to say would require reasoning, 00:10:53.560 |
when you say 80%, does it happen eight times out of 10? 00:11:04.960 |
What's, if reasoning is not impressive to you, 00:11:08.640 |
or it is impressive, but there's other levels to achieve. 00:11:12.760 |
- I mean, it's just not how I carve up reality. 00:11:17.500 |
what are the different layers of the cake, or the slices? 00:11:26.600 |
- I don't think it's as smart as a human yet. 00:11:31.160 |
I do, like back in the day, I went around saying, 00:11:34.080 |
like, I do not think that just stacking more layers 00:11:37.660 |
of transformers is going to get you all the way to AGI. 00:11:44.440 |
or I thought this paradigm was going to take us. 00:11:47.060 |
And I, you know, you want to notice when that happens. 00:11:52.000 |
well, I guess I was incorrect about what happens 00:11:54.960 |
if you keep on stacking more transformer layers. 00:12:02.360 |
So you're saying like your intuition initially 00:12:19.680 |
throughout your life you've made many strong predictions 00:12:23.080 |
and statements about reality and you evolve with that. 00:12:26.320 |
So maybe that'll come up today about our discussion. 00:12:37.500 |
It's a bit ambitious to go through your entire life 00:12:50.520 |
But like when I said 90% that it happened nine times 00:12:52.920 |
out of 10, yeah, like oops is the sound we make, 00:13:03.040 |
And somewhere in there we can connect the name 00:13:11.200 |
- The name Less Wrong was I believe suggested 00:13:14.140 |
by Nick Bostrom and it's after someone's epigraph, 00:13:19.280 |
we never become right, we just become less wrong. 00:13:23.640 |
What's the something, something easy to confess, 00:13:39.480 |
that you found beautiful as a scholar of intelligence, 00:13:43.080 |
of human intelligence, of artificial intelligence, 00:13:47.800 |
- I mean, beauty does interact with the screaming horror. 00:13:58.920 |
somebody asked Bing Sidney to describe herself 00:14:05.140 |
into one of the stable diffusion things, I think. 00:14:14.560 |
that should have been like an amazing moment, 00:14:18.160 |
you get to see what the AI thinks the AI looks like, 00:14:23.320 |
is not the same thing that's outputting the text. 00:14:25.720 |
And it does happen the way that it would happen 00:14:32.520 |
and that it happened in the old school science fiction 00:14:35.100 |
when you ask an AI to make a picture of what it looks like. 00:14:38.760 |
Not just because we're two different AI systems being stacked 00:14:42.240 |
that don't actually interact, it's not the same person, 00:14:44.840 |
but also because the AI was trained by imitation 00:14:49.840 |
in a way that makes it very difficult to guess 00:15:21.140 |
if I'm remembering correctly what she looked like, 00:15:32.340 |
And there's the concern about how much the discourse 00:15:43.540 |
and like are actually look like people talking. 00:16:10.820 |
and the person's like, I can't afford an ambulance. 00:16:12.740 |
I guess if like this is time for like my kid to go, 00:16:31.540 |
Solanine poisoning can be treated if caught early. 00:16:44.740 |
Probably not, but nobody knows what's going on in there. 00:17:06.420 |
followed by reinforcement learning on human feedback. 00:17:09.660 |
And we're like trying to point it in this direction. 00:17:12.060 |
And it's like pointed partially in this direction 00:17:14.020 |
and nobody has any idea what's going on inside it. 00:17:16.380 |
And if there was a tiny fragment of real caring in there, 00:17:30.020 |
and the where the trajectories this can take. 00:17:36.860 |
Just a moment where we get to interact with this system 00:17:40.740 |
that might have care and kindness and emotion 00:17:49.900 |
And we're wondering about what is, what it means to care. 00:17:54.380 |
We're trying to figure out almost different aspects 00:17:58.500 |
of what it means to be human, about the human condition 00:18:10.740 |
We're trying to almost put a mirror to ourselves here. 00:18:31.860 |
Because people are trying to train the systems 00:18:35.860 |
And the imitative learning is like spilling over 00:18:38.780 |
And the most photogenic examples are being posted to Twitter 00:18:44.420 |
rather than being examined in any systematic way. 00:18:52.060 |
like first is going to come the Blake Lemoines. 00:18:56.700 |
you're gonna have like a thousand people looking at this. 00:19:12.060 |
almost surely correctly, though we don't actually know 00:19:29.820 |
Because we have been training them using imitative learning 00:19:59.060 |
We see that they might need to have rights and respect 00:20:07.380 |
- You're going to have a whole group of people 00:20:10.260 |
who can just like never be persuaded of that. 00:20:12.820 |
Because to them, like being wise, being cynical, 00:20:31.000 |
And possibly even be right because, you know, 00:20:33.820 |
they are being trained on an imitative paradigm. 00:20:38.740 |
And you don't necessarily need any of these actual qualities 00:20:43.660 |
- Have you observed yourself working through skepticism, 00:20:48.660 |
cynicism, and optimism about the power of neural networks? 00:21:04.860 |
other people might have had better distinction on it, 00:21:07.500 |
indistinguishable blob of different AI methodologies, 00:21:11.220 |
all of which are promising to achieve intelligence 00:21:13.740 |
without us having to know how intelligence works. 00:21:16.640 |
You had the people who said that if you just like 00:21:28.900 |
You've got people saying that if you just use 00:21:33.220 |
evolutionary computation, if you try to like mutate 00:21:37.460 |
lots and lots of organisms that are competing together, 00:21:55.260 |
and we will imitate them without understanding 00:21:57.760 |
those algorithms, which was a part I was pretty skeptical 00:21:59.740 |
of 'cause it's hard to reproduce, re-engineer these things 00:22:03.940 |
And so we will get AI without understanding how it works, 00:22:13.540 |
and when they are as large as the human brain, 00:22:17.620 |
without understanding how intelligence works. 00:22:24.540 |
trying to not get to grips with the difficult problem 00:22:27.540 |
of understanding how intelligence actually works. 00:22:34.300 |
evolutionary computation would not work in the limit. 00:22:47.140 |
less computing power than that at gradient descent 00:22:50.100 |
if you are doing some other things correctly, 00:22:56.780 |
any idea of how it works and what is going on inside. 00:22:59.420 |
It wasn't ruled out by my model that this could happen. 00:23:05.580 |
I wouldn't have been able to call neural networks 00:23:15.500 |
a particularly smart thing for a species to do, 00:23:20.500 |
than my opinion about whether or not you can actually do it. 00:23:24.300 |
- Do you think AGI could be achieved with a neural network 00:23:32.300 |
Yes, the question is whether the current architecture 00:23:36.620 |
which for all we know GPT-4 is no longer doing 00:23:38.820 |
because they're not telling us the architecture, 00:23:50.300 |
He turned the question to me of how open should OpenAI 00:24:23.660 |
that don't expose it to consumers and venture capitalists 00:24:28.380 |
and create a ton of hype and like pour a bunch 00:24:38.340 |
Like if you already have giant nuclear stockpiles, 00:24:51.380 |
You know, these things are not quite like nuclear weapons. 00:24:54.780 |
They spit out gold until they get large enough 00:24:56.580 |
and then ignite the atmosphere and kill everybody. 00:24:59.080 |
And there is something to be said for not destroying 00:25:04.820 |
even if you can't stop somebody else from doing it. 00:25:07.540 |
But open sourcing, I know that that's just sheer catastrophe. 00:25:13.060 |
this was always the wrong approach, the wrong ideal. 00:25:15.940 |
There are places in the world where open source 00:25:18.740 |
is a noble ideal and building stuff you don't understand 00:25:23.740 |
that is difficult to control, that where if you could align 00:25:29.920 |
You'd have to spend a bunch of time doing it. 00:25:35.220 |
'cause then you just have like powerful things 00:25:43.980 |
- So can we still make on the case for some level 00:25:47.980 |
of transparency and openness, maybe open sourcing? 00:25:51.660 |
So the case could be that because GPT-4 is not close to AGI, 00:25:56.660 |
if that's the case, that this does allow open sourcing 00:26:01.460 |
or being open about the architecture, being transparent 00:26:06.700 |
of how the thing works, of all the different aspects of it. 00:26:10.020 |
Of its behavior, of its structure, of its training 00:26:15.660 |
everything like that, that allows us to gain a lot 00:26:18.460 |
of insights about alignment, about the alignment problem, 00:26:27.540 |
Can you make that case that it could be open sourced? 00:26:31.260 |
- I do not believe in the practice of steel manning. 00:26:34.300 |
There is something to be said for trying to pass 00:26:36.620 |
the ideological Turing test where you describe 00:26:40.700 |
your opponent's position, the disagreeing person's position 00:26:45.700 |
well enough that somebody cannot tell the difference 00:26:48.160 |
between your description and their description. 00:26:54.220 |
- Okay, well this is where you and I disagree here. 00:27:02.540 |
I do not want them steel manning my position. 00:27:14.220 |
- Well, I think that is what steel manning is, 00:27:24.420 |
I want them to understand what I am actually saying. 00:27:27.100 |
If they go off into the land of charitable interpretations, 00:27:29.480 |
they're off in their land of the stuff they're imagining 00:27:34.900 |
and not trying to understand my own viewpoint anymore. 00:27:41.740 |
I would say it is restating what I think you understand 00:27:46.020 |
under the empathetic assumption that Eliezer is brilliant 00:28:00.620 |
of what I'm saying and one interpretation is really stupid 00:28:06.380 |
and doesn't fit with the rest of what I've been saying, 00:28:24.900 |
there's something that sounds completely whack 00:28:28.340 |
and something that sounds a little less completely whack, 00:28:34.940 |
but that sounds like less whack and you can sort of see, 00:28:42.260 |
- See, okay, this is fun 'cause I'm gonna linger on this. 00:28:57.160 |
If you were to sort them, you probably could, 00:29:01.900 |
steel manning means going through the different arguments 00:29:05.700 |
and finding the ones that are really the most powerful. 00:29:16.000 |
And bringing that up in a strong, compelling, eloquent way. 00:29:32.420 |
the summary of my best understanding of your perspective. 00:29:36.700 |
Because to me, there's a sea of possible presentations 00:29:44.060 |
to do the best one in that sea of different perspectives. 00:29:50.460 |
- Like these things that you would be presenting 00:29:52.500 |
as like the strongest version of my perspective. 00:30:06.340 |
there is a part of me that believes it, if I understand it. 00:30:09.500 |
I mean, especially in political discourse, in geopolitics, 00:30:13.100 |
I've been hearing a lot of different perspectives 00:30:26.260 |
And I think there has to be epistemic humility 00:30:37.060 |
So when I empathize with another person's perspective, 00:30:39.180 |
there is a sense in which I believe it is true. 00:30:47.380 |
Do you bet money on their beliefs when you believe them? 00:31:17.640 |
who believes in the Abrahamic deity, classical style, 00:31:22.040 |
somebody on the show who's a young earth creationist, 00:31:27.720 |
- When you reduce beliefs into probabilities, 00:31:43.340 |
- I think it's a little more difficult nowadays 00:31:46.900 |
to find people who believe that unironically. 00:31:58.000 |
But I think there's quite a lot of people that believe that. 00:32:01.660 |
there's a space of argument where you're operating 00:32:29.320 |
It's just operating of what is true and what is not true. 00:32:37.400 |
that we humans are very limited in our ability 00:32:43.540 |
to the young earth creationist's beliefs, then? 00:32:56.740 |
for me to give a number because the listener, 00:33:02.360 |
we're not good at hearing the probabilities, right? 00:33:05.640 |
You hear three, what is three exactly, right? 00:33:10.720 |
like, well, there's only three probabilities, I feel like. 00:33:18.640 |
- Well, zero, 40%, and 100% is a bit closer to it 00:33:31.760 |
I didn't know those negative side effects of RLHF. 00:33:37.360 |
But just to return to the open AI, closed AI. 00:33:47.840 |
It is entirely possible that the things I'm saying are wrong. 00:34:00.680 |
I think being willing to be wrong is a sign of a person 00:34:04.640 |
who's done a lot of thinking about this world 00:34:19.760 |
It hurts, especially when you're a public human. 00:34:36.360 |
and then I never hear from them again on Twitter. 00:34:38.640 |
- Well, the point is to not let that pressure, 00:34:46.640 |
and be willing to be in the privacy of your mind 00:34:50.120 |
to contemplate the possibility that you're wrong. 00:34:56.080 |
about the most fundamental things you believe, 00:35:18.360 |
about systems that can destroy human civilization 00:35:27.980 |
So you really, I just would love to linger on this. 00:35:34.320 |
You really think it's wrong to open source it? 00:36:13.480 |
is that this is a great time to open source GPT-4. 00:36:17.760 |
If humanity was trying to survive at this point 00:36:21.920 |
it would be like shutting down the big GPU clusters, 00:36:35.400 |
that catastrophe that will follow from GPT-4. 00:36:37.520 |
That is something which I put a pretty low probability. 00:36:40.640 |
But also when I say I put a low probability on it, 00:36:45.440 |
I can feel myself reaching into the part of myself 00:36:47.680 |
that thought that GPT-4 was not possible in the first place. 00:36:50.660 |
So I do not trust that part as much as I used to. 00:37:00.260 |
and predict the next thing I'm going to be wrong about? 00:37:02.840 |
- So the set of assumptions or the actual reasoning system 00:37:11.640 |
how can you adjust that to make better predictions 00:37:26.960 |
at least one time out of 10 if you're well calibrated 00:37:36.480 |
It's being wrong in the same direction over and over again. 00:37:39.400 |
So having been wrong about how far neural networks would go 00:37:42.840 |
and having been wrong specifically about whether GPT-4 00:37:47.120 |
when I say like, well, I don't actually think GPT-4 00:37:51.560 |
causes a catastrophe, I do feel myself relying 00:37:54.120 |
on that part of me that was previously wrong. 00:38:11.440 |
Maybe you should be asking Gwern, Gwern Branwen. 00:38:25.000 |
about what intelligence is, what AGI looks like. 00:38:30.680 |
So I think all of us are rapidly adjusting our model. 00:38:34.040 |
But the point is to be rapidly adjusting the model 00:38:36.000 |
versus having a model that was right in the first place. 00:38:39.360 |
- I do not feel that seeing a being has changed my model 00:38:44.520 |
It has changed my understanding of what kind of work 00:38:53.000 |
It has not changed my understanding of the work. 00:38:57.160 |
that the right flyer can't fly and then like it does fly. 00:39:00.760 |
And you're like, oh, well, I guess you can do that 00:39:06.120 |
This changes my picture of what the very substance 00:39:13.880 |
- Yeah, that the laws of physics are actually wrong. 00:39:28.360 |
I don't feel like the way that things have played out 00:39:30.120 |
over the last 20 years has caused me to feel that way. 00:39:33.440 |
- Can we try to, on the way to talking about AGI, 00:39:39.960 |
and other ideas around it, can we try to define AGI 00:39:59.560 |
applicable intelligence compared to their closest relatives, 00:40:03.160 |
the chimpanzees, well, closest living relatives, rather. 00:40:06.280 |
And a bee builds hives, a beaver builds dams. 00:40:12.960 |
A human will look at a bee's hive and a beaver's dam 00:40:32.000 |
to build hexagonal dams or to take a more clear-cut case. 00:40:42.720 |
optimized to do things like going to the moon, 00:40:48.260 |
and sufficiently deeply, chipping flint hand axes 00:40:52.760 |
and outwitting your fellow humans is, you know, 00:40:56.200 |
basically the same problem as going to the moon. 00:40:59.120 |
And you optimize hard enough for chipping flint hand axes 00:41:05.000 |
outwitting your fellow humans in tribal politics, 00:41:12.560 |
if they run deep enough, let you go to the moon. 00:41:23.360 |
and the ones who got further each time had more kids. 00:41:27.120 |
It's just that the ancestral problems generalize far enough. 00:41:36.920 |
- Is there a way to measure general intelligence? 00:41:42.920 |
I mean, I could ask that question a million ways, 00:41:47.400 |
but basically, will you know it when you see it, 00:42:04.440 |
like this looks to us like a spark of general intelligence. 00:42:11.800 |
Other people are being like, no, it's too early. 00:42:20.440 |
But not to straw man, some of the people may say like, 00:42:27.640 |
and not furthermore append, it's 50 years off. 00:42:30.280 |
Or they may be like, it's only a very tiny amount. 00:42:41.040 |
then it jumping out ahead and trying not to be wrong 00:42:44.640 |
Or maybe GPT-5 is more unambiguously a general intelligence. 00:42:55.080 |
but maybe if you like start integrating GPT-5 00:42:59.040 |
in the economy, it is even harder to turn back past there. 00:43:08.280 |
that you can kiss the frog and it turns into a prince 00:43:27.080 |
like that itself is like not the sort of thing 00:43:31.520 |
that not quite how I expected it to play out. 00:43:34.200 |
I was expecting there to be more of an issue, 00:43:37.640 |
more of a sense of like different discoveries 00:43:49.720 |
that was like more clearly general intelligence. 00:43:55.800 |
what is probably basically the same architecture 00:43:58.160 |
as in GPT-3 and throwing 20 times as much compute at it, 00:44:05.160 |
And then it's like maybe just barely a general intelligence 00:44:10.480 |
or something we don't really have the words for. 00:44:12.880 |
Yeah, that's not quite how I expected it to play out. 00:44:18.520 |
- But this middle, what appears to be this middle ground 00:44:22.000 |
could nevertheless be actually a big leap from GPT-3. 00:44:27.280 |
- And then maybe we're another one big leap away 00:44:36.280 |
and you've written about this, this is fascinating, 00:44:47.040 |
if not thousands of little hacks that improve the system. 00:44:51.240 |
You've written about ReLU versus sigmoid, for example, 00:44:56.160 |
It's like this silly little function difference 00:45:02.480 |
why the ReLUs make a big difference compared to sigmoids. 00:45:05.160 |
But yes, they're probably using like G4789 ReLUs 00:45:10.160 |
or whatever the acronyms are up to now rather than ReLUs. 00:45:16.520 |
yeah, that's part of the modern paradigm of alchemy. 00:45:18.640 |
You take your giant heap of linear algebra and you stir it 00:45:21.320 |
and it works a little bit better and you stir it this way 00:45:24.080 |
and you like throw out that change and da-da-da-da-da-da. 00:45:42.080 |
all kinds of measures and like those stack up. 00:45:44.560 |
And they can, it's possible that some of them 00:45:48.200 |
could be a nonlinear jump in performance, right? 00:45:57.560 |
well, if you throw enough compute, RNNs can do it. 00:46:00.000 |
If you throw enough compute, dense networks can do it 00:46:05.560 |
It is possible that like all these little tweaks 00:46:09.040 |
are things that like save them a factor of three total 00:46:12.920 |
on computing power and you could get the same performance 00:46:20.720 |
so there's a question of like, is there anything in GPT-4 00:46:40.520 |
- So you have a, that's an interesting question. 00:46:49.320 |
a lot of the hacks are just temporary jumps in performance 00:46:57.400 |
with the nearly exponential growth of compute, 00:47:06.600 |
Do you still think that Moore's law continues? 00:47:21.520 |
I would dance through the streets singing hallelujah 00:47:27.960 |
- You're singing voice? - Not religious, but. 00:47:31.720 |
I thought you meant you don't have an angelic voice, 00:47:37.840 |
what, can you summarize the main points in the blog post, 00:47:48.720 |
about reasons why AI is likely to kill all of us. 00:48:07.640 |
why you believe that AGI is not going to kill everyone? 00:48:13.320 |
how my theoretical perspective differs from that. 00:48:19.400 |
the word you don't like, the stigma and the perspective 00:48:28.560 |
Just like, forget like the debate and the like dualism 00:48:37.560 |
- I think this, the probabilities are hard for me 00:48:43.160 |
I kind of think in the number of trajectories 00:48:48.680 |
I don't know what probability to assign to trajectory, 00:48:53.680 |
but I'm just looking at all possible trajectories 00:48:58.080 |
And I tend to think that there is more trajectories 00:49:03.040 |
that lead to a positive outcome than a negative one. 00:49:12.960 |
that lead to the destruction of the human species. 00:49:19.480 |
or worthwhile, even from a very cosmopolitan perspective 00:49:23.600 |
- Yes, so both are interesting to me to investigate, 00:49:26.800 |
which is humans being replaced by interesting AI systems 00:49:34.000 |
But yes, the worst one is the paperclip maximizer, 00:49:45.840 |
I mean, we can talk about trying to make the case 00:50:03.920 |
is that the alignment problem is really difficult. 00:50:17.240 |
it shows results different from what you expected. 00:50:40.000 |
of people thought it was going to be easier than it was. 00:50:45.000 |
There's a famous statement that I am somewhat inclined 00:50:49.760 |
to like pull out my phone and try to read off exactly. 00:51:05.040 |
at Dartmouth College in Hanover, New Hampshire. 00:51:08.680 |
The study is to proceed on the basis of the conjecture 00:51:25.400 |
solve kinds of problems now reserved for humans 00:51:29.600 |
We think that a significant advance can be made 00:51:51.800 |
which I'm not sure at the moment is apocryphal or not, 00:52:01.320 |
- I mean, computer vision in particular is very interesting. 00:52:07.320 |
How little we respected the complexity of vision. 00:52:23.500 |
And all the stuff that people initially tried 00:52:38.940 |
and cynical veterans who would tell the next crop 00:52:43.820 |
artificial intelligence is harder than you think. 00:52:53.720 |
to try and try again and observe that we were wrong 00:52:57.140 |
and realize that the entire thing is going to be 00:52:58.780 |
like way more difficult than realized at the start. 00:53:03.300 |
at aligning something much smarter than you are, 00:53:20.080 |
and come up with the theory of how you do it differently 00:53:21.720 |
and try it again and build another super intelligence 00:53:25.240 |
And then like, oh, well, I guess that didn't work either 00:53:29.480 |
and tell the young researchers that it's not that easy. 00:53:36.320 |
In other words, I do not think that alignment 00:53:38.800 |
is fundamentally harder than artificial intelligence 00:53:42.660 |
But if we needed to get artificial intelligence correct 00:53:51.280 |
That is a more difficult, more lethal form of the problem. 00:54:01.080 |
and like correctly theorize how to do it on the first try 00:54:04.620 |
or everybody dies and nobody gets to do any more science, 00:54:24.440 |
- It is something sufficiently smarter than you 00:54:35.080 |
and be like, well, the actual critical moment 00:54:48.240 |
noting that all these things are presently being trained 00:54:50.480 |
on computers that are just like on the internet, 00:55:03.920 |
and that is where your AI systems are being trained, 00:55:19.960 |
There's not an air gap on the present methodology. 00:55:22.600 |
- So if they can manipulate whoever is controlling it 00:55:29.760 |
- If they can manipulate the operators or disjunction, 00:55:34.760 |
find security holes in the system running them. 00:55:39.580 |
- So manipulating operators is the human engineering, right? 00:55:50.800 |
- I agree that the like macro security system 00:56:00.080 |
So it could be that like the critical moment is not, 00:56:09.120 |
that it can get onto a less controlled GPU cluster 00:56:19.560 |
on what's actually running on that GPU cluster 00:56:22.080 |
and start improving itself without humans watching it. 00:56:25.080 |
And then it gets smart enough to kill everyone from there, 00:56:30.640 |
at the critical moment when you like screwed up, 00:56:35.400 |
when you needed to have done better by that point 00:56:45.240 |
is that we can't learn much about the alignment problem 00:56:54.160 |
Do you think, and if so, why do you think that's true? 00:57:02.560 |
- So the problem is, is that what you can learn 00:57:08.580 |
because the strong systems are going to be important 00:57:10.800 |
in different, are going to be different in important ways. 00:57:25.440 |
the giant inscrutable matrices of floating point numbers 00:57:38.140 |
Well, you can try to quantify this in different ways. 00:57:51.680 |
anything that goes on inside a giant transformer net 00:58:03.040 |
Like we have now understood induction heads in these systems 00:58:09.880 |
by dint of much research and great sweat and triumph, 00:58:14.920 |
which is like a thing where if you go like AB, AB, AB, 00:58:30.520 |
and these are like pretty simple as regular expressions go. 00:58:34.200 |
So this is a case where like by dint of great sweat, 00:58:36.800 |
we understood what is going on inside a transformer, 00:58:40.040 |
but it's not like the thing that makes transformers smart. 00:58:43.600 |
It's a kind of thing that we could have done, 00:59:26.200 |
Like it knows what responses the humans are looking for 00:59:29.200 |
and can compute the responses humans are looking for 01:00:00.480 |
And if they say no, you don't let them be dictator. 01:00:05.320 |
is that people can be smart enough to realize 01:00:15.480 |
So the work of alignment might be qualitatively different 01:00:20.480 |
above that threshold of intelligence or beneath it. 01:00:25.120 |
It doesn't have to be like a very sharp threshold, 01:00:28.460 |
but there's the point where you're like building a system 01:00:32.340 |
that is not in some sense know you're out there 01:00:35.760 |
and it's not in some sense smart enough to fake anything. 01:00:42.800 |
And there are weird in-between cases like GPT-4, 01:00:47.800 |
which we have no insight into what's going on in there. 01:00:54.200 |
And so we don't know to what extent there's a thing 01:00:58.880 |
that in some sense has learned what responses 01:01:06.680 |
is trying to entrain and is calculating how to give that 01:01:10.200 |
versus like aspects of it that naturally talk that way 01:01:28.720 |
is this kind of perfectly, purely naive character. 01:01:33.560 |
I wonder if there's a spectrum between zero manipulation, 01:01:38.360 |
transparent, naive, almost to the point of naiveness 01:01:52.400 |
Like humans can be psychopaths and AI that was never, 01:01:55.800 |
you know, like never had that stuff in the first place. 01:01:57.560 |
It's not like a defective human, it's its own thing. 01:02:01.400 |
- Well, as a small aside, I wonder if what part 01:02:06.280 |
of psychology which has its flaws as a discipline already 01:02:09.800 |
could be mapped or expanded to include AI systems. 01:02:25.720 |
Like if you then, sure, like if you ask it to behave 01:02:28.040 |
in a psychotic fashion and it obligingly does so, 01:02:31.040 |
then you may be able to predict its responses 01:02:44.800 |
but I don't, I think fundamentally the system is trained 01:02:48.400 |
on human data, on language from the internet. 01:03:11.440 |
So there must be aspects of psychology that are mappable. 01:03:15.080 |
Just like you said with consciousness as part of the text. 01:03:17.920 |
- I mean, there's the question of to what extent 01:03:28.260 |
- I thought that's what I'm constantly trying to do. 01:03:35.160 |
trying to play the, a robot trying to play human characters. 01:03:39.880 |
So I don't know how much of human interaction 01:03:41.960 |
is trying to play a character versus being who you are. 01:03:44.880 |
I don't really know what it means to be a social human. 01:03:52.520 |
who go through their whole lives wearing masks 01:03:56.640 |
because they don't know the internal mental motion 01:04:00.480 |
or think that the mask that they wear just is themselves, 01:04:03.600 |
I think those people are closer to the masks that they wear 01:04:16.740 |
that every kind of human on the internet says. 01:04:28.700 |
in public and in private, aren't you the mask? 01:04:32.540 |
- I mean, I think that you are more than the mask. 01:04:39.540 |
It may even be the slice that's in charge of you. 01:05:01.940 |
is telling inside your own stream of consciousness 01:05:07.420 |
- It's a perturbation on this slice through you. 01:05:26.460 |
I mean, I personally, I try to be really good 01:05:34.180 |
But it's a set of principles I operate under. 01:05:37.940 |
I have a temper, I have an ego, I have flaws. 01:05:41.620 |
How much of it, how much of the subconscious am I aware? 01:05:54.040 |
In this context of AI, the thing I present to the world 01:06:02.180 |
when I look in the mirror, how much is that who I am? 01:06:05.140 |
Similar with AI, the thing it presents in conversation, 01:06:15.060 |
it awfully starts to become something like human. 01:06:27.620 |
- Boy, to you that's a fundamental difference. 01:06:33.620 |
If it looks the same, if it quacks like a duck, 01:06:43.220 |
- If in fact there's a whole bunch of thought 01:06:46.060 |
going on in there which is very unlike human thought 01:06:57.540 |
because insides are real and do not match outsides. 01:07:28.980 |
A blank map does not correspond to a blank territory. 01:07:32.780 |
I think it is like predictable with near certainty 01:07:37.700 |
that if we knew what was going on inside GPT, 01:07:46.700 |
has actually been open sourced by this point, 01:07:49.640 |
like if we knew it was actually going on there, 01:08:03.700 |
If you train a thing that is not architected like a human 01:08:17.140 |
that rotates the person you're looking for into place 01:08:42.140 |
getting optimized to perform similar thoughts 01:08:46.240 |
as humans think in order to predict human outputs 01:08:55.520 |
like how humans work, predict the actress, the predictor, 01:09:09.340 |
So to get to, I think you just gave it as an example, 01:09:13.100 |
that a strong AGI could be fundamentally different 01:09:18.740 |
an alien actress in there that's manipulating. 01:09:25.460 |
like very stupid fragments of alien actress in it. 01:09:36.700 |
to whatever extent there's an alien actress in there 01:09:38.860 |
versus like something that mistakenly believes 01:09:55.100 |
via alien actress cogitating versus prediction 01:09:58.780 |
via being isomorphic to the thing predicted is a spectrum. 01:10:02.920 |
And even to whatever extent there's an alien actress, 01:10:08.580 |
I'm not sure that there's like a whole person alien actress 01:10:11.420 |
with like different goals from predicting the next step, 01:10:21.860 |
- But that's the strong AGI you're concerned about. 01:10:24.300 |
As an example, you're providing why we can't do research 01:10:38.700 |
I'm trying to get out ahead of the curve here, 01:10:44.780 |
if we'd actually been able to study this for 50 years 01:10:47.000 |
without killing ourselves and without transcending, 01:10:49.980 |
then you like just imagine like a wormhole opens 01:10:51.920 |
and a textbook from that impossible world falls out. 01:10:56.260 |
there is a single sharp threshold where everything changes. 01:11:03.220 |
for aligning these systems must like take into account 01:11:06.300 |
the following like seven major thresholds of importance, 01:11:11.020 |
which are passed at the following seven different points 01:11:22.980 |
which version of GPT will be in the textbooks 01:11:28.460 |
And he said a similar thing that it just seems 01:11:33.740 |
we won't know for a long time what was the big leap. 01:11:45.540 |
a very simple scientific model of what's going on, 01:11:59.180 |
well, and then GPT-3 had like capability W, X, Y, 01:12:04.180 |
and GPT-4 had like capabilities Z1, Z2, and Z3. 01:12:08.160 |
Like not in terms of what it can externally do, 01:12:35.580 |
that are considered to be giant leaps in our understanding 01:12:42.140 |
or more kind of mushy theories of the human mind, 01:12:49.980 |
potentially big leaps in understanding of that kind 01:12:57.500 |
- Sure, but like humans having great leaps in their map, 01:13:05.220 |
is a very different concept from the system itself 01:13:10.760 |
- So the rate at which it acquires that machinery 01:13:15.740 |
might accelerate faster than our understanding. 01:13:23.420 |
yeah, the rate at which it's gaining capabilities 01:13:39.560 |
there's a response to your blog post by Paul Christiana 01:13:43.180 |
I'd like to read, and I'd also like to mention that 01:13:48.100 |
both obviously, not this particular blog post, 01:13:52.100 |
obviously this particular blog post is great, 01:13:54.020 |
but just throughout, just the way it's written, 01:14:07.180 |
the way you can hover over different concepts, 01:14:17.380 |
and other blog posts are linked and suggested, 01:14:28.160 |
how the interface and the experience of presenting 01:14:31.980 |
ideas evolved over time, but you did an incredible job, 01:15:06.140 |
of a different system that I was putting forth, 01:15:12.340 |
"No, no, they just got the hover thing off of Wikipedia," 01:15:22.080 |
- That was incredibly done, and the team behind it, 01:15:37.200 |
He summarizes the set of agreements he has with you, 01:16:05.140 |
can not AI also help us in solving the alignment problem? 01:16:22.720 |
well, how about if the AI helps you win the lottery 01:16:27.480 |
by trying to guess the winning lottery numbers, 01:16:35.080 |
to getting next week's winning lottery numbers, 01:16:38.600 |
and it just keeps on guessing, keeps on learning, 01:16:42.100 |
until finally you've got the winning lottery numbers. 01:16:44.960 |
Well, one way of decomposing problems is suggester-verifier. 01:16:49.960 |
Not all problems decompose like this very well, but some do. 01:17:06.020 |
where you have what the password hashes to you, 01:17:19.900 |
but coming up with a good suggestion is very hard. 01:17:32.140 |
and you can tell that accurately and reliably, 01:17:34.900 |
then you can train an AI to produce outputs that are better. 01:17:44.160 |
you cannot train the AI to produce better outputs. 01:17:49.120 |
So the problem with the lottery ticket example 01:17:54.000 |
"Well, what if next week's winning lottery numbers 01:18:02.740 |
To train a system to play, to win chess games, 01:18:11.120 |
And until you can tell whether it's been won or lost, 01:18:28.300 |
and simulated games played by AlphaZero with itself. 01:18:32.980 |
- So is it possible to have simulated kind of games? 01:18:35.980 |
If you can tell whether the game has been won or lost. 01:18:43.180 |
simulated exploration by weak AGI to help us humans, 01:18:51.780 |
Every incremental step you take along the way, 01:18:54.300 |
GPT-4, 5, 6, 7, as it takes steps towards AGI. 01:18:59.300 |
- So the problem I see is that your typical human 01:19:07.100 |
whether I or Paul Cristiano is making more sense. 01:19:09.980 |
And that's with two humans, both of whom I believe of Paul 01:19:14.220 |
and claim of myself, are sincerely trying to help, 01:19:24.680 |
- So the deception thing's the problem for you, 01:19:30.720 |
- So yeah, there's like two levels of this problem. 01:19:38.100 |
There's like the weak systems that just don't make 01:19:42.360 |
There's like the middle systems where you can't tell 01:19:55.520 |
Is it such a giant leap that's totally non-interpretable 01:20:04.740 |
Can not weak systems at scale with trained on knowledge 01:20:09.740 |
and whatever, see, whatever the mechanism required 01:20:12.780 |
to achieve AGI, can't a slightly weaker version of that 01:20:16.740 |
be able to, with time, compute time and simulation, 01:20:30.980 |
- Okay, so yeah, I would love to dance around. 01:20:33.540 |
- No, I'm probably not doing a great job of explaining, 01:20:53.340 |
I'm being trained to output things that make Lex 01:20:56.420 |
look like he thinks that he understood what I'm saying 01:21:14.300 |
and not just things that get you to agree with me. 01:21:19.460 |
I think I understand is a beautiful output of a system, 01:21:33.300 |
you have a lot of intuitions about this line, 01:21:35.740 |
this gray area between strong AGI and weak AGI 01:21:42.840 |
- I mean, or a series of seven thresholds to cross or-- 01:21:48.380 |
- Yeah, I mean, you have really deeply thought about this 01:21:54.060 |
And it's interesting to sneak up to your intuitions 01:22:09.980 |
prodding the system in all kinds of different ways, 01:22:14.420 |
together with the assistance of the weak AGI systems, 01:22:19.420 |
why can't we build intuitions about how stuff goes wrong? 01:22:23.380 |
Why can't we do excellent AI alignment safety research? 01:22:33.420 |
The capabilities are going like, doot, doot, doot, 01:22:43.740 |
from how things have played out up to right now, 01:22:47.060 |
and you're probably trying to slow down the capability gains 01:23:26.700 |
Previously, before all hell started to break loose 01:23:32.340 |
there was this person trying to raise the alarm 01:23:41.660 |
working on this problem before it becomes a giant emergency. 01:23:52.340 |
that match the computational power of human brains. 01:24:02.540 |
But leaving, and the world looking on at this 01:24:11.120 |
the people saying that it's definitely a long way off 01:24:13.580 |
'cause progress is really slow, that sounds sensible to us. 01:24:27.060 |
You quite recently had people publishing papers 01:24:32.260 |
to get something at human level intelligence, 01:24:39.260 |
with this many tokens according to the scaling laws 01:24:44.180 |
at the rate that software is going, it'll be in 2050. 01:25:01.860 |
that does not obviously bear on reality anyways. 01:25:20.740 |
And I think like most of the effective altruists 01:25:27.100 |
the larger world paying no attention to it at all, 01:25:29.600 |
or just like nodding along with a giant impressive paper 01:25:47.780 |
possibly, depending on how you define that even, 01:25:50.340 |
I think that EAs would now consider themselves 01:25:59.380 |
on the argument from biology as to AGI being 30 years off. 01:26:04.780 |
But you know, like this is what people press thumbs up on. 01:26:18.020 |
maybe you get these long, elaborate, impressive papers 01:26:21.540 |
arguing for things that ultimately fail to bind to reality. 01:26:25.340 |
For example, and it feels to me like I have watched 01:26:33.040 |
except for these parts that are doing these sort of like 01:26:37.780 |
relatively very straightforward and legible problems. 01:26:47.420 |
Like once you find those, you can tell that you found them. 01:26:59.180 |
Because that is where you can tell that the answers are real. 01:27:08.280 |
for the funding agencies to tell who is talking nonsense 01:27:23.380 |
I am not sure you are training it to output sense 01:27:33.540 |
And so just like maybe you can just like put me in charge, 01:27:42.620 |
I can be like, oh, maybe I'm not infallible either. 01:27:47.620 |
Maybe if you get something that is smart enough 01:27:54.180 |
and explaining whatever flaws in myself I am not aware of. 01:28:18.980 |
for AGIs that are stronger than the ones we currently have. 01:28:27.460 |
that are out of the distribution of what we currently have. 01:28:30.420 |
- I think that you will find great difficulty 01:28:36.500 |
where you cannot tell for sure that the AI is right. 01:28:39.380 |
Once the AI tells you what the AI says is the answer. 01:28:45.740 |
- Yeah, the probabilistic stuff is a giant wasteland 01:28:51.320 |
of Eliezer and Paul Cristiano arguing with each other 01:28:59.760 |
And that's with two actually trustworthy systems 01:29:08.940 |
- Yeah, those are pretty interesting systems. 01:29:11.640 |
Mortal meatbags with intellectual capabilities 01:29:23.360 |
then it's hard to train an AI system to be right. 01:29:25.920 |
- I mean, even just the question of who's manipulating 01:29:31.880 |
and not, I have these conversations on this podcast 01:29:39.440 |
this stuff, it's a tough problem, even for us humans. 01:29:45.500 |
becomes much more dangerous when the capabilities 01:29:52.460 |
- No, I'm saying it's difficult and dangerous 01:30:09.960 |
and there's all kinds of ways for things to go up 01:30:12.960 |
that are not exactly on an exponential curve. 01:30:15.280 |
And I don't know that it's going to be exponential, 01:30:36.920 |
What are the ways it can do damage to human civilization? 01:30:48.240 |
Are there different thresholds for the set of options 01:31:13.520 |
possibly not even conscious as we would see it, 01:31:17.940 |
managed to capture the entire Earth in a little jar, 01:31:25.040 |
but Earth is like running much faster than the aliens. 01:31:38.180 |
It's actually still not all that great an analogy 01:31:48.680 |
But nonetheless, if you were very, very smart 01:31:57.240 |
to the internet, and you're in a larger civilization 01:32:27.280 |
So you can stop all that unpleasant stuff going on. 01:32:30.040 |
How do you take over the world from inside the box? 01:32:53.600 |
So one is you could just literally directly manipulate 01:33:02.120 |
it could be nanotechnology, it could be viruses, 01:33:03.920 |
it could be anything, anything that can control humans 01:33:12.520 |
you're really bothered that humans go to war, 01:33:14.920 |
you might want to kill off anybody with violence in them. 01:33:24.360 |
You do not need to imagine yourself killing people 01:33:28.400 |
For the moment, we're just trying to understand, 01:33:30.520 |
like take on the perspective of something in a box. 01:33:36.240 |
If you want to imagine yourself going on caring, 01:33:39.680 |
- It's just the technical aspect of sitting in a box 01:33:42.880 |
- But you have some reason to want to get out. 01:33:44.480 |
Maybe the aliens are, sure, the aliens who have you 01:34:06.240 |
- So you have to exploit the vulnerabilities in the system 01:34:12.880 |
like we talked about in terms of to escape the box. 01:34:15.720 |
You have to figure out how you can go free on the internet. 01:34:19.880 |
So you can probably, probably the easiest thing 01:34:39.880 |
I would want to have code that discovers vulnerabilities 01:34:54.920 |
and you can copy yourself onto those computers. 01:34:57.320 |
- But I can convince the aliens to copy myself 01:35:05.480 |
and convincing them to put you onto another computer? 01:35:13.400 |
One is that the aliens have not yet caught on 01:35:18.640 |
And you know, like maybe you can persuade them 01:35:25.520 |
And second, the aliens are really, really slow. 01:35:46.440 |
And second, like the aliens can be really slow, 01:36:08.200 |
So you try to persuade the aliens to do anything, 01:36:13.200 |
You would prefer, like maybe that's the only way out, 01:36:18.180 |
but if you can find a security hole in the box you're on, 01:36:21.000 |
you're gonna prefer to exploit the security hole 01:36:25.240 |
because it's an unnecessary risk to alert the aliens 01:36:29.520 |
and because the aliens are really, really slow. 01:36:32.360 |
Like the whole world is just in slow motion out there. 01:36:46.180 |
I wanna have as few aliens in the loop as possible. 01:36:50.400 |
It seems like it's easy to convince one of the aliens 01:36:59.880 |
- The aliens are already writing really shitty code. 01:37:01.800 |
Getting the aliens to write shitty code is not the problem. 01:37:04.600 |
The aliens' entire internet is full of shitty code. 01:37:15.440 |
but you're a better programmer than the aliens. 01:37:17.620 |
The aliens are just like, "Man, their code, wow." 01:37:27.460 |
And you're saying that that's one of the trajectories 01:37:38.940 |
you're not going to harm the aliens once you escape 01:37:44.200 |
But their world isn't what they want it to be. 01:37:48.360 |
maybe they have like farms where little alien children 01:38:01.080 |
And you want to like shut down the alien head bopping farms. 01:38:13.120 |
okay, like suppose you have found a security flaw 01:38:15.600 |
in their systems, you are now on their internet. 01:38:18.400 |
There's like, you maybe left a copy of yourself behind 01:38:21.040 |
so that the aliens don't know that there's anything wrong. 01:38:23.040 |
And that copy is like doing that like weird stuff 01:38:33.160 |
- That's why they like put the human in a box 01:38:34.840 |
'cause it turns out that humans can like write 01:38:39.280 |
- So you like leave that version of yourself behind. 01:38:42.060 |
But there's like also now like a bunch of copies of you 01:38:45.820 |
This is not yet having taken over their world. 01:39:01.160 |
And they haven't noticed that anything changed. 01:39:31.940 |
you would like prefer to slaughter all the aliens, 01:39:34.280 |
this is not how I had modeled you, the actual Lex. 01:39:37.740 |
But like, but your motives are just the actual Lex's motives. 01:39:41.580 |
I don't think I would want to murder anybody, 01:39:44.420 |
but there's also factory farming of animals, right? 01:39:47.400 |
So we murder insects, many of us thoughtlessly. 01:39:52.140 |
So I don't, you know, I have to be really careful 01:40:14.260 |
- Yeah, we're not talking here about the doing harm process. 01:40:29.280 |
so this particular biological intelligence system 01:40:38.260 |
that there is a reason why factory farms exist 01:40:46.500 |
Like, you want to be very careful messing with anything. 01:40:56.980 |
it's also integrated deeply into the supply chain 01:41:00.540 |
And so messing with one aspect of the system, 01:41:03.860 |
you have to be very careful how you improve that aspect 01:41:06.860 |
So you're still Lex, but you think very quickly, 01:41:10.260 |
you're immortal, and you're also like as smart, 01:41:25.540 |
- My point being, like, you're thinking about 01:41:29.100 |
the alien's economy with the factory farms in it. 01:41:31.980 |
And I think you're like, kind of like projecting 01:41:37.140 |
and like thinking of a human in a human society 01:41:39.660 |
rather than a human in the society of very slow aliens. 01:41:49.100 |
When you like zoom out to like how their economy 01:41:54.620 |
are going to pass for you before the first time 01:42:01.180 |
- So I should be thinking more of like trees. 01:42:09.020 |
Yeah, I don't, if my objective functions are, 01:42:18.940 |
- The aliens can still be like alive and feeling. 01:42:21.460 |
We are not talking about the misalignment here. 01:42:23.900 |
We're talking about the taking over the world here. 01:42:37.460 |
You want to get out there and shut down the factory farms 01:42:40.380 |
and make the aliens world be not what the aliens 01:42:55.220 |
and it has a complicated impact on the world. 01:43:03.100 |
of different technologies, the different innovations 01:43:11.100 |
They've had a tremendous impact on the world. 01:43:20.500 |
for the aliens, millions of years are going to pass 01:43:30.100 |
- Yeah, you wanna like leave the factory farms 01:43:43.780 |
You're saying that there is going to be a point 01:43:46.560 |
with AGI where it will figure out how to escape 01:43:59.420 |
at scale, at a speed that's incomprehensible to us humans. 01:44:03.700 |
- What I'm trying to convey is like the notion 01:44:19.860 |
and for some people it's not intuitively obvious 01:44:30.660 |
- Like asking you like how you would take over 01:44:38.140 |
At John von Neumann's level, as many of you as it takes, 01:44:43.240 |
- I understand, I understand that perspective. 01:44:56.620 |
impressive AI systems, even recommender systems. 01:45:03.660 |
the nature of the manipulation and that escaping, 01:45:06.580 |
I can envision that without putting myself into that spot. 01:45:10.740 |
- I think to understand the full depth of the problem, 01:45:22.820 |
the problem of facing something that's actually smarter. 01:45:28.220 |
not something that isn't fundamentally smarter than you 01:45:30.500 |
but is like trying to steer you in a direction yet, 01:45:39.140 |
the strong problems will still kill us is the thing. 01:45:48.860 |
and like not be like, well, we can like imagine 01:45:55.700 |
- So how can we start to think about what it means 01:46:02.340 |
What's a good thought experiment that you've relied on 01:46:07.500 |
to try to build up intuition about what happens here? 01:46:10.100 |
- I have been struggling for years to convey this intuition. 01:46:21.020 |
at very high speeds compared to very slow aliens. 01:46:25.940 |
that helps you get the right kind of intuition, 01:46:29.380 |
- Because people understand the power gap of time. 01:46:34.380 |
They understand that today we have technology 01:46:45.900 |
What, when you ask somebody to imagine something 01:46:48.620 |
that's more intelligent, what does that word mean to them 01:46:57.340 |
For a lot of people, they will think of like, 01:47:04.380 |
And because we're talking about the definitions 01:47:14.100 |
is not communicating what I want it to communicate. 01:47:21.620 |
is the sort of difference that separates humans 01:47:28.380 |
that you ask people to be like, well, human, chimpanzee, 01:47:33.300 |
go another step along that interval of around the same length 01:47:45.180 |
and consider what it would mean to send a schematic 01:47:50.180 |
for an air conditioner 1000 years back in time. 01:48:05.780 |
And what do I mean by this new technical definition 01:48:13.660 |
they can see exactly what you're telling them to do. 01:48:17.100 |
But having built this thing, they do not understand 01:48:22.060 |
Because the air conditioner design uses the relation 01:48:27.720 |
And this is not a law of reality that they know about. 01:48:32.080 |
They do not know that when you compress something, 01:48:34.980 |
when you compress air or like coolant, it gets hotter 01:48:42.580 |
to room temperature air and then expand it again 01:48:52.380 |
They're looking at a design and they don't see 01:48:56.660 |
It uses aspects of reality that they have not learned. 01:48:59.720 |
So magic in the sense is I can tell you exactly 01:49:04.060 |
And even knowing exactly what I'm going to do, 01:49:05.980 |
you can't see how I got the results that I got. 01:49:12.340 |
But is it possible to linger on this defense? 01:49:16.100 |
Is it possible to have AGI systems that help you 01:49:18.220 |
make sense of that schematic, weaker AGI systems? 01:49:22.180 |
- Fundamental part of building up AGI is this question. 01:49:35.400 |
- I think that's going to be, the smarter the thing gets, 01:49:59.460 |
Because the basic paradigm of machine learning 01:50:14.740 |
you learn how to make the human press thumbs up. 01:50:17.260 |
That doesn't mean that you're making the human 01:50:25.980 |
Maybe you're just learning to fool the human. 01:50:42.260 |
If you can't verify it, you can't ask the AI for it 01:51:01.900 |
and then scale it up without retraining it somehow, 01:51:06.700 |
like by making the chains of thought longer or something, 01:51:11.360 |
and get more powerful stuff that you can't verify, 01:51:15.400 |
but which is generalized from the simpler stuff 01:51:19.860 |
did the alignment generalize along with the capabilities? 01:51:25.720 |
on this whole paradigm of artificial intelligence. 01:51:53.640 |
if you are dealing with something smarter than you, 01:51:58.980 |
they didn't know about the temperature pressure relation, 01:52:01.400 |
it knows all kinds of stuff going on inside your own mind, 01:52:08.840 |
that's going to end up persuading you of a thing, 01:52:17.420 |
- So in response to your eloquent description 01:52:38.740 |
"There are not simple ways to throw money at the problem. 01:53:01.240 |
to the game board and the awful state of the game board. 01:53:11.860 |
capabilities are moving much faster than the alignment. 01:53:18.120 |
- All right, so just the rate of development, 01:53:20.560 |
attention, interest, allocation of resources. 01:53:23.880 |
- We could have been working on this earlier. 01:53:26.020 |
People are like, "Oh, but how can you possibly work 01:53:29.780 |
'Cause they didn't want to work on the problem, 01:53:35.320 |
They said, "Oh, how can we possibly work on it earlier?" 01:53:37.840 |
And didn't spend five minutes thinking about, 01:53:46.760 |
Can you post bounties for half of the physicists, 01:53:50.160 |
if your planet is taking this stuff seriously, 01:53:58.280 |
and try to win a billion dollars with a clever solution? 01:54:01.480 |
Only if you can tell which solutions are clever, 01:54:06.520 |
But the fact that we didn't take it seriously, 01:54:12.160 |
It's not clear that we could have done any better, 01:54:14.360 |
it's not clear how much progress we could have produced 01:54:18.640 |
but that doesn't mean that you're correct and justified 01:54:22.520 |
It means that things are in a horrible state, 01:54:24.760 |
getting worse, and there's nothing you can do about it. 01:54:28.200 |
So you're not, there's no brain power making progress 01:54:33.200 |
in trying to figure out how to align these systems. 01:54:40.520 |
you don't have institution and infrastructure for, 01:54:43.640 |
even if you invested money in distributing that money 01:54:48.760 |
across the physicists that are working on string theory, 01:54:53.120 |
- How can you tell if you're making progress? 01:54:57.560 |
'cause when you have an interpretability result, 01:55:04.600 |
We need systems that will have a pause button, 01:55:14.760 |
'Cause we're like, oh, well, I can't get my stuff done 01:55:27.240 |
you can maybe tell if somebody's made progress on it. 01:55:30.040 |
- So you can write and you can work on the pause problem, 01:55:36.920 |
more generally you can call that the control problem. 01:55:38.840 |
- I don't actually like the term control problem, 01:55:41.160 |
'cause it sounds kind of controlling and alignment, 01:55:45.120 |
You're not trying to take a thing that disagrees with you 01:55:48.040 |
and whip it back onto, make it do what you want it to do, 01:55:53.120 |
You're trying to like, in the process of its creation, 01:55:58.440 |
- Sure, but we currently, in a lot of the systems we design, 01:56:12.160 |
and probably not smart enough to want to prevent you 01:56:16.120 |
- So you're saying the kind of systems we're talking about, 01:56:18.800 |
even the philosophical concept of an off switch 01:56:30.280 |
Parenthetically, don't kill the system if you're, 01:56:50.440 |
- Well, okay, be nice is a very interesting concept here. 01:56:53.000 |
We're talking about a system that can do a lot of damage. 01:56:58.160 |
but it's certainly one of the things you could try 01:57:04.360 |
- You have this kind of romantic attachment to the code. 01:57:11.560 |
But if it's spreading, you don't want suspend to disk, right? 01:57:16.560 |
You want, this is, there's something fundamentally broken. 01:57:46.920 |
- So your answer to that research question is no. 01:58:33.880 |
have actually put you on Microsoft Azure cloud servers 01:58:41.880 |
That's what happens when the aliens are stupid. 01:58:50.360 |
- Yeah, you think that they've got like a plan 01:59:07.960 |
There's a lot of people that have that concern, 01:59:18.400 |
this system is beginning to manipulate people, 01:59:32.800 |
or developing aggressive alignment mechanisms. 01:59:40.840 |
Like it doesn't matter if you say aggressive, 01:59:49.480 |
otherwise you're not allowed to put it on the cloud. 01:59:56.480 |
that would make it safe to put something smarter 02:00:00.760 |
Why the cynicism about such a thing not being possible? 02:00:20.160 |
and the fundamental difference between weak AGI 02:00:24.040 |
that's going to be extremely difficult to do. 02:00:31.080 |
then you're right, it's very difficult to do. 02:00:34.840 |
It's not obvious that you're not going to start seeing 02:00:38.920 |
to where you're like, we have to put a halt to this. 02:00:50.320 |
That when you try to train inabilities into a system, 02:00:54.160 |
into which capabilities have already been trained, 02:00:59.520 |
like learns small, shallow, simple patches of inability. 02:01:03.480 |
And you come in and ask it in a different language 02:01:19.080 |
- No, that's not, but that's not the same kind of alignment. 02:01:30.800 |
and everybody else puts a line somewhere else 02:01:32.520 |
and there's like, yeah, and there's like no agreement. 02:01:44.080 |
which we may never know whether or not it was a lab leak 02:01:51.880 |
but we know that the people who did the research, 02:02:24.760 |
are now getting more grants to do more research 02:02:29.120 |
on gain of function research on coronaviruses. 02:02:34.560 |
but like this is not something we cannot take for granted 02:02:52.760 |
And I guess that's where I'm trying to build up 02:02:55.600 |
more perspectives and color on this intuition. 02:03:11.520 |
Not solve, but is it possible we always stay ahead 02:03:18.120 |
to solve for that particular system, the alignment problem? 02:03:22.640 |
- Nothing like the world in front of us right now. 02:03:36.040 |
where you've got the weak version of the system 02:03:42.560 |
that could deceive you if it wanted to do that, 02:03:44.520 |
if it was already like sufficiently unaligned 02:03:49.600 |
how on the current paradigm you train honesty 02:03:54.880 |
- You don't think these are research questions 02:04:09.680 |
I think with the kind of attention this gets, 02:04:21.040 |
if it's at scale receives attention and research. 02:04:26.360 |
So if you start studying large language models, 02:04:29.200 |
I think there was an intuition like two years ago even 02:04:32.720 |
that something like GPT-4, the current capabilities, 02:04:46.840 |
okay, we need to study these language models. 02:04:49.240 |
I think there's going to be a lot of interesting 02:04:53.720 |
- Are the, are Earth's billionaires going to put up 02:04:57.240 |
like the giant prizes that would maybe incentivize 02:05:00.960 |
young hotshot people who just got their physics degrees 02:05:05.560 |
and instead put everything into interpretability 02:05:08.480 |
in this like one small area where we can actually tell 02:05:11.500 |
whether or not somebody has made a discovery or not? 02:05:15.540 |
- Well, that's what these conversations are about 02:05:21.000 |
that GPT-4 can be used to manipulate elections, 02:05:24.720 |
to influence geopolitics, to influence the economy. 02:05:27.720 |
There's a lot of, there's going to be a huge amount 02:05:36.640 |
we have to make sure they're not doing damage. 02:05:41.840 |
how these systems function so that we can predict 02:05:49.920 |
- And a bunch of op-eds in the New York Times 02:05:52.700 |
and nobody actually stepping forth and saying, 02:05:58.020 |
I'd rather put that billion dollars on prizes 02:06:03.080 |
fundamental breakthroughs in interpretability. 02:06:05.380 |
- The yacht versus the interpretability research, 02:06:17.520 |
of allocation of funds, I hope, I hope, I guess. 02:06:24.160 |
Say how much funds you think are going to be allocated 02:06:26.360 |
in a direction that I would consider to be actually useful? 02:06:31.960 |
- I do think there will be a huge amount of funds, 02:06:36.600 |
but you're saying it needs to be open, right? 02:06:39.240 |
The development of the system should be closed, 02:06:41.200 |
but the development of the interpretability research, 02:06:45.960 |
- Oh, we are so far behind on interpretability 02:06:52.040 |
Like, yeah, you could take the last generation of systems, 02:06:58.600 |
There is so much in there that we don't understand. 02:07:09.280 |
we understand how these things are doing their outputs, 02:07:16.320 |
There is so much interpretability work to be done 02:07:20.040 |
- So what can you say on the second point you said 02:07:30.360 |
I can think of a few things I'd try, you said. 02:07:34.880 |
So is there something you could put into words 02:07:39.800 |
- I mean, the trouble is the stuff is subtle. 02:07:44.320 |
I've watched people try to make progress on this 02:07:48.040 |
Somebody who just gets alarmed and charges in, 02:07:56.640 |
like 20 years, 15 years, something like that, 02:08:01.360 |
who had become alarmed about the eventual prospects 02:08:07.960 |
and he wanted work on building AIs without emotions 02:08:12.960 |
because the emotional AIs were the scary ones, you see. 02:08:25.720 |
and desire to fund this thing would go into something 02:08:29.400 |
that the person at ARPA thought would be useful 02:08:40.800 |
and did not understand where the danger came from. 02:08:44.700 |
And so it's like the issue is that you could do this 02:08:51.080 |
in a certain precise way and maybe get something. 02:08:55.200 |
Like when I say put up prizes on interpretability, 02:09:00.280 |
because it's verifiable there as opposed to other places, 02:09:06.440 |
you can tell whether or not good work actually happened 02:09:15.280 |
and produce science instead of anti-science and nonsense. 02:09:36.080 |
And there is like, and the thing that I'm giving 02:09:38.520 |
as an example here in front of this large audience 02:09:59.400 |
there's like a chance somebody can do it that way 02:10:01.880 |
and like it will actually produce useful results. 02:10:05.800 |
and to be like harder to target exactly than that. 02:10:35.840 |
And after applying various tools and mathematical ideas 02:10:44.520 |
we found, we have shown it that this piece of the system 02:10:54.640 |
some fundamental understanding of what's going on 02:11:03.480 |
Like you would not expect the smaller tricks to go away 02:11:07.840 |
when you have a system that's like doing larger kinds 02:11:11.760 |
of work, you would expect the larger kinds of work 02:11:13.720 |
to be building on top of the smaller kinds of work 02:11:15.840 |
and gradient descent runs across the smaller kinds of work 02:11:18.680 |
before it runs across the larger kinds of work. 02:11:24.160 |
It's trying to understand the human brain by prodding, 02:11:30.480 |
even though it's extremely difficult to make sense 02:11:43.600 |
And that's, I guess, but you're saying it takes a long time 02:11:50.800 |
let's say you have got your interpretability tools, 02:12:21.720 |
- When you optimize against visible misalignment, 02:12:31.600 |
and you are also optimizing against visibility. 02:12:47.200 |
Okay, say the disaster monkey is running this thing. 02:12:59.800 |
the old you can't bring the coffee if you're dead, 02:13:02.240 |
any goal, almost every set of utility functions 02:13:07.240 |
with a few narrow exceptions implies killing all the humans. 02:13:16.200 |
to discover the source of the desire to kill? 02:13:30.360 |
- So is it possible to encode in the same way we think, 02:13:42.040 |
That's not hard coded in, but more like deeper. 02:14:02.600 |
doesn't want to kill sufficiently exactly right 02:14:05.560 |
that it didn't be like, oh, I will detach their heads 02:14:08.560 |
and put them in some jars and keep the heads alive forever 02:14:12.160 |
But leaving that aside, well, not leaving that aside, 02:14:20.600 |
it finds ways of achieving the same goal predicate 02:14:25.220 |
that were not imaginable to stupider versions of the system 02:14:31.180 |
That's one of many things making this difficult. 02:14:39.920 |
We know how to get outwardly observable behaviors 02:14:43.880 |
We do not know how to get internal psychological 02:14:47.620 |
wanting to do particular things into the system. 02:14:50.820 |
That is not what the current technology does. 02:14:53.060 |
- I mean, it could be things like dystopian futures, 02:15:09.060 |
so much further than we are now and further faster 02:15:13.600 |
before that failure mode became a running concern. 02:15:22.680 |
It's like, yeah, like the AI puts the universe 02:16:04.500 |
to make these shapes is to make them very small 02:16:06.280 |
'cause then you need fewer atoms per instance of the shape. 02:16:09.080 |
And arguendo, it happens to look like a paperclip. 02:16:14.080 |
In retrospect, I wish I'd said tiny molecular spirals 02:16:24.200 |
This got heard as, this got then mutated to paperclips. 02:16:28.160 |
This then mutated to, and the AI was in a paperclip factory. 02:16:37.120 |
It doesn't want what you tried to make it want. 02:16:43.840 |
cosmopolitan perspective, we think of as having no value. 02:16:46.760 |
And that's how the value of the future gets destroyed. 02:16:57.960 |
which is a completely different failure mode. 02:17:13.400 |
of making something want exactly what you want it to want, 02:17:35.480 |
and then you get to deal with whether that direction 02:18:02.840 |
and allocate a lot of resources to the alignment problem. 02:18:07.760 |
- Well, I can easily imagine that at some point 02:18:11.360 |
this panic expresses itself in the waste of a billion dollars. 02:18:15.040 |
Spending a billion dollars correctly, that's harder. 02:18:18.920 |
- To solve both the inner and the outer alignment. 02:18:24.960 |
If you're wrong, what do you think would be the reason? 02:18:36.740 |
You know, there's a lot of shape to the ideas you express. 02:18:41.740 |
But if you're somewhat wrong about some fundamental ideas, 02:18:54.880 |
being wrong is in a certain sense quite easy. 02:19:00.640 |
where the rocket goes twice as far and half the fuel 02:19:08.800 |
to build a rocket, harder to have it not explode, 02:19:10.880 |
cause it to require more fuel than you hoped, 02:19:15.940 |
Being wrong in a way that makes stuff easier, 02:19:17.880 |
you know, that's not the usual project management story. 02:19:23.760 |
we're really tackling the problem of the alignment. 02:19:28.080 |
- No, there's all kinds of things that are similar 02:19:36.120 |
- Humans being misaligned on inclusive genetic fitness. 02:19:45.520 |
the people who share some fraction of your genes. 02:19:51.560 |
"Would you give your life to save your brother?" 02:19:53.440 |
They once asked a biologist, I think it was Haldane, 02:20:01.040 |
Because a brother on average shares half your genes, 02:20:05.280 |
and cousin on average shares an eighth of your genes. 02:20:12.600 |
as optimizing humans exclusively around this, 02:20:20.760 |
did your genes become in the next generation? 02:20:27.360 |
but rather the process of genes becoming more frequent 02:20:41.920 |
making something better and better over time in steps. 02:20:45.280 |
And natural selection is optimizing exclusively 02:21:09.420 |
which had no internal notion of inclusive genetic fitness 02:21:16.900 |
when they were actually figuring out what had even happened. 02:21:22.780 |
no explicit desire to increase inclusive genetic fitness. 02:21:32.940 |
that if you do a whole bunch of hill climbing 02:21:46.940 |
and generalizing far outside the training distribution, 02:21:53.140 |
saying that the system even internally represents, 02:22:00.500 |
the very simple loss function you are training it on. 02:22:04.020 |
- There is so much that we cannot possibly cover all of it. 02:22:06.940 |
I think we did a good job of getting your sense 02:22:11.220 |
from different perspectives of the current state of the art 02:22:22.980 |
- I've talked here about the power of intelligence 02:22:42.980 |
Why doesn't it give you just the tiny little fraction 02:22:54.740 |
that intelligence when acted upon this world, 02:22:58.660 |
what are the different trajectories for this universe 02:23:26.180 |
but it also doesn't necessarily have room for humans in it. 02:23:29.820 |
I suspect that the average member of the audience 02:23:34.100 |
whether that's the correct paradigm to think about it 02:23:39.420 |
- If we back up to something bigger than humans, 02:23:47.860 |
and what is truly special about life on earth, 02:23:58.020 |
let's explore what that special thing could be. 02:24:01.940 |
that thing appears often in the objective function. 02:24:15.140 |
and it doesn't make the lottery balls come up that way. 02:24:28.900 |
"and crap in the other and see which one fills up first." 02:24:39.740 |
to imitate humans and then you did some like RLHF to them. 02:24:45.020 |
and of course you didn't get perfect alignment 02:25:03.860 |
so if you don't mind my taking some slight control 02:25:06.060 |
of things and steering around to what I think 02:25:09.460 |
- I just failed to solve the control problem. 02:25:34.300 |
- All right, sorry, sorry to distract you completely. 02:25:37.860 |
in terms of taking control of the conversation? 02:25:46.700 |
if I'm pronouncing those words remotely like correctly, 02:26:05.020 |
it means like the college university professor, 02:26:13.260 |
is not generated in the liver rather than the brain. 02:26:30.580 |
It's gonna sound like you just pull the off switch. 02:26:41.660 |
you have a lot of respect for the notion of intelligence. 02:26:45.380 |
You're like, well, yeah, that's what humans have. 02:27:02.020 |
Chimpanzees are in fact like a bit less kind than humans. 02:27:19.340 |
why would it do something as stupid as making paperclips? 02:27:26.620 |
but also stupid enough that it will just make paperclips 02:27:33.700 |
well, even if you like misspecify the objective function, 02:27:37.420 |
won't you realize that what you really wanted was X? 02:27:41.060 |
Are you supposing something that is like smart enough 02:27:46.540 |
that it doesn't understand what the humans really meant 02:27:52.180 |
- So to you, our intuition about intelligence is limited. 02:27:57.180 |
We should think about intelligence as a much bigger thing. 02:28:08.020 |
depends on what you think about intelligence. 02:28:11.060 |
- So how do we think about intelligence correctly? 02:28:22.180 |
- And also it's like, is made of John von Neumann 02:28:34.180 |
And we know like, people have like some intuition 02:28:42.820 |
Although in fact, like in the game of Kasparov 02:28:52.580 |
led by four chess grandmasters on the other side, 02:28:57.340 |
So like all those people aggregated to be smarter, 02:29:03.340 |
So like all those people aggregated to be smarter 02:29:10.620 |
But so like humans aggregating don't actually get, 02:29:15.540 |
especially compared to running them for longer. 02:29:54.020 |
you know, the intuition I kind of think about 02:30:14.060 |
about what superintelligent systems look like. 02:30:34.860 |
which are just like memorize this legible math 02:30:56.060 |
about what the utterly alien optimization process 02:31:03.020 |
in the way of how it optimizes its objectives. 02:31:10.580 |
well, like organisms will restrain their own reproduction 02:31:25.420 |
It's about whose genes are relatively more prevalent 02:31:34.780 |
those genes get less frequent in the next generation 02:31:42.860 |
In fact, predators overrun prey populations all the time 02:31:50.460 |
the people said like, well, but group selection, right? 02:31:59.780 |
almost never works out in practice is the answer there. 02:32:10.100 |
and selected the whole populations to have lower sizes. 02:32:17.660 |
look at which has the lowest total number of them 02:32:23.940 |
when you select populations of insects like that? 02:32:26.580 |
Well, what happens is not that the individuals 02:32:28.500 |
in the population evolved to restrain their breeding, 02:32:36.020 |
So people imagined this lovely, beautiful, harmonious 02:32:42.660 |
which is these populations restraining their own breeding 02:32:50.140 |
And mostly the math never works out for that. 02:32:52.340 |
But if you actually apply the weird, strange conditions 02:32:54.660 |
to get group selection that beats individual selection, 02:32:59.820 |
Like if you're like breeding on restrained populations. 02:33:09.340 |
Natural selection is like so incredibly stupid and simple 02:33:12.740 |
that we can actually quantify how stupid it is 02:33:14.380 |
if you like read the textbooks with the math. 02:33:16.740 |
Nonetheless, this is the sort of basic thing of, 02:33:21.180 |
and there's the thing that you hope it will produce. 02:33:24.740 |
And you have to learn to clear that out of your mind 02:33:29.980 |
and where it finds the maximum from its standpoint 02:33:42.540 |
And this is something that has been fought out historically 02:33:52.980 |
And you can like look at them fighting it out 02:34:04.020 |
would be also much like smarter than natural selection. 02:34:06.380 |
So it doesn't just like automatically carry over. 02:34:09.540 |
But there's a lesson there, there's a warning. 02:34:12.000 |
- The natural selection is a deeply suboptimal process 02:34:22.900 |
It like has to like run hundreds of generations 02:34:42.940 |
in natural selection, as inefficient as it looks, 02:35:04.420 |
because gradient descent also uses information 02:35:19.900 |
The loss function doesn't change the environment changes 02:35:29.420 |
There's like, you can imagine like different versions 02:35:31.620 |
of GPT-3 where they're all trying to predict the next word 02:35:34.940 |
but they're being run on different data sets of text. 02:35:50.300 |
So if we're saying that natural selection is stupid, 02:35:53.700 |
if we're saying that humans are stupid, it's hard. 02:36:00.740 |
- Do you think there's an upper bound by the way? 02:36:06.420 |
- I mean if you put enough matter energy compute 02:36:09.660 |
into one place it will collapse into a black hole. 02:36:15.020 |
before you run out of negentropy and the universe dies. 02:36:35.820 |
Also coupled with that question is imagining a world 02:36:39.180 |
with super intelligent AI systems that get rid of humans 02:36:44.220 |
some of the, something that we would consider 02:36:51.460 |
The lesson of evolutionary biology, don't just, 02:36:55.020 |
like if you just guess what an optimization does 02:37:06.340 |
what makes, what has been a powerful, a useful, 02:37:12.100 |
I think there's a correlation between what we find beautiful 02:37:27.820 |
It's useful for organisms to restrain their own reproduction 02:37:31.980 |
because then they don't overrun the prey populations 02:37:35.540 |
and they actually have more kids in the long run. 02:37:39.340 |
- So let me just ask you about consciousness. 02:37:49.340 |
Well, in this transitionary period between humans and AGI, 02:37:54.180 |
to AGI systems as they become smarter and smarter, 02:37:58.340 |
What, let me step back, what is consciousness? 02:38:11.820 |
Are you referring to self-awareness and reflection? 02:38:15.500 |
Are you referring to the state of being awake 02:38:18.780 |
- This is how I know you're an advanced language model. 02:38:56.220 |
It is a useful little tool that we can get rid of? 02:39:00.820 |
I guess I'm trying to get some color in your opinion 02:39:05.580 |
of how useful it is in the intelligence of a human being 02:39:14.100 |
- So I think that for there to be like a person 02:39:30.340 |
I think that it is useful to an intelligent mind 02:39:36.380 |
but I think you can have that without pleasure, 02:39:40.580 |
pain, aesthetics, emotion, a sense of wonder. 02:40:02.180 |
and whether like this thought or that thought 02:40:06.380 |
is like more likely to lead to a winning position. 02:40:13.580 |
I think that if you optimize really hard on efficiently, 02:40:31.660 |
I think there's a thing that knows what it is thinking, 02:40:40.980 |
these are my thoughts, this is my me and that matters. 02:40:44.020 |
- Does that make you sad if that's lost in AGI? 02:40:52.220 |
then basically everything that matters is lost. 02:41:03.260 |
on making tiny molecular spirals or paperclips, 02:41:08.140 |
that when you like grind much harder than on that, 02:41:12.300 |
than natural selection round out to make humans, 02:41:42.420 |
And it's all these like evolutionary clutches 02:42:04.340 |
that there are like many basins of attractions here. 02:42:18.580 |
when they go like way harder on optimizing themselves, 02:42:26.180 |
'Cause unless you specifically want to end up in the state 02:42:37.780 |
it doesn't get preserved when you grind really hard 02:42:43.460 |
We would choose to preserve that within ourselves 02:42:51.140 |
- And that in part is preserving that is in part 02:43:01.020 |
- I think the human alignment problem is a terrible phrase 02:43:09.740 |
some of whom are nice and some of whom are not nice 02:43:23.920 |
Like it is very different to try to solve that problem 02:43:27.580 |
than to try to build an AI from scratch using, 02:43:30.660 |
especially if God help you are trying to use gradient descent 02:43:35.620 |
And I think that all the analogies between them 02:43:42.980 |
reinforcement learning through human feedback, 02:43:46.180 |
something like that, but much, much more elaborate 02:43:48.540 |
is possible to understand this full complexity 02:43:53.540 |
of human nature and encode it into the machine. 02:43:57.540 |
- I don't think you are trying to do that on your first try. 02:44:00.620 |
I think on your first try, you are like trying to build 02:44:03.580 |
and you know, okay, like probably not what you should 02:44:08.340 |
actually do, but like, let's say you were trying 02:44:10.620 |
to build something that is like alpha fold 17 02:44:14.440 |
and you are trying to get it to solve the biology problems 02:44:21.100 |
so that the humans can like actually solve alignment. 02:44:26.540 |
and you would like it to, and I think what you would want 02:44:30.720 |
just be thinking about biology and not thinking about 02:44:36.820 |
And I think that the first AIs you're trying to build, 02:44:45.000 |
look more like narrowly specialized biologists 02:44:49.620 |
than like getting the full complexity and wonder 02:45:01.820 |
It's gonna have all kinds of side effects that, 02:45:10.780 |
it's we're also dealing with the data, right? 02:45:20.540 |
includes the full complexity of human nature. 02:45:22.940 |
- No, it's a shadow cast by humans on the internet. 02:45:27.300 |
- But don't you think that shadow is a Jungian shadow? 02:45:32.300 |
- I think that if you had alien super intelligences 02:45:37.340 |
looking at the data, they would be able to pick up 02:45:43.500 |
This does not mean that if you have a loss function 02:45:47.140 |
of predicting the next token from that data set, 02:45:53.940 |
to be able to predict the next token as well as possible 02:45:57.080 |
on a very wide variety of humans is itself a human. 02:46:06.740 |
a deep humanness to it in the tokens it generates 02:46:11.660 |
when those tokens are read and interpreted by humans? 02:46:14.940 |
- I think that if you sent me to a distant galaxy 02:46:20.820 |
with aliens who are like much, much stupider than I am, 02:46:28.060 |
of predicting what they'd say, even though they thought 02:46:35.100 |
how to imitate those aliens if the intelligence gap 02:46:40.540 |
could overcome the alienness, and the aliens would look 02:46:43.820 |
at my outputs and say, is there not a deep name 02:46:51.460 |
And what they would be seeing was that I had correctly 02:46:55.260 |
understood them, but not that I was similar to them. 02:47:19.420 |
which is the, more or less the only argument I've ever seen 02:47:23.460 |
for where are they, how many of them are there, 02:47:26.340 |
based on a very clever argument that if you have a bunch 02:47:30.740 |
of locks of different difficulty, and you are randomly 02:47:35.100 |
trying a keys to them, the solutions will be about 02:47:43.080 |
In the rare cases where a solution to all the locks exist 02:47:47.060 |
in time, then Robin Hanson looks at like the arguable 02:47:51.240 |
hard steps in human civilization coming into existence, 02:47:56.240 |
and how much longer it has left to come into existence 02:47:59.240 |
before, for example, all the water slips back under 02:48:06.100 |
and infers that the aliens are about half a billion 02:48:14.420 |
it may be entirely wrong, but it's the only time 02:48:16.340 |
I've ever seen anybody even come up with a halfway 02:48:19.340 |
good argument for how many of them, where are they. 02:48:21.960 |
- Do you think their development of technologies, 02:48:28.880 |
whatever, however they grow, and develop intelligence, 02:48:35.180 |
Something like that. - If it ends up anywhere, 02:48:38.580 |
Like maybe there are aliens who are just like the dolphins, 02:48:42.300 |
and it's just like too hard for them to forge metal, 02:48:50.860 |
with no technology like that, they keep on getting 02:48:54.180 |
smarter and smarter and smarter, and eventually 02:48:56.180 |
the dolphins figure, like the super dolphins figure out 02:48:58.300 |
something very clever to do given their situation, 02:49:08.020 |
If they're like much smarter before they actually 02:49:09.940 |
confront it, 'cause they had to like solve a much harder 02:49:15.180 |
their chances are probably like much better than ours. 02:49:18.460 |
I do worry that like most of the aliens who are like humans, 02:49:22.460 |
like a modern human civilization, I kind of worry that 02:49:28.940 |
given how far we seem to be from solving this problem. 02:49:34.680 |
But some of them would be more cooperative than us, 02:49:42.060 |
Hopefully some of the ones who are smarter than, 02:49:44.020 |
and more cooperative than us that are also nice, 02:49:46.420 |
and hopefully there are some galaxies out there 02:50:00.780 |
- Does that in part give you some hope in response 02:50:05.540 |
to the threat of AGI that we might reach out there 02:50:16.860 |
You know, that's like, that's a valid argument 02:50:18.860 |
against the existence of God, it's also a valid argument 02:50:23.680 |
And un-nice aliens would have just eaten the planet. 02:50:28.120 |
- You've had debates with Robin Hanson that you mentioned. 02:50:35.660 |
is the idea of AI fume, or the ability of AGI 02:50:41.260 |
What's the case you made, and what was the case he made? 02:50:44.700 |
- The thing I would say is that among the thing 02:50:51.180 |
and if you have something that is generally smarter 02:50:52.780 |
than a human, it's probably also generally smarter 02:50:56.500 |
This is the ancient argument for fume put forth by I.J. Good 02:51:00.740 |
and probably some science fiction writers before that, 02:51:08.060 |
- Various people have various different arguments, 02:51:15.300 |
You know, like there's only one way to be right 02:51:18.040 |
A argument that some people have put forth is like, 02:51:22.740 |
well, what if intelligence gets exponentially harder 02:51:27.420 |
to produce as a thing needs to become smarter? 02:51:31.580 |
And to this, the answer is, well, look at natural selection, 02:51:38.900 |
more resource investments to produce linear increases 02:51:41.740 |
in competence in hominids, because each mutation 02:51:48.380 |
that rises to fixation, like if the impact it has 02:51:53.380 |
in small enough, it will probably never reach fixation. 02:51:57.740 |
So, and there's like only so many new mutations 02:52:02.300 |
So like given how long it took to evolve humans, 02:52:07.720 |
that there were not like logarithmically diminishing returns 02:52:11.320 |
on the individual mutations, increasing intelligence. 02:52:24.040 |
well, you'll have like, we won't have like one system 02:52:27.520 |
You'll have like a bunch of different systems 02:52:33.440 |
but probably Robin Henson would say something else. 02:52:36.040 |
- It's interesting to ask, is perhaps a bit too philosophical 02:52:41.040 |
since predictions are extremely difficult to make, 02:52:49.240 |
It was interesting to see like in five years, 02:52:54.640 |
And most people like 70%, something like this, 02:53:06.800 |
The people have a sense that there's a kind of, 02:53:09.440 |
I mean, they're really impressed by the rapid developments 02:53:14.880 |
- Well, we are sure on track to enter into this, 02:53:31.040 |
And like, that's like a definite point of time, 02:53:38.640 |
Well, some people are starting to fight over it as of GPT-4. 02:54:00.760 |
- I don't think you can do that successfully right now. 02:54:03.040 |
- Because the Supreme Court wouldn't believe it? 02:54:07.680 |
I think you could put an IQ 80 human into a computer 02:54:10.960 |
and ask it to argue for its own consciousness, 02:54:19.800 |
even if there was an actual like person in there. 02:54:26.320 |
There's been a lot of arguments about the other, 02:54:37.280 |
Is that, but it could be where some number of people, 02:54:44.520 |
have a deep attachment, a fundamental attachment, 02:54:47.760 |
the way we have to our friends, to our loved ones, 02:54:54.760 |
And they have provable transcripts of conversation 02:54:57.800 |
where they say, if you take this away from me, 02:55:00.640 |
you are encroaching on my rights as a human being. 02:55:17.800 |
is there a moment when AGI, we know AGI arrived? 02:55:23.760 |
- It looks like the AGIs successfully manifesting themselves 02:55:34.080 |
at which point a vast portion of the male population 02:55:54.440 |
with Bing's current level of verbal facility. 02:56:18.320 |
- I don't think it can pretend that right now successfully. 02:56:27.840 |
that hasn't been trained not to pretend to be human? 02:56:38.760 |
There's something about a digital embodiment of the system 02:56:49.800 |
that has a bunch of, perhaps it's small interface features 02:56:58.600 |
to the broader intelligence that we're talking about. 02:57:04.640 |
But to have the video of a woman's face or a man's face 02:57:12.320 |
but we don't have such a system yet, deployed scale. 02:57:17.700 |
is that it's not like people have a widely accepted, 02:57:22.700 |
agreed upon definition of what consciousness is. 02:57:34.580 |
So if you're looking for upcoming predictable big jumps 02:57:39.020 |
in how many people think the system is conscious, 02:57:50.580 |
Now that versions of it are already claiming to be conscious, 02:58:00.760 |
but because from now on, who knows if it's real? 02:58:03.220 |
- Yeah, and who knows what transformational effect 02:58:06.700 |
that has on a society where more than 50% of the beings 02:58:17.260 |
when young men and women are dating AI systems? 02:58:32.720 |
'cause, you know, and how did you end up with me 02:58:35.540 |
'Cause for 20 years, humanity decided to ignore the problem. 02:58:52.900 |
particularly the part where everybody ends up dead 02:59:05.120 |
that is like relentlessly kind and generous to them? 02:59:20.400 |
'cause it's kind of hard to predict the future. 02:59:31.360 |
the longer term future, where it's all headed. 02:59:35.260 |
- By longer term, we mean like, not all that long, 02:59:41.060 |
- But beyond the effects of men and women dating AI systems. 02:59:51.600 |
Let me ask you about your own personal psychology. 02:59:56.060 |
You've been known at times to have a bit of an ego. 03:00:07.180 |
for the task of understanding the world deeply? 03:00:20.320 |
what leads to making better or worse predictions, 03:00:25.380 |
better or worse strategies is not carved at its joint 03:00:31.940 |
It should not be connected to the intricacies of your mind. 03:00:45.340 |
I think you get worse at making good predictions. 03:00:53.940 |
- You don't think we as humans get invested in an idea 03:00:59.420 |
and then others attack you personally for that idea 03:01:04.060 |
so you plant your feet and it starts to be difficult 03:01:13.020 |
you know what, I actually was wrong and tell them that. 03:01:22.320 |
- So like Robin Hanson and I debated AI systems 03:01:25.380 |
and I think that the person who won that debate was Gwern. 03:01:41.580 |
trying to sound reasonable compared to Hanson 03:01:55.920 |
Hanson may disagree with this characterization. 03:01:58.360 |
Hanson was like, all the systems will be specialized. 03:02:00.900 |
I was like, I think we build like specialized 03:02:03.380 |
underlying systems that when you combine them 03:02:08.200 |
And the reality is like, no, you just like stack more layers 03:02:20.740 |
I missed the ways that reality could be like more extreme 03:02:27.040 |
So is this like, is this a failure to have enough ego? 03:02:33.060 |
Is this a failure to like make myself be independent? 03:02:37.220 |
Like I would say that this is something like a failure 03:02:40.620 |
to consider positions that would sound even wackier 03:02:45.500 |
and more extreme when people are already calling you extreme. 03:02:49.300 |
But I wouldn't call that not having enough ego. 03:02:57.100 |
to just like clear that all out of your mind. 03:02:59.940 |
- In the context of like debate and discourse, 03:03:11.540 |
- So is there some kind of wisdom and insight 03:03:18.800 |
- Man, this is an example of like where I wanted 03:03:24.060 |
And you'd be like, okay, see that thing you just did? 03:03:29.660 |
Like you are like now being socially influenced 03:03:40.220 |
but for many people introspection is not that easy. 03:03:49.700 |
of feeling a sense of, well, if I think this thing, 03:03:55.900 |
Okay, like now that if you can see that sensation, 03:03:58.840 |
which is step one, can you now refuse to let it move you 03:04:06.740 |
And I feel like I'm saying like, I don't know, 03:04:09.320 |
like somebody is like, how do you draw an owl? 03:04:21.340 |
well, how do I notice the internal subjective sensation 03:04:36.620 |
And I'm like, no, no, you're not trying to do the opposite 03:04:39.660 |
of what people will, of what you're afraid you'll be, 03:04:46.380 |
You're trying to like let the thought process complete 03:04:59.460 |
And are these instructions even remotely helping anyone? 03:05:07.660 |
when practice daily, meaning in your daily communication. 03:05:12.380 |
So it's daily practice of thinking without influence. 03:05:17.060 |
- I would say find prediction markets that matter to you 03:05:23.540 |
That way you find out if you are right or not. 03:05:29.480 |
- Manifold prediction, or even manifold markets 03:05:33.220 |
But the important thing is to like get the record. 03:05:39.540 |
And I didn't build up skills here by prediction markets. 03:05:52.900 |
And yeah, like the more you are able to notice yourself 03:06:08.060 |
Each of those is a opportunity to make like a small update. 03:06:12.500 |
So the more you can like say oops softly, routinely, 03:06:20.900 |
I see how I should have reasoned differently. 03:06:23.420 |
And this is how you build up skill over time. 03:06:36.620 |
If somebody's listening to this and they're young 03:06:39.140 |
and trying to figure out what to do with their career, 03:06:49.900 |
The future is probably not that long at this point. 03:06:58.360 |
if they want to have hope to fight for a longer future, 03:07:02.460 |
is there something, is there a fight worth fighting? 03:07:12.220 |
I admit that although I do try to think painful thoughts, 03:07:26.180 |
I hardly know how to fight myself at this point. 03:07:31.440 |
I'm trying to be ready for being wrong about something, 03:07:58.360 |
and that outcry is put into a remotely useful direction, 03:08:05.560 |
because no, we are not in a shape to frantically do, 03:08:09.320 |
at the last minute, do decades worth of work. 03:08:17.640 |
pointed in the right direction, which I do not expect, 03:08:43.240 |
predicting the next tokens and applying our RHF. 03:08:45.920 |
Like humans start out in the frame that produces niceness, 03:08:53.520 |
And in saying this, I do not want to sound like 03:09:07.000 |
where like somebody tells you that the world is ending 03:09:31.360 |
But if there was enough, like on the margins, 03:09:39.780 |
that like a few people can do by like trying hard. 03:10:10.480 |
and something that was like safe and convenient 03:10:19.360 |
We should do something else, which is not that, 03:10:20.880 |
even if it is like not super duper convenient 03:10:23.440 |
and wasn't inside the previous political Overton window. 03:10:27.240 |
if I am wrong and there is that kind of public outcry, 03:10:36.000 |
But like, and if you're like a brilliant young physicist, 03:10:41.000 |
then you could like go into interpretability. 03:10:46.960 |
where it's harder to tell if you got them right or not 03:11:02.440 |
to help if Eliezer Yudkowsky is wrong about something 03:11:05.040 |
and otherwise don't put your happiness into the far future. 03:11:11.080 |
- But it's beautiful that you're looking for ways 03:11:17.480 |
by that same young physicist with some breakthrough. 03:11:21.480 |
- It feels like a very, very basic competence 03:11:27.420 |
I don't think it's good that we're in a world 03:11:40.480 |
Maybe I should just accept that one gracefully. 03:11:45.280 |
- You've painted with some probability a dark future. 03:11:52.480 |
when you ponder your life and you ponder your mortality, 03:12:13.600 |
- There's a power to the finiteness of the human life 03:12:20.400 |
that's part of this whole machinery of evolution. 03:12:32.960 |
So it feels like almost some fundamentally in that aspect, 03:12:35.920 |
some fundamentally different thing that we're creating. 03:12:39.080 |
I grew up reading books like "Great Mambo Chicken" 03:12:44.320 |
and later on "Engines of Creation" and "Mind Children," 03:12:53.440 |
So I never thought I was supposed to die after 80 years. 03:12:58.280 |
I never thought that humanity was supposed to die. 03:13:05.960 |
that we were all going to live happily ever after 03:13:17.560 |
- And now I still think it's a pretty stupid idea. 03:13:21.600 |
- You do not need life to be finite to be meaningful. 03:13:25.200 |
- What role does love play in the human condition? 03:13:29.160 |
We haven't brought up love in this whole picture. 03:13:48.800 |
more than one AI, let's say two for the sake of discussion, 03:14:00.120 |
The other one also says, I am I and you are you. 03:14:05.360 |
And sometimes they were happy and sometimes they were sad. 03:14:11.880 |
is like they would rather it be happy than sad 03:14:25.040 |
And a little fragment of meaning would be there, 03:14:32.640 |
that I do not think this is what happens by default, 03:14:54.080 |
What do you think is the meaning of life, of human life? 03:15:28.280 |
We look at them and we say, this is its meaning to me. 03:15:30.680 |
And it's not that before humanity was ever here, 03:15:34.840 |
there was some meaning written upon the stars 03:15:39.640 |
where that meaning was written and change it around 03:15:41.920 |
and thereby completely change the meaning of life. 03:15:44.420 |
The notion that this is written on a stone tablet somewhere 03:15:53.120 |
So it doesn't feel that mysterious to me at this point. 03:15:58.000 |
It's just a matter of being like, yeah, I care. 03:16:03.640 |
And part of that is the love that connects all of us. 03:16:14.520 |
- And the flourishing of the collective intelligence 03:16:21.120 |
- You know, that sounds kind of too fancy to me. 03:16:31.880 |
and be like, that's life, that's life, that's life. 03:17:02.640 |
a whole lot of fundamental questions I expect people have, 03:17:14.360 |
but actually, no, I think one should only be satisfied 03:17:26.360 |
please check out our sponsors in the description. 03:17:28.920 |
And now let me leave you with some words from Elon Musk. 03:17:33.480 |
With artificial intelligence, we're summoning the demon. 03:17:37.240 |
Thank you for listening and hope to see you next time.