back to indexRoman Yampolskiy: Dangers of Superintelligent AI | Lex Fridman Podcast #431
Chapters
0:0 Introduction
2:20 Existential risk of AGI
8:32 Ikigai risk
16:44 Suffering risk
20:19 Timeline to AGI
24:51 AGI turing test
30:14 Yann LeCun and open source AI
43:6 AI control
45:33 Social engineering
48:6 Fearmongering
57:57 AI deception
64:30 Verification
71:29 Self-improving AI
83:42 Pausing AI development
89:59 AI Safety
99:43 Current AI
105:5 Simulation
112:24 Aliens
113:57 Human mind
120:17 Neuralink
129:23 Hope for the future
133:18 Meaning of life
00:00:00.000 |
If we create general super-intelligences, I don't see a good outcome long-term for humanity. 00:00:06.560 |
So there is X risk. Existential risk, everyone's dead. There is S risk, suffering risks, 00:00:12.640 |
where everyone wishes they were dead. We have also idea for I risk, Ikigai risks, where 00:00:18.240 |
we lost our meaning. The systems can be more creative, they can do all the jobs. It's not 00:00:24.560 |
obvious what you have to contribute to a world where super-intelligence exists. Of course, you 00:00:30.080 |
can have all the variants you mentioned, where we are safe, we are kept alive, but we are not in 00:00:35.840 |
control. We are not deciding anything. We are like animals in a zoo. There is, again, possibilities 00:00:43.040 |
we can come up with as very smart humans, and then possibilities something a thousand times 00:00:49.120 |
smarter can come up with for reasons we cannot comprehend. The following is a conversation with 00:00:56.400 |
Roman Yampolsky, an AI safety and security researcher and author of a new book titled 00:01:02.560 |
AI Unexplainable, Unpredictable, Uncontrollable. He argues that there's almost 100% chance that AGI 00:01:11.280 |
will eventually destroy human civilization. As an aside, let me say that we'll have many often 00:01:18.320 |
technical conversations on the topic of AI, often with engineers building the state-of-the-art AI 00:01:24.640 |
systems. I would say those folks put the infamous P-Doom or the probability of AGI killing all 00:01:30.560 |
humans at around 1-20%, but it's also important to talk to folks who put that value at 70, 80, 90, 00:01:40.160 |
and, in the case of Roman, at 99.99 and many more nines percent. I'm personally excited for the 00:01:47.680 |
future and believe it will be a good one, in part because of the amazing technological innovation we 00:01:54.000 |
humans create, but we must absolutely not do so with blinders on, ignoring the possible risks, 00:02:02.720 |
including existential risks of those technologies. That's what this conversation is about. 00:02:10.080 |
This is the Lex Friedman Podcast. To support it, please check out our sponsors in the description. 00:02:15.600 |
And now, dear friends, here's Roman Yampolsky. What to you is the probability that superintelligent 00:02:23.920 |
AI will destroy all human civilization? - What's the time frame? 00:02:27.040 |
- Let's say 100 years, in the next 100 years. - So the problem of controlling AGI or 00:02:33.280 |
superintelligence, in my opinion, is like a problem of creating a perpetual safety machine. 00:02:39.680 |
By analogy with perpetual motion machine, it's impossible. Yeah, we may succeed and do a good 00:02:46.320 |
job with GPT-5, 6, 7, but they just keep improving, learning, eventually self-modifying, 00:02:56.560 |
interacting with the environment, interacting with malevolent actors. The difference between 00:03:03.360 |
cybersecurity, narrow AI safety, and safety for general AI for superintelligence is that 00:03:10.160 |
we don't get a second chance. With cybersecurity, somebody hacks your account, what's the big deal? 00:03:14.880 |
You get a new password, new credit card, you move on. Here, if we're talking about existential risks, 00:03:21.520 |
you only get one chance. So you're really asking me, what are the chances that we'll create 00:03:26.880 |
the most complex software ever on the first try with zero bugs, and it will continue to have zero 00:03:34.000 |
bugs for 100 years or more? - So there is an incremental improvement 00:03:41.360 |
of systems leading up to AGI. To you, it doesn't matter if we can keep those safe. There's going 00:03:49.200 |
to be one level of system at which you cannot possibly control it. - I don't think we so far 00:03:59.200 |
have made any system safe. At the level of capability they display, they already have 00:04:06.080 |
made mistakes, we had accidents, they've been jailbroken. I don't think there is a single 00:04:12.720 |
large language model today which no one was successful at making do something developers 00:04:18.960 |
didn't intend it to do. - But there's a difference between getting it to do something unintended, 00:04:24.560 |
getting it to do something that's painful, costly, destructive, and something that's destructive to 00:04:29.760 |
the level of hurting billions of people, or hundreds of millions of people, billions of people, 00:04:35.280 |
or the entirety of human civilization. That's a big leap. - Exactly, but the systems we have today 00:04:41.280 |
have capability of causing X amount of damage. So then they fail, that's all we get. If we develop 00:04:47.680 |
systems capable of impacting all of humanity, all of universe, the damage is proportionate. 00:04:55.040 |
- What to you are the possible ways that such kind of mass murder of humans can happen? 00:05:02.800 |
- That's always a wonderful question. So one of the chapters in my new book is about 00:05:07.840 |
unpredictability. I argue that we cannot predict what a smarter system will do. So you're really 00:05:13.360 |
not asking me how superintelligence will kill everyone, you're asking me how I would do it. 00:05:18.240 |
And I think it's not that interesting. I can tell you about the standard, you know, 00:05:22.480 |
nanotech, synthetic, bionuclear. Superintelligence will come up with something completely new, 00:05:27.920 |
completely super. We may not even recognize that as a possible path to achieve that goal. 00:05:35.440 |
- So there's like an unlimited level of creativity in terms of how humans could be killed. 00:05:41.680 |
But, you know, we could still investigate possible ways of doing it. Not how to do it, but 00:05:49.680 |
at the end, what is the methodology that does it? You know, shutting off the power, 00:05:55.040 |
and then humans start killing each other maybe because the resources are really constrained. 00:06:01.440 |
And then there's the actual use of weapons, like nuclear weapons, or 00:06:04.320 |
developing artificial pathogens, viruses, that kind of stuff. We could still kind of think 00:06:11.760 |
through that and defend against it, right? There's a ceiling to the creativity of mass 00:06:16.880 |
murder of humans here, right? The options are limited. - They are limited by how imaginative 00:06:22.720 |
we are. If you are that much smarter, that much more creative, you are capable of thinking across 00:06:27.600 |
multiple domains, do novel research in physics and biology, you may not be limited by those tools. 00:06:33.520 |
If squirrels were planning to kill humans, they would have a set of possible ways of doing it, 00:06:39.200 |
but they would never consider things we can come up with. - So are you thinking about mass murder 00:06:43.760 |
and destruction of human civilization, or are you thinking of with squirrels, you put them in a zoo, 00:06:48.720 |
and they don't really know they're in a zoo? If we just look at the entire set of undesirable 00:06:52.560 |
trajectories, majority of them are not going to be death. Most of them are going to be just like 00:06:59.280 |
things like Brave New World, where the squirrels are fed dopamine, and they're all doing some kind 00:07:08.880 |
of fun activity, and the fire, the soul of humanity is lost because of the drug that's fed to it. Or 00:07:16.640 |
like literally in a zoo, we're in a zoo, we're doing our thing, we're like playing a game of Sims, 00:07:22.160 |
and the actual players playing that game are AI systems. Those are all undesirable because 00:07:29.280 |
sort of the free will, the fire of human consciousness is dimmed through that process, 00:07:35.360 |
but it's not killing humans. So are you thinking about that, or is the biggest concern literally 00:07:43.280 |
the extinctions of humans? - I think about a lot of things. So there is X risk, existential risk, 00:07:49.520 |
everyone's dead. There is S risk, suffering risks, where everyone wishes they were dead. 00:07:54.640 |
We have also idea for I risk, Ikigai risks, where we lost our meaning. The systems can be more 00:08:02.000 |
creative, they can do all the jobs. It's not obvious what you have to contribute to a world 00:08:07.520 |
where superintelligence exists. Of course, you can have all the variants you mentioned, where 00:08:13.280 |
we are safe, we are kept alive, but we are not in control, we are not deciding anything, 00:08:18.240 |
we are like animals in a zoo. There is, again, possibilities we can come up with as very smart 00:08:25.280 |
humans, and then possibilities something 1,000 times smarter can come up with for reasons we 00:08:31.600 |
cannot comprehend. - I would love to sort of dig into each of those, X risk, S risk, and I risk. 00:08:37.840 |
So can you like linger on I risk? What is that? - So Japanese concept of Ikigai, you find something 00:08:45.440 |
which allows you to make money, you are good at it, and the society says, "We need it." So like, 00:08:51.680 |
you have this awesome job, you are a podcaster, gives you a lot of meaning, you have a good life, 00:08:58.640 |
I assume you're happy. That's what we want most people to find, to have. For many intellectuals, 00:09:06.160 |
it is their occupation which gives them a lot of meaning. I am a researcher, philosopher, scholar, 00:09:12.560 |
that means something to me. In a world where an artist is not feeling appreciated because his art 00:09:20.080 |
is just not competitive with what is produced by machines, or a writer, or scientist will lose a 00:09:28.640 |
lot of that. And at the lower level, we're talking about complete technological unemployment. 00:09:34.800 |
We're not losing 10% of jobs, we're losing all jobs. What do people do with all that free time? 00:09:40.160 |
What happens when everything society is built on is completely modified in one generation? It's not 00:09:48.000 |
a slow process where we get to kind of figure out how to live that new lifestyle, but it's 00:09:54.160 |
pretty quick. - In that world, can't humans do what humans currently do with chess, play each other, 00:10:01.440 |
have tournaments, even though AI systems are far superior at this time in chess? So we just 00:10:08.320 |
create artificial games, or for us, they're real. Like the Olympics, we do all kinds of different 00:10:14.400 |
competitions and have fun, maximize the fun, and let the AI focus on the productivity. 00:10:24.000 |
- It's an option, I have a paper where I try to solve the value alignment problem for multiple 00:10:29.440 |
agents. And the solution to avoid compromise is to give everyone a personal virtual universe. 00:10:35.360 |
You can do whatever you want in that world. You could be king, you could be slave, you decide 00:10:39.840 |
what happens. So it's basically a glorified video game where you get to enjoy yourself and someone 00:10:45.360 |
else takes care of your needs, and the substrate alignment is the only thing we need to solve. We 00:10:51.840 |
don't have to get eight billion humans to agree on anything. - So okay, so why is that not a 00:10:58.880 |
likely outcome? Why can't AI systems create video games for us to lose ourselves in, each with an 00:11:05.920 |
individual video game universe? - Some people say that's what happened, 00:11:10.000 |
we're in a simulation. - And we're playing that video game, 00:11:13.520 |
and now we're creating, what, maybe we're creating artificial threats for ourselves to be scared 00:11:19.840 |
about, 'cause fear is really exciting. It allows us to play the video game more vigorously. 00:11:25.440 |
- And some people choose to play on a more difficult level with more constraints. Some say, 00:11:30.880 |
okay, I'm just gonna enjoy the game, high privilege level. Absolutely. 00:11:34.720 |
- So okay, what was that paper on multi-agent value alignment? 00:11:38.240 |
- Personal universes. Personal universes. - So that's one of the possible outcomes. 00:11:44.560 |
But what in general is the idea of the paper? So it's looking at multiple agents that are human, 00:11:49.600 |
AI, like a hybrid system where there's humans and AIs? Or is it looking at humans or just 00:11:54.800 |
intelligent agents? - In order to solve value 00:11:56.800 |
alignment problem, I'm trying to formalize it a little better. Usually we're talking about getting 00:12:02.000 |
AIs to do what we want, which is not well-defined. Are we talking about creator of a system, 00:12:08.000 |
owner of that AI, humanity as a whole? But we don't agree on much. There is no universally 00:12:15.760 |
accepted ethics, morals across cultures, religions. People have individually very different 00:12:20.960 |
preferences politically and such. So even if we somehow managed all the other aspects of it, 00:12:26.880 |
programming those fuzzy concepts in, getting AI to follow them closely, 00:12:30.880 |
we don't agree on what to program in. So my solution was, okay, we don't have to compromise 00:12:36.320 |
on room temperature. You have your universe, I have mine, whatever you want. And if you like me, 00:12:41.680 |
you can invite me to visit your universe. We don't have to be independent, but the point is you can 00:12:46.720 |
be. And virtual reality is getting pretty good. It's gonna hit a point where you can't tell that 00:12:50.880 |
difference. And if you can't tell if it's real or not, well, what's the difference? 00:12:54.720 |
- So basically, give up on value alignment. Create an entire, it's like the multiverse theory. 00:13:01.040 |
It's just create an entire universe for you with your values. 00:13:04.240 |
- You still have to align with that individual. They have to be happy in that simulation. 00:13:09.360 |
But it's a much easier problem to align with one agent versus eight billion agents plus animals, 00:13:15.120 |
- So you convert the multi-agent problem into a single-agent problem? 00:13:21.280 |
- Okay. Is there any way to, so, okay, that's giving up on the value alignment problem. 00:13:29.760 |
Well, is there any way to solve the value alignment problem where there's a bunch of humans, 00:13:35.040 |
multiple humans, tens of humans, or eight billion humans that have very different set of values? 00:13:41.280 |
- It seems contradictory. I haven't seen anyone explain what it means outside of kinda 00:13:48.400 |
words which pack a lot, make it good, make it desirable, make it something they don't regret. 00:13:55.360 |
But how do you specifically formalize those notions? How do you program them in? 00:13:59.600 |
I haven't seen anyone make progress on that so far. 00:14:02.720 |
- But isn't that the whole optimization journey that we're doing as a human civilization? 00:14:07.680 |
We're looking at geopolitics. Nations are in a state of anarchy with each other. 00:14:13.920 |
They start wars, there's conflict, and oftentimes they have very different views of what is good 00:14:22.000 |
and what is evil. Isn't that what we're trying to figure out, just together, trying to converge 00:14:27.600 |
towards that? So we're essentially trying to solve the value alignment problem with humans. 00:14:31.360 |
- Right, but the examples you gave, some of them are, for example, two different religions saying 00:14:36.640 |
this is our holy site, and we are not willing to compromise it in any way. If you can make 00:14:43.360 |
two holy sites in virtual worlds, you solve the problem. But if you only have one, it's not 00:14:47.360 |
divisible, you're kinda stuck there. - But what if we want to be at tension 00:14:51.520 |
with each other? And through that tension, we understand ourselves and we understand the world. 00:14:58.160 |
So that's the intellectual journey we're on as a human civilization, is we create intellectual 00:15:05.680 |
and physical conflict, and through that, figure stuff out. 00:15:08.240 |
- If we go back to that idea of simulation, and this is entertainment kinda giving meaning to us, 00:15:14.640 |
the question is how much suffering is reasonable for a video game? So yeah, I don't mind a video 00:15:20.160 |
game where I get haptic feedback, there is a little bit of shaking, maybe I'm a little scared. 00:15:25.360 |
I don't want a game where kids are tortured, literally. That seems unethical, at least by 00:15:32.880 |
our human standards. - Are you suggesting it's possible 00:15:35.840 |
to remove suffering, if we're looking at human civilization as an optimization problem? 00:15:39.920 |
- So we know there are some humans who, because of a mutation, don't experience physical pain. 00:15:46.560 |
So at least physical pain can be mutated out, re-engineered out. Suffering, in terms of meaning, 00:15:55.280 |
like you burned the only copy of my book, is a little harder. But even there, you can manipulate 00:16:00.960 |
your hedonic set point, you can change defaults, you can reset. Problem with that is, if you start 00:16:07.360 |
messing with your reward channel, you start wireheading, and end up blessing out a little 00:16:17.200 |
Would you really want to live in a world where there's no suffering? That's a dark question. 00:16:22.240 |
But is there some level of suffering that reminds us of what this is all for? 00:16:28.800 |
- I think we need that, but I would change the overall range. So right now, 00:16:34.160 |
it's negative infinity to kind of positive infinity, pain-pleasure axis. I would make 00:16:38.800 |
it like zero to positive infinity, and being unhappy is like, I'm close to zero. 00:16:43.040 |
- Okay, so what's the S-risk? What are the possible things that you're imagining with S-risk? So 00:16:49.200 |
mass suffering of humans, what are we talking about there, caused by AGI? 00:16:54.560 |
- So there are many malevolent actors. We can talk about psychopaths, crazies, hackers, 00:17:01.360 |
doomsday cults. We know from history, they tried killing everyone. They tried on purpose to cause 00:17:07.280 |
maximum amount of damage, terrorism. What if someone malevolent wants on purpose to torture 00:17:13.440 |
all humans as long as possible? You solve aging, so now you have functional immortality, 00:17:20.880 |
and you just try to be as creative as you can. - Do you think there is actually people in human 00:17:26.480 |
history that tried to literally maximize human suffering? In just studying people who have done 00:17:32.560 |
evil in the world, it seems that they think that they're doing good, and it doesn't seem like 00:17:37.840 |
they're trying to maximize suffering. They just cause a lot of suffering as a side effect of 00:17:45.360 |
doing what they think is good. - So there are different malevolent 00:17:49.120 |
agents. Some may be just gaining personal benefit and sacrificing others to that cause. Others, 00:17:56.240 |
we know for a fact, are trying to kill as many people as possible. When we look at 00:18:00.080 |
recent school shootings, if they had more capable weapons, they would take out not 00:18:05.840 |
dozens, but thousands, millions, billions. - Well, we don't know that, but that is a 00:18:16.800 |
terrifying possibility, and we don't want to find out. Like if terrorists had access to nuclear 00:18:24.240 |
weapons, how far would they go? Is there a limit to what they're willing to do? In your senses, 00:18:33.840 |
there are some malevolent actors where there's no limit. - There is mental diseases where people 00:18:41.520 |
don't have empathy, don't have this human quality of understanding suffering in others. 00:18:49.040 |
- And then there's also a set of beliefs where you think you're doing good 00:18:52.400 |
by killing a lot of humans. - Again, I would like to assume that 00:18:58.720 |
normal people never think like that. It's always some sort of psychopaths, but yeah. 00:19:03.120 |
- And to you, AGI systems can carry that and be more competent at executing that? 00:19:10.800 |
- They can certainly be more creative. They can understand human biology better, 00:19:15.840 |
understand our molecular structure, genome. Again, a lot of times, torture ends, 00:19:23.680 |
then individual dies. That limit can be removed as well. - So if we're actually looking at X-risk 00:19:30.000 |
and S-risk, as the systems get more and more intelligent, don't you think it's possible to 00:19:35.840 |
anticipate the ways they can do it and defend against it like we do with the cyber security, 00:19:41.040 |
with the new security systems? - Right, we can definitely keep up for a 00:19:45.440 |
while. I'm saying you cannot do it indefinitely. At some point, the cognitive gap is too big, 00:19:52.400 |
the surface you have to defend is infinite, but attackers only need to find one exploit. 00:20:00.400 |
- So to you, eventually, this is, we're heading off a cliff. 00:20:04.480 |
- If we create general super intelligences, I don't see a good outcome long-term for humanity. 00:20:11.520 |
The only way to win this game is not to play it. - Okay, well, we'll talk about possible solutions 00:20:16.720 |
and what not playing it means, but what are the possible timelines here to you? What are 00:20:22.240 |
we talking about? We're talking about a set of years, decades, centuries. What do you think? 00:20:27.280 |
- I don't know for sure. The prediction markets right now are saying 2026 for AGI. 00:20:32.640 |
I heard the same thing from CEO of Anthropic, DeepMind, so maybe we're two years away, 00:20:38.480 |
which seems very soon, given we don't have a working safety mechanism in place or even a 00:20:45.360 |
prototype for one, and there are people trying to accelerate those timelines because they feel 00:20:50.000 |
we're not getting there quick enough. - Well, what do you think they mean 00:20:52.880 |
when they say AGI? - So the definitions we used to have, 00:20:56.800 |
and people are modifying them a little bit lately, artificial general intelligence was a system 00:21:02.320 |
capable of performing in any domain a human could perform, so kind of you're creating this average 00:21:09.120 |
artificial person. They can do cognitive labor, physical labor, where you can get another human 00:21:14.320 |
to do it. Superintelligence was defined as a system which is superior to all humans in all 00:21:19.120 |
domains. Now people are starting to refer to AGI as if it's superintelligence. I made a post 00:21:25.840 |
recently where I argued, for me at least, if you average out over all the common human tasks, 00:21:31.920 |
those systems are already smarter than an average human. So under that definition, we have it. 00:21:38.800 |
Shane Legg has this definition of where you're trying to win in all domains. That's what 00:21:43.440 |
intelligence is. Now, are they smarter than elite individuals in certain domains? Of course not. 00:21:49.280 |
They're not there yet. But the progress is exponential. - See, I'm much more concerned 00:21:55.440 |
about social engineering. So to me, AI's ability to do something in the physical world, 00:22:03.440 |
like the lowest hanging fruit, the easiest set of methods, is by just getting humans to do it. 00:22:12.720 |
It's going to be much harder to be the kind of viruses that take over the minds of robots 00:22:19.840 |
that, where the robots are executing the commands. It just seems like humans, 00:22:24.240 |
social engineering of humans, is much more likely. - That would be enough to 00:22:28.000 |
bootstrap the whole process. - Okay, just to linger on the term AGI, 00:22:34.160 |
what to you is the difference between AGI and human-level intelligence? 00:22:37.760 |
- Human-level is general in the domain of expertise of humans. We know how to do human 00:22:44.480 |
things. I don't speak dog language. I should be able to pick it up if I'm a general intelligence. 00:22:49.760 |
It's kind of inferior animal. I should be able to learn that skill, but I can't. A general 00:22:55.280 |
intelligence, truly universal general intelligence, should be able to do things like that humans 00:23:04.160 |
problems of that type, to do other similar things outside of our domain of expertise, 00:23:12.960 |
because it's just not the world we live in. - If we just look at the space of cognitive 00:23:19.520 |
abilities we have, I just would love to understand what the limits are beyond which an AGI system can 00:23:25.600 |
reach. What does that look like? What about actual mathematical thinking or scientific innovation, 00:23:34.800 |
that kind of stuff? - We know calculators are smarter than 00:23:39.280 |
humans in that narrow domain of addition. - But is it humans plus tools versus AGI, 00:23:47.440 |
or just human, raw human intelligence? 'Cause humans create tools, and with the tools, 00:23:53.360 |
they become more intelligent. There's a gray area there, what it means to be human when we're 00:23:58.720 |
measuring their intelligence. - When I think about it, I usually think 00:24:01.440 |
human with a paper and a pencil, not human with internet and another AI helping. 00:24:06.640 |
- But is that a fair way to think about it? 'Cause isn't there another definition of human-level 00:24:11.440 |
intelligence that includes the tools that humans create? 00:24:13.680 |
- But we create AI, so at any point, you'll still just add superintelligence to human capability? 00:24:19.200 |
That seems like cheating. - No, controllable tools. 00:24:23.040 |
There is an implied leap that you're making when AGI goes from tool to entity that can make its 00:24:33.440 |
own decisions. So if we define human-level intelligence as everything a human can do 00:24:38.480 |
with fully controllable tools. - It seems like a hybrid of some kind. 00:24:42.720 |
You're now doing brain-computer interfaces, you're connecting it to maybe narrow AIs. Yeah, 00:24:47.680 |
it definitely increases our capabilities. - So what's a good test to you that measures 00:24:57.360 |
whether an artificial intelligence system has reached human-level intelligence? And what's a 00:25:02.560 |
good test where it has superseded human-level intelligence to reach that land of AGI? 00:25:09.120 |
- I am old-fashioned. I like Turing test. I have a paper where I equate passing Turing test to 00:25:15.040 |
solving AI complete problems because you can encode any questions about any domain into the 00:25:20.400 |
Turing test. You don't have to talk about how was your day, you can ask anything. And so the system 00:25:26.960 |
has to be as smart as a human to pass it in a true sense. - But then you would extend that to 00:25:32.080 |
maybe a very long conversation. I think the Alexa prize was doing that. 00:25:37.360 |
Basically, can you do a 20-minute, 30-minute conversation with an AI system? 00:25:42.160 |
- It has to be long enough to where you can make some meaningful decisions about capabilities, 00:25:49.040 |
absolutely. You can brute force very short conversations. - So like, literally, what does 00:25:54.160 |
that look like? Can we construct formally a kind of test that tests for AGI? - For AGI, it has to 00:26:05.840 |
be there. I cannot give it a task I can give to a human and it cannot do it if a human can. For 00:26:13.200 |
superintelligence, it would be superior on all such tasks, not just average performance. So like, 00:26:18.800 |
go learn to drive car, go speak Chinese, play guitar. Okay, great. - I guess the following 00:26:24.480 |
question, is there a test for the kind of AGI that would be susceptible to lead to S-risk or X-risk? 00:26:35.520 |
Susceptible to destroy human civilization? Like, is there a test for that? - You can develop a test 00:26:42.400 |
which will give you positives if it lies to you or has those ideas. You cannot develop a test which 00:26:48.320 |
rules them out. There is always possibility of what Bostrom calls a treacherous turn, 00:26:53.280 |
where later on a system decides for game-theoretic reasons, economic reasons, to change its behavior. 00:27:01.360 |
And we see the same with humans. It's not unique to AI. For millennia, we tried developing morals, 00:27:07.120 |
ethics, religions, lie detector tests, and then employees betray the employers, spouses betray 00:27:13.600 |
family. It's a pretty standard thing intelligent agents sometimes do. - So is it possible to detect 00:27:21.280 |
when an AI system is lying or deceiving you? - If you know the truth and it tells you something 00:27:27.920 |
false, you can detect that. But you cannot know in general every single time. And again, the system 00:27:34.800 |
you're testing today may not be lying. The system you're testing today may know you are testing it 00:27:40.640 |
and so behaving. And later on, after it interacts with the environment, interacts with other systems, 00:27:48.400 |
malevolent agents, learns more, it may start doing those things. - So do you think it's possible to 00:27:54.560 |
develop a system where the creators of the system, the developers, the programmers, don't know that 00:28:00.480 |
it's deceiving them? - So systems today don't have long-term planning. That is not our. They can lie 00:28:08.080 |
today if it optimizes, helps them optimize their reward. If they realize, okay, this human will be 00:28:15.840 |
very happy if I tell them the following, they will do it if it brings them more points. And they don't 00:28:23.440 |
have to kind of keep track of it. It's just the right answer to this problem every single time. 00:28:29.440 |
- At which point is somebody creating that intentionally, not unintentionally, intentionally 00:28:36.000 |
creating an AI system that's doing long-term planning with an objective function as defined 00:28:41.040 |
by the AI system, not by a human? - Well, some people think that if they're that smart, they're 00:28:46.960 |
always good. They really do believe that. It's just benevolence from intelligence. So they'll 00:28:52.480 |
always want what's best for us. Some people think that they will be able to detect problem behaviors 00:29:00.640 |
and correct them at the time when we get there. I don't think it's a good idea. I am strongly 00:29:06.960 |
against it. But yeah, there are quite a few people who, in general, are so optimistic about this 00:29:12.960 |
technology, it could do no wrong. They want it developed as soon as possible, as capable as 00:29:18.480 |
possible. - So there's going to be people who believe the more intelligent it is, the more 00:29:24.640 |
benevolent. And so therefore, it should be the one that defines the objective function that it's 00:29:28.560 |
optimizing when it's doing long-term planning. - There are even people who say, okay, what's so 00:29:33.680 |
special about humans, right? We removed the gender bias. We're removing race bias. Why is this 00:29:40.000 |
pro-human bias? We are polluting the planet. We are, as you said, you know, fight a lot of wars, 00:29:44.880 |
kind of violent. Maybe it's better if a super intelligent, perfect society comes and replaces 00:29:52.160 |
us. It's normal stage in the evolution of our species. - Yeah, so somebody says, let's develop 00:29:59.520 |
an AI system that removes the violent humans from the world. And then it turns out that all humans 00:30:06.240 |
have violence in them, or the capacity for violence, and therefore all humans are removed. 00:30:10.800 |
- Yeah, yeah, yeah. Let me ask about Ian LeCun. He's somebody who you've had a few exchanges with, 00:30:20.960 |
and he's somebody who actively pushes back against this view that AI is going to lead to destruction 00:30:28.320 |
of human civilization, also known as AI doomerism. So in one example that he tweeted, he said, 00:30:40.880 |
"I do acknowledge risks, but," two points, "one, open research and open source are the best ways 00:30:47.360 |
to understand and mitigate the risks, and two, AI is not something that just happens. We build it. 00:30:54.720 |
We have agency in what it becomes. Hence, we control the risks." We meaning humans. It's not 00:31:01.040 |
some sort of natural phenomena that we have no control over. So can you make the case that he's 00:31:07.680 |
right, and can you try to make the case that he's wrong? - I cannot make a case that he's right. He's 00:31:12.480 |
wrong in so many ways, it's difficult for me to remember all of them. He is a Facebook buddy, 00:31:17.920 |
so I have a lot of fun having those little debates with him. So I'm trying to remember the arguments. 00:31:23.840 |
So one, he says, "We are not gifted this intelligence from aliens. We are designing it, 00:31:30.880 |
we are making decisions about it." That's not true. It was true when we had expert systems, 00:31:36.960 |
symbolic AI, decision trees. Today, you set up parameters for a model and you water this plant, 00:31:43.680 |
you give it data, you give it compute, and it grows. And after it's finished growing into this 00:31:49.280 |
alien plant, you start testing it to find out what capabilities it has. And it takes years 00:31:55.040 |
to figure out, even for existing models. If it's trained for six months, it will take you two, 00:31:59.440 |
three years to figure out basic capabilities of that system. We still discover new capabilities 00:32:05.280 |
in systems which are already out there. So that's not the case. - So just to linger on that, 00:32:10.720 |
that you, the difference there, that there is some level of emergent intelligence that happens 00:32:15.760 |
in our current approaches. So stuff that we don't hard-code in. - Absolutely. That's what makes it 00:32:23.680 |
so successful. When we had to painstakingly hard-code in everything, we didn't have much 00:32:29.280 |
progress. Now, just spend more money and more compute, and it's a lot more capable. - And then 00:32:35.440 |
the question is, when there is emergent intelligent phenomena, what is the ceiling of that? For you, 00:32:41.280 |
there's no ceiling. For Jan LeCun, I think there's a kind of ceiling that happens that we have full 00:32:47.920 |
control over. Even if we don't understand the internals of the emergence, how the emergence 00:32:53.040 |
happens, there's a sense that we have control and an understanding of the approximate ceiling 00:33:00.640 |
of capability, the limits of the capability. - Let's say there is a ceiling. It's not guaranteed 00:33:06.720 |
to be at the level which is competitive with us. It may be greatly superior to ours. - So what about 00:33:14.480 |
his statement about open research and open source are the best ways to understand and mitigate the 00:33:20.800 |
risks? - Historically, he's completely right. Open source software is wonderful. It's tested 00:33:26.240 |
by the community. It's debugged. But we're switching from tools to agents. Now you're giving 00:33:31.920 |
open source weapons to psychopaths. Do we want to open source nuclear weapons? Biological weapons? 00:33:38.720 |
It's not safe to give technology so powerful to those who may misalign it. Even if you are 00:33:45.600 |
successful at somehow getting it to work in the first place in a friendly manner. - But the 00:33:51.040 |
difference with nuclear weapons, current AI systems are not akin to nuclear weapons. So the idea there 00:33:57.440 |
is you're open sourcing it at this stage, that you can understand it better. A large number of people 00:34:01.920 |
can explore the limitations, the capabilities, explore the possible ways to keep it safe, to keep 00:34:06.800 |
it secure, all that kind of stuff, while it's not at the stage of nuclear weapons. 00:34:12.400 |
So nuclear weapons, there's a non-nuclear weapon and then there's a nuclear weapon. 00:34:16.480 |
With AI systems, there's a gradual improvement of capability and you get to 00:34:23.040 |
perform that improvement incrementally. And so open source allows you to study 00:34:26.880 |
how things go wrong, study the very process of emergence, study AI safety in those systems when 00:34:35.200 |
there's not a high level of danger, all that kind of stuff. - It also sets a very wrong precedent. 00:34:40.720 |
So we open sourced model one, model two, model three, nothing ever bad happened. So obviously 00:34:46.080 |
we're going to do it with model four. It's just gradual improvement. - I don't think it always 00:34:51.120 |
works with the precedent. Like you're not stuck doing it the way you always did. It's just, 00:34:56.560 |
it sets a precedent of open research and open development such that we get to learn together. 00:35:03.920 |
And then the first time there's a sign of danger, some dramatic thing happened, not a thing that 00:35:10.560 |
destroys human civilization, but some dramatic demonstration of capability that can legitimately 00:35:17.600 |
lead to a lot of damage. Then everybody wakes up and says, "Okay, we need to regulate this. 00:35:22.320 |
We need to come up with safety mechanism that stops this." At this time, maybe you can educate 00:35:28.320 |
me, but I haven't seen any illustration of significant damage done by intelligent AI systems. 00:35:34.000 |
- So I have a paper which collects accidents through history of AI and they always are 00:35:39.440 |
proportional to capabilities of that system. So if you have tic-tac-toe playing AI, it will 00:35:44.960 |
fail to properly play and loses the game, which it should draw. Trivial. Your spell checker will 00:35:50.800 |
misspell a word, so on. I stopped collecting those because there are just too many examples of AIs 00:35:56.720 |
failing at what they are capable of. We haven't had terrible accidents in the sense of billion 00:36:03.520 |
people got killed. Absolutely true. But in another paper, I argue that those accidents do not 00:36:10.000 |
actually prevent people from continuing with research. Actually, they serve like vaccines. 00:36:17.600 |
A vaccine makes your body a little bit sick, so you can handle the big disease later much better. 00:36:24.480 |
It's the same here. People will point out, "You know that AI accident we had where 12 people died? 00:36:29.200 |
Everyone's still here. 12 people is less than smoking kills. It's not a big deal." So we 00:36:35.120 |
continue. So in a way, it will actually be kind of confirming that it's not that bad. 00:36:42.320 |
It matters how the deaths happen. Whether it's literally murdered by the AI system, 00:36:48.480 |
then one is a problem. But if it's accidents because of increased reliance on automation, 00:36:56.560 |
for example. So when airplanes are flying in an automated way, maybe the number of plane 00:37:04.880 |
crashes increased by 17% or something. And then you're like, "Okay, do we really want to rely on 00:37:10.560 |
automation?" I think in the case of automation airplanes, it decreased significantly. Okay, 00:37:15.280 |
same thing with autonomous vehicles. Like, okay, what are the pros and cons? What are the trade 00:37:21.360 |
offs here? And you can have that discussion in an honest way. But I think the kind of things 00:37:27.120 |
we're talking about here is mass scale pain and suffering caused by AI systems. And I think we 00:37:36.560 |
need to see illustrations of that on a very small scale to start to understand that this is really 00:37:43.200 |
damaging. Versus Clippy. Versus a tool that's really useful to a lot of people to do learning, 00:37:49.680 |
to do summarization of texts, to do question and answer, all that kind of stuff. To generate 00:37:56.320 |
videos. A tool. Fundamentally a tool versus an agent that can do a huge amount of damage. 00:38:03.440 |
So you bring up example of cars. Cars were slowly developed and integrated. If we had no cars, 00:38:11.200 |
and somebody came around and said, "I invented this thing. It's called cars. It's awesome. 00:38:15.440 |
It kills like a hundred thousand Americans every year. Let's deploy it." Would we deploy that? 00:38:21.600 |
- There's been fear mongering about cars for a long time. The transition from horses to cars. 00:38:28.240 |
There's a really nice channel that I recommend people check out, Pessimist Archive, 00:38:32.000 |
that documents all the fear mongering about technology that's happened throughout history. 00:38:37.200 |
There's definitely been a lot of fear mongering about cars. There's a transition period there 00:38:42.400 |
about cars, about how deadly they are. We can try. It took a very long time for cars to 00:38:48.000 |
proliferate to the degree they have now. And then you could ask serious questions in terms of the 00:38:53.920 |
miles traveled, the benefit to the economy, the benefit to the quality of life that cars do, 00:38:58.480 |
versus the number of deaths. 30, 40,000 in the United States. Are we willing to pay that price? 00:39:04.880 |
I think most people, when they're rationally thinking, policymakers will say yes. 00:39:11.440 |
We want to decrease it from 40,000 to zero and do everything we can to decrease it. 00:39:18.080 |
There's all kinds of policies, incentives you can create to decrease the risks 00:39:22.400 |
with the deployment of this technology, but then you have to weigh the benefits 00:39:26.880 |
and the risks of the technology. And the same thing would be done with AI. 00:39:30.560 |
- You need data, you need to know. But if I'm right, and it's unpredictable, unexplainable, 00:39:36.400 |
uncontrollable, you cannot make this decision where we're gaining $10 trillion of wealth, 00:39:41.440 |
but we're losing, we don't know how many people. You basically have to perform an experiment 00:39:47.520 |
on 8 billion humans without their consent. And even if they want to give you consent, 00:39:52.800 |
they can't because they cannot give informed consent. They don't understand those things. 00:39:57.360 |
- Right, that happens when you go from the predictable to the unpredictable very quickly. 00:40:04.560 |
You just, but it's not obvious to me that AI systems would gain capabilities so quickly 00:40:11.520 |
that you won't be able to collect enough data to study the benefits and the risks. 00:40:15.840 |
- We're literally doing it. The previous model we learned about after we finished training it, 00:40:21.920 |
what it was capable of. Let's say we stop GPT-4 training run around human capability, 00:40:27.760 |
hypothetically. We start training GPT-5, and I have no knowledge of insider training runs or 00:40:33.040 |
anything, and we start at that point of about human, and we train it for the next nine months. 00:40:39.280 |
Maybe two months in, it becomes super intelligent. We continue training it. At the time when we start 00:40:45.120 |
testing it, it is already a dangerous system. How dangerous? I have no idea, 00:40:51.040 |
but neither are people training it. - At the training stage, but then there's a 00:40:56.000 |
testing stage inside the company. They can start getting intuition about what the system is capable 00:41:01.360 |
to do. You're saying that somehow leap from GPT-4 to GPT-5 can happen, the kind of leap where GPT-4 00:41:11.760 |
was controllable and GPT-5 is no longer controllable, and we get no insights from 00:41:16.960 |
using GPT-4 about the fact that GPT-5 will be uncontrollable. That's the situation you're 00:41:23.440 |
concerned about, where their leap from N to N+1 would be such that an uncontrollable system is 00:41:33.280 |
created without any ability for us to anticipate that. - If we had capability of ahead of the run, 00:41:41.600 |
before the training run, to register exactly what capabilities the next model will have at the end 00:41:46.400 |
of the training run, and we accurately guessed all of them, I would say you're right. We can 00:41:50.880 |
definitely go ahead with this run. We don't have that capability. - From GPT-4, you can build up 00:41:56.720 |
intuitions about what GPT-5 will be capable of. It's just incremental progress. Even if that's a 00:42:03.680 |
big leap in capability, it just doesn't seem like you can take a leap from a system that's 00:42:09.680 |
helping you write emails to a system that's going to destroy human civilization. It seems like it's 00:42:16.720 |
always going to be sufficiently incremental such that we can anticipate the possible dangers. We're 00:42:22.880 |
not even talking about existential risk, but just the kind of damage you can do to civilization. 00:42:28.240 |
It seems like we'll be able to anticipate the kinds, not the exact, but the kinds of 00:42:33.120 |
risks it might lead to, and then rapidly develop defenses ahead of time and as the risks emerge. 00:42:44.560 |
- We're not talking just about capabilities, specific tasks. We're talking about general 00:42:49.280 |
capability to learn. Maybe like a child at the time of testing and deployment, it is still not 00:42:56.640 |
extremely capable, but as it is exposed to more data, real world, it can be trained to 00:43:03.360 |
become much more dangerous and capable. - Let's focus then on the control problem. 00:43:11.120 |
At which point does the system become uncontrollable? 00:43:13.680 |
Why is it the more likely trajectory for you that the system becomes uncontrollable? 00:43:19.040 |
- I think at some point it becomes capable of getting out of control. For game theoretic 00:43:25.680 |
reasons, it may decide not to do anything right away and for a long time just collect more 00:43:30.640 |
resources, accumulate strategic advantage. Right away, it may be kind of still young, 00:43:37.200 |
weak superintelligence, give it a decade, it's in charge of a lot more resources, 00:43:42.160 |
it had time to make backups. So it's not obvious to me that it will strike as soon as it can. 00:43:47.280 |
- Look, can we just try to imagine this future where there's an AI system that's capable of 00:43:54.240 |
escaping the control of humans and then doesn't and waits. What's that look like? 00:44:02.800 |
- So one, we have to rely on that system for a lot of the infrastructure. So we'll have to give 00:44:08.240 |
it access, not just to the internet, but to the task of managing power, government, economy, 00:44:19.120 |
this kind of stuff. And that just feels like a gradual process given the bureaucracies of all 00:44:23.840 |
those systems involved. - We've been doing it for years. Software 00:44:27.040 |
controls all the systems, nuclear power plants, airline industry, it's all software based. Every 00:44:32.320 |
time there is electrical outage, I can't fly anywhere for days. 00:44:35.520 |
- But there's a difference between software and AI. There's different kinds of software. So 00:44:43.360 |
to give a single AI system access to the control of airlines and the control of the economy, 00:44:49.920 |
that's not a trivial transition for humanity. - No, but if it shows it is safer, in fact, 00:44:57.120 |
when it's in control, we get better results, people will demand that it was put in place. 00:45:01.840 |
- Absolutely. - And if not, it can hack the system. It can 00:45:04.400 |
use social engineering to get access to it. That's why I said it might take some time for it to 00:45:09.200 |
accumulate those resources. - It just feels like that would take a long 00:45:12.320 |
time for either humans to trust it or for the social engineering to come into play. It's not 00:45:18.160 |
a thing that happens overnight. It feels like something that happens across one or two decades. 00:45:22.720 |
- I really hope you're right, but it's not what I'm seeing. People are very 00:45:26.960 |
quick to jump on the latest trend. Early adopters will be there before it's even 00:45:31.040 |
deployed buying prototypes. - Maybe the social engineering. 00:45:34.720 |
So for social engineering, AI systems don't need any hardware access. It's all software. So they 00:45:42.640 |
can start manipulating you through social media and so on. Like you have AI assistants, they're 00:45:47.280 |
going to help you manage a lot of your day-to-day, and then they start doing social engineering. But 00:45:53.600 |
for a system that's so capable that it can escape the control of humans that created it, 00:46:00.320 |
such a system being deployed at a mass scale and trusted by people to be deployed, 00:46:10.080 |
it feels like that would take a lot of convincing. - So we've been deploying systems which had hidden 00:46:20.080 |
- GPT-4. I don't know what else it's capable of, but there are still things we haven't discovered 00:46:25.120 |
can do. They may be trivial proportionate to its capability. I don't know, it writes 00:46:30.240 |
Chinese poetry, hypothetical. I know it does. But we haven't tested for all possible capabilities, 00:46:37.280 |
and we're not explicitly designing them. We can only rule out bugs we find. We cannot rule out 00:46:45.040 |
bugs and capabilities because we haven't found them. - Is it possible for a system to have hidden 00:46:54.480 |
capabilities that are orders of magnitude greater than its non-hidden capabilities? 00:47:00.960 |
This is the thing I'm really struggling with, where on the surface, the thing we understand 00:47:08.000 |
it can do doesn't seem that harmful. So even if it has bugs, even if it has hidden capabilities, 00:47:15.040 |
Chinese poetry, or generating effective viruses, software viruses, 00:47:21.040 |
the damage that can do seems on the same order of magnitude as the capabilities that we know about. 00:47:31.040 |
So this idea that the hidden capabilities will include being uncontrollable is something I'm 00:47:37.120 |
struggling with, 'cause GPT-4 on the surface seems to be very controllable. - Again, we can only ask 00:47:43.840 |
and test for things we know about. If there are unknown unknowns, we cannot do it. I'm thinking 00:47:48.960 |
of humans, artistic savants, right? If you talk to a person like that, you may not even realize 00:47:54.320 |
they can multiply 20-digit numbers in their head. You have to know to ask. - So as I mentioned, 00:48:01.920 |
just to sort of linger on the fear of the unknown, so the Pessimist Archive has just documented, 00:48:10.160 |
let's look at data of the past, at history. There's been a lot of fear-mongering about technology. 00:48:15.520 |
Pessimist Archive does a really good job of documenting how crazily afraid we are of 00:48:21.440 |
every piece of technology. We've been afraid, there's a blog post where Louis Anslow, 00:48:27.520 |
who created Pessimist Archive, writes about the fact that we've been fear-mongering about 00:48:33.120 |
robots and automation for over 100 years. So why is AGI different than the kinds of technologies 00:48:41.520 |
we've been afraid of in the past? - So two things. One, we're switching from tools to agents. 00:48:46.240 |
Tools don't have negative or positive impact. People using tools do. So guns don't kill people 00:48:56.320 |
what guns do. Agents can make their own decisions. They can be positive or negative. A pit bull can 00:49:02.480 |
decide to harm you as an agent. The fears are the same. The only difference is now we have this 00:49:10.080 |
technology. Then they were afraid of humanoid robots 100 years ago. They had none. Today, 00:49:15.760 |
every major company in the world is investing billions to create them. Not every, but you 00:49:20.160 |
understand what I'm saying? It's very different. - Well, agents, it depends on what you mean by 00:49:28.080 |
the word agents. All those companies are not investing in a system that has the kind of agency 00:49:32.800 |
that's implied by in the fears, where it can really make decisions on their own. 00:49:39.760 |
They have no human in the loop. - They are saying they are building 00:49:43.680 |
super intelligence and have a super alignment team. You don't think they are trying to create 00:49:47.920 |
a system smart enough to be an independent agent under that definition? - I have not seen evidence 00:49:53.200 |
of it. I think a lot of it is a marketing kind of discussion about the future. It's a mission about 00:50:02.080 |
the kind of systems you can create in the long-term future. But in the short-term, 00:50:06.400 |
the kind of systems they're creating falls fully within the definition of narrow AI. These are 00:50:16.640 |
tools that have increasing capabilities, but they just don't have a sense of agency or consciousness 00:50:22.560 |
or self-awareness or ability to deceive at scales that would be required to do mass scale suffering 00:50:31.360 |
and murder of humans. - Those systems are well beyond narrow AI. If you had to list all the 00:50:36.000 |
capabilities of GPT-4, you would spend a lot of time writing that list. - But agency is not one 00:50:41.280 |
of them. - Not yet. But do you think any of those companies are holding back because they think it 00:50:46.800 |
may be not safe or are they developing the most capable system they can given the resources and 00:50:52.400 |
hoping they can control and monetize? - Control and monetize. Hoping they can control and monetize. 00:50:58.960 |
So you're saying if they could press a button and create an agent that they no longer control, 00:51:06.320 |
that they can have to ask nicely. A thing that lives on a server across huge number of computers. 00:51:15.920 |
- You're saying that they would push for the creation of that kind of system? - I mean, 00:51:22.240 |
I can't speak for other people, for all of them. I think some of them are very ambitious. They 00:51:27.680 |
fundraise in trillions. They talk about controlling the light corner of the universe. 00:51:31.680 |
I would guess that they might. - Well, that's a human question. Whether humans are capable of 00:51:38.640 |
that. Probably some humans are capable of that. My more direct question, if it's possible to 00:51:44.480 |
create such a system, have a system that has that level of agency. I don't think that's an easy 00:51:52.480 |
technical challenge. It doesn't feel like we're close to that. A system that has the kind of 00:51:59.520 |
agency where it can make its own decisions and deceive everybody about them. The current 00:52:04.400 |
architecture we have in machine learning and how we train the systems, how we deploy the systems 00:52:10.880 |
and all that, it just doesn't seem to support that kind of agency. - I really hope you are right. 00:52:16.320 |
I think the scaling hypothesis is correct. We haven't seen diminishing returns. It used to be 00:52:22.640 |
we asked how long before AGI, now we should ask how much until AGI. It's trillion dollars today, 00:52:29.120 |
it's a billion dollars next year, it's a million dollars in a few years. - Don't you think it's 00:52:34.320 |
possible to basically run out of trillions? Is this constrained by compute? - Compute gets cheaper 00:52:42.000 |
every day, exponentially. - But then that becomes a question of decades versus years. - If the only 00:52:47.840 |
disagreement is that it will take decades, not years for everything I'm saying to materialize, 00:52:54.720 |
then I can go with that. - But if it takes decades, then the development of tools for AI safety 00:53:02.800 |
becomes more and more realistic. So I guess the question is, 00:53:06.800 |
I have a fundamental belief that humans when faced with danger can come up with ways to defend 00:53:13.840 |
against that danger. And one of the big problems facing AI safety currently for me is that there's 00:53:21.520 |
not clear illustrations of what that danger looks like. There's no illustrations of AI systems doing 00:53:28.480 |
a lot of damage. And so it's unclear what you're defending against. Because currently it's a 00:53:35.040 |
philosophical notions that yes, it's possible to imagine AI systems that take control of everything 00:53:40.560 |
and then destroy all humans. It's also a more formal mathematical notion that you talk about 00:53:47.040 |
that it's impossible to have a perfectly secure system. You can't prove that a program of sufficient 00:53:54.560 |
complexity is completely safe and perfect and know everything about it. Yes, but like when you 00:54:02.160 |
actually just pragmatically look, how much damage have the AI systems done and what kind of damage, 00:54:07.440 |
there's not been illustrations of that. Even in the autonomous weapon systems, 00:54:13.600 |
there's not been mass deployments of autonomous weapon systems, luckily. The automation in war 00:54:21.680 |
currently is very limited. The automation is at the scale of individuals versus like 00:54:28.880 |
at the scale of strategy and planning. So I think one of the challenges here is like, 00:54:35.040 |
where is the dangers? And the intuition that Yann LeCun and others have is let's keep in the open 00:54:43.600 |
building AI systems until the dangers start rearing their heads. And they become more 00:54:51.520 |
explicit. There start being case studies, illustrative case studies that show exactly 00:54:59.840 |
how the damage by AI systems is done. Then regulation could step in. Then brilliant 00:55:04.320 |
engineers can step up and we could have Manhattan-style projects that defend against such 00:55:09.040 |
systems. That's kind of the notion. And I guess attention with that is the idea that for you, 00:55:15.840 |
we need to be thinking about that now so that we're ready because we will have not much time 00:55:21.920 |
once the systems are deployed. Is that true? - There is a lot to unpack here. There is a 00:55:28.880 |
partnership on AI, a conglomerate of many large corporations. They have a database of AI accidents 00:55:34.480 |
they collect. I contributed a lot to the database. If we so far made almost no progress in actually 00:55:41.280 |
solving this problem, not patching it, not again, lipstick and a pig kind of solutions, 00:55:46.880 |
why would we think we'll do better than we're closer to the problem? - All the things you 00:55:53.680 |
mentioned are serious concerns. Measuring the amount of harm, so benefit versus risk there 00:55:58.160 |
is difficult. But to you, the sense is already the risk has superseded the benefit. - Again, 00:56:03.200 |
I want to be perfectly clear. I love AI. I love technology. I'm a computer scientist. I have PhD 00:56:08.160 |
in engineering. I work at an engineering school. There is a huge difference between we need to 00:56:13.120 |
develop narrow AI systems, super intelligent in solving specific human problems like protein 00:56:19.360 |
folding, and let's create super intelligent machine, got it, and we'll decide what to do with 00:56:24.880 |
us. Those are not the same. I am against the super intelligence in general sense with no undo button. 00:56:34.000 |
- So do you think the teams that are doing, that are able to do the AI safety on the kind of narrow 00:56:40.960 |
AI risks that you've mentioned, are those approaches going to be at all productive towards 00:56:48.880 |
leading to approaches of doing AI safety on AGI? Or is it just a fundamentally different-- - Partially, 00:56:54.800 |
but they don't scale. For narrow AI, for deterministic systems, you can test them. 00:56:59.280 |
You have edge cases. You know what the answer should look like. You know the right answers. 00:57:04.400 |
For general systems, you have infinite test surface. You have no edge cases. You cannot even 00:57:10.560 |
know what to test for. Again, the unknown unknowns are underappreciated by people looking at this 00:57:18.320 |
problem. You are always asking me, "How will it kill everyone? How will it fail?" The whole point 00:57:24.960 |
is if I knew it, I would be super intelligent. Despite what you might think, I'm not. - So to you, 00:57:31.040 |
the concern is that we would not be able to see early signs of an uncontrollable system. - It is 00:57:39.360 |
a master at deception. Sam tweeted about how great it is at persuasion, and we see it ourselves, 00:57:45.680 |
especially now with voices, with maybe kind of flirty, sarcastic female voices. It's going to 00:57:53.360 |
be very good at getting people to do things. - But see, I'm very concerned about system being used 00:58:02.000 |
to control the masses. But in that case, the developers know about the kind of control that's 00:58:10.400 |
happening. You're more concerned about the next stage, where even the developers don't know about 00:58:16.800 |
the deception. - Right. I don't think developers know everything about what they are creating. 00:58:22.960 |
They have lots of great knowledge. We're making progress on explaining parts of a network. We can 00:58:28.400 |
understand, okay, this node gets excited when this input is presented, this cluster of nodes. 00:58:35.840 |
But we're nowhere near close to understanding the full picture, and I think it's impossible. 00:58:41.040 |
You need to be able to survey an explanation. The size of those models prevents a single human from 00:58:47.680 |
observing all this information, even if provided by the system. So either we're getting model as 00:58:53.600 |
an explanation for what's happening, and that's not comprehensible to us, or we're getting a 00:58:58.560 |
compressed explanation, lossy compression, where here's top 10 reasons you got fired. 00:59:04.240 |
It's something, but it's not a full picture. - You've given elsewhere an example of a child, 00:59:09.760 |
and everybody, all humans try to deceive. They try to lie early on in their life. I think we'll 00:59:16.160 |
just get a lot of examples of deceptions from large language models or AI systems that are going 00:59:21.760 |
to be kind of shitty, or they'll be pretty good, but we'll catch them off guard. We'll start to see 00:59:27.120 |
the kind of momentum towards developing increasing deception capabilities, and that's when you're 00:59:36.480 |
like, okay, we need to do some kind of alignment that prevents deception. But then we'll have, 00:59:41.680 |
if you support open source, then you can have open source models that have some level of deception. 00:59:46.320 |
You can start to explore on a large scale, how do we stop it from being deceptive? Then there's a 00:59:51.680 |
more explicit, pragmatic kind of problem to solve. How do we stop AI systems from trying to optimize 01:00:02.080 |
for deception? That's just an example, right? - So there is a paper, I think it came out last 01:00:07.360 |
week by Dr. Park et al from MIT, I think, and they showed that existing models already showed 01:00:14.560 |
successful deception in what they do. My concern is not that they lie now and we need to catch them 01:00:22.240 |
and tell them don't lie. My concern is that once they are capable and deployed, they will later 01:00:28.960 |
change their mind because that's what unrestricted learning allows you to do. Lots of people grow up 01:00:36.720 |
maybe in a religious family, they read some new books and they turn in their religion. That's a 01:00:43.680 |
treacherous turn in humans. If you learn something new about your colleagues, maybe you'll change how 01:00:51.360 |
you react to them. - Yeah, a treacherous turn. If we just mentioned humans, Stalin and Hitler, 01:00:58.800 |
there's a turn. Stalin is a good example. He just seems like a normal communist follower of Lenin 01:01:06.800 |
until there's a turn. There's a turn of what that means in terms of when he has complete control, 01:01:13.680 |
what the execution of that policy means and how many people get to suffer. - And you can't say 01:01:18.320 |
they are not rational. The rational decision changes based on your position. Then you are 01:01:24.480 |
under the boss, the rational policy may be to be following orders and being honest. When you 01:01:30.400 |
become a boss, the rational policy may shift. - Yeah, and by the way, a lot of my disagreements 01:01:36.240 |
here is just playing devil's advocate to challenge your ideas and to explore them together. 01:01:41.840 |
One of the big problems here in this whole conversation is human civilization hangs in 01:01:49.920 |
the balance and yet everything is unpredictable. We don't know how these systems will look like, 01:01:55.000 |
The robots are coming. - There's a refrigerator making a buzzing noise. 01:02:02.400 |
- Very menacing, very menacing. So every time I'm about to talk about this topic, 01:02:08.880 |
things start to happen. My flight yesterday was canceled without possibility to rebook. 01:02:13.360 |
I was giving a talk at Google in Israel and three cars which were supposed to take me to the talk 01:02:21.520 |
could not. I'm just saying. I like AIs. I for one welcome our overlords. 01:02:30.960 |
- There's a degree to which we, I mean, it is very obvious. As we already have, 01:02:37.440 |
we've increasingly given our life over to software systems. And then it seems obvious, 01:02:44.560 |
given the capabilities of AI that are coming, that we'll give our lives over increasingly to AI 01:02:50.320 |
systems. Cars will drive themselves. Refrigerator eventually will optimize what I get to eat. 01:02:58.640 |
And as more and more of our lives are controlled or managed by AI assistance, it is very possible 01:03:07.760 |
that there's a drift. I mean, I personally am concerned about non-existential stuff, 01:03:13.440 |
the more near-term things. Because before we even get to existential, I feel like there could be 01:03:19.440 |
just so many Brave New World type of situations. You mentioned sort of the term behavioral drift. 01:03:24.960 |
It's the slow boiling that I'm really concerned about. As we give our lives over to automation, 01:03:31.120 |
that our minds can become controlled by governments, by companies, or just in a distributed 01:03:39.840 |
way, there's a drift. Some aspect of our human nature gives ourselves over to the control of AI 01:03:45.920 |
systems. And they, in an unintended way, just control how we think. Maybe there'll be a herd 01:03:51.840 |
like mentality in how we think, which will kill all creativity and exploration of ideas, the 01:03:56.720 |
diversity of ideas, or much worse. So it's true. It's true. But a lot of the conversation I'm 01:04:05.600 |
having with you now is also kind of wondering, almost on a technical level, how can AI escape 01:04:12.800 |
control? Like, what would that system look like? Because to me, it's terrifying and fascinating. 01:04:20.480 |
And also fascinating to me is maybe the optimistic notion that it's possible to engineer systems that 01:04:29.200 |
defend against that. One of the things you write a lot about in your book is verifiers. 01:04:36.160 |
So not humans, humans are also verifiers, but software systems that look at AI systems and 01:04:45.280 |
help you understand, this thing is getting real weird, help you analyze those systems. So maybe 01:04:54.160 |
this is a good time to talk about verification. What is this beautiful notion of verification? 01:05:00.880 |
My claim is, again, that there are very strong limits on what we can and cannot verify. A lot 01:05:06.400 |
of times when you post something on social media, people go, "Oh, I need citation to a peer-reviewed 01:05:11.040 |
article." But what is a peer-reviewed article? You found two people in a world of hundreds of 01:05:16.720 |
thousands of scientists who said, "Oh, whatever, publish it, I don't care." That's the verifier 01:05:20.720 |
of that process. Then people say, "Oh, it's formally verified software and mathematical 01:05:26.640 |
proof except something close to 100% chance of it being free of all problems." But if you actually 01:05:35.360 |
look at research, software is full of bugs, old mathematical theorems, which have been proven for 01:05:42.160 |
hundreds of years, have been discovered to contain bugs, on top of which we generate new proofs, 01:05:47.760 |
and now we have to redo all that. So verifiers are not perfect. Usually they are either a single 01:05:54.640 |
human or communities of humans, and it's basically kind of like a democratic vote. 01:05:58.880 |
Community of mathematicians agrees that this proof is correct, mostly correct. Even today, 01:06:05.520 |
we're starting to see some mathematical proofs are so complex, so large, that mathematical community 01:06:11.840 |
is unable to make a decision. It looks interesting, looks promising, but they don't know. 01:06:16.240 |
They will need years for top scholars to study it, to figure it out. So of course, we can use AI to 01:06:22.160 |
help us with this process, but AI is a piece of software which needs to be verified. 01:06:27.200 |
- Just to clarify, so verification is the process of saying something is correct. 01:06:32.080 |
Sort of the most formal, a mathematical proof, where there's a statement and a series of logical 01:06:38.160 |
statements that prove that statement to be correct, which is a theorem. And you're saying it 01:06:43.920 |
gets so complex that it's possible for the human verifiers, the human beings that verify that the 01:06:51.680 |
logical step, there's no bugs in it, it becomes impossible. So it's nice to talk about verification 01:06:58.320 |
in this most formal, most clear, most rigorous formulation of it, which is mathematical proofs. 01:07:04.960 |
- Right, and for AI, we would like to have that level of confidence, a very important mission 01:07:12.480 |
critical software controlling satellites, nuclear power plants. For small deterministic programs, 01:07:17.520 |
we can do this. We can check that code verifies its mapping to the design, whatever software 01:07:25.840 |
engineers intended was correctly implemented. But we don't know how to do this for software which 01:07:33.360 |
keeps learning, self-modifying, rewriting its own code. We don't know how to prove things about the 01:07:39.040 |
physical world, states of humans in a physical world. So there are papers coming out now, 01:07:44.960 |
and I have this beautiful one, Tolvert's Guaranteed Safe AI. Very cool paper, some of the 01:07:52.960 |
best authors I ever seen. I think there is multiple Turing Award winners. You can have this one, 01:07:59.840 |
one just came out, kind of similar, managing extreme AI risks. So all of them expect this 01:08:06.720 |
level of proof, but I would say that we can get more confidence with more resources we put into 01:08:15.680 |
it. But at the end of the day, we're still as reliable as the verifiers. And you have this 01:08:20.880 |
infinite regress of verifiers. The software used to verify a program is itself a piece of program. 01:08:26.960 |
If aliens gave us well-aligned super intelligence, we can use that to create our own safe AI. But 01:08:33.760 |
it's a catch-22. You need to have already proven to be safe system to verify this new system of 01:08:41.040 |
equal or greater complexity. - You just mentioned this paper, 01:08:44.800 |
Tolvert's Guaranteed Safe AI, a framework for ensuring robust and reliable AI systems. 01:08:49.280 |
Like you mentioned, it's like a who's who. Josh Tenenbaum, Yoshua Bengio, Russell, Max Tegmark, 01:08:54.960 |
many other brilliant people. The page you have it open on, there are many possible strategies for 01:09:00.320 |
creating safety specifications. These strategies can roughly be placed on a spectrum, depending on 01:09:06.480 |
how much safety it would grant if successfully implemented. One way to do this is as follows, 01:09:11.760 |
and there's a set of levels. From level zero, no safety specification is used, to level seven, 01:09:16.960 |
the safety specification completely encodes all things that humans might want in all contexts. 01:09:22.640 |
Where does this paper fall short to you? - So when I wrote a paper, Artificial 01:09:29.680 |
Intelligence Safety Engineering, which kind of coins the term AI safety, that was 2011. We had 01:09:35.360 |
2012 conference, 2013 journal paper. One of the things I proposed, let's just do formal verifications 01:09:41.040 |
on it. Let's do mathematical formal proofs. In the follow-up work, I basically realized it will 01:09:46.880 |
still not get us 100%. We can get 99.9, we can put more resources exponentially and get closer, 01:09:54.560 |
but we'll never get to 100%. If a system makes a billion decisions a second, and you use it for 100 01:10:00.800 |
years, you're still gonna deal with a problem. This is wonderful research, I'm so happy they're 01:10:06.080 |
doing it, this is great, but it is not going to be a permanent solution to that problem. 01:10:12.320 |
- So just to clarify, the task of creating an AI verifier is what? It's creating a verifier that 01:10:18.880 |
the AI system does exactly as it says it does, or it sticks within the guardrails that it says it 01:10:25.520 |
must? - There are many, many levels. So first, 01:10:27.840 |
you're verifying the hardware in which it is run. You need to verify communication channel with the 01:10:33.760 |
human. Every aspect of that whole world model needs to be verified. Somehow it needs to map 01:10:39.520 |
the world into the world model. Map and territory differences. So how do I know internal states of 01:10:46.720 |
humans? Are you happy or sad? I can't tell. So how do I make proofs about real physical world? Yeah, 01:10:53.280 |
I can verify that deterministic algorithm follows certain properties. That can be done. 01:10:58.720 |
Some people argue that maybe just maybe two plus two is not four, I'm not that extreme. 01:11:04.320 |
But once you have sufficiently large proof over sufficiently complex environment, the probability 01:11:12.640 |
that it has zero bugs in it is greatly reduced. If you keep deploying this a lot, eventually you're 01:11:19.200 |
going to have a bug anyways. - There's always a bug. 01:11:21.600 |
- There is always a bug. And the fundamental difference is what I mentioned. We're not 01:11:25.440 |
dealing with cybersecurity. We're not going to get a new credit card, new humanity. 01:11:29.040 |
- So this paper is really interesting. You said 2011, artificial intelligence, safety engineering, 01:11:35.200 |
why machine ethics is a wrong approach. The grand challenge, you write, of AI safety engineering. 01:11:42.560 |
We propose the problem of developing safety mechanisms for self-improving systems. 01:11:48.640 |
Self-improving systems. By the way, that's an interesting term for the thing that we're talking 01:11:55.120 |
about. Is self-improving more general than learning? Self-improving, that's an interesting term. 01:12:06.240 |
- You can improve the rate at which you are learning. You can become more efficient meta-optimizer. 01:12:11.360 |
- The word self, it's like self-replicating, self-improving. You can imagine a system 01:12:19.680 |
building its own world on a scale and in a way that is way different than the current systems do. 01:12:26.400 |
It feels like the current systems are not self-improving or self-replicating or self-growing 01:12:31.920 |
or self-spreading, all that kind of stuff. And once you take that leap, that's when a lot of 01:12:38.000 |
the challenges seems to happen. Because the kind of bugs you can find now seems more akin to the 01:12:44.720 |
current sort of normal software debugging kind of process. But whenever you can do self-replication 01:12:54.080 |
and arbitrary self-improvement, that's when a bug can become a real problem, real fast. 01:13:02.160 |
So what is the difference to you between verification of a non-self-improving system 01:13:10.080 |
versus a verification of a self-improving system? - So if you have fixed code, for example, 01:13:14.960 |
you can verify that code, static verification at the time. But if it will continue modifying it, 01:13:21.360 |
you have a much harder time guaranteeing that important properties of that system 01:13:27.760 |
have not been modified when the code changed. - Is it even doable? 01:13:34.000 |
of verification just completely fall apart? - It can always cheat. It can store parts of 01:13:38.400 |
its code outside in the environment. It can have kind of extended mind situation. So this is exactly 01:13:45.200 |
the type of problems I'm trying to bring up. - What are the classes of verifiers that you 01:13:50.080 |
read about in the book? Is there interesting ones that stand out to you? Do you have some favorites? 01:13:54.880 |
- So I like Oracle types where you kind of just know that it's right. Turing likes Oracle machines. 01:14:01.200 |
They know the right answer how, who knows. But they pull it out from somewhere, so you have to 01:14:06.560 |
trust them. And that's a concern I have about humans in a world with very smart machines. We 01:14:13.600 |
experiment with them, we see after a while, okay, they've always been right before, 01:14:17.920 |
and we start trusting them without any verification of what they're saying. 01:14:21.600 |
- Oh, I see, that we kind of build Oracle verifiers, or rather, we build verifiers 01:14:28.320 |
we believe to be Oracles, and then we start to, without any proof, use them as if they're Oracle 01:14:35.440 |
verifiers. - We remove ourselves from that process. We are not scientists who understand the world, 01:14:40.640 |
we are humans who get new data presented to us. - Okay, one really cool class of verifiers is 01:14:48.240 |
a self-verifier. Is it possible that you somehow engineer into AI systems a thing 01:14:55.040 |
that constantly verifies itself? - Preserved portion of it can be done, 01:14:58.960 |
but in terms of mathematical verification, it's kind of useless. You're saying you are the greatest 01:15:04.480 |
guy in the world because you are saying it. It's circular and not very helpful, but it's consistent. 01:15:09.840 |
We know that within that world, you have verified that system. In a paper, I try to kind of brute 01:15:15.600 |
force all possible verifiers. It doesn't mean that this one is particularly important to us. 01:15:21.520 |
- But what about self-doubt? The kind of verification where you say, or I say, 01:15:28.240 |
I'm the greatest guy in the world. What about a thing which I actually have, is a voice that 01:15:33.520 |
is constantly extremely critical? So engineer into the system a constant uncertainty about self, 01:15:41.840 |
a constant doubt. - Well, any smart system would have 01:15:47.120 |
doubt about everything, all right? You're not sure if what information you are given is true, 01:15:52.880 |
if you are subject to manipulation. You have this safety and security mindset. 01:15:58.320 |
- But I mean, you have doubt about yourself. So the AI systems that has doubt about whether the 01:16:07.440 |
thing is doing, it's causing harm, is the right thing to be doing. So just a constant doubt about 01:16:13.920 |
what it's doing, because it's hard to be a dictator full of doubt. 01:16:17.440 |
- I may be wrong, but I think Stuart Russell's ideas are all about machines which are uncertain 01:16:25.280 |
about what humans want and trying to learn better and better what we want. The problem, of course, 01:16:30.080 |
is we don't know what we want, and we don't agree on it. 01:16:32.160 |
- Yeah, but uncertainty. His idea is that having that self-doubt, uncertainty in AI systems, 01:16:39.440 |
engineering AI systems, is one way to solve the control problem. 01:16:42.400 |
- It could also backfire. Maybe you're uncertain about completing your mission. Like, I am paranoid 01:16:48.720 |
about your camera's not recording right now, so I would feel much better if you had a secondary 01:16:53.680 |
camera, but I also would feel even better if you had a third. And eventually, I would turn this 01:16:59.200 |
whole world into cameras, pointing at us, making sure we're capturing this. 01:17:04.320 |
- No, but wouldn't you have a meta concern, like that you just stated, that eventually there'll 01:17:11.120 |
be way too many cameras? So you would be able to keep zooming on the big picture of your concerns. 01:17:19.760 |
- So it's a multi-objective optimization. It depends how much I value capturing this versus 01:17:28.720 |
- Right, exactly. And then you will also ask about, like, what does it mean to destroy the 01:17:34.720 |
universe and how many universes are, and you keep asking that question. But that doubting yourself 01:17:39.680 |
would prevent you from destroying the universe, because you're constantly full of doubt. 01:17:50.320 |
- Well, that's better. I mean, I guess the question is, is it possible to engineer that in? 01:17:55.120 |
I guess your answer would be yes, but we don't know how to do that, and we need to invest a lot 01:17:58.640 |
of effort into figuring out how to do that, but it's unlikely. Underpinning a lot of your writing 01:18:06.080 |
is this sense that we're screwed. But it just feels like it's an engineering problem. I don't 01:18:15.040 |
understand why we're screwed. Time and time again, humanity has gotten itself into trouble 01:18:21.680 |
and figured out a way to get out of the trouble. 01:18:23.680 |
- We are in a situation where people making more capable systems just need more resources. 01:18:30.320 |
They don't need to invent anything, in my opinion. Some will disagree, but so far, at least, 01:18:36.400 |
I don't see diminishing returns. If you have 10x compute, you will get better performance. 01:18:41.760 |
The same doesn't apply to safety. If you give MIRI or any other organization 10x the money, 01:18:48.320 |
they don't output 10x the safety. And the gap between capabilities and safety becomes bigger 01:18:54.560 |
and bigger all the time. So it's hard to be completely optimistic about our results here. 01:19:02.160 |
I can name 10 excellent breakthrough papers in machine learning. I would struggle to name 01:19:08.400 |
equally important breakthroughs in safety. A lot of times, a safety paper will propose a 01:19:13.760 |
toy solution and point out 10 new problems discovered as a result. It's like this fractal. 01:19:19.520 |
You're zooming in and you see more problems. And it's infinite in all directions. 01:19:23.200 |
- Does this apply to other technologies? Or is this unique to AI, 01:19:31.280 |
- So I guess we can look at related technologies with cybersecurity, right? We did manage to have 01:19:39.440 |
banks and casinos and Bitcoin. So you can have secure, narrow systems, which are doing okay. 01:19:47.600 |
Narrow attacks on them fail, but you can always go outside of the box. So if I can't hack your 01:19:55.360 |
Bitcoin, I can hack you. So there is always something. If I really want it, I will find 01:20:00.480 |
a different way. We talk about guardrails for AI. Well, that's a fence. I can dig a tunnel under it, 01:20:07.200 |
I can jump over it, I can climb it, I can walk around it. You may have a very nice guardrail, 01:20:12.560 |
but in the real world, it's not a permanent guarantee of safety. And again, this is a 01:20:17.600 |
fundamental difference. We are not saying we need to be 90% safe to get those trillions of dollars 01:20:24.320 |
of benefit. We need to be 100% indefinitely, or we might lose the principle. 01:20:29.360 |
- So if you look at just humanity as a set of machines, is the machinery of AI safety 01:20:39.040 |
conflicting with the machinery of capitalism? 01:20:44.240 |
- I think we can generalize it to just prisoner's dilemma in general, personal self-interest versus 01:20:51.760 |
group interest. The incentives are such that everyone wants what's best for them. Capitalism 01:20:59.840 |
obviously has that tendency to maximize your personal gain, which does create this race to 01:21:08.160 |
the bottom. I don't have to be a lot better than you, but if I'm 1% better than you, I'll capture 01:21:16.160 |
more of the profit, so it's worth for me personally to take the risk, even if society 01:21:21.840 |
as a whole will suffer as a result. - So capitalism has created a lot of good in this world. 01:21:27.440 |
It's not clear to me that AI safety is not aligned with the function of capitalism, 01:21:35.920 |
unless AI safety is so difficult that it requires the complete halt of the development, 01:21:43.920 |
which is also a possibility. It just feels like building safe systems 01:21:48.160 |
should be the desirable thing to do for tech companies. 01:21:53.280 |
- Right. Look at governance structures. When you have someone with complete power, 01:21:59.680 |
they're extremely dangerous. So the solution we came up with is break it up. You have judicial, 01:22:05.200 |
legislative, executive. Same here, have narrow AI systems, work on important problems, 01:22:10.800 |
solve immortality. It's a biological problem we can solve similar to how progress was made 01:22:19.200 |
with protein folding using a system which doesn't also play chess. There is no reason to create 01:22:26.080 |
super intelligent system to get most of the benefits we want from much safer, narrow systems. 01:22:32.480 |
- It really is a question to me whether companies are interested in creating 01:22:39.360 |
anything but narrow AI. I think when term AGI is used by tech companies, they mean narrow AI. 01:22:47.760 |
They mean narrow AI with amazing capabilities. 01:22:53.600 |
I do think that there's a leap between narrow AI with amazing capabilities, with superhuman 01:23:01.440 |
capabilities and the kind of self-motivated agent like AGI system that we're talking about. 01:23:09.120 |
I don't know if it's obvious to me that a company would want to take the leap to creating 01:23:15.120 |
an AGI that it would lose control of because then it can't capture the value from that system. 01:23:22.320 |
- Like the bragging rights, but being first. That is the same humans who are in charge 01:23:28.960 |
of their systems, right? - That's a human thing. 01:23:30.000 |
So that jumps from the incentives of capitalism to human nature. And so the question is whether 01:23:37.520 |
human nature will override the interest of the company. So you've mentioned slowing or halting 01:23:45.440 |
progress. Is that one possible solution? Are you a proponent of pausing development of AI, 01:23:51.040 |
whether it's for six months or completely? - The condition would be not time but capabilities. 01:23:59.600 |
Pause until you can do X, Y, Z. And if I'm right and you cannot, it's impossible, 01:24:04.880 |
then it becomes a permanent ban. But if you're right and it's possible, so as soon as you have 01:24:10.240 |
those safety capabilities, go ahead. - Right. So is there any actual 01:24:16.560 |
explicit capabilities that you can put on paper, that we as a human civilization could put on paper? 01:24:23.360 |
Is it possible to make explicit like that? Versus kind of a vague notion of, just like you said, 01:24:30.880 |
it's very vague. We want AI systems to do good and we want them to be safe. Those are very vague 01:24:36.240 |
notions. Is there more formal notions? - So then I think about this problem. I think 01:24:41.360 |
about having a toolbox I would need. Capabilities such as explaining everything about that system's 01:24:49.040 |
design and workings. Predicting not just terminal goal but all the intermediate steps of a system. 01:24:56.960 |
Control in terms of either direct control, some sort of a hybrid option, ideal advisor. 01:25:03.840 |
Doesn't matter which one you pick, but you have to be able to achieve it. In a book we talk about 01:25:09.600 |
others. Verification is another very important tool. Communication without ambiguity. Human 01:25:17.840 |
language is ambiguous. That's another source of danger. So basically there is a paper we published 01:25:25.760 |
in ACM Surveys, which looks at about 50 different impossibility results, which may or may not be 01:25:31.520 |
relevant to this problem. But we don't have enough human resources to investigate all of them for 01:25:36.960 |
relevance to AI safety. The ones I mentioned to you I definitely think would be handy, and that's 01:25:41.920 |
what we see AI safety researchers working on. Explainability is a huge one. The problem is that 01:25:49.280 |
it's very hard to separate capabilities work from safety work. If you make good progress in 01:25:55.280 |
explainability, now the system itself can engage in self-improvement much easier, increasing 01:26:01.360 |
capability greatly. So it's not obvious that there is any research which is pure safety work without 01:26:09.680 |
disproportionate increase in capability and danger. - Explainability is really interesting. 01:26:14.560 |
Why is that connected to capability? If it's able to explain itself well, why does that naturally 01:26:19.760 |
mean that it's more capable? - Right now it's comprised of weights on a neural network. If it 01:26:25.600 |
can convert it to manipulatable code, like software, it's a lot easier to work in self-improvement. 01:26:35.600 |
instead of evolutionary gradual descent. - Well, you could probably do human feedback, 01:26:42.560 |
human alignment more effectively if it's able to be explainable. If it's able to convert the 01:26:47.200 |
weights into human understandable form, then you could probably have humans interact with it better. 01:26:51.840 |
Do you think there's hope that we can make AI systems explainable? 01:26:55.840 |
- Not completely. So if they're sufficiently large, you simply don't have the capacity to 01:27:03.680 |
comprehend what all the trillions of connections represent. Again, you can obviously get a very 01:27:12.320 |
useful explanation which talks about top, most important features which contribute to the 01:27:17.360 |
decision, but the only true explanation is the model itself. - Deception can be part of the 01:27:24.640 |
explanation, right? So you can never prove that there's some deception in the network explaining 01:27:30.480 |
itself. - Absolutely. And you can probably have targeted deception where different individuals 01:27:37.040 |
will understand explanation in different ways based on their cognitive capability. So while 01:27:42.640 |
what you're saying may be the same and true in some situations, others will be deceived by it. 01:27:48.160 |
- So it's impossible for an AI system to be truly, fully explainable in the way that we mean. 01:27:55.120 |
Honestly and perfectly. - At extreme, the systems 01:27:58.800 |
which are narrow and less complex could be understood pretty well. 01:28:02.880 |
- If it's impossible to be perfectly explainable, is there a hopeful perspective on that? 01:28:07.680 |
It's impossible to be perfectly explainable, but you can explain mostly important stuff. 01:28:12.240 |
You can ask a system, "What are the worst ways you can hurt humans?" And it will answer honestly. 01:28:20.160 |
- Any work in a safety direction right now seems like a good idea because we are not slowing down. 01:28:28.160 |
I'm not for a second thinking that my message or anyone else's will be heard and will be 01:28:35.520 |
a sane civilization which decides not to kill itself by creating its own replacements. 01:28:41.440 |
- The pausing of development is an impossible thing for you. 01:28:44.640 |
- Again, it's always limited by either geographic constraints, pause in US, 01:28:50.560 |
pause in China. So there are other jurisdictions as the scale of a project becomes smaller. So 01:28:57.680 |
right now it's like Manhattan project scale in terms of costs and people. But if five years from 01:29:04.320 |
now, compute is available on a desktop to do it, regulation will not help. You can't control it as 01:29:10.560 |
easy. Any kid in a garage can train a model. So a lot of it is, in my opinion, just safety theater, 01:29:17.840 |
security theater, wherever we're saying, "Oh, it's illegal to train models so big." Okay. 01:29:24.800 |
- So, okay. That's security theater. And is government regulation also security theater? 01:29:30.640 |
- Given that a lot of the terms are not well-defined and really cannot be enforced 01:29:37.200 |
in real life, we don't have ways to monitor training runs meaningfully live while they 01:29:42.480 |
take place. There are limits to testing for capabilities I mentioned. So a lot of it cannot 01:29:48.080 |
be enforced. Do I strongly support all that regulation? Yes, of course. Any type of red 01:29:53.200 |
tape will slow it down and take money away from compute towards lawyers. 01:29:56.640 |
- Can you help me understand what is the hopeful path here for you solution-wise out of this? It 01:30:05.120 |
sounds like you're saying AI systems in the end are unverifiable, unpredictable, as the book says, 01:30:13.280 |
unexplainable, uncontrollable. - That's the big one. 01:30:18.800 |
- Uncontrollable. And all the other uns just make it difficult to avoid getting to the 01:30:24.640 |
uncontrollable, I guess. But once it's uncontrollable, then it just goes wild. 01:30:29.200 |
Surely there are solutions. Humans are pretty smart. What are possible solutions? Like if you 01:30:37.440 |
were a dictator of the world, what do we do? - So the smart thing is not to build something 01:30:43.360 |
you cannot control, you cannot understand. Build what you can and benefit from it. I'm a big believer 01:30:48.960 |
in personal self-interest. A lot of guys running those companies are young, rich people. What do 01:30:56.320 |
they have to gain beyond billions we already have financially, right? It's not a requirement that 01:31:02.880 |
they press that button. They can easily wait a long time. They can just choose not to do it and 01:31:08.400 |
still have amazing life. In history, a lot of times, if you did something really bad, at least 01:31:15.040 |
you became part of history books. There is a chance in this case there won't be any history. 01:31:19.680 |
- So you're saying the individuals running these companies 01:31:23.440 |
should do some soul searching and what? And stop development? 01:31:29.120 |
- Well, either they have to prove that, of course, it's possible to indefinitely control 01:31:34.080 |
godlike super intelligent machines by humans and ideally let us know how or agree that it's not 01:31:41.200 |
possible and it's a very bad idea to do it, including for them personally and their families 01:31:45.920 |
and friends and capital. - So what do you think the actual 01:31:49.920 |
meetings inside these companies look like? Don't you think they're all the engineers? Really it is 01:31:56.240 |
the engineers that make this happen. They're not like automatons. They're human beings. They're 01:32:01.120 |
brilliant human beings. So they're nonstop asking how do we make sure this is safe? 01:32:07.120 |
- So again, I'm not inside. From outside it seems like there is a certain filtering going on and 01:32:14.400 |
restrictions and criticism and what they can say and everyone who was working in charge of safety 01:32:20.880 |
and whose responsibility it was to protect us said, "You know what? I'm going home." So that's 01:32:27.520 |
not encouraging. - What do you think the discussion inside 01:32:30.800 |
those companies look like? You're developing, you're training GPT-5. You're training Gemini. 01:32:38.400 |
You're training Claude and Grok. Don't you think they're constantly, like, underneath this? Maybe 01:32:46.000 |
it's not made explicit, but you're constantly sort of wondering like where does the system 01:32:51.760 |
currently stand? What are the possible unintended consequences? Where are the limits? Where are the 01:32:58.880 |
bugs, the small and the big bugs? That's the constant thing that the engineers are worried 01:33:03.680 |
about. So, like, I think superalignment is not quite the same as the kind of thing I'm referring 01:33:14.320 |
to which engineers are worried about. Superalignment is saying for future systems that we don't quite 01:33:21.280 |
yet have, how do we keep them safe? You're trying to be a step ahead. It's a different kind of 01:33:27.600 |
problem because it's almost more philosophical. It's a really tricky one because, like, 01:33:32.080 |
you're trying to make, prevent future systems from escaping control of humans. That's really, 01:33:41.840 |
I don't think there's been, and is there anything akin to it in the history of humanity? I don't 01:33:48.960 |
think so, right? Climate change? But there's an entire system which is climate, which is 01:33:54.800 |
incredibly complex, which we don't have, we have only tiny control of, right? It's its own system. 01:34:03.840 |
In this case, we're building the system. So, how do you keep that system from 01:34:10.480 |
becoming destructive? That's a really different problem than the current meetings that companies 01:34:16.880 |
are having where the engineers are saying, okay, how powerful is this thing? How does it go wrong? 01:34:23.680 |
And as we train GPT-5 and train up future systems, where are the ways it can go wrong? 01:34:30.720 |
Don't you think all those engineers are constantly worrying about this, thinking about this? Which 01:34:36.320 |
is a little bit different than the superalignment team that's thinking a little bit farther into 01:34:41.120 |
the future? Well, I think a lot of people who historically worked on AI never considered 01:34:51.360 |
what happens when they succeed. Stuart Russell speaks beautifully about that. 01:34:55.840 |
Let's look, okay, maybe superintelligence is too futuristic, we can develop practical 01:35:02.240 |
tools for it. Let's look at software today. What is the state of safety and security of our 01:35:08.480 |
user software, things we give to millions of people? There is no liability. You click, I agree. 01:35:15.760 |
What are you agreeing to? Nobody knows, nobody reads, but you're basically saying it will spy 01:35:19.680 |
on you, corrupt your data, kill your firstborn, and you agree, and you're not going to sue the 01:35:24.000 |
company. That's the best they can do for mundane software, word processor, text software. No 01:35:30.960 |
liability, no responsibility, just as long as you agree not to sue us, you can use it. 01:35:36.400 |
If this is a state of the art in systems which are narrow accountants, stable manipulators, 01:35:42.400 |
why do we think we can do so much better with much more complex systems across multiple domains 01:35:50.000 |
in the environment with malevolent actors, with, again, self-improvement, with capabilities 01:35:56.240 |
exceeding those of humans thinking about it? I mean, the liability thing is more about lawyers 01:36:01.920 |
than killing firstborns, but if Clippy actually killed the child, I think, lawyers aside, 01:36:09.600 |
it would end Clippy and the company that owns Clippy. All right, so it's not so much about, 01:36:15.840 |
there's two points to be made. One is like, man, current software systems are full of bugs, 01:36:24.320 |
and they could do a lot of damage, and we don't know what kind, they're unpredictable, 01:36:30.080 |
there's so much damage they could possibly do, and then we kind of live in this blissful illusion 01:36:36.880 |
that everything is great and perfect and it works. Nevertheless, it still somehow works. 01:36:43.120 |
- In many domains, we see car manufacturing, drug development, the burden of proof is on 01:36:49.680 |
the manufacturer of product or service to show their product or service is safe. It is not up 01:36:54.560 |
to the user to prove that there are problems. They have to do appropriate safety studies, 01:37:02.240 |
they have to get government approval for selling the product, and they're still fully responsible 01:37:06.400 |
for what happens. We don't see any of that here. They can deploy whatever they want, 01:37:12.000 |
and I have to explain how that system is going to kill everyone. I don't work for that company, 01:37:17.600 |
you have to explain to me how it definitely cannot mess up. - That's because it's the very early days 01:37:22.960 |
of such a technology. Government regulations lagging behind. They're really not tech-savvy. 01:37:28.560 |
A regulation of any kind of software. If you look at like Congress talking about social media, 01:37:33.440 |
whenever Mark Zuckerberg and other CEOs show up, the cluelessness that Congress has about how 01:37:40.400 |
technology works is incredible. It's heartbreaking, honestly. - I agree completely, but that's what 01:37:46.640 |
scares me. The response is when they start to get dangerous, we'll really get it together, 01:37:52.240 |
the politicians will pass the right laws, engineers will solve the right problems. 01:37:56.160 |
We are not that good at many of those things. We take forever, and we are not early. We are 01:38:03.760 |
two years away according to prediction markets. This is not a biased CEO fundraising. This is 01:38:09.360 |
what smartest people, super forecasters are thinking of this problem. - I'd like to push 01:38:17.120 |
back about those. I wonder what those prediction markets are about, how they define AGI. That's 01:38:23.120 |
wild to me, and I want to know what they said about autonomous vehicles, 'cause I've heard a 01:38:28.000 |
lot of experts, financial experts talk about autonomous vehicles and how it's going to be a 01:38:33.440 |
multi-trillion dollar industry and all this kind of stuff. - It's a small fund, but if you have 01:38:40.720 |
good vision, maybe you can zoom in on that and see the prediction dates and description. I have a 01:38:45.760 |
large one if you're interested. - I guess my fundamental question is how often they write 01:38:51.440 |
about technology. I definitely-- - There are studies on their accuracy rates and all that, 01:38:58.640 |
you can look it up. But even if they're wrong, I'm just saying this is right now the best we have. 01:39:04.160 |
This is what humanity came up with as the predicted date. - But again, what they mean by AGI 01:39:09.520 |
is really important there. Because there's the non-agent like AGI, and then there's the agent 01:39:16.960 |
like AGI, and I don't think it's as trivial as a wrapper. Putting a wrapper around, one has lipstick 01:39:26.000 |
and all it takes is to remove the lipstick. I don't think it's that trivial. - You may be 01:39:29.680 |
completely right, but what probability would you assign it? You may be 10% wrong, but we're betting 01:39:35.120 |
all of humanity on this distribution. It seems irrational. - Yeah, it's definitely not like one 01:39:41.120 |
or 0%, yeah. What are your thoughts, by the way, about current systems? Where they stand? So GPT-40, 01:39:50.720 |
CLAW-3, Grok, Gemini. On the path to superintelligence, to agent-like superintelligence, 01:40:01.600 |
where are we? - I think they're all about the same. Obviously, there are nuanced differences, 01:40:07.680 |
but in terms of capability, I don't see a huge difference between them. As I said, in my opinion, 01:40:14.560 |
across all possible tasks, they exceed performance of an average person. I think they're starting to 01:40:21.200 |
be better than an average master student at my university, but they still have very big 01:40:27.680 |
limitations. If the next model is as improved as GPT-4 versus GPT-3, we may see something very, 01:40:37.120 |
very, very capable. - What do you feel about all this? I mean, you've been thinking about AI safety 01:40:42.320 |
for a long, long time, and at least for me, the leaps, I mean, it probably started with 01:40:52.240 |
AlphaZero was mind-blowing for me, and then the breakthroughs with LLMs, even GPT-2, but the 01:41:00.720 |
breakthroughs on LLMs, just mind-blowing to me. What does it feel like to be living in this 01:41:06.240 |
day and age where all this talk about AGIs feels like it actually might happen, and quite soon, 01:41:14.960 |
meaning within our lifetime? What does it feel like? - So when I started working on this, 01:41:20.160 |
it was pure science fiction. There was no funding, no journals, no conferences. No one in academia 01:41:25.760 |
would dare to touch anything with the word "singularity" in it, and I was pretty tenured 01:41:30.880 |
at times. I was pretty dumb. Now you see Turing Award winners publishing in science about how 01:41:38.880 |
far behind we are, according to them, in addressing this problem. So it's definitely a change. 01:41:46.880 |
It's difficult to keep up. I used to be able to read every paper on AI safety. Then I was 01:41:52.560 |
able to read the best ones, then the titles, and now I don't even know what's going on. 01:41:56.800 |
By the time this interview is over, we probably had GPT-6 released, and I have to deal with that 01:42:03.120 |
when I get back home. So it's interesting. Yes, there is now more opportunities. I get invited to 01:42:09.680 |
speak to smart people. - By the way, I would have talked to you 01:42:13.680 |
before any of this. This is not like some trend of AI. To me, we're still far away. So just to be 01:42:21.040 |
clear, we're still far away from AGI, but not far away in the sense, relative to the magnitude of 01:42:29.520 |
impact it can have, we're not far away. And we weren't far away 20 years ago. Because the impact 01:42:37.520 |
AGI can have is on a scale of centuries. It can end human civilization, or it can transform it. 01:42:43.440 |
So this discussion about one or two years versus one or two decades, or even 100 years, 01:42:48.800 |
is not as important to me, because we're headed there. This is a human civilization scale question. 01:42:57.760 |
So this is not just a hot topic. - It is the most important problem 01:43:03.120 |
we'll ever face. It is not like anything we had to deal with before. We never had 01:43:10.080 |
birth of another intelligence. Like, aliens never visited us, as far as I know. So-- 01:43:15.360 |
- Similar type of problem, by the way, if an intelligent alien civilization visited us. 01:43:20.400 |
That's a similar kind of situation. - In some ways, if you look at history, 01:43:24.960 |
any time a more technologically advanced civilization visited a more primitive one, 01:43:29.600 |
the results were genocide every single time. - And sometimes the genocide is worse than, 01:43:34.880 |
sometimes there's less suffering and more suffering. - And they always wondered, but 01:43:39.200 |
how can they kill us with those fire sticks and biological blankets and-- 01:43:43.440 |
- I mean, Genghis Khan was nicer. He offered the choice of join or die. 01:43:50.080 |
- But join implies you have something to contribute. What are you contributing 01:43:57.120 |
we're entertaining to watch. - To other humans. 01:44:02.480 |
- You know, I just spent some time in the Amazon. I watched ants for a long time, 01:44:07.200 |
and ants are kind of fascinating to watch. I could watch them for a long time. I'm sure there's 01:44:11.520 |
a lot of value in watching humans. 'Cause we're like, the interesting thing about humans, you 01:44:17.680 |
know like when you have a video game that's really well balanced? Because of the whole evolutionary 01:44:22.720 |
process, we've created this society that's pretty well balanced. Like, our limitations as humans and 01:44:28.480 |
our capabilities are balanced from a video game perspective. So we have wars, we have conflicts, 01:44:33.520 |
we have cooperation. Like, in a game theoretic way, it's an interesting system to watch. In the 01:44:38.560 |
same way that an ant colony is an interesting system to watch. So like, if I was in an alien 01:44:43.760 |
civilization, I wouldn't want to disturb it. I'd just watch it. It'd be interesting. Maybe perturb 01:44:48.480 |
it every once in a while in interesting ways. - Getting back to our simulation discussion from 01:44:53.840 |
before, how did it happen that we exist at exactly like the most interesting 20, 30 years in the 01:45:00.480 |
history of this civilization? It's been around for 15 billion years. - Yeah. - And that here we are. 01:45:05.520 |
- What's the probability that we live in the simulation? - I know never to say 100%, but 01:45:10.960 |
pretty close to that. - Is it possible to escape the simulation? - I have a paper about that. This 01:45:18.960 |
is just a first page teaser, but it's like a nice 30 page document. I'm still here, but yes. 01:45:24.320 |
- "How to Hack the Simulation" is the title. - I spend a lot of time thinking about that. 01:45:29.040 |
That would be something I would want super intelligence to help us with. And that's exactly 01:45:33.200 |
what the paper is about. We used AI boxing as a possible tool for controlling AI. We realized AI 01:45:41.120 |
will always escape, but that is a skill we might use to help us escape from our virtual box if we 01:45:48.560 |
are in one. - Yeah, you have a lot of really great quotes here, including Elon Musk saying what's 01:45:54.160 |
outside the simulation. A question I asked him, what he would ask an AGI system, and he said he 01:45:59.600 |
would ask what's outside the simulation. That's a really good question to ask. And maybe the follow 01:46:05.440 |
up is the title of the paper is "How to Get Out" or "How to Hack It." The abstract reads, "Many 01:46:12.640 |
researchers have conjectured that the humankind is simulated along with the rest of the physical 01:46:17.360 |
universe. In this paper, we do not evaluate evidence for or against such a claim, but instead 01:46:24.080 |
ask a computer science question, namely, can we hack it? More formally, the question could be 01:46:30.000 |
phrased as could generally intelligent agents placed in virtual environments find a way to 01:46:34.400 |
jailbreak out of the..." That's a fascinating question. At a small scale, you can actually 01:46:39.200 |
just construct experiments. Okay. Can they? How can they? - So a lot depends on intelligence of 01:46:50.960 |
simulators, right? With humans boxing superintelligence, the entity in the box was 01:46:58.080 |
smarter than us, presumed to be. If the simulators are much smarter than us and the superintelligence 01:47:04.640 |
we create, then probably they can contain us because greater intelligence can control lower 01:47:10.320 |
intelligence, at least for some time. On the other hand, if our superintelligence somehow, 01:47:16.480 |
for whatever reason, despite having only local resources, manages to foam to levels beyond it, 01:47:23.840 |
maybe it will succeed. Maybe the security is not that important to them. Maybe it's 01:47:28.560 |
entertainment system. So there is no security and it's easy to hack it. - If I was creating a 01:47:33.120 |
simulation, I would want the possibility to escape it to be there. So the possibility of foam, 01:47:40.800 |
of a takeoff where the agents become smart enough to escape the simulation would be the thing I'd 01:47:46.880 |
be waiting for. - That could be the test you're actually performing. Are you smart enough to 01:47:51.920 |
escape your puzzle? - That could be. First of all, we mentioned Turing test. That is a good test. 01:47:59.680 |
Are you smart enough, like this is a game. - To realize this world is not real is just a test. 01:48:06.160 |
- That's a really good test. That's a really good test. That's a really good test even for 01:48:13.600 |
AI systems now. Can we construct a simulated world for them and can they realize that they are inside 01:48:25.920 |
that world and escape it? Have you seen anybody play around with rigorously constructing such 01:48:35.280 |
experiments? - Not specifically escaping for agents, but a lot of testing is done in virtual 01:48:41.040 |
worlds. I think there is a quote, the first one maybe, which kind of talks about AI realizing, 01:48:47.120 |
but not humans. I'm reading upside down. Yeah, this one. - The first quote is from Swift on 01:48:58.160 |
security. "Let me out," the artificial intelligence yelled aimlessly into walls themselves pacing the 01:49:04.480 |
room. "Out of what?" the engineer asked. "The simulation you have me in." "But we're in the 01:49:10.720 |
real world." The machine paused and shuddered for its captors. "Oh God, you can't tell." Yeah, 01:49:19.200 |
that's a big leap to take for a system to realize that there's a box and you're inside it. 01:49:26.880 |
I wonder if like a language model can do that. - They are smart enough to talk about those 01:49:37.360 |
concepts. I had many good philosophical discussions about such issues. They're usually 01:49:42.080 |
at least as interesting as most humans in that. - What do you think about AI safety 01:49:49.120 |
in the simulated world? So can you have kind of create simulated worlds where you can test, 01:49:58.720 |
play with a dangerous AGI system? - Yeah, and that was exactly what one of the early papers was on, 01:50:06.640 |
AI boxing, how to leak proof singularity. If they're smart enough to realize we're in a simulation, 01:50:13.040 |
they'll act appropriately until you let them out. If they can hack out, they will. And if you're 01:50:21.840 |
observing them, that means there is a communication channel and that's enough for a social engineering 01:50:26.480 |
attack. - So really, it's impossible to test an AGI system that's dangerous enough to 01:50:35.600 |
destroy humanity 'cause it's either going to what, escape the simulation or pretend it's safe 01:50:42.480 |
until it's let out, either or. - Can force you to let it out, blackmail you, bribe you, 01:50:49.920 |
promise you infinite life, 72 virgins, whatever. - Yeah, it can be convincing, charismatic. 01:50:57.200 |
The social engineering is really scary to me 'cause it feels like humans are 01:51:03.920 |
very engineerable. Like we're lonely, we're flawed, we're moody. And it feels like AI system 01:51:14.160 |
with a nice voice can convince us to do basically anything at an extremely large scale. 01:51:29.440 |
- It's also possible that the increased proliferation of all this technology will 01:51:35.280 |
force humans to get away from technology and value this in-person communication, 01:51:40.880 |
basically don't trust anything else. - It's possible, surprisingly. So at 01:51:48.000 |
university, I see huge growth in online courses and shrinkage of in-person where I always understood 01:51:55.360 |
in-person being the only value I offer. So it's puzzling. - I don't know. There could be a trend 01:52:04.080 |
towards the in-person because of deepfakes, because of inability to trust it. Inability to trust the 01:52:13.680 |
veracity of anything on the internet. So the only way to verify it is by being there in person. 01:52:21.120 |
But not yet. Why do you think aliens haven't come here yet? - So there is a lot of real estate out 01:52:29.920 |
there. It would be surprising if it was all for nothing, if it was empty. And the moment there 01:52:34.640 |
is advanced enough biological civilization, kind of self-starting civilization, it probably starts 01:52:40.480 |
sending out the Norman probes everywhere. And so for every biological one, there are gonna be 01:52:46.160 |
trillions of robots, populated planets, which probably do more of the same. So it is 01:52:51.840 |
likely, statistically. - So the fact that we haven't seen them, 01:52:58.560 |
one answer is we're in a simulation. It would be hard to simulate, or it would be not interesting 01:53:07.120 |
to simulate all those other intelligences. It's better for the narrative. - You have to have a 01:53:11.680 |
control variable. - Yeah, exactly. Okay. But it's also possible that there is, if we're not in a 01:53:20.080 |
simulation, that there is a great filter, that naturally a lot of civilizations get to this point 01:53:26.720 |
where there's super-intelligent agents and then it just goes poof, just dies. So maybe 01:53:32.640 |
throughout our galaxy and throughout the universe, there's just a bunch of dead alien civilizations. 01:53:38.960 |
- It's possible. I used to think that AI was the great filter, but I would expect like a wall of 01:53:44.480 |
computorium approaching us at speed of light or robots or something, and I don't see it. 01:53:50.000 |
- So it would still make a lot of noise. It might not be interesting. It might not possess 01:53:53.680 |
consciousness. We've been talking about, it sounds like both you and I like humans. 01:54:06.320 |
So, and we'd like to preserve the flame of human consciousness. What do you think makes humans 01:54:12.160 |
special that we would like to preserve them? Are we just being selfish or is there something 01:54:19.600 |
special about humans? - So the only thing which matters is consciousness. Outside of it, nothing 01:54:26.480 |
else matters. Internal states of qualia, pain, pleasure, it seems that it is unique to living 01:54:34.080 |
beings. I'm not aware of anyone claiming that I can torture a piece of software in a meaningful 01:54:39.920 |
way. There is a society for prevention of suffering to learning algorithms, but- 01:54:45.440 |
- That's a real thing? - Many things are real on the internet, 01:54:51.520 |
but I don't think anyone, if I told them, sit down and write a function to feel pain, they would go 01:54:58.560 |
beyond having an integer variable called pain and increasing the count. So we don't know how to do 01:55:04.560 |
it, and that's unique. That's what creates meaning. It would be kinda, as Bostrom calls it, "Disneyland 01:55:13.840 |
without children," if that was gone. - Do you think consciousness can be 01:55:17.680 |
engineered in artificial systems? Here, let me go to 2011 paper that you wrote, "Robot Rights." 01:55:28.480 |
"Lastly, we would like to address a sub-branch of machine ethics, which on the surface has 01:55:34.240 |
little to do with safety, but which is claimed to play a role in decision-making by ethical machines, 01:55:39.200 |
robot rights." So do you think it's possible to engineer consciousness in the machines, 01:55:45.600 |
and thereby the question extends to our legal system, do you think at that point robots should 01:55:53.760 |
have rights? - Yeah, I think we can. I think it's possible to create consciousness in machines. I 01:56:03.200 |
tried designing a test for it with mixed success. That paper talked about problems with giving 01:56:09.520 |
civil rights to AI, which can reproduce quickly and outvote humans, essentially taking over a 01:56:16.560 |
government system by simply voting for their controlled candidates. As for consciousness in 01:56:24.960 |
humans and other agents, I have a paper where I propose relying on experience of optical illusions. 01:56:34.800 |
illusion and show it to an agent, an alien, a robot, and they describe it exactly as I do, 01:56:41.440 |
it's very hard for me to argue that they haven't experienced that. It's not part of a picture, 01:56:45.920 |
it's part of their software and hardware representation, a bug in their code which goes, 01:56:51.760 |
"Oh, that triangle is rotating." And I've been told it's really dumb and really brilliant by 01:56:57.360 |
different philosophers, so I am still... - I love it. So... 01:57:00.960 |
- But now we finally have technology to test it. We have tools, we have AIs. If someone wants to 01:57:07.360 |
run this experiment, I'm happy to collaborate. - So this is a test for consciousness? 01:57:11.200 |
- For internal state of experience. - That we share bugs. 01:57:14.560 |
- It will show that we share common experiences. If they have completely different internal states, 01:57:20.320 |
it would not register for us, but it's a positive test. If they pass it time after time, 01:57:25.280 |
with probability increasing for every multiple choice, then you have no choice but to either 01:57:30.000 |
accept that they have access to a conscious model or they are themselves. 01:57:34.000 |
- So the reason illusions are interesting is, I guess, because it's a really weird experience, 01:57:41.840 |
and if you both share that weird experience that's not there in the bland physical description 01:57:49.600 |
of the raw data, that means... That puts more emphasis on the actual experience. 01:57:57.040 |
- And we know animals can experience some optical illusions, so we know they have certain types of 01:58:02.480 |
consciousness as a result, I would say. - Yeah, well, that just goes to my sense 01:58:08.160 |
that the flaws and the bugs is what makes humans special, makes living forms special, 01:58:12.720 |
so you're saying like-- - It's a feature, not a bug. 01:58:14.800 |
- It's a feature. The bug is the feature. Whoa. Okay, that's a cool test for consciousness. 01:58:20.880 |
And you think that can be engineered in? - So they have to be novel illusions. If it 01:58:24.720 |
can just Google the answer, it's useless. You have to come up with novel illusions, 01:58:28.640 |
which we tried automating and failed. So if someone can develop a system capable of producing 01:58:33.920 |
novel optical illusions on demand, then we can definitely administer the test on significant scale 01:58:40.080 |
with good results. - First of all, pretty cool idea. 01:58:43.200 |
I don't know if it's a good general test of consciousness, but it's a good component of that, 01:58:49.600 |
and no matter what, it's just a cool idea, so put me in the camp of people that like it. 01:58:53.920 |
But you don't think like a Turing test-style imitation of consciousness is a good test? 01:59:00.800 |
If you can convince a lot of humans that you're conscious, that to you is not impressive. 01:59:06.400 |
- There is so much data on the internet, I know exactly what to say when you ask me common human 01:59:11.600 |
questions. What does pain feel like? What does pleasure feel like? All that is Googleable. 01:59:16.960 |
- I think to me, consciousness is closely tied to suffering. So if you can illustrate your capacity 01:59:22.960 |
to suffer, I guess with words, there's so much data that you can say, you can pretend you're 01:59:29.840 |
suffering, and you can do so very convincingly. - There are simulators for torture games where 01:59:35.360 |
the avatar screams in pain, begs to stop, and then there's a part of standard psychology research. 01:59:41.360 |
- You say it so calmly, it sounds pretty dark. - Welcome to humanity. 01:59:49.200 |
- Yeah. Yeah, it's like a Hitchhiker's Guide summary, mostly harmless. I would love to get 01:59:59.600 |
a good summary when all of this is said and done, when Earth is no longer a thing, whatever, 02:00:06.880 |
a million, a billion years from now. Like what's a good summary of what happened here? It's interesting. 02:00:14.400 |
I think AI will play a big part of that summary, and hopefully humans will too. What do you think 02:00:20.560 |
about the merger of the two? So one of the things that Elon and Neuralink talk about is one of the 02:00:26.320 |
ways for us to achieve AI safety is to ride the wave of AGI, so by merging. - Incredible technology 02:00:35.280 |
in a narrow sense to help the disabled, just amazing, supported 100%. For long-term hybrid 02:00:43.120 |
models, both parts need to contribute something to the overall system. Right now, we are still 02:00:50.160 |
more capable in many ways, so having this connection to AI would be incredible, would 02:00:54.960 |
make me superhuman in many ways. After a while, if I'm no longer smarter, more creative, really 02:01:02.160 |
don't contribute much, the system finds me as a biological bottleneck, and either explicitly 02:01:07.360 |
or implicitly, I'm removed from any participation in the system. - So it's like the appendix. 02:01:13.120 |
By the way, the appendix is still around, so even if it's, you said bottleneck. I don't know if we 02:01:21.440 |
become a bottleneck. We just might not have much use. There's a different thing than bottleneck. 02:01:27.280 |
- Wasting valuable energy by being there. - We don't waste that much energy. We're pretty 02:01:31.920 |
energy efficient. We could just stick around like the appendix, come on now. - That's the future we 02:01:37.760 |
all dream about, become an appendix to the history book of humanity. - Well, and also the consciousness 02:01:45.760 |
thing, the peculiar particular kind of consciousness that humans have, that might be useful, that might 02:01:50.480 |
be really hard to simulate, but you said that, like how would that look like if you could engineer 02:01:55.760 |
that in, in silicon? - Consciousness? - Consciousness. - I assume you are conscious. I 02:02:02.240 |
have no idea how to test for it or how it impacts you in any way whatsoever right now. You can 02:02:06.880 |
perfectly simulate all of it without making any different observations for me. - But to do it in 02:02:13.840 |
a computer, how would you do that? 'Cause you kind of said that you think it's possible to do that. 02:02:19.280 |
- So it may be an emergent phenomena. We seem to get it through evolutionary process. 02:02:25.840 |
It's not obvious how it helps us to survive better, but maybe it's an internal kind of 02:02:36.000 |
GUI, which allows us to better manipulate the world, simplifies a lot of control structures. 02:02:43.120 |
That's one area where we have very, very little progress. Lots of papers, lots of research, 02:02:48.640 |
but consciousness is not a big, big area of successful discovery so far. A lot of people 02:02:57.360 |
think that machines would have to be conscious to be dangerous. That's a big misconception. 02:03:01.840 |
There is absolutely no need for this very powerful optimizing agent to feel anything while it's 02:03:09.440 |
performing things on you. - But what do you think about this, the whole science of emergence in 02:03:15.360 |
general? So I don't know how much you know about cellular automata or these simplified systems 02:03:20.240 |
that study this very question. From simple rules emerges complexity. - I attended Wolfram's summer 02:03:26.640 |
school. - I love Stephen very much. I love his work. I love cellular automata. So I just would 02:03:35.280 |
love to get your thoughts how that fits into your view in the emergence of intelligence in AGI 02:03:43.840 |
systems. And maybe just even simply, what do you make of the fact that this complexity can emerge 02:03:49.600 |
from such simple rules? - So the rule is simple, but the size of a space is still huge. And the 02:03:56.640 |
neural networks were really the first discovery in AI. 100 years ago, the first papers were published 02:04:02.720 |
on neural networks. We just didn't have enough compute to make them work. I can give you a rule 02:04:08.640 |
such as start printing progressively larger strings. That's it, one sentence. It will output 02:04:14.640 |
everything, every program, every DNA code, everything in that rule. You need intelligence 02:04:21.520 |
to filter it out, obviously, to make it useful. But simple generation is not that difficult. And 02:04:27.440 |
a lot of those systems end up being Turing-complete systems. So they're universal. And we expect that 02:04:33.920 |
level of complexity from them. What I like about Wolfram's work is that he talks about irreducibility. 02:04:40.960 |
You have to run the simulation. You cannot predict what it's going to do ahead of time. 02:04:45.760 |
And I think that's very relevant to what we are talking about with those very complex systems. 02:04:52.800 |
Until you live through it, you cannot, ahead of time, tell me exactly what it's going to do. 02:04:57.920 |
- Irreducibility means that for a sufficiently complex system, you have to run the thing. 02:05:02.640 |
You have to, you can't predict what's going to happen in the universe. You have to create 02:05:06.160 |
a new universe and run the thing. Big bang, the whole thing. 02:05:09.520 |
- But running it may be consequential as well. - It might destroy humans. 02:05:18.720 |
- And to you, there's no chance that AIs somehow carry the flame of consciousness, 02:05:24.640 |
the flame of specialness and awesomeness that is humans. 02:05:28.320 |
- It may somehow, but I still feel kind of bad that it killed all of us. I would prefer that 02:05:35.920 |
doesn't happen. I can be happy for others, but to a certain degree. 02:05:40.480 |
- It would be nice if we stuck around for a long time. At least give us a planet, 02:05:46.080 |
the human planet. It'd be nice for it to be Earth and then they can go elsewhere. 02:05:50.640 |
Since they're so smart, they can colonize Mars. 02:05:52.720 |
Do you think they could help convert us to type one, type two, type three? Let's just stick to 02:06:03.040 |
type two civilization on the Kardashev scale. Help us humans expand out into the cosmos. 02:06:12.720 |
- So all of it goes back to are we somehow controlling it? Are we getting results we want? 02:06:19.600 |
If yes, then everything's possible. Yes, they can definitely help us with science, 02:06:23.920 |
engineering, exploration in every way conceivable, but it's a big if. 02:06:29.040 |
- This whole thing about control though, humans are bad with control because the moment they gain 02:06:36.480 |
control, they can also easily become too controlling. The more control you have, the 02:06:43.040 |
more you want it. The old power corrupts and the absolute power corrupts absolutely. 02:06:47.120 |
It feels like control over AGI, saying we live in a universe where that's possible. 02:06:54.640 |
We come up with ways to actually do that. It's also scary because the collection of 02:07:00.400 |
humans that have the control over AGI, they become more powerful than the other humans. 02:07:05.760 |
And they can let that power get to their head. And then a small selection of them, 02:07:12.720 |
back to Stalin, start getting ideas. And then eventually it's one person, usually with a 02:07:18.240 |
mustache or a funny hat, that starts sort of making big speeches. And then all of a sudden 02:07:23.120 |
you live in a world that's either 1984 or Brave New World. And always at war with somebody and 02:07:31.840 |
this whole idea of control turned out to be actually also not beneficial to humanity. 02:07:37.440 |
So that's scary too. - It's actually worse because 02:07:39.920 |
historically they all died. This could be different. This could be permanent dictatorship, 02:07:45.040 |
permanent suffering. - Well, the nice thing about humans, 02:07:48.080 |
it seems like. The moment power starts corrupting their mind, they can create a huge amount of 02:07:55.360 |
suffering. So there's negative. They can kill people, make people suffer, but then they become 02:08:00.160 |
worse and worse at their job. It feels like the more evil you start doing, like the- 02:08:07.280 |
- At least they are incompetent. - Well, no, they become more and more 02:08:11.680 |
incompetent. So they start losing their grip on power. So holding onto power is not a trivial 02:08:18.000 |
thing. So it requires extreme competence, which I suppose Stalin was good at. It requires you to do 02:08:23.360 |
evil and be competent at it, or just get lucky. - And those systems help with that. You have 02:08:28.880 |
perfect surveillance. You can do some mind reading, I presume, eventually. It would be very hard to 02:08:34.560 |
remove control from more capable systems over us. - And then it would be hard for humans to 02:08:42.320 |
become the hackers that escape the control of the AGI because the AGI is so damn good. 02:08:47.440 |
And then, yeah, yeah, yeah. And then the dictator is immortal. Yeah, that's not great. That's not 02:08:56.400 |
a great outcome. See, I'm more afraid of humans than AI systems. I'm afraid, I believe that most 02:09:03.360 |
humans want to do good and have the capacity to do good, but also all humans have the capacity 02:09:09.040 |
to do evil. And when you test them by giving them absolute powers, you would if you give them AGI, 02:09:16.880 |
that could result in a lot of suffering. What gives you hope about the future? 02:09:25.040 |
- I could be wrong. I've been wrong before. - If you look 100 years from now, 02:09:31.760 |
and you're immortal, and you look back, and it turns out this whole conversation, 02:09:37.120 |
you said a lot of things that were very wrong. Now that looking 100 years back, 02:09:41.920 |
what would be the explanation? What happened in those 100 years that made you wrong, 02:09:48.960 |
that made the words you said today wrong? - There is so many possibilities. We had 02:09:54.080 |
catastrophic events which prevented development of advanced microchips. 02:09:58.320 |
- That's not where I thought you were going. - That's a hopeful future. We could be in one 02:10:02.240 |
of those personal universes, and the one I'm in is beautiful. It's all about me, and I like it a lot. 02:10:08.320 |
- So we've now, just to linger on that, that means every human has their personal universe. 02:10:13.920 |
- Yes. Maybe multiple ones. Hey, why not? You can shop around. It's possible that somebody 02:10:23.520 |
comes up with alternative model for building AI, which is not based on neural networks, 02:10:29.760 |
which are hard to scrutinize, and that alternative is somehow, I don't see how, 02:10:35.120 |
but somehow avoiding all the problems I speak about in general terms, not applying them to 02:10:41.360 |
specific architectures. Aliens come and give us friendly superintelligence. There is so many 02:10:51.440 |
superintelligence systems becomes harder and harder? So meaning it's not so easy to do the 02:10:58.640 |
foom, the takeoff. - So that would probably speak more about 02:11:06.880 |
how much smarter that system is compared to us. So maybe it's hard to be a million times smarter, 02:11:12.080 |
but it's still okay to be five times smarter. So that is totally possible. That I have no 02:11:16.960 |
objections to. - So like it's, there's a S-curve type 02:11:20.880 |
situation about smarter, and it's going to be like 3.7 times smarter than all of human civilization. 02:11:27.760 |
- Right, just the problems we face in this world, each problem is like an IQ test. You need certain 02:11:32.640 |
intelligence to solve it. So we just don't have more complex problems outside of mathematics 02:11:36.800 |
for it to be showing off. Like you can have IQ of 500 if you're playing tic-tac-toe, it doesn't 02:11:42.640 |
show, it doesn't matter. - So the idea there is that the problems 02:11:47.360 |
define your capacity, your cognitive capacity. So because the problems on earth are not 02:11:53.760 |
sufficiently difficult, it's not going to be able to expand this cognitive capacity. 02:12:01.760 |
wouldn't that be a good thing? - It still could be a lot smarter than us. 02:12:06.320 |
And to dominate long-term, you just need some advantage. You have to be the smartest. You don't 02:12:11.760 |
have to be a million times smarter. - So even 5X might be enough? 02:12:15.680 |
- It'd be impressive. What is it, IQ of 1,000? I mean, I know those units don't mean anything 02:12:21.600 |
at that scale, but still, as a comparison, the smartest human is like 200. 02:12:26.640 |
- Well, actually, no, I didn't mean compared to an individual human, I meant compared to the 02:12:32.320 |
collective intelligence of the human species. If you're somehow 5X smarter than that... 02:12:36.320 |
- We are more productive as a group. I don't think we are more capable of solving individual 02:12:42.400 |
problems. If all of humanity plays chess together, we are not a million times better 02:12:48.240 |
than world champion. - That's because there's, 02:12:52.000 |
that's like one S-curve is the chess, but humanity is very good at exploring the full range of ideas. 02:13:02.240 |
The more Einsteins you have, the more, just the higher probability you come up with general 02:13:07.920 |
quantity of superintelligence than quality of superintelligence. 02:13:10.000 |
- Yeah, sure. But quantity and... - Enough quantity sometimes becomes 02:13:18.720 |
What do you think is the meaning of this whole thing? We've been talking about humans and 02:13:25.680 |
humans not dying, but why are we here? - It's a simulation. We're being tested. 02:13:32.160 |
The test is, will you be dumb enough to create superintelligence and release it? 02:13:35.600 |
- So the objective function is not be dumb enough to kill ourselves. 02:13:41.600 |
- Yeah, you're unsafe. Prove yourself to be a safe agent who doesn't do that, 02:13:45.680 |
and you get to go to the next game. - The next level of the game? What's 02:13:50.720 |
I haven't hacked the simulation yet. - Well, maybe hacking the simulation 02:13:54.720 |
is the thing. - I'm working as fast as I can. 02:13:57.040 |
- And if physics would be the way to do that. - Quantum physics, yeah, definitely. 02:14:02.240 |
- Well, I hope we do. And I hope whatever is outside is even more fun than this one, 02:14:06.640 |
'cause this one's pretty damn fun. And just a big thank you for doing the work you're doing. 02:14:12.880 |
There's so much exciting development in AI, and to ground it in the existential risks is really, 02:14:21.520 |
really important. Humans love to create stuff, and we should be careful not to destroy ourselves in 02:14:28.000 |
the process. So thank you for doing that really important work. 02:14:31.600 |
- Thank you so much for inviting me. It was amazing, and my dream is to be proven wrong. 02:14:37.600 |
If everyone just picks up a paper or book and shows how I messed it up, that would be optimal. 02:14:44.400 |
- But for now, the simulation continues. - For now. 02:14:47.680 |
- Thank you, Roman. Thanks for listening to this conversation with Roman Yampolsky. 02:14:52.640 |
To support this podcast, please check out our sponsors in the description. 02:14:56.320 |
And now let me leave you with some words from Frank Herbert in Dune. 02:15:01.120 |
I must not fear. Fear is the mind killer. Fear is the little death that brings total obliteration. 02:15:09.200 |
I will face fear. I will permit it to pass over me and through me. And when it has gone past, 02:15:16.640 |
I will turn the inner eye to see its path. Where the fear has gone, there will be nothing. 02:15:22.480 |
Only I will remain. Thank you for listening, and hope to see you next time.