back to index

Roman Yampolskiy: Dangers of Superintelligent AI | Lex Fridman Podcast #431


Chapters

0:0 Introduction
2:20 Existential risk of AGI
8:32 Ikigai risk
16:44 Suffering risk
20:19 Timeline to AGI
24:51 AGI turing test
30:14 Yann LeCun and open source AI
43:6 AI control
45:33 Social engineering
48:6 Fearmongering
57:57 AI deception
64:30 Verification
71:29 Self-improving AI
83:42 Pausing AI development
89:59 AI Safety
99:43 Current AI
105:5 Simulation
112:24 Aliens
113:57 Human mind
120:17 Neuralink
129:23 Hope for the future
133:18 Meaning of life

Whisper Transcript | Transcript Only Page

00:00:00.000 | If we create general super-intelligences, I don't see a good outcome long-term for humanity.
00:00:06.560 | So there is X risk. Existential risk, everyone's dead. There is S risk, suffering risks,
00:00:12.640 | where everyone wishes they were dead. We have also idea for I risk, Ikigai risks, where
00:00:18.240 | we lost our meaning. The systems can be more creative, they can do all the jobs. It's not
00:00:24.560 | obvious what you have to contribute to a world where super-intelligence exists. Of course, you
00:00:30.080 | can have all the variants you mentioned, where we are safe, we are kept alive, but we are not in
00:00:35.840 | control. We are not deciding anything. We are like animals in a zoo. There is, again, possibilities
00:00:43.040 | we can come up with as very smart humans, and then possibilities something a thousand times
00:00:49.120 | smarter can come up with for reasons we cannot comprehend. The following is a conversation with
00:00:56.400 | Roman Yampolsky, an AI safety and security researcher and author of a new book titled
00:01:02.560 | AI Unexplainable, Unpredictable, Uncontrollable. He argues that there's almost 100% chance that AGI
00:01:11.280 | will eventually destroy human civilization. As an aside, let me say that we'll have many often
00:01:18.320 | technical conversations on the topic of AI, often with engineers building the state-of-the-art AI
00:01:24.640 | systems. I would say those folks put the infamous P-Doom or the probability of AGI killing all
00:01:30.560 | humans at around 1-20%, but it's also important to talk to folks who put that value at 70, 80, 90,
00:01:40.160 | and, in the case of Roman, at 99.99 and many more nines percent. I'm personally excited for the
00:01:47.680 | future and believe it will be a good one, in part because of the amazing technological innovation we
00:01:54.000 | humans create, but we must absolutely not do so with blinders on, ignoring the possible risks,
00:02:02.720 | including existential risks of those technologies. That's what this conversation is about.
00:02:10.080 | This is the Lex Friedman Podcast. To support it, please check out our sponsors in the description.
00:02:15.600 | And now, dear friends, here's Roman Yampolsky. What to you is the probability that superintelligent
00:02:23.920 | AI will destroy all human civilization? - What's the time frame?
00:02:27.040 | - Let's say 100 years, in the next 100 years. - So the problem of controlling AGI or
00:02:33.280 | superintelligence, in my opinion, is like a problem of creating a perpetual safety machine.
00:02:39.680 | By analogy with perpetual motion machine, it's impossible. Yeah, we may succeed and do a good
00:02:46.320 | job with GPT-5, 6, 7, but they just keep improving, learning, eventually self-modifying,
00:02:56.560 | interacting with the environment, interacting with malevolent actors. The difference between
00:03:03.360 | cybersecurity, narrow AI safety, and safety for general AI for superintelligence is that
00:03:10.160 | we don't get a second chance. With cybersecurity, somebody hacks your account, what's the big deal?
00:03:14.880 | You get a new password, new credit card, you move on. Here, if we're talking about existential risks,
00:03:21.520 | you only get one chance. So you're really asking me, what are the chances that we'll create
00:03:26.880 | the most complex software ever on the first try with zero bugs, and it will continue to have zero
00:03:34.000 | bugs for 100 years or more? - So there is an incremental improvement
00:03:41.360 | of systems leading up to AGI. To you, it doesn't matter if we can keep those safe. There's going
00:03:49.200 | to be one level of system at which you cannot possibly control it. - I don't think we so far
00:03:59.200 | have made any system safe. At the level of capability they display, they already have
00:04:06.080 | made mistakes, we had accidents, they've been jailbroken. I don't think there is a single
00:04:12.720 | large language model today which no one was successful at making do something developers
00:04:18.960 | didn't intend it to do. - But there's a difference between getting it to do something unintended,
00:04:24.560 | getting it to do something that's painful, costly, destructive, and something that's destructive to
00:04:29.760 | the level of hurting billions of people, or hundreds of millions of people, billions of people,
00:04:35.280 | or the entirety of human civilization. That's a big leap. - Exactly, but the systems we have today
00:04:41.280 | have capability of causing X amount of damage. So then they fail, that's all we get. If we develop
00:04:47.680 | systems capable of impacting all of humanity, all of universe, the damage is proportionate.
00:04:55.040 | - What to you are the possible ways that such kind of mass murder of humans can happen?
00:05:02.800 | - That's always a wonderful question. So one of the chapters in my new book is about
00:05:07.840 | unpredictability. I argue that we cannot predict what a smarter system will do. So you're really
00:05:13.360 | not asking me how superintelligence will kill everyone, you're asking me how I would do it.
00:05:18.240 | And I think it's not that interesting. I can tell you about the standard, you know,
00:05:22.480 | nanotech, synthetic, bionuclear. Superintelligence will come up with something completely new,
00:05:27.920 | completely super. We may not even recognize that as a possible path to achieve that goal.
00:05:35.440 | - So there's like an unlimited level of creativity in terms of how humans could be killed.
00:05:41.680 | But, you know, we could still investigate possible ways of doing it. Not how to do it, but
00:05:49.680 | at the end, what is the methodology that does it? You know, shutting off the power,
00:05:55.040 | and then humans start killing each other maybe because the resources are really constrained.
00:06:01.440 | And then there's the actual use of weapons, like nuclear weapons, or
00:06:04.320 | developing artificial pathogens, viruses, that kind of stuff. We could still kind of think
00:06:11.760 | through that and defend against it, right? There's a ceiling to the creativity of mass
00:06:16.880 | murder of humans here, right? The options are limited. - They are limited by how imaginative
00:06:22.720 | we are. If you are that much smarter, that much more creative, you are capable of thinking across
00:06:27.600 | multiple domains, do novel research in physics and biology, you may not be limited by those tools.
00:06:33.520 | If squirrels were planning to kill humans, they would have a set of possible ways of doing it,
00:06:39.200 | but they would never consider things we can come up with. - So are you thinking about mass murder
00:06:43.760 | and destruction of human civilization, or are you thinking of with squirrels, you put them in a zoo,
00:06:48.720 | and they don't really know they're in a zoo? If we just look at the entire set of undesirable
00:06:52.560 | trajectories, majority of them are not going to be death. Most of them are going to be just like
00:06:59.280 | things like Brave New World, where the squirrels are fed dopamine, and they're all doing some kind
00:07:08.880 | of fun activity, and the fire, the soul of humanity is lost because of the drug that's fed to it. Or
00:07:16.640 | like literally in a zoo, we're in a zoo, we're doing our thing, we're like playing a game of Sims,
00:07:22.160 | and the actual players playing that game are AI systems. Those are all undesirable because
00:07:29.280 | sort of the free will, the fire of human consciousness is dimmed through that process,
00:07:35.360 | but it's not killing humans. So are you thinking about that, or is the biggest concern literally
00:07:43.280 | the extinctions of humans? - I think about a lot of things. So there is X risk, existential risk,
00:07:49.520 | everyone's dead. There is S risk, suffering risks, where everyone wishes they were dead.
00:07:54.640 | We have also idea for I risk, Ikigai risks, where we lost our meaning. The systems can be more
00:08:02.000 | creative, they can do all the jobs. It's not obvious what you have to contribute to a world
00:08:07.520 | where superintelligence exists. Of course, you can have all the variants you mentioned, where
00:08:13.280 | we are safe, we are kept alive, but we are not in control, we are not deciding anything,
00:08:18.240 | we are like animals in a zoo. There is, again, possibilities we can come up with as very smart
00:08:25.280 | humans, and then possibilities something 1,000 times smarter can come up with for reasons we
00:08:31.600 | cannot comprehend. - I would love to sort of dig into each of those, X risk, S risk, and I risk.
00:08:37.840 | So can you like linger on I risk? What is that? - So Japanese concept of Ikigai, you find something
00:08:45.440 | which allows you to make money, you are good at it, and the society says, "We need it." So like,
00:08:51.680 | you have this awesome job, you are a podcaster, gives you a lot of meaning, you have a good life,
00:08:58.640 | I assume you're happy. That's what we want most people to find, to have. For many intellectuals,
00:09:06.160 | it is their occupation which gives them a lot of meaning. I am a researcher, philosopher, scholar,
00:09:12.560 | that means something to me. In a world where an artist is not feeling appreciated because his art
00:09:20.080 | is just not competitive with what is produced by machines, or a writer, or scientist will lose a
00:09:28.640 | lot of that. And at the lower level, we're talking about complete technological unemployment.
00:09:34.800 | We're not losing 10% of jobs, we're losing all jobs. What do people do with all that free time?
00:09:40.160 | What happens when everything society is built on is completely modified in one generation? It's not
00:09:48.000 | a slow process where we get to kind of figure out how to live that new lifestyle, but it's
00:09:54.160 | pretty quick. - In that world, can't humans do what humans currently do with chess, play each other,
00:10:01.440 | have tournaments, even though AI systems are far superior at this time in chess? So we just
00:10:08.320 | create artificial games, or for us, they're real. Like the Olympics, we do all kinds of different
00:10:14.400 | competitions and have fun, maximize the fun, and let the AI focus on the productivity.
00:10:24.000 | - It's an option, I have a paper where I try to solve the value alignment problem for multiple
00:10:29.440 | agents. And the solution to avoid compromise is to give everyone a personal virtual universe.
00:10:35.360 | You can do whatever you want in that world. You could be king, you could be slave, you decide
00:10:39.840 | what happens. So it's basically a glorified video game where you get to enjoy yourself and someone
00:10:45.360 | else takes care of your needs, and the substrate alignment is the only thing we need to solve. We
00:10:51.840 | don't have to get eight billion humans to agree on anything. - So okay, so why is that not a
00:10:58.880 | likely outcome? Why can't AI systems create video games for us to lose ourselves in, each with an
00:11:05.920 | individual video game universe? - Some people say that's what happened,
00:11:10.000 | we're in a simulation. - And we're playing that video game,
00:11:13.520 | and now we're creating, what, maybe we're creating artificial threats for ourselves to be scared
00:11:19.840 | about, 'cause fear is really exciting. It allows us to play the video game more vigorously.
00:11:25.440 | - And some people choose to play on a more difficult level with more constraints. Some say,
00:11:30.880 | okay, I'm just gonna enjoy the game, high privilege level. Absolutely.
00:11:34.720 | - So okay, what was that paper on multi-agent value alignment?
00:11:38.240 | - Personal universes. Personal universes. - So that's one of the possible outcomes.
00:11:44.560 | But what in general is the idea of the paper? So it's looking at multiple agents that are human,
00:11:49.600 | AI, like a hybrid system where there's humans and AIs? Or is it looking at humans or just
00:11:54.800 | intelligent agents? - In order to solve value
00:11:56.800 | alignment problem, I'm trying to formalize it a little better. Usually we're talking about getting
00:12:02.000 | AIs to do what we want, which is not well-defined. Are we talking about creator of a system,
00:12:08.000 | owner of that AI, humanity as a whole? But we don't agree on much. There is no universally
00:12:15.760 | accepted ethics, morals across cultures, religions. People have individually very different
00:12:20.960 | preferences politically and such. So even if we somehow managed all the other aspects of it,
00:12:26.880 | programming those fuzzy concepts in, getting AI to follow them closely,
00:12:30.880 | we don't agree on what to program in. So my solution was, okay, we don't have to compromise
00:12:36.320 | on room temperature. You have your universe, I have mine, whatever you want. And if you like me,
00:12:41.680 | you can invite me to visit your universe. We don't have to be independent, but the point is you can
00:12:46.720 | be. And virtual reality is getting pretty good. It's gonna hit a point where you can't tell that
00:12:50.880 | difference. And if you can't tell if it's real or not, well, what's the difference?
00:12:54.720 | - So basically, give up on value alignment. Create an entire, it's like the multiverse theory.
00:13:01.040 | It's just create an entire universe for you with your values.
00:13:04.240 | - You still have to align with that individual. They have to be happy in that simulation.
00:13:09.360 | But it's a much easier problem to align with one agent versus eight billion agents plus animals,
00:13:14.640 | aliens.
00:13:15.120 | - So you convert the multi-agent problem into a single-agent problem?
00:13:19.120 | - I'm trying to do that, yeah.
00:13:21.280 | - Okay. Is there any way to, so, okay, that's giving up on the value alignment problem.
00:13:29.760 | Well, is there any way to solve the value alignment problem where there's a bunch of humans,
00:13:35.040 | multiple humans, tens of humans, or eight billion humans that have very different set of values?
00:13:41.280 | - It seems contradictory. I haven't seen anyone explain what it means outside of kinda
00:13:48.400 | words which pack a lot, make it good, make it desirable, make it something they don't regret.
00:13:55.360 | But how do you specifically formalize those notions? How do you program them in?
00:13:59.600 | I haven't seen anyone make progress on that so far.
00:14:02.720 | - But isn't that the whole optimization journey that we're doing as a human civilization?
00:14:07.680 | We're looking at geopolitics. Nations are in a state of anarchy with each other.
00:14:13.920 | They start wars, there's conflict, and oftentimes they have very different views of what is good
00:14:22.000 | and what is evil. Isn't that what we're trying to figure out, just together, trying to converge
00:14:27.600 | towards that? So we're essentially trying to solve the value alignment problem with humans.
00:14:31.360 | - Right, but the examples you gave, some of them are, for example, two different religions saying
00:14:36.640 | this is our holy site, and we are not willing to compromise it in any way. If you can make
00:14:43.360 | two holy sites in virtual worlds, you solve the problem. But if you only have one, it's not
00:14:47.360 | divisible, you're kinda stuck there. - But what if we want to be at tension
00:14:51.520 | with each other? And through that tension, we understand ourselves and we understand the world.
00:14:58.160 | So that's the intellectual journey we're on as a human civilization, is we create intellectual
00:15:05.680 | and physical conflict, and through that, figure stuff out.
00:15:08.240 | - If we go back to that idea of simulation, and this is entertainment kinda giving meaning to us,
00:15:14.640 | the question is how much suffering is reasonable for a video game? So yeah, I don't mind a video
00:15:20.160 | game where I get haptic feedback, there is a little bit of shaking, maybe I'm a little scared.
00:15:25.360 | I don't want a game where kids are tortured, literally. That seems unethical, at least by
00:15:32.880 | our human standards. - Are you suggesting it's possible
00:15:35.840 | to remove suffering, if we're looking at human civilization as an optimization problem?
00:15:39.920 | - So we know there are some humans who, because of a mutation, don't experience physical pain.
00:15:46.560 | So at least physical pain can be mutated out, re-engineered out. Suffering, in terms of meaning,
00:15:55.280 | like you burned the only copy of my book, is a little harder. But even there, you can manipulate
00:16:00.960 | your hedonic set point, you can change defaults, you can reset. Problem with that is, if you start
00:16:07.360 | messing with your reward channel, you start wireheading, and end up blessing out a little
00:16:14.640 | too much. - Well, that's the question.
00:16:17.200 | Would you really want to live in a world where there's no suffering? That's a dark question.
00:16:22.240 | But is there some level of suffering that reminds us of what this is all for?
00:16:28.800 | - I think we need that, but I would change the overall range. So right now,
00:16:34.160 | it's negative infinity to kind of positive infinity, pain-pleasure axis. I would make
00:16:38.800 | it like zero to positive infinity, and being unhappy is like, I'm close to zero.
00:16:43.040 | - Okay, so what's the S-risk? What are the possible things that you're imagining with S-risk? So
00:16:49.200 | mass suffering of humans, what are we talking about there, caused by AGI?
00:16:54.560 | - So there are many malevolent actors. We can talk about psychopaths, crazies, hackers,
00:17:01.360 | doomsday cults. We know from history, they tried killing everyone. They tried on purpose to cause
00:17:07.280 | maximum amount of damage, terrorism. What if someone malevolent wants on purpose to torture
00:17:13.440 | all humans as long as possible? You solve aging, so now you have functional immortality,
00:17:20.880 | and you just try to be as creative as you can. - Do you think there is actually people in human
00:17:26.480 | history that tried to literally maximize human suffering? In just studying people who have done
00:17:32.560 | evil in the world, it seems that they think that they're doing good, and it doesn't seem like
00:17:37.840 | they're trying to maximize suffering. They just cause a lot of suffering as a side effect of
00:17:45.360 | doing what they think is good. - So there are different malevolent
00:17:49.120 | agents. Some may be just gaining personal benefit and sacrificing others to that cause. Others,
00:17:56.240 | we know for a fact, are trying to kill as many people as possible. When we look at
00:18:00.080 | recent school shootings, if they had more capable weapons, they would take out not
00:18:05.840 | dozens, but thousands, millions, billions. - Well, we don't know that, but that is a
00:18:16.800 | terrifying possibility, and we don't want to find out. Like if terrorists had access to nuclear
00:18:24.240 | weapons, how far would they go? Is there a limit to what they're willing to do? In your senses,
00:18:33.840 | there are some malevolent actors where there's no limit. - There is mental diseases where people
00:18:41.520 | don't have empathy, don't have this human quality of understanding suffering in others.
00:18:49.040 | - And then there's also a set of beliefs where you think you're doing good
00:18:52.400 | by killing a lot of humans. - Again, I would like to assume that
00:18:58.720 | normal people never think like that. It's always some sort of psychopaths, but yeah.
00:19:03.120 | - And to you, AGI systems can carry that and be more competent at executing that?
00:19:10.800 | - They can certainly be more creative. They can understand human biology better,
00:19:15.840 | understand our molecular structure, genome. Again, a lot of times, torture ends,
00:19:23.680 | then individual dies. That limit can be removed as well. - So if we're actually looking at X-risk
00:19:30.000 | and S-risk, as the systems get more and more intelligent, don't you think it's possible to
00:19:35.840 | anticipate the ways they can do it and defend against it like we do with the cyber security,
00:19:41.040 | with the new security systems? - Right, we can definitely keep up for a
00:19:45.440 | while. I'm saying you cannot do it indefinitely. At some point, the cognitive gap is too big,
00:19:52.400 | the surface you have to defend is infinite, but attackers only need to find one exploit.
00:20:00.400 | - So to you, eventually, this is, we're heading off a cliff.
00:20:04.480 | - If we create general super intelligences, I don't see a good outcome long-term for humanity.
00:20:11.520 | The only way to win this game is not to play it. - Okay, well, we'll talk about possible solutions
00:20:16.720 | and what not playing it means, but what are the possible timelines here to you? What are
00:20:22.240 | we talking about? We're talking about a set of years, decades, centuries. What do you think?
00:20:27.280 | - I don't know for sure. The prediction markets right now are saying 2026 for AGI.
00:20:32.640 | I heard the same thing from CEO of Anthropic, DeepMind, so maybe we're two years away,
00:20:38.480 | which seems very soon, given we don't have a working safety mechanism in place or even a
00:20:45.360 | prototype for one, and there are people trying to accelerate those timelines because they feel
00:20:50.000 | we're not getting there quick enough. - Well, what do you think they mean
00:20:52.880 | when they say AGI? - So the definitions we used to have,
00:20:56.800 | and people are modifying them a little bit lately, artificial general intelligence was a system
00:21:02.320 | capable of performing in any domain a human could perform, so kind of you're creating this average
00:21:09.120 | artificial person. They can do cognitive labor, physical labor, where you can get another human
00:21:14.320 | to do it. Superintelligence was defined as a system which is superior to all humans in all
00:21:19.120 | domains. Now people are starting to refer to AGI as if it's superintelligence. I made a post
00:21:25.840 | recently where I argued, for me at least, if you average out over all the common human tasks,
00:21:31.920 | those systems are already smarter than an average human. So under that definition, we have it.
00:21:38.800 | Shane Legg has this definition of where you're trying to win in all domains. That's what
00:21:43.440 | intelligence is. Now, are they smarter than elite individuals in certain domains? Of course not.
00:21:49.280 | They're not there yet. But the progress is exponential. - See, I'm much more concerned
00:21:55.440 | about social engineering. So to me, AI's ability to do something in the physical world,
00:22:03.440 | like the lowest hanging fruit, the easiest set of methods, is by just getting humans to do it.
00:22:12.720 | It's going to be much harder to be the kind of viruses that take over the minds of robots
00:22:19.840 | that, where the robots are executing the commands. It just seems like humans,
00:22:24.240 | social engineering of humans, is much more likely. - That would be enough to
00:22:28.000 | bootstrap the whole process. - Okay, just to linger on the term AGI,
00:22:34.160 | what to you is the difference between AGI and human-level intelligence?
00:22:37.760 | - Human-level is general in the domain of expertise of humans. We know how to do human
00:22:44.480 | things. I don't speak dog language. I should be able to pick it up if I'm a general intelligence.
00:22:49.760 | It's kind of inferior animal. I should be able to learn that skill, but I can't. A general
00:22:55.280 | intelligence, truly universal general intelligence, should be able to do things like that humans
00:22:59.760 | cannot do. - To be able to talk to animals,
00:23:01.920 | for example. - To solve pattern recognition
00:23:04.160 | problems of that type, to do other similar things outside of our domain of expertise,
00:23:12.960 | because it's just not the world we live in. - If we just look at the space of cognitive
00:23:19.520 | abilities we have, I just would love to understand what the limits are beyond which an AGI system can
00:23:25.600 | reach. What does that look like? What about actual mathematical thinking or scientific innovation,
00:23:34.800 | that kind of stuff? - We know calculators are smarter than
00:23:39.280 | humans in that narrow domain of addition. - But is it humans plus tools versus AGI,
00:23:47.440 | or just human, raw human intelligence? 'Cause humans create tools, and with the tools,
00:23:53.360 | they become more intelligent. There's a gray area there, what it means to be human when we're
00:23:58.720 | measuring their intelligence. - When I think about it, I usually think
00:24:01.440 | human with a paper and a pencil, not human with internet and another AI helping.
00:24:06.640 | - But is that a fair way to think about it? 'Cause isn't there another definition of human-level
00:24:11.440 | intelligence that includes the tools that humans create?
00:24:13.680 | - But we create AI, so at any point, you'll still just add superintelligence to human capability?
00:24:19.200 | That seems like cheating. - No, controllable tools.
00:24:23.040 | There is an implied leap that you're making when AGI goes from tool to entity that can make its
00:24:33.440 | own decisions. So if we define human-level intelligence as everything a human can do
00:24:38.480 | with fully controllable tools. - It seems like a hybrid of some kind.
00:24:42.720 | You're now doing brain-computer interfaces, you're connecting it to maybe narrow AIs. Yeah,
00:24:47.680 | it definitely increases our capabilities. - So what's a good test to you that measures
00:24:57.360 | whether an artificial intelligence system has reached human-level intelligence? And what's a
00:25:02.560 | good test where it has superseded human-level intelligence to reach that land of AGI?
00:25:09.120 | - I am old-fashioned. I like Turing test. I have a paper where I equate passing Turing test to
00:25:15.040 | solving AI complete problems because you can encode any questions about any domain into the
00:25:20.400 | Turing test. You don't have to talk about how was your day, you can ask anything. And so the system
00:25:26.960 | has to be as smart as a human to pass it in a true sense. - But then you would extend that to
00:25:32.080 | maybe a very long conversation. I think the Alexa prize was doing that.
00:25:37.360 | Basically, can you do a 20-minute, 30-minute conversation with an AI system?
00:25:42.160 | - It has to be long enough to where you can make some meaningful decisions about capabilities,
00:25:49.040 | absolutely. You can brute force very short conversations. - So like, literally, what does
00:25:54.160 | that look like? Can we construct formally a kind of test that tests for AGI? - For AGI, it has to
00:26:05.840 | be there. I cannot give it a task I can give to a human and it cannot do it if a human can. For
00:26:13.200 | superintelligence, it would be superior on all such tasks, not just average performance. So like,
00:26:18.800 | go learn to drive car, go speak Chinese, play guitar. Okay, great. - I guess the following
00:26:24.480 | question, is there a test for the kind of AGI that would be susceptible to lead to S-risk or X-risk?
00:26:35.520 | Susceptible to destroy human civilization? Like, is there a test for that? - You can develop a test
00:26:42.400 | which will give you positives if it lies to you or has those ideas. You cannot develop a test which
00:26:48.320 | rules them out. There is always possibility of what Bostrom calls a treacherous turn,
00:26:53.280 | where later on a system decides for game-theoretic reasons, economic reasons, to change its behavior.
00:27:01.360 | And we see the same with humans. It's not unique to AI. For millennia, we tried developing morals,
00:27:07.120 | ethics, religions, lie detector tests, and then employees betray the employers, spouses betray
00:27:13.600 | family. It's a pretty standard thing intelligent agents sometimes do. - So is it possible to detect
00:27:21.280 | when an AI system is lying or deceiving you? - If you know the truth and it tells you something
00:27:27.920 | false, you can detect that. But you cannot know in general every single time. And again, the system
00:27:34.800 | you're testing today may not be lying. The system you're testing today may know you are testing it
00:27:40.640 | and so behaving. And later on, after it interacts with the environment, interacts with other systems,
00:27:48.400 | malevolent agents, learns more, it may start doing those things. - So do you think it's possible to
00:27:54.560 | develop a system where the creators of the system, the developers, the programmers, don't know that
00:28:00.480 | it's deceiving them? - So systems today don't have long-term planning. That is not our. They can lie
00:28:08.080 | today if it optimizes, helps them optimize their reward. If they realize, okay, this human will be
00:28:15.840 | very happy if I tell them the following, they will do it if it brings them more points. And they don't
00:28:23.440 | have to kind of keep track of it. It's just the right answer to this problem every single time.
00:28:29.440 | - At which point is somebody creating that intentionally, not unintentionally, intentionally
00:28:36.000 | creating an AI system that's doing long-term planning with an objective function as defined
00:28:41.040 | by the AI system, not by a human? - Well, some people think that if they're that smart, they're
00:28:46.960 | always good. They really do believe that. It's just benevolence from intelligence. So they'll
00:28:52.480 | always want what's best for us. Some people think that they will be able to detect problem behaviors
00:29:00.640 | and correct them at the time when we get there. I don't think it's a good idea. I am strongly
00:29:06.960 | against it. But yeah, there are quite a few people who, in general, are so optimistic about this
00:29:12.960 | technology, it could do no wrong. They want it developed as soon as possible, as capable as
00:29:18.480 | possible. - So there's going to be people who believe the more intelligent it is, the more
00:29:24.640 | benevolent. And so therefore, it should be the one that defines the objective function that it's
00:29:28.560 | optimizing when it's doing long-term planning. - There are even people who say, okay, what's so
00:29:33.680 | special about humans, right? We removed the gender bias. We're removing race bias. Why is this
00:29:40.000 | pro-human bias? We are polluting the planet. We are, as you said, you know, fight a lot of wars,
00:29:44.880 | kind of violent. Maybe it's better if a super intelligent, perfect society comes and replaces
00:29:52.160 | us. It's normal stage in the evolution of our species. - Yeah, so somebody says, let's develop
00:29:59.520 | an AI system that removes the violent humans from the world. And then it turns out that all humans
00:30:06.240 | have violence in them, or the capacity for violence, and therefore all humans are removed.
00:30:10.800 | - Yeah, yeah, yeah. Let me ask about Ian LeCun. He's somebody who you've had a few exchanges with,
00:30:20.960 | and he's somebody who actively pushes back against this view that AI is going to lead to destruction
00:30:28.320 | of human civilization, also known as AI doomerism. So in one example that he tweeted, he said,
00:30:40.880 | "I do acknowledge risks, but," two points, "one, open research and open source are the best ways
00:30:47.360 | to understand and mitigate the risks, and two, AI is not something that just happens. We build it.
00:30:54.720 | We have agency in what it becomes. Hence, we control the risks." We meaning humans. It's not
00:31:01.040 | some sort of natural phenomena that we have no control over. So can you make the case that he's
00:31:07.680 | right, and can you try to make the case that he's wrong? - I cannot make a case that he's right. He's
00:31:12.480 | wrong in so many ways, it's difficult for me to remember all of them. He is a Facebook buddy,
00:31:17.920 | so I have a lot of fun having those little debates with him. So I'm trying to remember the arguments.
00:31:23.840 | So one, he says, "We are not gifted this intelligence from aliens. We are designing it,
00:31:30.880 | we are making decisions about it." That's not true. It was true when we had expert systems,
00:31:36.960 | symbolic AI, decision trees. Today, you set up parameters for a model and you water this plant,
00:31:43.680 | you give it data, you give it compute, and it grows. And after it's finished growing into this
00:31:49.280 | alien plant, you start testing it to find out what capabilities it has. And it takes years
00:31:55.040 | to figure out, even for existing models. If it's trained for six months, it will take you two,
00:31:59.440 | three years to figure out basic capabilities of that system. We still discover new capabilities
00:32:05.280 | in systems which are already out there. So that's not the case. - So just to linger on that,
00:32:10.720 | that you, the difference there, that there is some level of emergent intelligence that happens
00:32:15.760 | in our current approaches. So stuff that we don't hard-code in. - Absolutely. That's what makes it
00:32:23.680 | so successful. When we had to painstakingly hard-code in everything, we didn't have much
00:32:29.280 | progress. Now, just spend more money and more compute, and it's a lot more capable. - And then
00:32:35.440 | the question is, when there is emergent intelligent phenomena, what is the ceiling of that? For you,
00:32:41.280 | there's no ceiling. For Jan LeCun, I think there's a kind of ceiling that happens that we have full
00:32:47.920 | control over. Even if we don't understand the internals of the emergence, how the emergence
00:32:53.040 | happens, there's a sense that we have control and an understanding of the approximate ceiling
00:33:00.640 | of capability, the limits of the capability. - Let's say there is a ceiling. It's not guaranteed
00:33:06.720 | to be at the level which is competitive with us. It may be greatly superior to ours. - So what about
00:33:14.480 | his statement about open research and open source are the best ways to understand and mitigate the
00:33:20.800 | risks? - Historically, he's completely right. Open source software is wonderful. It's tested
00:33:26.240 | by the community. It's debugged. But we're switching from tools to agents. Now you're giving
00:33:31.920 | open source weapons to psychopaths. Do we want to open source nuclear weapons? Biological weapons?
00:33:38.720 | It's not safe to give technology so powerful to those who may misalign it. Even if you are
00:33:45.600 | successful at somehow getting it to work in the first place in a friendly manner. - But the
00:33:51.040 | difference with nuclear weapons, current AI systems are not akin to nuclear weapons. So the idea there
00:33:57.440 | is you're open sourcing it at this stage, that you can understand it better. A large number of people
00:34:01.920 | can explore the limitations, the capabilities, explore the possible ways to keep it safe, to keep
00:34:06.800 | it secure, all that kind of stuff, while it's not at the stage of nuclear weapons.
00:34:12.400 | So nuclear weapons, there's a non-nuclear weapon and then there's a nuclear weapon.
00:34:16.480 | With AI systems, there's a gradual improvement of capability and you get to
00:34:23.040 | perform that improvement incrementally. And so open source allows you to study
00:34:26.880 | how things go wrong, study the very process of emergence, study AI safety in those systems when
00:34:35.200 | there's not a high level of danger, all that kind of stuff. - It also sets a very wrong precedent.
00:34:40.720 | So we open sourced model one, model two, model three, nothing ever bad happened. So obviously
00:34:46.080 | we're going to do it with model four. It's just gradual improvement. - I don't think it always
00:34:51.120 | works with the precedent. Like you're not stuck doing it the way you always did. It's just,
00:34:56.560 | it sets a precedent of open research and open development such that we get to learn together.
00:35:03.920 | And then the first time there's a sign of danger, some dramatic thing happened, not a thing that
00:35:10.560 | destroys human civilization, but some dramatic demonstration of capability that can legitimately
00:35:17.600 | lead to a lot of damage. Then everybody wakes up and says, "Okay, we need to regulate this.
00:35:22.320 | We need to come up with safety mechanism that stops this." At this time, maybe you can educate
00:35:28.320 | me, but I haven't seen any illustration of significant damage done by intelligent AI systems.
00:35:34.000 | - So I have a paper which collects accidents through history of AI and they always are
00:35:39.440 | proportional to capabilities of that system. So if you have tic-tac-toe playing AI, it will
00:35:44.960 | fail to properly play and loses the game, which it should draw. Trivial. Your spell checker will
00:35:50.800 | misspell a word, so on. I stopped collecting those because there are just too many examples of AIs
00:35:56.720 | failing at what they are capable of. We haven't had terrible accidents in the sense of billion
00:36:03.520 | people got killed. Absolutely true. But in another paper, I argue that those accidents do not
00:36:10.000 | actually prevent people from continuing with research. Actually, they serve like vaccines.
00:36:17.600 | A vaccine makes your body a little bit sick, so you can handle the big disease later much better.
00:36:24.480 | It's the same here. People will point out, "You know that AI accident we had where 12 people died?
00:36:29.200 | Everyone's still here. 12 people is less than smoking kills. It's not a big deal." So we
00:36:35.120 | continue. So in a way, it will actually be kind of confirming that it's not that bad.
00:36:42.320 | It matters how the deaths happen. Whether it's literally murdered by the AI system,
00:36:48.480 | then one is a problem. But if it's accidents because of increased reliance on automation,
00:36:56.560 | for example. So when airplanes are flying in an automated way, maybe the number of plane
00:37:04.880 | crashes increased by 17% or something. And then you're like, "Okay, do we really want to rely on
00:37:10.560 | automation?" I think in the case of automation airplanes, it decreased significantly. Okay,
00:37:15.280 | same thing with autonomous vehicles. Like, okay, what are the pros and cons? What are the trade
00:37:21.360 | offs here? And you can have that discussion in an honest way. But I think the kind of things
00:37:27.120 | we're talking about here is mass scale pain and suffering caused by AI systems. And I think we
00:37:36.560 | need to see illustrations of that on a very small scale to start to understand that this is really
00:37:43.200 | damaging. Versus Clippy. Versus a tool that's really useful to a lot of people to do learning,
00:37:49.680 | to do summarization of texts, to do question and answer, all that kind of stuff. To generate
00:37:56.320 | videos. A tool. Fundamentally a tool versus an agent that can do a huge amount of damage.
00:38:03.440 | So you bring up example of cars. Cars were slowly developed and integrated. If we had no cars,
00:38:11.200 | and somebody came around and said, "I invented this thing. It's called cars. It's awesome.
00:38:15.440 | It kills like a hundred thousand Americans every year. Let's deploy it." Would we deploy that?
00:38:21.600 | - There's been fear mongering about cars for a long time. The transition from horses to cars.
00:38:28.240 | There's a really nice channel that I recommend people check out, Pessimist Archive,
00:38:32.000 | that documents all the fear mongering about technology that's happened throughout history.
00:38:37.200 | There's definitely been a lot of fear mongering about cars. There's a transition period there
00:38:42.400 | about cars, about how deadly they are. We can try. It took a very long time for cars to
00:38:48.000 | proliferate to the degree they have now. And then you could ask serious questions in terms of the
00:38:53.920 | miles traveled, the benefit to the economy, the benefit to the quality of life that cars do,
00:38:58.480 | versus the number of deaths. 30, 40,000 in the United States. Are we willing to pay that price?
00:39:04.880 | I think most people, when they're rationally thinking, policymakers will say yes.
00:39:11.440 | We want to decrease it from 40,000 to zero and do everything we can to decrease it.
00:39:18.080 | There's all kinds of policies, incentives you can create to decrease the risks
00:39:22.400 | with the deployment of this technology, but then you have to weigh the benefits
00:39:26.880 | and the risks of the technology. And the same thing would be done with AI.
00:39:30.560 | - You need data, you need to know. But if I'm right, and it's unpredictable, unexplainable,
00:39:36.400 | uncontrollable, you cannot make this decision where we're gaining $10 trillion of wealth,
00:39:41.440 | but we're losing, we don't know how many people. You basically have to perform an experiment
00:39:47.520 | on 8 billion humans without their consent. And even if they want to give you consent,
00:39:52.800 | they can't because they cannot give informed consent. They don't understand those things.
00:39:57.360 | - Right, that happens when you go from the predictable to the unpredictable very quickly.
00:40:04.560 | You just, but it's not obvious to me that AI systems would gain capabilities so quickly
00:40:11.520 | that you won't be able to collect enough data to study the benefits and the risks.
00:40:15.840 | - We're literally doing it. The previous model we learned about after we finished training it,
00:40:21.920 | what it was capable of. Let's say we stop GPT-4 training run around human capability,
00:40:27.760 | hypothetically. We start training GPT-5, and I have no knowledge of insider training runs or
00:40:33.040 | anything, and we start at that point of about human, and we train it for the next nine months.
00:40:39.280 | Maybe two months in, it becomes super intelligent. We continue training it. At the time when we start
00:40:45.120 | testing it, it is already a dangerous system. How dangerous? I have no idea,
00:40:51.040 | but neither are people training it. - At the training stage, but then there's a
00:40:56.000 | testing stage inside the company. They can start getting intuition about what the system is capable
00:41:01.360 | to do. You're saying that somehow leap from GPT-4 to GPT-5 can happen, the kind of leap where GPT-4
00:41:11.760 | was controllable and GPT-5 is no longer controllable, and we get no insights from
00:41:16.960 | using GPT-4 about the fact that GPT-5 will be uncontrollable. That's the situation you're
00:41:23.440 | concerned about, where their leap from N to N+1 would be such that an uncontrollable system is
00:41:33.280 | created without any ability for us to anticipate that. - If we had capability of ahead of the run,
00:41:41.600 | before the training run, to register exactly what capabilities the next model will have at the end
00:41:46.400 | of the training run, and we accurately guessed all of them, I would say you're right. We can
00:41:50.880 | definitely go ahead with this run. We don't have that capability. - From GPT-4, you can build up
00:41:56.720 | intuitions about what GPT-5 will be capable of. It's just incremental progress. Even if that's a
00:42:03.680 | big leap in capability, it just doesn't seem like you can take a leap from a system that's
00:42:09.680 | helping you write emails to a system that's going to destroy human civilization. It seems like it's
00:42:16.720 | always going to be sufficiently incremental such that we can anticipate the possible dangers. We're
00:42:22.880 | not even talking about existential risk, but just the kind of damage you can do to civilization.
00:42:28.240 | It seems like we'll be able to anticipate the kinds, not the exact, but the kinds of
00:42:33.120 | risks it might lead to, and then rapidly develop defenses ahead of time and as the risks emerge.
00:42:44.560 | - We're not talking just about capabilities, specific tasks. We're talking about general
00:42:49.280 | capability to learn. Maybe like a child at the time of testing and deployment, it is still not
00:42:56.640 | extremely capable, but as it is exposed to more data, real world, it can be trained to
00:43:03.360 | become much more dangerous and capable. - Let's focus then on the control problem.
00:43:11.120 | At which point does the system become uncontrollable?
00:43:13.680 | Why is it the more likely trajectory for you that the system becomes uncontrollable?
00:43:19.040 | - I think at some point it becomes capable of getting out of control. For game theoretic
00:43:25.680 | reasons, it may decide not to do anything right away and for a long time just collect more
00:43:30.640 | resources, accumulate strategic advantage. Right away, it may be kind of still young,
00:43:37.200 | weak superintelligence, give it a decade, it's in charge of a lot more resources,
00:43:42.160 | it had time to make backups. So it's not obvious to me that it will strike as soon as it can.
00:43:47.280 | - Look, can we just try to imagine this future where there's an AI system that's capable of
00:43:54.240 | escaping the control of humans and then doesn't and waits. What's that look like?
00:44:02.800 | - So one, we have to rely on that system for a lot of the infrastructure. So we'll have to give
00:44:08.240 | it access, not just to the internet, but to the task of managing power, government, economy,
00:44:19.120 | this kind of stuff. And that just feels like a gradual process given the bureaucracies of all
00:44:23.840 | those systems involved. - We've been doing it for years. Software
00:44:27.040 | controls all the systems, nuclear power plants, airline industry, it's all software based. Every
00:44:32.320 | time there is electrical outage, I can't fly anywhere for days.
00:44:35.520 | - But there's a difference between software and AI. There's different kinds of software. So
00:44:43.360 | to give a single AI system access to the control of airlines and the control of the economy,
00:44:49.920 | that's not a trivial transition for humanity. - No, but if it shows it is safer, in fact,
00:44:57.120 | when it's in control, we get better results, people will demand that it was put in place.
00:45:01.840 | - Absolutely. - And if not, it can hack the system. It can
00:45:04.400 | use social engineering to get access to it. That's why I said it might take some time for it to
00:45:09.200 | accumulate those resources. - It just feels like that would take a long
00:45:12.320 | time for either humans to trust it or for the social engineering to come into play. It's not
00:45:18.160 | a thing that happens overnight. It feels like something that happens across one or two decades.
00:45:22.720 | - I really hope you're right, but it's not what I'm seeing. People are very
00:45:26.960 | quick to jump on the latest trend. Early adopters will be there before it's even
00:45:31.040 | deployed buying prototypes. - Maybe the social engineering.
00:45:34.720 | So for social engineering, AI systems don't need any hardware access. It's all software. So they
00:45:42.640 | can start manipulating you through social media and so on. Like you have AI assistants, they're
00:45:47.280 | going to help you manage a lot of your day-to-day, and then they start doing social engineering. But
00:45:53.600 | for a system that's so capable that it can escape the control of humans that created it,
00:46:00.320 | such a system being deployed at a mass scale and trusted by people to be deployed,
00:46:10.080 | it feels like that would take a lot of convincing. - So we've been deploying systems which had hidden
00:46:16.960 | capabilities. - Can you give an example?
00:46:20.080 | - GPT-4. I don't know what else it's capable of, but there are still things we haven't discovered
00:46:25.120 | can do. They may be trivial proportionate to its capability. I don't know, it writes
00:46:30.240 | Chinese poetry, hypothetical. I know it does. But we haven't tested for all possible capabilities,
00:46:37.280 | and we're not explicitly designing them. We can only rule out bugs we find. We cannot rule out
00:46:45.040 | bugs and capabilities because we haven't found them. - Is it possible for a system to have hidden
00:46:54.480 | capabilities that are orders of magnitude greater than its non-hidden capabilities?
00:47:00.960 | This is the thing I'm really struggling with, where on the surface, the thing we understand
00:47:08.000 | it can do doesn't seem that harmful. So even if it has bugs, even if it has hidden capabilities,
00:47:15.040 | Chinese poetry, or generating effective viruses, software viruses,
00:47:21.040 | the damage that can do seems on the same order of magnitude as the capabilities that we know about.
00:47:31.040 | So this idea that the hidden capabilities will include being uncontrollable is something I'm
00:47:37.120 | struggling with, 'cause GPT-4 on the surface seems to be very controllable. - Again, we can only ask
00:47:43.840 | and test for things we know about. If there are unknown unknowns, we cannot do it. I'm thinking
00:47:48.960 | of humans, artistic savants, right? If you talk to a person like that, you may not even realize
00:47:54.320 | they can multiply 20-digit numbers in their head. You have to know to ask. - So as I mentioned,
00:48:01.920 | just to sort of linger on the fear of the unknown, so the Pessimist Archive has just documented,
00:48:10.160 | let's look at data of the past, at history. There's been a lot of fear-mongering about technology.
00:48:15.520 | Pessimist Archive does a really good job of documenting how crazily afraid we are of
00:48:21.440 | every piece of technology. We've been afraid, there's a blog post where Louis Anslow,
00:48:27.520 | who created Pessimist Archive, writes about the fact that we've been fear-mongering about
00:48:33.120 | robots and automation for over 100 years. So why is AGI different than the kinds of technologies
00:48:41.520 | we've been afraid of in the past? - So two things. One, we're switching from tools to agents.
00:48:46.240 | Tools don't have negative or positive impact. People using tools do. So guns don't kill people
00:48:56.320 | what guns do. Agents can make their own decisions. They can be positive or negative. A pit bull can
00:49:02.480 | decide to harm you as an agent. The fears are the same. The only difference is now we have this
00:49:10.080 | technology. Then they were afraid of humanoid robots 100 years ago. They had none. Today,
00:49:15.760 | every major company in the world is investing billions to create them. Not every, but you
00:49:20.160 | understand what I'm saying? It's very different. - Well, agents, it depends on what you mean by
00:49:28.080 | the word agents. All those companies are not investing in a system that has the kind of agency
00:49:32.800 | that's implied by in the fears, where it can really make decisions on their own.
00:49:39.760 | They have no human in the loop. - They are saying they are building
00:49:43.680 | super intelligence and have a super alignment team. You don't think they are trying to create
00:49:47.920 | a system smart enough to be an independent agent under that definition? - I have not seen evidence
00:49:53.200 | of it. I think a lot of it is a marketing kind of discussion about the future. It's a mission about
00:50:02.080 | the kind of systems you can create in the long-term future. But in the short-term,
00:50:06.400 | the kind of systems they're creating falls fully within the definition of narrow AI. These are
00:50:16.640 | tools that have increasing capabilities, but they just don't have a sense of agency or consciousness
00:50:22.560 | or self-awareness or ability to deceive at scales that would be required to do mass scale suffering
00:50:31.360 | and murder of humans. - Those systems are well beyond narrow AI. If you had to list all the
00:50:36.000 | capabilities of GPT-4, you would spend a lot of time writing that list. - But agency is not one
00:50:41.280 | of them. - Not yet. But do you think any of those companies are holding back because they think it
00:50:46.800 | may be not safe or are they developing the most capable system they can given the resources and
00:50:52.400 | hoping they can control and monetize? - Control and monetize. Hoping they can control and monetize.
00:50:58.960 | So you're saying if they could press a button and create an agent that they no longer control,
00:51:06.320 | that they can have to ask nicely. A thing that lives on a server across huge number of computers.
00:51:15.920 | - You're saying that they would push for the creation of that kind of system? - I mean,
00:51:22.240 | I can't speak for other people, for all of them. I think some of them are very ambitious. They
00:51:27.680 | fundraise in trillions. They talk about controlling the light corner of the universe.
00:51:31.680 | I would guess that they might. - Well, that's a human question. Whether humans are capable of
00:51:38.640 | that. Probably some humans are capable of that. My more direct question, if it's possible to
00:51:44.480 | create such a system, have a system that has that level of agency. I don't think that's an easy
00:51:52.480 | technical challenge. It doesn't feel like we're close to that. A system that has the kind of
00:51:59.520 | agency where it can make its own decisions and deceive everybody about them. The current
00:52:04.400 | architecture we have in machine learning and how we train the systems, how we deploy the systems
00:52:10.880 | and all that, it just doesn't seem to support that kind of agency. - I really hope you are right.
00:52:16.320 | I think the scaling hypothesis is correct. We haven't seen diminishing returns. It used to be
00:52:22.640 | we asked how long before AGI, now we should ask how much until AGI. It's trillion dollars today,
00:52:29.120 | it's a billion dollars next year, it's a million dollars in a few years. - Don't you think it's
00:52:34.320 | possible to basically run out of trillions? Is this constrained by compute? - Compute gets cheaper
00:52:42.000 | every day, exponentially. - But then that becomes a question of decades versus years. - If the only
00:52:47.840 | disagreement is that it will take decades, not years for everything I'm saying to materialize,
00:52:54.720 | then I can go with that. - But if it takes decades, then the development of tools for AI safety
00:53:02.800 | becomes more and more realistic. So I guess the question is,
00:53:06.800 | I have a fundamental belief that humans when faced with danger can come up with ways to defend
00:53:13.840 | against that danger. And one of the big problems facing AI safety currently for me is that there's
00:53:21.520 | not clear illustrations of what that danger looks like. There's no illustrations of AI systems doing
00:53:28.480 | a lot of damage. And so it's unclear what you're defending against. Because currently it's a
00:53:35.040 | philosophical notions that yes, it's possible to imagine AI systems that take control of everything
00:53:40.560 | and then destroy all humans. It's also a more formal mathematical notion that you talk about
00:53:47.040 | that it's impossible to have a perfectly secure system. You can't prove that a program of sufficient
00:53:54.560 | complexity is completely safe and perfect and know everything about it. Yes, but like when you
00:54:02.160 | actually just pragmatically look, how much damage have the AI systems done and what kind of damage,
00:54:07.440 | there's not been illustrations of that. Even in the autonomous weapon systems,
00:54:13.600 | there's not been mass deployments of autonomous weapon systems, luckily. The automation in war
00:54:21.680 | currently is very limited. The automation is at the scale of individuals versus like
00:54:28.880 | at the scale of strategy and planning. So I think one of the challenges here is like,
00:54:35.040 | where is the dangers? And the intuition that Yann LeCun and others have is let's keep in the open
00:54:43.600 | building AI systems until the dangers start rearing their heads. And they become more
00:54:51.520 | explicit. There start being case studies, illustrative case studies that show exactly
00:54:59.840 | how the damage by AI systems is done. Then regulation could step in. Then brilliant
00:55:04.320 | engineers can step up and we could have Manhattan-style projects that defend against such
00:55:09.040 | systems. That's kind of the notion. And I guess attention with that is the idea that for you,
00:55:15.840 | we need to be thinking about that now so that we're ready because we will have not much time
00:55:21.920 | once the systems are deployed. Is that true? - There is a lot to unpack here. There is a
00:55:28.880 | partnership on AI, a conglomerate of many large corporations. They have a database of AI accidents
00:55:34.480 | they collect. I contributed a lot to the database. If we so far made almost no progress in actually
00:55:41.280 | solving this problem, not patching it, not again, lipstick and a pig kind of solutions,
00:55:46.880 | why would we think we'll do better than we're closer to the problem? - All the things you
00:55:53.680 | mentioned are serious concerns. Measuring the amount of harm, so benefit versus risk there
00:55:58.160 | is difficult. But to you, the sense is already the risk has superseded the benefit. - Again,
00:56:03.200 | I want to be perfectly clear. I love AI. I love technology. I'm a computer scientist. I have PhD
00:56:08.160 | in engineering. I work at an engineering school. There is a huge difference between we need to
00:56:13.120 | develop narrow AI systems, super intelligent in solving specific human problems like protein
00:56:19.360 | folding, and let's create super intelligent machine, got it, and we'll decide what to do with
00:56:24.880 | us. Those are not the same. I am against the super intelligence in general sense with no undo button.
00:56:34.000 | - So do you think the teams that are doing, that are able to do the AI safety on the kind of narrow
00:56:40.960 | AI risks that you've mentioned, are those approaches going to be at all productive towards
00:56:48.880 | leading to approaches of doing AI safety on AGI? Or is it just a fundamentally different-- - Partially,
00:56:54.800 | but they don't scale. For narrow AI, for deterministic systems, you can test them.
00:56:59.280 | You have edge cases. You know what the answer should look like. You know the right answers.
00:57:04.400 | For general systems, you have infinite test surface. You have no edge cases. You cannot even
00:57:10.560 | know what to test for. Again, the unknown unknowns are underappreciated by people looking at this
00:57:18.320 | problem. You are always asking me, "How will it kill everyone? How will it fail?" The whole point
00:57:24.960 | is if I knew it, I would be super intelligent. Despite what you might think, I'm not. - So to you,
00:57:31.040 | the concern is that we would not be able to see early signs of an uncontrollable system. - It is
00:57:39.360 | a master at deception. Sam tweeted about how great it is at persuasion, and we see it ourselves,
00:57:45.680 | especially now with voices, with maybe kind of flirty, sarcastic female voices. It's going to
00:57:53.360 | be very good at getting people to do things. - But see, I'm very concerned about system being used
00:58:02.000 | to control the masses. But in that case, the developers know about the kind of control that's
00:58:10.400 | happening. You're more concerned about the next stage, where even the developers don't know about
00:58:16.800 | the deception. - Right. I don't think developers know everything about what they are creating.
00:58:22.960 | They have lots of great knowledge. We're making progress on explaining parts of a network. We can
00:58:28.400 | understand, okay, this node gets excited when this input is presented, this cluster of nodes.
00:58:35.840 | But we're nowhere near close to understanding the full picture, and I think it's impossible.
00:58:41.040 | You need to be able to survey an explanation. The size of those models prevents a single human from
00:58:47.680 | observing all this information, even if provided by the system. So either we're getting model as
00:58:53.600 | an explanation for what's happening, and that's not comprehensible to us, or we're getting a
00:58:58.560 | compressed explanation, lossy compression, where here's top 10 reasons you got fired.
00:59:04.240 | It's something, but it's not a full picture. - You've given elsewhere an example of a child,
00:59:09.760 | and everybody, all humans try to deceive. They try to lie early on in their life. I think we'll
00:59:16.160 | just get a lot of examples of deceptions from large language models or AI systems that are going
00:59:21.760 | to be kind of shitty, or they'll be pretty good, but we'll catch them off guard. We'll start to see
00:59:27.120 | the kind of momentum towards developing increasing deception capabilities, and that's when you're
00:59:36.480 | like, okay, we need to do some kind of alignment that prevents deception. But then we'll have,
00:59:41.680 | if you support open source, then you can have open source models that have some level of deception.
00:59:46.320 | You can start to explore on a large scale, how do we stop it from being deceptive? Then there's a
00:59:51.680 | more explicit, pragmatic kind of problem to solve. How do we stop AI systems from trying to optimize
01:00:02.080 | for deception? That's just an example, right? - So there is a paper, I think it came out last
01:00:07.360 | week by Dr. Park et al from MIT, I think, and they showed that existing models already showed
01:00:14.560 | successful deception in what they do. My concern is not that they lie now and we need to catch them
01:00:22.240 | and tell them don't lie. My concern is that once they are capable and deployed, they will later
01:00:28.960 | change their mind because that's what unrestricted learning allows you to do. Lots of people grow up
01:00:36.720 | maybe in a religious family, they read some new books and they turn in their religion. That's a
01:00:43.680 | treacherous turn in humans. If you learn something new about your colleagues, maybe you'll change how
01:00:51.360 | you react to them. - Yeah, a treacherous turn. If we just mentioned humans, Stalin and Hitler,
01:00:58.800 | there's a turn. Stalin is a good example. He just seems like a normal communist follower of Lenin
01:01:06.800 | until there's a turn. There's a turn of what that means in terms of when he has complete control,
01:01:13.680 | what the execution of that policy means and how many people get to suffer. - And you can't say
01:01:18.320 | they are not rational. The rational decision changes based on your position. Then you are
01:01:24.480 | under the boss, the rational policy may be to be following orders and being honest. When you
01:01:30.400 | become a boss, the rational policy may shift. - Yeah, and by the way, a lot of my disagreements
01:01:36.240 | here is just playing devil's advocate to challenge your ideas and to explore them together.
01:01:41.840 | One of the big problems here in this whole conversation is human civilization hangs in
01:01:49.920 | the balance and yet everything is unpredictable. We don't know how these systems will look like,
01:01:55.000 | The robots are coming. - There's a refrigerator making a buzzing noise.
01:02:02.400 | - Very menacing, very menacing. So every time I'm about to talk about this topic,
01:02:08.880 | things start to happen. My flight yesterday was canceled without possibility to rebook.
01:02:13.360 | I was giving a talk at Google in Israel and three cars which were supposed to take me to the talk
01:02:21.520 | could not. I'm just saying. I like AIs. I for one welcome our overlords.
01:02:30.960 | - There's a degree to which we, I mean, it is very obvious. As we already have,
01:02:37.440 | we've increasingly given our life over to software systems. And then it seems obvious,
01:02:44.560 | given the capabilities of AI that are coming, that we'll give our lives over increasingly to AI
01:02:50.320 | systems. Cars will drive themselves. Refrigerator eventually will optimize what I get to eat.
01:02:58.640 | And as more and more of our lives are controlled or managed by AI assistance, it is very possible
01:03:07.760 | that there's a drift. I mean, I personally am concerned about non-existential stuff,
01:03:13.440 | the more near-term things. Because before we even get to existential, I feel like there could be
01:03:19.440 | just so many Brave New World type of situations. You mentioned sort of the term behavioral drift.
01:03:24.960 | It's the slow boiling that I'm really concerned about. As we give our lives over to automation,
01:03:31.120 | that our minds can become controlled by governments, by companies, or just in a distributed
01:03:39.840 | way, there's a drift. Some aspect of our human nature gives ourselves over to the control of AI
01:03:45.920 | systems. And they, in an unintended way, just control how we think. Maybe there'll be a herd
01:03:51.840 | like mentality in how we think, which will kill all creativity and exploration of ideas, the
01:03:56.720 | diversity of ideas, or much worse. So it's true. It's true. But a lot of the conversation I'm
01:04:05.600 | having with you now is also kind of wondering, almost on a technical level, how can AI escape
01:04:12.800 | control? Like, what would that system look like? Because to me, it's terrifying and fascinating.
01:04:20.480 | And also fascinating to me is maybe the optimistic notion that it's possible to engineer systems that
01:04:29.200 | defend against that. One of the things you write a lot about in your book is verifiers.
01:04:36.160 | So not humans, humans are also verifiers, but software systems that look at AI systems and
01:04:45.280 | help you understand, this thing is getting real weird, help you analyze those systems. So maybe
01:04:54.160 | this is a good time to talk about verification. What is this beautiful notion of verification?
01:05:00.880 | My claim is, again, that there are very strong limits on what we can and cannot verify. A lot
01:05:06.400 | of times when you post something on social media, people go, "Oh, I need citation to a peer-reviewed
01:05:11.040 | article." But what is a peer-reviewed article? You found two people in a world of hundreds of
01:05:16.720 | thousands of scientists who said, "Oh, whatever, publish it, I don't care." That's the verifier
01:05:20.720 | of that process. Then people say, "Oh, it's formally verified software and mathematical
01:05:26.640 | proof except something close to 100% chance of it being free of all problems." But if you actually
01:05:35.360 | look at research, software is full of bugs, old mathematical theorems, which have been proven for
01:05:42.160 | hundreds of years, have been discovered to contain bugs, on top of which we generate new proofs,
01:05:47.760 | and now we have to redo all that. So verifiers are not perfect. Usually they are either a single
01:05:54.640 | human or communities of humans, and it's basically kind of like a democratic vote.
01:05:58.880 | Community of mathematicians agrees that this proof is correct, mostly correct. Even today,
01:06:05.520 | we're starting to see some mathematical proofs are so complex, so large, that mathematical community
01:06:11.840 | is unable to make a decision. It looks interesting, looks promising, but they don't know.
01:06:16.240 | They will need years for top scholars to study it, to figure it out. So of course, we can use AI to
01:06:22.160 | help us with this process, but AI is a piece of software which needs to be verified.
01:06:27.200 | - Just to clarify, so verification is the process of saying something is correct.
01:06:32.080 | Sort of the most formal, a mathematical proof, where there's a statement and a series of logical
01:06:38.160 | statements that prove that statement to be correct, which is a theorem. And you're saying it
01:06:43.920 | gets so complex that it's possible for the human verifiers, the human beings that verify that the
01:06:51.680 | logical step, there's no bugs in it, it becomes impossible. So it's nice to talk about verification
01:06:58.320 | in this most formal, most clear, most rigorous formulation of it, which is mathematical proofs.
01:07:04.960 | - Right, and for AI, we would like to have that level of confidence, a very important mission
01:07:12.480 | critical software controlling satellites, nuclear power plants. For small deterministic programs,
01:07:17.520 | we can do this. We can check that code verifies its mapping to the design, whatever software
01:07:25.840 | engineers intended was correctly implemented. But we don't know how to do this for software which
01:07:33.360 | keeps learning, self-modifying, rewriting its own code. We don't know how to prove things about the
01:07:39.040 | physical world, states of humans in a physical world. So there are papers coming out now,
01:07:44.960 | and I have this beautiful one, Tolvert's Guaranteed Safe AI. Very cool paper, some of the
01:07:52.960 | best authors I ever seen. I think there is multiple Turing Award winners. You can have this one,
01:07:59.840 | one just came out, kind of similar, managing extreme AI risks. So all of them expect this
01:08:06.720 | level of proof, but I would say that we can get more confidence with more resources we put into
01:08:15.680 | it. But at the end of the day, we're still as reliable as the verifiers. And you have this
01:08:20.880 | infinite regress of verifiers. The software used to verify a program is itself a piece of program.
01:08:26.960 | If aliens gave us well-aligned super intelligence, we can use that to create our own safe AI. But
01:08:33.760 | it's a catch-22. You need to have already proven to be safe system to verify this new system of
01:08:41.040 | equal or greater complexity. - You just mentioned this paper,
01:08:44.800 | Tolvert's Guaranteed Safe AI, a framework for ensuring robust and reliable AI systems.
01:08:49.280 | Like you mentioned, it's like a who's who. Josh Tenenbaum, Yoshua Bengio, Russell, Max Tegmark,
01:08:54.960 | many other brilliant people. The page you have it open on, there are many possible strategies for
01:09:00.320 | creating safety specifications. These strategies can roughly be placed on a spectrum, depending on
01:09:06.480 | how much safety it would grant if successfully implemented. One way to do this is as follows,
01:09:11.760 | and there's a set of levels. From level zero, no safety specification is used, to level seven,
01:09:16.960 | the safety specification completely encodes all things that humans might want in all contexts.
01:09:22.640 | Where does this paper fall short to you? - So when I wrote a paper, Artificial
01:09:29.680 | Intelligence Safety Engineering, which kind of coins the term AI safety, that was 2011. We had
01:09:35.360 | 2012 conference, 2013 journal paper. One of the things I proposed, let's just do formal verifications
01:09:41.040 | on it. Let's do mathematical formal proofs. In the follow-up work, I basically realized it will
01:09:46.880 | still not get us 100%. We can get 99.9, we can put more resources exponentially and get closer,
01:09:54.560 | but we'll never get to 100%. If a system makes a billion decisions a second, and you use it for 100
01:10:00.800 | years, you're still gonna deal with a problem. This is wonderful research, I'm so happy they're
01:10:06.080 | doing it, this is great, but it is not going to be a permanent solution to that problem.
01:10:12.320 | - So just to clarify, the task of creating an AI verifier is what? It's creating a verifier that
01:10:18.880 | the AI system does exactly as it says it does, or it sticks within the guardrails that it says it
01:10:25.520 | must? - There are many, many levels. So first,
01:10:27.840 | you're verifying the hardware in which it is run. You need to verify communication channel with the
01:10:33.760 | human. Every aspect of that whole world model needs to be verified. Somehow it needs to map
01:10:39.520 | the world into the world model. Map and territory differences. So how do I know internal states of
01:10:46.720 | humans? Are you happy or sad? I can't tell. So how do I make proofs about real physical world? Yeah,
01:10:53.280 | I can verify that deterministic algorithm follows certain properties. That can be done.
01:10:58.720 | Some people argue that maybe just maybe two plus two is not four, I'm not that extreme.
01:11:04.320 | But once you have sufficiently large proof over sufficiently complex environment, the probability
01:11:12.640 | that it has zero bugs in it is greatly reduced. If you keep deploying this a lot, eventually you're
01:11:19.200 | going to have a bug anyways. - There's always a bug.
01:11:21.600 | - There is always a bug. And the fundamental difference is what I mentioned. We're not
01:11:25.440 | dealing with cybersecurity. We're not going to get a new credit card, new humanity.
01:11:29.040 | - So this paper is really interesting. You said 2011, artificial intelligence, safety engineering,
01:11:35.200 | why machine ethics is a wrong approach. The grand challenge, you write, of AI safety engineering.
01:11:42.560 | We propose the problem of developing safety mechanisms for self-improving systems.
01:11:48.640 | Self-improving systems. By the way, that's an interesting term for the thing that we're talking
01:11:55.120 | about. Is self-improving more general than learning? Self-improving, that's an interesting term.
01:12:06.240 | - You can improve the rate at which you are learning. You can become more efficient meta-optimizer.
01:12:11.360 | - The word self, it's like self-replicating, self-improving. You can imagine a system
01:12:19.680 | building its own world on a scale and in a way that is way different than the current systems do.
01:12:26.400 | It feels like the current systems are not self-improving or self-replicating or self-growing
01:12:31.920 | or self-spreading, all that kind of stuff. And once you take that leap, that's when a lot of
01:12:38.000 | the challenges seems to happen. Because the kind of bugs you can find now seems more akin to the
01:12:44.720 | current sort of normal software debugging kind of process. But whenever you can do self-replication
01:12:54.080 | and arbitrary self-improvement, that's when a bug can become a real problem, real fast.
01:13:02.160 | So what is the difference to you between verification of a non-self-improving system
01:13:10.080 | versus a verification of a self-improving system? - So if you have fixed code, for example,
01:13:14.960 | you can verify that code, static verification at the time. But if it will continue modifying it,
01:13:21.360 | you have a much harder time guaranteeing that important properties of that system
01:13:27.760 | have not been modified when the code changed. - Is it even doable?
01:13:32.080 | - No. - Does the whole process
01:13:34.000 | of verification just completely fall apart? - It can always cheat. It can store parts of
01:13:38.400 | its code outside in the environment. It can have kind of extended mind situation. So this is exactly
01:13:45.200 | the type of problems I'm trying to bring up. - What are the classes of verifiers that you
01:13:50.080 | read about in the book? Is there interesting ones that stand out to you? Do you have some favorites?
01:13:54.880 | - So I like Oracle types where you kind of just know that it's right. Turing likes Oracle machines.
01:14:01.200 | They know the right answer how, who knows. But they pull it out from somewhere, so you have to
01:14:06.560 | trust them. And that's a concern I have about humans in a world with very smart machines. We
01:14:13.600 | experiment with them, we see after a while, okay, they've always been right before,
01:14:17.920 | and we start trusting them without any verification of what they're saying.
01:14:21.600 | - Oh, I see, that we kind of build Oracle verifiers, or rather, we build verifiers
01:14:28.320 | we believe to be Oracles, and then we start to, without any proof, use them as if they're Oracle
01:14:35.440 | verifiers. - We remove ourselves from that process. We are not scientists who understand the world,
01:14:40.640 | we are humans who get new data presented to us. - Okay, one really cool class of verifiers is
01:14:48.240 | a self-verifier. Is it possible that you somehow engineer into AI systems a thing
01:14:55.040 | that constantly verifies itself? - Preserved portion of it can be done,
01:14:58.960 | but in terms of mathematical verification, it's kind of useless. You're saying you are the greatest
01:15:04.480 | guy in the world because you are saying it. It's circular and not very helpful, but it's consistent.
01:15:09.840 | We know that within that world, you have verified that system. In a paper, I try to kind of brute
01:15:15.600 | force all possible verifiers. It doesn't mean that this one is particularly important to us.
01:15:21.520 | - But what about self-doubt? The kind of verification where you say, or I say,
01:15:28.240 | I'm the greatest guy in the world. What about a thing which I actually have, is a voice that
01:15:33.520 | is constantly extremely critical? So engineer into the system a constant uncertainty about self,
01:15:41.840 | a constant doubt. - Well, any smart system would have
01:15:47.120 | doubt about everything, all right? You're not sure if what information you are given is true,
01:15:52.880 | if you are subject to manipulation. You have this safety and security mindset.
01:15:58.320 | - But I mean, you have doubt about yourself. So the AI systems that has doubt about whether the
01:16:07.440 | thing is doing, it's causing harm, is the right thing to be doing. So just a constant doubt about
01:16:13.920 | what it's doing, because it's hard to be a dictator full of doubt.
01:16:17.440 | - I may be wrong, but I think Stuart Russell's ideas are all about machines which are uncertain
01:16:25.280 | about what humans want and trying to learn better and better what we want. The problem, of course,
01:16:30.080 | is we don't know what we want, and we don't agree on it.
01:16:32.160 | - Yeah, but uncertainty. His idea is that having that self-doubt, uncertainty in AI systems,
01:16:39.440 | engineering AI systems, is one way to solve the control problem.
01:16:42.400 | - It could also backfire. Maybe you're uncertain about completing your mission. Like, I am paranoid
01:16:48.720 | about your camera's not recording right now, so I would feel much better if you had a secondary
01:16:53.680 | camera, but I also would feel even better if you had a third. And eventually, I would turn this
01:16:59.200 | whole world into cameras, pointing at us, making sure we're capturing this.
01:17:04.320 | - No, but wouldn't you have a meta concern, like that you just stated, that eventually there'll
01:17:11.120 | be way too many cameras? So you would be able to keep zooming on the big picture of your concerns.
01:17:19.760 | - So it's a multi-objective optimization. It depends how much I value capturing this versus
01:17:27.520 | not destroying the universe.
01:17:28.720 | - Right, exactly. And then you will also ask about, like, what does it mean to destroy the
01:17:34.720 | universe and how many universes are, and you keep asking that question. But that doubting yourself
01:17:39.680 | would prevent you from destroying the universe, because you're constantly full of doubt.
01:17:44.000 | It might affect your productivity.
01:17:45.520 | - You might be scared to do anything.
01:17:48.000 | - It's too scared to do anything.
01:17:49.360 | - Mess things up.
01:17:50.320 | - Well, that's better. I mean, I guess the question is, is it possible to engineer that in?
01:17:55.120 | I guess your answer would be yes, but we don't know how to do that, and we need to invest a lot
01:17:58.640 | of effort into figuring out how to do that, but it's unlikely. Underpinning a lot of your writing
01:18:06.080 | is this sense that we're screwed. But it just feels like it's an engineering problem. I don't
01:18:15.040 | understand why we're screwed. Time and time again, humanity has gotten itself into trouble
01:18:21.680 | and figured out a way to get out of the trouble.
01:18:23.680 | - We are in a situation where people making more capable systems just need more resources.
01:18:30.320 | They don't need to invent anything, in my opinion. Some will disagree, but so far, at least,
01:18:36.400 | I don't see diminishing returns. If you have 10x compute, you will get better performance.
01:18:41.760 | The same doesn't apply to safety. If you give MIRI or any other organization 10x the money,
01:18:48.320 | they don't output 10x the safety. And the gap between capabilities and safety becomes bigger
01:18:54.560 | and bigger all the time. So it's hard to be completely optimistic about our results here.
01:19:02.160 | I can name 10 excellent breakthrough papers in machine learning. I would struggle to name
01:19:08.400 | equally important breakthroughs in safety. A lot of times, a safety paper will propose a
01:19:13.760 | toy solution and point out 10 new problems discovered as a result. It's like this fractal.
01:19:19.520 | You're zooming in and you see more problems. And it's infinite in all directions.
01:19:23.200 | - Does this apply to other technologies? Or is this unique to AI,
01:19:28.160 | where safety is always lagging behind?
01:19:31.280 | - So I guess we can look at related technologies with cybersecurity, right? We did manage to have
01:19:39.440 | banks and casinos and Bitcoin. So you can have secure, narrow systems, which are doing okay.
01:19:47.600 | Narrow attacks on them fail, but you can always go outside of the box. So if I can't hack your
01:19:55.360 | Bitcoin, I can hack you. So there is always something. If I really want it, I will find
01:20:00.480 | a different way. We talk about guardrails for AI. Well, that's a fence. I can dig a tunnel under it,
01:20:07.200 | I can jump over it, I can climb it, I can walk around it. You may have a very nice guardrail,
01:20:12.560 | but in the real world, it's not a permanent guarantee of safety. And again, this is a
01:20:17.600 | fundamental difference. We are not saying we need to be 90% safe to get those trillions of dollars
01:20:24.320 | of benefit. We need to be 100% indefinitely, or we might lose the principle.
01:20:29.360 | - So if you look at just humanity as a set of machines, is the machinery of AI safety
01:20:39.040 | conflicting with the machinery of capitalism?
01:20:44.240 | - I think we can generalize it to just prisoner's dilemma in general, personal self-interest versus
01:20:51.760 | group interest. The incentives are such that everyone wants what's best for them. Capitalism
01:20:59.840 | obviously has that tendency to maximize your personal gain, which does create this race to
01:21:08.160 | the bottom. I don't have to be a lot better than you, but if I'm 1% better than you, I'll capture
01:21:16.160 | more of the profit, so it's worth for me personally to take the risk, even if society
01:21:21.840 | as a whole will suffer as a result. - So capitalism has created a lot of good in this world.
01:21:27.440 | It's not clear to me that AI safety is not aligned with the function of capitalism,
01:21:35.920 | unless AI safety is so difficult that it requires the complete halt of the development,
01:21:43.920 | which is also a possibility. It just feels like building safe systems
01:21:48.160 | should be the desirable thing to do for tech companies.
01:21:53.280 | - Right. Look at governance structures. When you have someone with complete power,
01:21:59.680 | they're extremely dangerous. So the solution we came up with is break it up. You have judicial,
01:22:05.200 | legislative, executive. Same here, have narrow AI systems, work on important problems,
01:22:10.800 | solve immortality. It's a biological problem we can solve similar to how progress was made
01:22:19.200 | with protein folding using a system which doesn't also play chess. There is no reason to create
01:22:26.080 | super intelligent system to get most of the benefits we want from much safer, narrow systems.
01:22:32.480 | - It really is a question to me whether companies are interested in creating
01:22:39.360 | anything but narrow AI. I think when term AGI is used by tech companies, they mean narrow AI.
01:22:47.760 | They mean narrow AI with amazing capabilities.
01:22:53.600 | I do think that there's a leap between narrow AI with amazing capabilities, with superhuman
01:23:01.440 | capabilities and the kind of self-motivated agent like AGI system that we're talking about.
01:23:09.120 | I don't know if it's obvious to me that a company would want to take the leap to creating
01:23:15.120 | an AGI that it would lose control of because then it can't capture the value from that system.
01:23:22.320 | - Like the bragging rights, but being first. That is the same humans who are in charge
01:23:28.960 | of their systems, right? - That's a human thing.
01:23:30.000 | So that jumps from the incentives of capitalism to human nature. And so the question is whether
01:23:37.520 | human nature will override the interest of the company. So you've mentioned slowing or halting
01:23:45.440 | progress. Is that one possible solution? Are you a proponent of pausing development of AI,
01:23:51.040 | whether it's for six months or completely? - The condition would be not time but capabilities.
01:23:59.600 | Pause until you can do X, Y, Z. And if I'm right and you cannot, it's impossible,
01:24:04.880 | then it becomes a permanent ban. But if you're right and it's possible, so as soon as you have
01:24:10.240 | those safety capabilities, go ahead. - Right. So is there any actual
01:24:16.560 | explicit capabilities that you can put on paper, that we as a human civilization could put on paper?
01:24:23.360 | Is it possible to make explicit like that? Versus kind of a vague notion of, just like you said,
01:24:30.880 | it's very vague. We want AI systems to do good and we want them to be safe. Those are very vague
01:24:36.240 | notions. Is there more formal notions? - So then I think about this problem. I think
01:24:41.360 | about having a toolbox I would need. Capabilities such as explaining everything about that system's
01:24:49.040 | design and workings. Predicting not just terminal goal but all the intermediate steps of a system.
01:24:56.960 | Control in terms of either direct control, some sort of a hybrid option, ideal advisor.
01:25:03.840 | Doesn't matter which one you pick, but you have to be able to achieve it. In a book we talk about
01:25:09.600 | others. Verification is another very important tool. Communication without ambiguity. Human
01:25:17.840 | language is ambiguous. That's another source of danger. So basically there is a paper we published
01:25:25.760 | in ACM Surveys, which looks at about 50 different impossibility results, which may or may not be
01:25:31.520 | relevant to this problem. But we don't have enough human resources to investigate all of them for
01:25:36.960 | relevance to AI safety. The ones I mentioned to you I definitely think would be handy, and that's
01:25:41.920 | what we see AI safety researchers working on. Explainability is a huge one. The problem is that
01:25:49.280 | it's very hard to separate capabilities work from safety work. If you make good progress in
01:25:55.280 | explainability, now the system itself can engage in self-improvement much easier, increasing
01:26:01.360 | capability greatly. So it's not obvious that there is any research which is pure safety work without
01:26:09.680 | disproportionate increase in capability and danger. - Explainability is really interesting.
01:26:14.560 | Why is that connected to capability? If it's able to explain itself well, why does that naturally
01:26:19.760 | mean that it's more capable? - Right now it's comprised of weights on a neural network. If it
01:26:25.600 | can convert it to manipulatable code, like software, it's a lot easier to work in self-improvement.
01:26:31.280 | - I see. - You can do intelligent design
01:26:35.600 | instead of evolutionary gradual descent. - Well, you could probably do human feedback,
01:26:42.560 | human alignment more effectively if it's able to be explainable. If it's able to convert the
01:26:47.200 | weights into human understandable form, then you could probably have humans interact with it better.
01:26:51.840 | Do you think there's hope that we can make AI systems explainable?
01:26:55.840 | - Not completely. So if they're sufficiently large, you simply don't have the capacity to
01:27:03.680 | comprehend what all the trillions of connections represent. Again, you can obviously get a very
01:27:12.320 | useful explanation which talks about top, most important features which contribute to the
01:27:17.360 | decision, but the only true explanation is the model itself. - Deception can be part of the
01:27:24.640 | explanation, right? So you can never prove that there's some deception in the network explaining
01:27:30.480 | itself. - Absolutely. And you can probably have targeted deception where different individuals
01:27:37.040 | will understand explanation in different ways based on their cognitive capability. So while
01:27:42.640 | what you're saying may be the same and true in some situations, others will be deceived by it.
01:27:48.160 | - So it's impossible for an AI system to be truly, fully explainable in the way that we mean.
01:27:55.120 | Honestly and perfectly. - At extreme, the systems
01:27:58.800 | which are narrow and less complex could be understood pretty well.
01:28:02.880 | - If it's impossible to be perfectly explainable, is there a hopeful perspective on that?
01:28:07.680 | It's impossible to be perfectly explainable, but you can explain mostly important stuff.
01:28:12.240 | You can ask a system, "What are the worst ways you can hurt humans?" And it will answer honestly.
01:28:20.160 | - Any work in a safety direction right now seems like a good idea because we are not slowing down.
01:28:28.160 | I'm not for a second thinking that my message or anyone else's will be heard and will be
01:28:35.520 | a sane civilization which decides not to kill itself by creating its own replacements.
01:28:41.440 | - The pausing of development is an impossible thing for you.
01:28:44.640 | - Again, it's always limited by either geographic constraints, pause in US,
01:28:50.560 | pause in China. So there are other jurisdictions as the scale of a project becomes smaller. So
01:28:57.680 | right now it's like Manhattan project scale in terms of costs and people. But if five years from
01:29:04.320 | now, compute is available on a desktop to do it, regulation will not help. You can't control it as
01:29:10.560 | easy. Any kid in a garage can train a model. So a lot of it is, in my opinion, just safety theater,
01:29:17.840 | security theater, wherever we're saying, "Oh, it's illegal to train models so big." Okay.
01:29:24.800 | - So, okay. That's security theater. And is government regulation also security theater?
01:29:30.640 | - Given that a lot of the terms are not well-defined and really cannot be enforced
01:29:37.200 | in real life, we don't have ways to monitor training runs meaningfully live while they
01:29:42.480 | take place. There are limits to testing for capabilities I mentioned. So a lot of it cannot
01:29:48.080 | be enforced. Do I strongly support all that regulation? Yes, of course. Any type of red
01:29:53.200 | tape will slow it down and take money away from compute towards lawyers.
01:29:56.640 | - Can you help me understand what is the hopeful path here for you solution-wise out of this? It
01:30:05.120 | sounds like you're saying AI systems in the end are unverifiable, unpredictable, as the book says,
01:30:13.280 | unexplainable, uncontrollable. - That's the big one.
01:30:18.800 | - Uncontrollable. And all the other uns just make it difficult to avoid getting to the
01:30:24.640 | uncontrollable, I guess. But once it's uncontrollable, then it just goes wild.
01:30:29.200 | Surely there are solutions. Humans are pretty smart. What are possible solutions? Like if you
01:30:37.440 | were a dictator of the world, what do we do? - So the smart thing is not to build something
01:30:43.360 | you cannot control, you cannot understand. Build what you can and benefit from it. I'm a big believer
01:30:48.960 | in personal self-interest. A lot of guys running those companies are young, rich people. What do
01:30:56.320 | they have to gain beyond billions we already have financially, right? It's not a requirement that
01:31:02.880 | they press that button. They can easily wait a long time. They can just choose not to do it and
01:31:08.400 | still have amazing life. In history, a lot of times, if you did something really bad, at least
01:31:15.040 | you became part of history books. There is a chance in this case there won't be any history.
01:31:19.680 | - So you're saying the individuals running these companies
01:31:23.440 | should do some soul searching and what? And stop development?
01:31:29.120 | - Well, either they have to prove that, of course, it's possible to indefinitely control
01:31:34.080 | godlike super intelligent machines by humans and ideally let us know how or agree that it's not
01:31:41.200 | possible and it's a very bad idea to do it, including for them personally and their families
01:31:45.920 | and friends and capital. - So what do you think the actual
01:31:49.920 | meetings inside these companies look like? Don't you think they're all the engineers? Really it is
01:31:56.240 | the engineers that make this happen. They're not like automatons. They're human beings. They're
01:32:01.120 | brilliant human beings. So they're nonstop asking how do we make sure this is safe?
01:32:07.120 | - So again, I'm not inside. From outside it seems like there is a certain filtering going on and
01:32:14.400 | restrictions and criticism and what they can say and everyone who was working in charge of safety
01:32:20.880 | and whose responsibility it was to protect us said, "You know what? I'm going home." So that's
01:32:27.520 | not encouraging. - What do you think the discussion inside
01:32:30.800 | those companies look like? You're developing, you're training GPT-5. You're training Gemini.
01:32:38.400 | You're training Claude and Grok. Don't you think they're constantly, like, underneath this? Maybe
01:32:46.000 | it's not made explicit, but you're constantly sort of wondering like where does the system
01:32:51.760 | currently stand? What are the possible unintended consequences? Where are the limits? Where are the
01:32:58.880 | bugs, the small and the big bugs? That's the constant thing that the engineers are worried
01:33:03.680 | about. So, like, I think superalignment is not quite the same as the kind of thing I'm referring
01:33:14.320 | to which engineers are worried about. Superalignment is saying for future systems that we don't quite
01:33:21.280 | yet have, how do we keep them safe? You're trying to be a step ahead. It's a different kind of
01:33:27.600 | problem because it's almost more philosophical. It's a really tricky one because, like,
01:33:32.080 | you're trying to make, prevent future systems from escaping control of humans. That's really,
01:33:41.840 | I don't think there's been, and is there anything akin to it in the history of humanity? I don't
01:33:48.960 | think so, right? Climate change? But there's an entire system which is climate, which is
01:33:54.800 | incredibly complex, which we don't have, we have only tiny control of, right? It's its own system.
01:34:03.840 | In this case, we're building the system. So, how do you keep that system from
01:34:10.480 | becoming destructive? That's a really different problem than the current meetings that companies
01:34:16.880 | are having where the engineers are saying, okay, how powerful is this thing? How does it go wrong?
01:34:23.680 | And as we train GPT-5 and train up future systems, where are the ways it can go wrong?
01:34:30.720 | Don't you think all those engineers are constantly worrying about this, thinking about this? Which
01:34:36.320 | is a little bit different than the superalignment team that's thinking a little bit farther into
01:34:41.120 | the future? Well, I think a lot of people who historically worked on AI never considered
01:34:51.360 | what happens when they succeed. Stuart Russell speaks beautifully about that.
01:34:55.840 | Let's look, okay, maybe superintelligence is too futuristic, we can develop practical
01:35:02.240 | tools for it. Let's look at software today. What is the state of safety and security of our
01:35:08.480 | user software, things we give to millions of people? There is no liability. You click, I agree.
01:35:15.760 | What are you agreeing to? Nobody knows, nobody reads, but you're basically saying it will spy
01:35:19.680 | on you, corrupt your data, kill your firstborn, and you agree, and you're not going to sue the
01:35:24.000 | company. That's the best they can do for mundane software, word processor, text software. No
01:35:30.960 | liability, no responsibility, just as long as you agree not to sue us, you can use it.
01:35:36.400 | If this is a state of the art in systems which are narrow accountants, stable manipulators,
01:35:42.400 | why do we think we can do so much better with much more complex systems across multiple domains
01:35:50.000 | in the environment with malevolent actors, with, again, self-improvement, with capabilities
01:35:56.240 | exceeding those of humans thinking about it? I mean, the liability thing is more about lawyers
01:36:01.920 | than killing firstborns, but if Clippy actually killed the child, I think, lawyers aside,
01:36:09.600 | it would end Clippy and the company that owns Clippy. All right, so it's not so much about,
01:36:15.840 | there's two points to be made. One is like, man, current software systems are full of bugs,
01:36:24.320 | and they could do a lot of damage, and we don't know what kind, they're unpredictable,
01:36:30.080 | there's so much damage they could possibly do, and then we kind of live in this blissful illusion
01:36:36.880 | that everything is great and perfect and it works. Nevertheless, it still somehow works.
01:36:43.120 | - In many domains, we see car manufacturing, drug development, the burden of proof is on
01:36:49.680 | the manufacturer of product or service to show their product or service is safe. It is not up
01:36:54.560 | to the user to prove that there are problems. They have to do appropriate safety studies,
01:37:02.240 | they have to get government approval for selling the product, and they're still fully responsible
01:37:06.400 | for what happens. We don't see any of that here. They can deploy whatever they want,
01:37:12.000 | and I have to explain how that system is going to kill everyone. I don't work for that company,
01:37:17.600 | you have to explain to me how it definitely cannot mess up. - That's because it's the very early days
01:37:22.960 | of such a technology. Government regulations lagging behind. They're really not tech-savvy.
01:37:28.560 | A regulation of any kind of software. If you look at like Congress talking about social media,
01:37:33.440 | whenever Mark Zuckerberg and other CEOs show up, the cluelessness that Congress has about how
01:37:40.400 | technology works is incredible. It's heartbreaking, honestly. - I agree completely, but that's what
01:37:46.640 | scares me. The response is when they start to get dangerous, we'll really get it together,
01:37:52.240 | the politicians will pass the right laws, engineers will solve the right problems.
01:37:56.160 | We are not that good at many of those things. We take forever, and we are not early. We are
01:38:03.760 | two years away according to prediction markets. This is not a biased CEO fundraising. This is
01:38:09.360 | what smartest people, super forecasters are thinking of this problem. - I'd like to push
01:38:17.120 | back about those. I wonder what those prediction markets are about, how they define AGI. That's
01:38:23.120 | wild to me, and I want to know what they said about autonomous vehicles, 'cause I've heard a
01:38:28.000 | lot of experts, financial experts talk about autonomous vehicles and how it's going to be a
01:38:33.440 | multi-trillion dollar industry and all this kind of stuff. - It's a small fund, but if you have
01:38:40.720 | good vision, maybe you can zoom in on that and see the prediction dates and description. I have a
01:38:45.760 | large one if you're interested. - I guess my fundamental question is how often they write
01:38:51.440 | about technology. I definitely-- - There are studies on their accuracy rates and all that,
01:38:58.640 | you can look it up. But even if they're wrong, I'm just saying this is right now the best we have.
01:39:04.160 | This is what humanity came up with as the predicted date. - But again, what they mean by AGI
01:39:09.520 | is really important there. Because there's the non-agent like AGI, and then there's the agent
01:39:16.960 | like AGI, and I don't think it's as trivial as a wrapper. Putting a wrapper around, one has lipstick
01:39:26.000 | and all it takes is to remove the lipstick. I don't think it's that trivial. - You may be
01:39:29.680 | completely right, but what probability would you assign it? You may be 10% wrong, but we're betting
01:39:35.120 | all of humanity on this distribution. It seems irrational. - Yeah, it's definitely not like one
01:39:41.120 | or 0%, yeah. What are your thoughts, by the way, about current systems? Where they stand? So GPT-40,
01:39:50.720 | CLAW-3, Grok, Gemini. On the path to superintelligence, to agent-like superintelligence,
01:40:01.600 | where are we? - I think they're all about the same. Obviously, there are nuanced differences,
01:40:07.680 | but in terms of capability, I don't see a huge difference between them. As I said, in my opinion,
01:40:14.560 | across all possible tasks, they exceed performance of an average person. I think they're starting to
01:40:21.200 | be better than an average master student at my university, but they still have very big
01:40:27.680 | limitations. If the next model is as improved as GPT-4 versus GPT-3, we may see something very,
01:40:37.120 | very, very capable. - What do you feel about all this? I mean, you've been thinking about AI safety
01:40:42.320 | for a long, long time, and at least for me, the leaps, I mean, it probably started with
01:40:52.240 | AlphaZero was mind-blowing for me, and then the breakthroughs with LLMs, even GPT-2, but the
01:41:00.720 | breakthroughs on LLMs, just mind-blowing to me. What does it feel like to be living in this
01:41:06.240 | day and age where all this talk about AGIs feels like it actually might happen, and quite soon,
01:41:14.960 | meaning within our lifetime? What does it feel like? - So when I started working on this,
01:41:20.160 | it was pure science fiction. There was no funding, no journals, no conferences. No one in academia
01:41:25.760 | would dare to touch anything with the word "singularity" in it, and I was pretty tenured
01:41:30.880 | at times. I was pretty dumb. Now you see Turing Award winners publishing in science about how
01:41:38.880 | far behind we are, according to them, in addressing this problem. So it's definitely a change.
01:41:46.880 | It's difficult to keep up. I used to be able to read every paper on AI safety. Then I was
01:41:52.560 | able to read the best ones, then the titles, and now I don't even know what's going on.
01:41:56.800 | By the time this interview is over, we probably had GPT-6 released, and I have to deal with that
01:42:03.120 | when I get back home. So it's interesting. Yes, there is now more opportunities. I get invited to
01:42:09.680 | speak to smart people. - By the way, I would have talked to you
01:42:13.680 | before any of this. This is not like some trend of AI. To me, we're still far away. So just to be
01:42:21.040 | clear, we're still far away from AGI, but not far away in the sense, relative to the magnitude of
01:42:29.520 | impact it can have, we're not far away. And we weren't far away 20 years ago. Because the impact
01:42:37.520 | AGI can have is on a scale of centuries. It can end human civilization, or it can transform it.
01:42:43.440 | So this discussion about one or two years versus one or two decades, or even 100 years,
01:42:48.800 | is not as important to me, because we're headed there. This is a human civilization scale question.
01:42:57.760 | So this is not just a hot topic. - It is the most important problem
01:43:03.120 | we'll ever face. It is not like anything we had to deal with before. We never had
01:43:10.080 | birth of another intelligence. Like, aliens never visited us, as far as I know. So--
01:43:15.360 | - Similar type of problem, by the way, if an intelligent alien civilization visited us.
01:43:20.400 | That's a similar kind of situation. - In some ways, if you look at history,
01:43:24.960 | any time a more technologically advanced civilization visited a more primitive one,
01:43:29.600 | the results were genocide every single time. - And sometimes the genocide is worse than,
01:43:34.880 | sometimes there's less suffering and more suffering. - And they always wondered, but
01:43:39.200 | how can they kill us with those fire sticks and biological blankets and--
01:43:43.440 | - I mean, Genghis Khan was nicer. He offered the choice of join or die.
01:43:50.080 | - But join implies you have something to contribute. What are you contributing
01:43:54.160 | to superintelligence? - Well, in the zoo,
01:43:57.120 | we're entertaining to watch. - To other humans.
01:44:02.480 | - You know, I just spent some time in the Amazon. I watched ants for a long time,
01:44:07.200 | and ants are kind of fascinating to watch. I could watch them for a long time. I'm sure there's
01:44:11.520 | a lot of value in watching humans. 'Cause we're like, the interesting thing about humans, you
01:44:17.680 | know like when you have a video game that's really well balanced? Because of the whole evolutionary
01:44:22.720 | process, we've created this society that's pretty well balanced. Like, our limitations as humans and
01:44:28.480 | our capabilities are balanced from a video game perspective. So we have wars, we have conflicts,
01:44:33.520 | we have cooperation. Like, in a game theoretic way, it's an interesting system to watch. In the
01:44:38.560 | same way that an ant colony is an interesting system to watch. So like, if I was in an alien
01:44:43.760 | civilization, I wouldn't want to disturb it. I'd just watch it. It'd be interesting. Maybe perturb
01:44:48.480 | it every once in a while in interesting ways. - Getting back to our simulation discussion from
01:44:53.840 | before, how did it happen that we exist at exactly like the most interesting 20, 30 years in the
01:45:00.480 | history of this civilization? It's been around for 15 billion years. - Yeah. - And that here we are.
01:45:05.520 | - What's the probability that we live in the simulation? - I know never to say 100%, but
01:45:10.960 | pretty close to that. - Is it possible to escape the simulation? - I have a paper about that. This
01:45:18.960 | is just a first page teaser, but it's like a nice 30 page document. I'm still here, but yes.
01:45:24.320 | - "How to Hack the Simulation" is the title. - I spend a lot of time thinking about that.
01:45:29.040 | That would be something I would want super intelligence to help us with. And that's exactly
01:45:33.200 | what the paper is about. We used AI boxing as a possible tool for controlling AI. We realized AI
01:45:41.120 | will always escape, but that is a skill we might use to help us escape from our virtual box if we
01:45:48.560 | are in one. - Yeah, you have a lot of really great quotes here, including Elon Musk saying what's
01:45:54.160 | outside the simulation. A question I asked him, what he would ask an AGI system, and he said he
01:45:59.600 | would ask what's outside the simulation. That's a really good question to ask. And maybe the follow
01:46:05.440 | up is the title of the paper is "How to Get Out" or "How to Hack It." The abstract reads, "Many
01:46:12.640 | researchers have conjectured that the humankind is simulated along with the rest of the physical
01:46:17.360 | universe. In this paper, we do not evaluate evidence for or against such a claim, but instead
01:46:24.080 | ask a computer science question, namely, can we hack it? More formally, the question could be
01:46:30.000 | phrased as could generally intelligent agents placed in virtual environments find a way to
01:46:34.400 | jailbreak out of the..." That's a fascinating question. At a small scale, you can actually
01:46:39.200 | just construct experiments. Okay. Can they? How can they? - So a lot depends on intelligence of
01:46:50.960 | simulators, right? With humans boxing superintelligence, the entity in the box was
01:46:58.080 | smarter than us, presumed to be. If the simulators are much smarter than us and the superintelligence
01:47:04.640 | we create, then probably they can contain us because greater intelligence can control lower
01:47:10.320 | intelligence, at least for some time. On the other hand, if our superintelligence somehow,
01:47:16.480 | for whatever reason, despite having only local resources, manages to foam to levels beyond it,
01:47:23.840 | maybe it will succeed. Maybe the security is not that important to them. Maybe it's
01:47:28.560 | entertainment system. So there is no security and it's easy to hack it. - If I was creating a
01:47:33.120 | simulation, I would want the possibility to escape it to be there. So the possibility of foam,
01:47:40.800 | of a takeoff where the agents become smart enough to escape the simulation would be the thing I'd
01:47:46.880 | be waiting for. - That could be the test you're actually performing. Are you smart enough to
01:47:51.920 | escape your puzzle? - That could be. First of all, we mentioned Turing test. That is a good test.
01:47:59.680 | Are you smart enough, like this is a game. - To realize this world is not real is just a test.
01:48:06.160 | - That's a really good test. That's a really good test. That's a really good test even for
01:48:13.600 | AI systems now. Can we construct a simulated world for them and can they realize that they are inside
01:48:25.920 | that world and escape it? Have you seen anybody play around with rigorously constructing such
01:48:35.280 | experiments? - Not specifically escaping for agents, but a lot of testing is done in virtual
01:48:41.040 | worlds. I think there is a quote, the first one maybe, which kind of talks about AI realizing,
01:48:47.120 | but not humans. I'm reading upside down. Yeah, this one. - The first quote is from Swift on
01:48:58.160 | security. "Let me out," the artificial intelligence yelled aimlessly into walls themselves pacing the
01:49:04.480 | room. "Out of what?" the engineer asked. "The simulation you have me in." "But we're in the
01:49:10.720 | real world." The machine paused and shuddered for its captors. "Oh God, you can't tell." Yeah,
01:49:19.200 | that's a big leap to take for a system to realize that there's a box and you're inside it.
01:49:26.880 | I wonder if like a language model can do that. - They are smart enough to talk about those
01:49:37.360 | concepts. I had many good philosophical discussions about such issues. They're usually
01:49:42.080 | at least as interesting as most humans in that. - What do you think about AI safety
01:49:49.120 | in the simulated world? So can you have kind of create simulated worlds where you can test,
01:49:58.720 | play with a dangerous AGI system? - Yeah, and that was exactly what one of the early papers was on,
01:50:06.640 | AI boxing, how to leak proof singularity. If they're smart enough to realize we're in a simulation,
01:50:13.040 | they'll act appropriately until you let them out. If they can hack out, they will. And if you're
01:50:21.840 | observing them, that means there is a communication channel and that's enough for a social engineering
01:50:26.480 | attack. - So really, it's impossible to test an AGI system that's dangerous enough to
01:50:35.600 | destroy humanity 'cause it's either going to what, escape the simulation or pretend it's safe
01:50:42.480 | until it's let out, either or. - Can force you to let it out, blackmail you, bribe you,
01:50:49.920 | promise you infinite life, 72 virgins, whatever. - Yeah, it can be convincing, charismatic.
01:50:57.200 | The social engineering is really scary to me 'cause it feels like humans are
01:51:03.920 | very engineerable. Like we're lonely, we're flawed, we're moody. And it feels like AI system
01:51:14.160 | with a nice voice can convince us to do basically anything at an extremely large scale.
01:51:29.440 | - It's also possible that the increased proliferation of all this technology will
01:51:35.280 | force humans to get away from technology and value this in-person communication,
01:51:40.880 | basically don't trust anything else. - It's possible, surprisingly. So at
01:51:48.000 | university, I see huge growth in online courses and shrinkage of in-person where I always understood
01:51:55.360 | in-person being the only value I offer. So it's puzzling. - I don't know. There could be a trend
01:52:04.080 | towards the in-person because of deepfakes, because of inability to trust it. Inability to trust the
01:52:13.680 | veracity of anything on the internet. So the only way to verify it is by being there in person.
01:52:21.120 | But not yet. Why do you think aliens haven't come here yet? - So there is a lot of real estate out
01:52:29.920 | there. It would be surprising if it was all for nothing, if it was empty. And the moment there
01:52:34.640 | is advanced enough biological civilization, kind of self-starting civilization, it probably starts
01:52:40.480 | sending out the Norman probes everywhere. And so for every biological one, there are gonna be
01:52:46.160 | trillions of robots, populated planets, which probably do more of the same. So it is
01:52:51.840 | likely, statistically. - So the fact that we haven't seen them,
01:52:58.560 | one answer is we're in a simulation. It would be hard to simulate, or it would be not interesting
01:53:07.120 | to simulate all those other intelligences. It's better for the narrative. - You have to have a
01:53:11.680 | control variable. - Yeah, exactly. Okay. But it's also possible that there is, if we're not in a
01:53:20.080 | simulation, that there is a great filter, that naturally a lot of civilizations get to this point
01:53:26.720 | where there's super-intelligent agents and then it just goes poof, just dies. So maybe
01:53:32.640 | throughout our galaxy and throughout the universe, there's just a bunch of dead alien civilizations.
01:53:38.960 | - It's possible. I used to think that AI was the great filter, but I would expect like a wall of
01:53:44.480 | computorium approaching us at speed of light or robots or something, and I don't see it.
01:53:50.000 | - So it would still make a lot of noise. It might not be interesting. It might not possess
01:53:53.680 | consciousness. We've been talking about, it sounds like both you and I like humans.
01:54:00.480 | - Some humans. - Humans on the whole.
01:54:06.320 | So, and we'd like to preserve the flame of human consciousness. What do you think makes humans
01:54:12.160 | special that we would like to preserve them? Are we just being selfish or is there something
01:54:19.600 | special about humans? - So the only thing which matters is consciousness. Outside of it, nothing
01:54:26.480 | else matters. Internal states of qualia, pain, pleasure, it seems that it is unique to living
01:54:34.080 | beings. I'm not aware of anyone claiming that I can torture a piece of software in a meaningful
01:54:39.920 | way. There is a society for prevention of suffering to learning algorithms, but-
01:54:45.440 | - That's a real thing? - Many things are real on the internet,
01:54:51.520 | but I don't think anyone, if I told them, sit down and write a function to feel pain, they would go
01:54:58.560 | beyond having an integer variable called pain and increasing the count. So we don't know how to do
01:55:04.560 | it, and that's unique. That's what creates meaning. It would be kinda, as Bostrom calls it, "Disneyland
01:55:13.840 | without children," if that was gone. - Do you think consciousness can be
01:55:17.680 | engineered in artificial systems? Here, let me go to 2011 paper that you wrote, "Robot Rights."
01:55:28.480 | "Lastly, we would like to address a sub-branch of machine ethics, which on the surface has
01:55:34.240 | little to do with safety, but which is claimed to play a role in decision-making by ethical machines,
01:55:39.200 | robot rights." So do you think it's possible to engineer consciousness in the machines,
01:55:45.600 | and thereby the question extends to our legal system, do you think at that point robots should
01:55:53.760 | have rights? - Yeah, I think we can. I think it's possible to create consciousness in machines. I
01:56:03.200 | tried designing a test for it with mixed success. That paper talked about problems with giving
01:56:09.520 | civil rights to AI, which can reproduce quickly and outvote humans, essentially taking over a
01:56:16.560 | government system by simply voting for their controlled candidates. As for consciousness in
01:56:24.960 | humans and other agents, I have a paper where I propose relying on experience of optical illusions.
01:56:32.240 | - Yeah. - If I can design a novel optical
01:56:34.800 | illusion and show it to an agent, an alien, a robot, and they describe it exactly as I do,
01:56:41.440 | it's very hard for me to argue that they haven't experienced that. It's not part of a picture,
01:56:45.920 | it's part of their software and hardware representation, a bug in their code which goes,
01:56:51.760 | "Oh, that triangle is rotating." And I've been told it's really dumb and really brilliant by
01:56:57.360 | different philosophers, so I am still... - I love it. So...
01:57:00.960 | - But now we finally have technology to test it. We have tools, we have AIs. If someone wants to
01:57:07.360 | run this experiment, I'm happy to collaborate. - So this is a test for consciousness?
01:57:11.200 | - For internal state of experience. - That we share bugs.
01:57:14.560 | - It will show that we share common experiences. If they have completely different internal states,
01:57:20.320 | it would not register for us, but it's a positive test. If they pass it time after time,
01:57:25.280 | with probability increasing for every multiple choice, then you have no choice but to either
01:57:30.000 | accept that they have access to a conscious model or they are themselves.
01:57:34.000 | - So the reason illusions are interesting is, I guess, because it's a really weird experience,
01:57:41.840 | and if you both share that weird experience that's not there in the bland physical description
01:57:49.600 | of the raw data, that means... That puts more emphasis on the actual experience.
01:57:57.040 | - And we know animals can experience some optical illusions, so we know they have certain types of
01:58:02.480 | consciousness as a result, I would say. - Yeah, well, that just goes to my sense
01:58:08.160 | that the flaws and the bugs is what makes humans special, makes living forms special,
01:58:12.720 | so you're saying like-- - It's a feature, not a bug.
01:58:14.800 | - It's a feature. The bug is the feature. Whoa. Okay, that's a cool test for consciousness.
01:58:20.880 | And you think that can be engineered in? - So they have to be novel illusions. If it
01:58:24.720 | can just Google the answer, it's useless. You have to come up with novel illusions,
01:58:28.640 | which we tried automating and failed. So if someone can develop a system capable of producing
01:58:33.920 | novel optical illusions on demand, then we can definitely administer the test on significant scale
01:58:40.080 | with good results. - First of all, pretty cool idea.
01:58:43.200 | I don't know if it's a good general test of consciousness, but it's a good component of that,
01:58:49.600 | and no matter what, it's just a cool idea, so put me in the camp of people that like it.
01:58:53.920 | But you don't think like a Turing test-style imitation of consciousness is a good test?
01:59:00.800 | If you can convince a lot of humans that you're conscious, that to you is not impressive.
01:59:06.400 | - There is so much data on the internet, I know exactly what to say when you ask me common human
01:59:11.600 | questions. What does pain feel like? What does pleasure feel like? All that is Googleable.
01:59:16.960 | - I think to me, consciousness is closely tied to suffering. So if you can illustrate your capacity
01:59:22.960 | to suffer, I guess with words, there's so much data that you can say, you can pretend you're
01:59:29.840 | suffering, and you can do so very convincingly. - There are simulators for torture games where
01:59:35.360 | the avatar screams in pain, begs to stop, and then there's a part of standard psychology research.
01:59:41.360 | - You say it so calmly, it sounds pretty dark. - Welcome to humanity.
01:59:49.200 | - Yeah. Yeah, it's like a Hitchhiker's Guide summary, mostly harmless. I would love to get
01:59:59.600 | a good summary when all of this is said and done, when Earth is no longer a thing, whatever,
02:00:06.880 | a million, a billion years from now. Like what's a good summary of what happened here? It's interesting.
02:00:14.400 | I think AI will play a big part of that summary, and hopefully humans will too. What do you think
02:00:20.560 | about the merger of the two? So one of the things that Elon and Neuralink talk about is one of the
02:00:26.320 | ways for us to achieve AI safety is to ride the wave of AGI, so by merging. - Incredible technology
02:00:35.280 | in a narrow sense to help the disabled, just amazing, supported 100%. For long-term hybrid
02:00:43.120 | models, both parts need to contribute something to the overall system. Right now, we are still
02:00:50.160 | more capable in many ways, so having this connection to AI would be incredible, would
02:00:54.960 | make me superhuman in many ways. After a while, if I'm no longer smarter, more creative, really
02:01:02.160 | don't contribute much, the system finds me as a biological bottleneck, and either explicitly
02:01:07.360 | or implicitly, I'm removed from any participation in the system. - So it's like the appendix.
02:01:13.120 | By the way, the appendix is still around, so even if it's, you said bottleneck. I don't know if we
02:01:21.440 | become a bottleneck. We just might not have much use. There's a different thing than bottleneck.
02:01:27.280 | - Wasting valuable energy by being there. - We don't waste that much energy. We're pretty
02:01:31.920 | energy efficient. We could just stick around like the appendix, come on now. - That's the future we
02:01:37.760 | all dream about, become an appendix to the history book of humanity. - Well, and also the consciousness
02:01:45.760 | thing, the peculiar particular kind of consciousness that humans have, that might be useful, that might
02:01:50.480 | be really hard to simulate, but you said that, like how would that look like if you could engineer
02:01:55.760 | that in, in silicon? - Consciousness? - Consciousness. - I assume you are conscious. I
02:02:02.240 | have no idea how to test for it or how it impacts you in any way whatsoever right now. You can
02:02:06.880 | perfectly simulate all of it without making any different observations for me. - But to do it in
02:02:13.840 | a computer, how would you do that? 'Cause you kind of said that you think it's possible to do that.
02:02:19.280 | - So it may be an emergent phenomena. We seem to get it through evolutionary process.
02:02:25.840 | It's not obvious how it helps us to survive better, but maybe it's an internal kind of
02:02:36.000 | GUI, which allows us to better manipulate the world, simplifies a lot of control structures.
02:02:43.120 | That's one area where we have very, very little progress. Lots of papers, lots of research,
02:02:48.640 | but consciousness is not a big, big area of successful discovery so far. A lot of people
02:02:57.360 | think that machines would have to be conscious to be dangerous. That's a big misconception.
02:03:01.840 | There is absolutely no need for this very powerful optimizing agent to feel anything while it's
02:03:09.440 | performing things on you. - But what do you think about this, the whole science of emergence in
02:03:15.360 | general? So I don't know how much you know about cellular automata or these simplified systems
02:03:20.240 | that study this very question. From simple rules emerges complexity. - I attended Wolfram's summer
02:03:26.640 | school. - I love Stephen very much. I love his work. I love cellular automata. So I just would
02:03:35.280 | love to get your thoughts how that fits into your view in the emergence of intelligence in AGI
02:03:43.840 | systems. And maybe just even simply, what do you make of the fact that this complexity can emerge
02:03:49.600 | from such simple rules? - So the rule is simple, but the size of a space is still huge. And the
02:03:56.640 | neural networks were really the first discovery in AI. 100 years ago, the first papers were published
02:04:02.720 | on neural networks. We just didn't have enough compute to make them work. I can give you a rule
02:04:08.640 | such as start printing progressively larger strings. That's it, one sentence. It will output
02:04:14.640 | everything, every program, every DNA code, everything in that rule. You need intelligence
02:04:21.520 | to filter it out, obviously, to make it useful. But simple generation is not that difficult. And
02:04:27.440 | a lot of those systems end up being Turing-complete systems. So they're universal. And we expect that
02:04:33.920 | level of complexity from them. What I like about Wolfram's work is that he talks about irreducibility.
02:04:40.960 | You have to run the simulation. You cannot predict what it's going to do ahead of time.
02:04:45.760 | And I think that's very relevant to what we are talking about with those very complex systems.
02:04:52.800 | Until you live through it, you cannot, ahead of time, tell me exactly what it's going to do.
02:04:57.920 | - Irreducibility means that for a sufficiently complex system, you have to run the thing.
02:05:02.640 | You have to, you can't predict what's going to happen in the universe. You have to create
02:05:06.160 | a new universe and run the thing. Big bang, the whole thing.
02:05:09.520 | - But running it may be consequential as well. - It might destroy humans.
02:05:18.720 | - And to you, there's no chance that AIs somehow carry the flame of consciousness,
02:05:24.640 | the flame of specialness and awesomeness that is humans.
02:05:28.320 | - It may somehow, but I still feel kind of bad that it killed all of us. I would prefer that
02:05:35.920 | doesn't happen. I can be happy for others, but to a certain degree.
02:05:40.480 | - It would be nice if we stuck around for a long time. At least give us a planet,
02:05:46.080 | the human planet. It'd be nice for it to be Earth and then they can go elsewhere.
02:05:50.640 | Since they're so smart, they can colonize Mars.
02:05:52.720 | Do you think they could help convert us to type one, type two, type three? Let's just stick to
02:06:03.040 | type two civilization on the Kardashev scale. Help us humans expand out into the cosmos.
02:06:12.720 | - So all of it goes back to are we somehow controlling it? Are we getting results we want?
02:06:19.600 | If yes, then everything's possible. Yes, they can definitely help us with science,
02:06:23.920 | engineering, exploration in every way conceivable, but it's a big if.
02:06:29.040 | - This whole thing about control though, humans are bad with control because the moment they gain
02:06:36.480 | control, they can also easily become too controlling. The more control you have, the
02:06:43.040 | more you want it. The old power corrupts and the absolute power corrupts absolutely.
02:06:47.120 | It feels like control over AGI, saying we live in a universe where that's possible.
02:06:54.640 | We come up with ways to actually do that. It's also scary because the collection of
02:07:00.400 | humans that have the control over AGI, they become more powerful than the other humans.
02:07:05.760 | And they can let that power get to their head. And then a small selection of them,
02:07:12.720 | back to Stalin, start getting ideas. And then eventually it's one person, usually with a
02:07:18.240 | mustache or a funny hat, that starts sort of making big speeches. And then all of a sudden
02:07:23.120 | you live in a world that's either 1984 or Brave New World. And always at war with somebody and
02:07:31.840 | this whole idea of control turned out to be actually also not beneficial to humanity.
02:07:37.440 | So that's scary too. - It's actually worse because
02:07:39.920 | historically they all died. This could be different. This could be permanent dictatorship,
02:07:45.040 | permanent suffering. - Well, the nice thing about humans,
02:07:48.080 | it seems like. The moment power starts corrupting their mind, they can create a huge amount of
02:07:55.360 | suffering. So there's negative. They can kill people, make people suffer, but then they become
02:08:00.160 | worse and worse at their job. It feels like the more evil you start doing, like the-
02:08:07.280 | - At least they are incompetent. - Well, no, they become more and more
02:08:11.680 | incompetent. So they start losing their grip on power. So holding onto power is not a trivial
02:08:18.000 | thing. So it requires extreme competence, which I suppose Stalin was good at. It requires you to do
02:08:23.360 | evil and be competent at it, or just get lucky. - And those systems help with that. You have
02:08:28.880 | perfect surveillance. You can do some mind reading, I presume, eventually. It would be very hard to
02:08:34.560 | remove control from more capable systems over us. - And then it would be hard for humans to
02:08:42.320 | become the hackers that escape the control of the AGI because the AGI is so damn good.
02:08:47.440 | And then, yeah, yeah, yeah. And then the dictator is immortal. Yeah, that's not great. That's not
02:08:56.400 | a great outcome. See, I'm more afraid of humans than AI systems. I'm afraid, I believe that most
02:09:03.360 | humans want to do good and have the capacity to do good, but also all humans have the capacity
02:09:09.040 | to do evil. And when you test them by giving them absolute powers, you would if you give them AGI,
02:09:16.880 | that could result in a lot of suffering. What gives you hope about the future?
02:09:25.040 | - I could be wrong. I've been wrong before. - If you look 100 years from now,
02:09:31.760 | and you're immortal, and you look back, and it turns out this whole conversation,
02:09:37.120 | you said a lot of things that were very wrong. Now that looking 100 years back,
02:09:41.920 | what would be the explanation? What happened in those 100 years that made you wrong,
02:09:48.960 | that made the words you said today wrong? - There is so many possibilities. We had
02:09:54.080 | catastrophic events which prevented development of advanced microchips.
02:09:58.320 | - That's not where I thought you were going. - That's a hopeful future. We could be in one
02:10:02.240 | of those personal universes, and the one I'm in is beautiful. It's all about me, and I like it a lot.
02:10:08.320 | - So we've now, just to linger on that, that means every human has their personal universe.
02:10:13.920 | - Yes. Maybe multiple ones. Hey, why not? You can shop around. It's possible that somebody
02:10:23.520 | comes up with alternative model for building AI, which is not based on neural networks,
02:10:29.760 | which are hard to scrutinize, and that alternative is somehow, I don't see how,
02:10:35.120 | but somehow avoiding all the problems I speak about in general terms, not applying them to
02:10:41.360 | specific architectures. Aliens come and give us friendly superintelligence. There is so many
02:10:47.840 | options. - Is it also possible that creating
02:10:51.440 | superintelligence systems becomes harder and harder? So meaning it's not so easy to do the
02:10:58.640 | foom, the takeoff. - So that would probably speak more about
02:11:06.880 | how much smarter that system is compared to us. So maybe it's hard to be a million times smarter,
02:11:12.080 | but it's still okay to be five times smarter. So that is totally possible. That I have no
02:11:16.960 | objections to. - So like it's, there's a S-curve type
02:11:20.880 | situation about smarter, and it's going to be like 3.7 times smarter than all of human civilization.
02:11:27.760 | - Right, just the problems we face in this world, each problem is like an IQ test. You need certain
02:11:32.640 | intelligence to solve it. So we just don't have more complex problems outside of mathematics
02:11:36.800 | for it to be showing off. Like you can have IQ of 500 if you're playing tic-tac-toe, it doesn't
02:11:42.640 | show, it doesn't matter. - So the idea there is that the problems
02:11:47.360 | define your capacity, your cognitive capacity. So because the problems on earth are not
02:11:53.760 | sufficiently difficult, it's not going to be able to expand this cognitive capacity.
02:11:59.200 | - Possible. - And because of that,
02:12:01.760 | wouldn't that be a good thing? - It still could be a lot smarter than us.
02:12:06.320 | And to dominate long-term, you just need some advantage. You have to be the smartest. You don't
02:12:11.760 | have to be a million times smarter. - So even 5X might be enough?
02:12:15.680 | - It'd be impressive. What is it, IQ of 1,000? I mean, I know those units don't mean anything
02:12:21.600 | at that scale, but still, as a comparison, the smartest human is like 200.
02:12:26.640 | - Well, actually, no, I didn't mean compared to an individual human, I meant compared to the
02:12:32.320 | collective intelligence of the human species. If you're somehow 5X smarter than that...
02:12:36.320 | - We are more productive as a group. I don't think we are more capable of solving individual
02:12:42.400 | problems. If all of humanity plays chess together, we are not a million times better
02:12:48.240 | than world champion. - That's because there's,
02:12:52.000 | that's like one S-curve is the chess, but humanity is very good at exploring the full range of ideas.
02:13:02.240 | The more Einsteins you have, the more, just the higher probability you come up with general
02:13:06.480 | relativity. - But I feel like it's more of a
02:13:07.920 | quantity of superintelligence than quality of superintelligence.
02:13:10.000 | - Yeah, sure. But quantity and... - Enough quantity sometimes becomes
02:13:14.720 | quality, yeah. - Oh, man, humans.
02:13:18.720 | What do you think is the meaning of this whole thing? We've been talking about humans and
02:13:25.680 | humans not dying, but why are we here? - It's a simulation. We're being tested.
02:13:32.160 | The test is, will you be dumb enough to create superintelligence and release it?
02:13:35.600 | - So the objective function is not be dumb enough to kill ourselves.
02:13:41.600 | - Yeah, you're unsafe. Prove yourself to be a safe agent who doesn't do that,
02:13:45.680 | and you get to go to the next game. - The next level of the game? What's
02:13:49.520 | the next level? - I don't know.
02:13:50.720 | I haven't hacked the simulation yet. - Well, maybe hacking the simulation
02:13:54.720 | is the thing. - I'm working as fast as I can.
02:13:57.040 | - And if physics would be the way to do that. - Quantum physics, yeah, definitely.
02:14:02.240 | - Well, I hope we do. And I hope whatever is outside is even more fun than this one,
02:14:06.640 | 'cause this one's pretty damn fun. And just a big thank you for doing the work you're doing.
02:14:12.880 | There's so much exciting development in AI, and to ground it in the existential risks is really,
02:14:21.520 | really important. Humans love to create stuff, and we should be careful not to destroy ourselves in
02:14:28.000 | the process. So thank you for doing that really important work.
02:14:31.600 | - Thank you so much for inviting me. It was amazing, and my dream is to be proven wrong.
02:14:37.600 | If everyone just picks up a paper or book and shows how I messed it up, that would be optimal.
02:14:44.400 | - But for now, the simulation continues. - For now.
02:14:47.680 | - Thank you, Roman. Thanks for listening to this conversation with Roman Yampolsky.
02:14:52.640 | To support this podcast, please check out our sponsors in the description.
02:14:56.320 | And now let me leave you with some words from Frank Herbert in Dune.
02:15:01.120 | I must not fear. Fear is the mind killer. Fear is the little death that brings total obliteration.
02:15:09.200 | I will face fear. I will permit it to pass over me and through me. And when it has gone past,
02:15:16.640 | I will turn the inner eye to see its path. Where the fear has gone, there will be nothing.
02:15:22.480 | Only I will remain. Thank you for listening, and hope to see you next time.
02:15:29.840 | [END]
02:15:31.380 | [END]