Juergen Schmidhuber: Godel Machines, Meta-Learning, and LSTMs

00:00:00.000 | The following is a conversation with Jürgen Schmidhuber.

00:00:03.520 | He's the co-director of the CS Swiss AI Lab

00:00:06.360 | and a co-creator of long short-term memory networks.

00:00:10.400 | LSTMs are used in billions of devices today

00:00:13.720 | for speech recognition, translation, and much more.

00:00:17.400 | Over 30 years, he has proposed a lot of interesting

00:00:20.800 | out-of-the-box ideas on meta-learning,

00:00:23.440 | adversarial networks, computer vision,

00:00:26.000 | and even a formal theory of quote,

00:00:28.760 | creativity, curiosity, and fun.

00:00:32.360 | This conversation is part of the MIT course

00:00:34.920 | on artificial general intelligence

00:00:36.520 | and the Artificial Intelligence Podcast.

00:00:38.840 | If you enjoy it, subscribe on YouTube, iTunes,

00:00:41.960 | or simply connect with me on Twitter

00:00:43.960 | at Lex Friedman spelled F-R-I-D.

00:00:47.320 | And now, here's my conversation with Jürgen Schmidhuber.

00:00:51.520 | Early on, you dreamed of AI systems

00:00:55.640 | that self-improve recursively.

00:00:58.720 | When was that dream born?

00:01:00.240 | - When I was a baby?

00:01:02.920 | No, that's not true.

00:01:04.000 | When I was a teenager.

00:01:05.240 | - And what was the catalyst for that birth?

00:01:09.440 | What was the thing that first inspired you?

00:01:11.600 | - When I was a boy,

00:01:13.960 | I was thinking about what to do in my life

00:01:19.960 | and then I thought the most exciting thing

00:01:23.640 | is to solve the riddles of the universe.

00:01:27.960 | And that means you have to become a physicist.

00:01:30.720 | However, then I realized that there's something even grander

00:01:35.640 | you can try to build a machine

00:01:39.680 | that isn't really a machine any longer,

00:01:41.920 | that learns to become a much better physicist

00:01:44.320 | than I could ever hope to be.

00:01:46.880 | And that's how I thought maybe I can multiply

00:01:50.120 | my tiny little bit of creativity into infinity.

00:01:54.320 | - But ultimately that creativity will be multiplied

00:01:57.040 | to understand the universe around us.

00:01:59.120 | That's the curiosity for that mystery that drove you?

00:02:04.120 | - Yes, so if you can build a machine

00:02:08.320 | that learns to solve more and more complex problems

00:02:13.320 | and more and more general problem solver,

00:02:16.760 | then you basically have solved all the problems,

00:02:21.760 | at least all the solvable problems.

00:02:25.960 | - So how do you think, what is the mechanism

00:02:28.120 | for that kind of general solver look like?

00:02:31.640 | Obviously we don't quite yet have one

00:02:34.840 | or know how to build one, but we have ideas

00:02:37.040 | and you have had throughout your career

00:02:39.120 | several ideas about it.

00:02:40.800 | So how do you think about that mechanism?

00:02:43.640 | - So in the 80s, I thought about how to build this machine

00:02:48.640 | that learns to solve all these problems

00:02:51.040 | that I cannot solve myself.

00:02:54.120 | And I thought it is clear it has to be a machine

00:02:57.160 | that not only learns to solve this problem here

00:03:00.880 | and this problem here, but it also has to learn

00:03:04.160 | to improve the learning algorithm itself.

00:03:08.080 | So it has to have the learning algorithm

00:03:12.480 | in a representation that allows it to inspect it

00:03:15.720 | and modify it such that it can come up

00:03:19.240 | with a better learning algorithm.

00:03:22.120 | So I call that meta-learning, learning to learn

00:03:25.720 | and recursive self-improvement.

00:03:28.080 | That is really the pinnacle of that,

00:03:29.880 | where you then not only learn how to improve

00:03:34.880 | on that problem and on that,

00:03:37.520 | but you also improve the way the machine improves

00:03:41.120 | and you also improve the way it improves

00:03:43.200 | the way it improves itself.

00:03:44.600 | And that was my 1987 diploma thesis,

00:03:48.600 | which was all about that hierarchy of meta-learners

00:03:53.240 | that have no computational limits

00:03:57.280 | except for the well-known limits

00:03:59.960 | that Gödel identified in 1931

00:04:03.240 | and for the limits of physics.

00:04:05.720 | - In the recent years, meta-learning has gained popularity

00:04:10.120 | in a specific kind of form.

00:04:12.840 | You've talked about how that's not really meta-learning

00:04:16.040 | with neural networks, that's more basic transfer learning.

00:04:21.040 | Can you talk about the difference

00:04:22.720 | between the big general meta-learning

00:04:25.480 | and a more narrow sense of meta-learning

00:04:27.960 | the way it's used today, the way it's talked about today?

00:04:30.880 | - Let's take the example of a deep neural network

00:04:33.440 | that has learned to classify images.

00:04:37.240 | And maybe you have trained that network

00:04:40.080 | on 100 different databases of images.

00:04:45.840 | And now a new database comes along

00:04:48.120 | and you want to quickly learn the new thing as well.

00:04:52.000 | So one simple way of doing that is you take the network,

00:04:57.720 | which already knows 100 types of databases,

00:05:02.440 | and then you would just take the top layer of that

00:05:06.320 | and you retrain that using the new label data

00:05:11.320 | that you have in the new image database.

00:05:14.720 | And then it turns out that it really, really quickly

00:05:17.360 | can learn that too, one shot basically,

00:05:20.600 | because from the first 100 datasets,

00:05:24.320 | it already has learned so much about computer vision

00:05:27.560 | that it can reuse that, and that is then almost good enough

00:05:31.880 | to solve the new task, except you need a little bit

00:05:34.240 | of adjustment on the top.

00:05:37.080 | So that is transfer learning, and it has been done

00:05:42.320 | in principle for many decades.

00:05:44.520 | People have done similar things for decades.

00:05:46.720 | Meta-learning, true meta-learning is about

00:05:51.080 | having the learning algorithm itself open to introspection

00:05:56.080 | by the system that is using it,

00:06:00.440 | and also open to modification,

00:06:04.800 | such that the learning system has an opportunity

00:06:07.880 | to modify any part of the learning algorithm,

00:06:12.120 | and then evaluate the consequences of that modification,

00:06:16.760 | and then learn from that to create

00:06:21.040 | a better learning algorithm, and so on recursively.

00:06:24.880 | So that's a very different animal,

00:06:28.560 | where you are opening the space

00:06:31.160 | of possible learning algorithms

00:06:33.560 | to the learning system itself.

00:06:35.520 | - Right, so you've, like in the 2004 paper,

00:06:39.040 | you described Gator machines,

00:06:41.920 | programs that rewrite themselves, right?

00:06:44.480 | Philosophically, and even in your paper,

00:06:46.560 | mathematically, these are really compelling ideas,

00:06:49.960 | but practically, do you see these self-referential programs

00:06:54.960 | being successful in the near term,

00:06:58.360 | to having an impact where sort of it demonstrates

00:07:01.960 | to the world that this direction

00:07:04.760 | is a good one to pursue in the near term?

00:07:08.640 | - Yes, we had these two different types

00:07:11.320 | of fundamental research,

00:07:13.440 | how to build a universal problem solver,

00:07:15.800 | one basically exploiting proof search,

00:07:20.320 | and things like that, that you need to come up with

00:07:25.520 | asymptotically optimal, theoretically optimal,

00:07:30.280 | self-improvers and problem solvers.

00:07:33.200 | However, one has to admit that through this proof search,

00:07:39.160 | comes in an additive constant,

00:07:43.600 | an overhead, an additive overhead,

00:07:46.760 | that vanishes in comparison to what you have to do

00:07:51.760 | to solve large problems.

00:07:55.160 | However, for many of the small problems

00:07:58.000 | that we want to solve in our everyday life,

00:08:00.880 | we cannot ignore this constant overhead.

00:08:03.280 | And that's why we also have been doing other things,

00:08:08.120 | non-universal things, such as recurrent neural networks,

00:08:12.160 | which are trained by gradient descent,

00:08:15.400 | and local search techniques, which aren't universal at all,

00:08:18.680 | which aren't provably optimal at all,

00:08:21.280 | like the other stuff that we did,

00:08:22.880 | but which are much more practical,

00:08:25.400 | as long as we only want to solve the small problems

00:08:28.760 | that we are typically trying to solve

00:08:33.320 | in this environment here.

00:08:35.560 | So the universal problem solvers, like the Godel machine,

00:08:38.920 | but also Markus Hutter's fastest way

00:08:42.080 | of solving all possible problems,

00:08:44.360 | which he developed around 2002 in my lab,

00:08:48.080 | they are associated with these constant overheads

00:08:52.520 | for proof search, which guarantees that the thing

00:08:55.160 | that you're doing is optimal.

00:08:56.560 | For example, there is this fastest way

00:09:01.160 | of solving all problems with a computable solution,

00:09:05.280 | which is due to Markus Hutter.

00:09:07.280 | And to explain what's going on there,

00:09:12.240 | let's take traveling salesman problems.

00:09:14.320 | With traveling salesman problems,

00:09:17.360 | you have a number of cities, N cities,

00:09:21.320 | and you try to find the shortest path

00:09:23.680 | through all these cities without visiting any city twice.

00:09:27.840 | And nobody knows the fastest way

00:09:32.320 | of solving traveling salesman problems, TSPs,

00:09:36.560 | but let's assume there is a method of solving them

00:09:41.760 | within N to the five operations,

00:09:45.920 | where N is the number of cities.

00:09:50.240 | Then the universal method of Markus

00:09:54.600 | is going to solve the same traveling salesman problem

00:09:58.600 | also within N to the five steps,

00:10:02.160 | plus O of one, plus a constant number of steps

00:10:06.440 | that you need for the proof searcher,

00:10:09.280 | which you need to show that this particular class

00:10:14.120 | of problems, the traveling salesman problems,

00:10:17.240 | can be solved within a certain time bound

00:10:20.320 | within order N to the five steps, basically.

00:10:24.320 | And this additive constant doesn't care for N,

00:10:28.440 | which means as N is getting larger and larger,

00:10:32.320 | as you have more and more cities,

00:10:34.800 | the constant overhead pales in comparison.

00:10:38.640 | And that means that almost all large problems are solved

00:10:43.640 | in the best possible way.

00:10:45.600 | Already today, we already have a universal problem solver

00:10:49.120 | like that.

00:10:50.600 | However, it's not practical because the overhead,

00:10:54.640 | the constant overhead is so large

00:10:57.600 | that for the small kinds of problems

00:11:00.320 | that we want to solve in this little biosphere.

00:11:04.680 | - By the way, when you say small,

00:11:06.520 | you're talking about things that fall within the constraints

00:11:09.520 | of our computational systems.

00:11:11.000 | So they can seem quite large to us mere humans, right?

00:11:14.560 | - That's right, yeah.

00:11:15.480 | So they seem large and even unsolvable

00:11:19.120 | in a practical sense today,

00:11:21.120 | but they are still small compared to almost all problems

00:11:24.880 | because almost all problems are large problems,

00:11:28.600 | which are much larger than any constant.

00:11:31.000 | - Do you find it useful as a person who has dreamed

00:11:36.120 | of creating a general learning system,

00:11:38.720 | has worked on creating one,

00:11:39.960 | has done a lot of interesting ideas there,

00:11:42.240 | to think about P versus NP,

00:11:46.400 | this formalization of how hard problems are,

00:11:50.840 | how they scale,

00:11:52.440 | this kind of worst case analysis type of thinking,

00:11:55.280 | do you find that useful?

00:11:56.880 | Or is it only just a mathematical,

00:11:59.740 | it's a set of mathematical techniques

00:12:02.680 | to give you intuition about what's good and bad.

00:12:05.800 | - So P versus NP, that's super interesting

00:12:09.520 | from a theoretical point of view.

00:12:11.840 | And in fact, as you are thinking about that problem,

00:12:14.600 | you can also get inspiration

00:12:17.320 | for better practical problem solvers.

00:12:21.320 | On the other hand, we have to admit that at the moment,

00:12:24.600 | the best practical problem solvers

00:12:28.400 | for all kinds of problems that we are now solving

00:12:31.120 | through what is called AI at the moment,

00:12:33.880 | they are not of the kind

00:12:36.280 | that is inspired by these questions.

00:12:38.840 | There we are using general purpose computers,

00:12:42.720 | such as recurrent neural networks,

00:12:44.840 | but we have a search technique,

00:12:46.720 | which is just local search, gradient descent,

00:12:50.320 | to try to find a program

00:12:51.960 | that is running on these recurrent networks,

00:12:54.420 | such that it can solve some interesting problems,

00:12:58.560 | such as speech recognition,

00:13:00.560 | or machine translation, and something like that.

00:13:03.120 | And there is very little theory

00:13:06.480 | behind the best solutions that we have at the moment

00:13:09.720 | that can do that.

00:13:10.800 | - Do you think that needs to change?

00:13:12.640 | Do you think that will change?

00:13:14.080 | Or can we go,

00:13:15.120 | can we create a general intelligence systems

00:13:17.120 | without ever really proving that that system is intelligent

00:13:20.600 | in some kind of mathematical way,

00:13:22.560 | solving machine translation perfectly,

00:13:24.960 | or something like that,

00:13:26.300 | within some kind of syntactic definition of a language?

00:13:29.160 | Or can we just be super impressed

00:13:31.120 | by the thing working extremely well, and that's sufficient?

00:13:35.080 | - There's an old saying,

00:13:36.720 | and I don't know who brought it up first,

00:13:39.340 | which says, "There's nothing more practical

00:13:42.440 | than a good theory."

00:13:43.720 | (laughing)

00:13:45.960 | And a good theory of problem solving

00:13:50.400 | under limited resources,

00:13:54.320 | like here in this universe, or on this little planet,

00:13:57.040 | has to take into account these limited resources.

00:14:01.800 | And so probably there is lacking

00:14:04.960 | a theory which is related to what we already have,

00:14:09.960 | these asymptotically optimal problem solvers,

00:14:14.440 | which tells us what we need in addition to that

00:14:18.560 | to come up with a practically optimal problem solver.

00:14:21.760 | So I believe we will have something like that.

00:14:27.120 | And maybe just a few little tiny twists

00:14:29.720 | are necessary to change what we already have

00:14:34.280 | to come up with that as well.

00:14:36.320 | As long as we don't have that,

00:14:37.720 | we admit that we are taking suboptimal ways

00:14:42.560 | and returning neural networks and long, short-term memory

00:14:45.960 | for equipped with local search techniques,

00:14:50.400 | and we are happy that it works better

00:14:53.520 | than any competing method,

00:14:55.520 | but that doesn't mean that we think we are done.

00:15:00.520 | - You've said that an AGI system

00:15:02.760 | will ultimately be a simple one.

00:15:05.080 | A general intelligence system

00:15:06.240 | will ultimately be a simple one.

00:15:08.040 | Maybe a pseudocode of a few lines

00:15:10.280 | would be able to describe it.

00:15:11.880 | Can you talk through your intuition behind this idea?

00:15:16.800 | Why you feel that at its core,

00:15:21.520 | intelligence is a simple algorithm?

00:15:25.560 | - Experience tells us that the stuff that works best

00:15:31.680 | is really simple.

00:15:33.120 | So the asymptotically optimal ways of solving problems,

00:15:37.640 | if you look at them,

00:15:38.800 | they're just a few lines of code, it's really true.

00:15:41.800 | Although they are these amazing properties,

00:15:44.000 | just a few lines of code.

00:15:45.800 | Then the most promising and most useful practical thing

00:15:51.600 | maybe don't have this proof of optimality

00:15:56.360 | associated with them.

00:15:57.800 | However, they are also just a few lines of code.

00:16:00.880 | The most successful return neural networks,

00:16:05.080 | you can write them down in five lines of pseudocode.

00:16:08.520 | - That's a beautiful, almost poetic idea,

00:16:10.920 | but what you're describing there

00:16:15.640 | is the lines of pseudocode are sitting on top of layers

00:16:19.240 | and layers of abstractions in a sense.

00:16:21.440 | So you're saying at the very top,

00:16:25.040 | it'll be a beautifully written sort of algorithm,

00:16:29.040 | but do you think that there's many layers of abstractions

00:16:34.000 | we have to first learn to construct?

00:16:36.880 | - Yeah, of course, we are building on all these

00:16:40.400 | great abstractions that people have invented

00:16:44.000 | over the millennia, such as matrix multiplications.

00:16:49.760 | And real numbers and basic arithmetics and calculus

00:16:54.760 | and derivations of error functions

00:17:01.240 | and derivatives of error functions and stuff like that.

00:17:04.280 | So without that language that greatly simplifies

00:17:10.440 | our way of thinking about these problems,

00:17:13.840 | we couldn't do anything.

00:17:14.800 | So in that sense, as always,

00:17:16.560 | we are standing on the shoulders of the giants

00:17:19.600 | who in the past simplified the problem

00:17:24.240 | of problem solving so much

00:17:26.360 | that now we have a chance to do the final step.

00:17:29.960 | - So the final step will be a simple one.

00:17:32.120 | If we take a step back through all of human civilization

00:17:36.720 | and just the universe in general,

00:17:38.360 | how do you think about evolution

00:17:41.440 | and what if creating a universe is required

00:17:45.360 | to achieve this final step?

00:17:47.280 | What if going through the very painful

00:17:50.880 | and inefficient process of evolution is needed

00:17:53.800 | to come up with this set of abstractions

00:17:55.840 | that ultimately lead to intelligence?

00:17:57.760 | Do you think there's a shortcut

00:18:00.760 | or do you think we have to create

00:18:02.400 | something like our universe

00:18:04.600 | in order to create something like human level intelligence?

00:18:07.720 | - So far, the only example we have is this one,

00:18:13.120 | this universe in which we are living.

00:18:15.400 | - Do you think you can do better?

00:18:16.560 | - Maybe not, but we are part of this whole process.

00:18:25.040 | So apparently, so it might be the case

00:18:30.000 | that the code that runs the universe is really, really simple.

00:18:33.680 | Everything points to that possibility

00:18:36.680 | because gravity and other basic forces

00:18:39.960 | are really simple laws that can be easily described

00:18:44.120 | also in just a few lines of code basically.

00:18:46.480 | And then there are these other events

00:18:52.200 | that the apparently random events

00:18:55.080 | in the history of the universe,

00:18:56.560 | which as far as we know at the moment

00:18:58.800 | don't have a compact code, but who knows,

00:19:01.360 | maybe somebody in the near future

00:19:03.240 | is going to figure out the pseudo random generator

00:19:06.840 | which is computing whether the measurement

00:19:11.840 | of that spin up or down thing here

00:19:16.000 | is going to be positive or negative.

00:19:18.520 | - Underlying quantum mechanics.

00:19:20.000 | - Yes.

00:19:20.840 | - So you ultimately think quantum mechanics

00:19:23.240 | is a pseudo random number generator,

00:19:25.280 | so it's all deterministic.

00:19:27.000 | There's no randomness in our universe.

00:19:28.880 | Does God play dice?

00:19:31.880 | - So a couple of years ago, a famous physicist,

00:19:37.080 | quantum physicist, Anton Zeilinger,

00:19:39.520 | he wrote an essay in "Nature"

00:19:42.160 | and it started more or less like that.

00:19:44.840 | One of the fundamental insights of the 20th century

00:19:52.280 | was that the universe is fundamentally random

00:19:58.880 | on the quantum level.

00:20:02.920 | And that whenever you measure spin up or down

00:20:06.800 | or something like that,

00:20:08.280 | a new bit of information enters the history of the universe.

00:20:12.320 | And while I was reading that,

00:20:16.000 | I was already typing the response

00:20:19.320 | and they had to publish it because I was right.

00:20:21.720 | That there is no evidence, no physical evidence for that.

00:20:27.880 | So there's an alternative explanation

00:20:30.440 | where everything that we consider random

00:20:33.480 | is actually pseudo random,

00:20:35.760 | such as the decimal expansion of pi, 3.141 and so on,

00:20:40.760 | which looks random, but isn't.

00:20:44.360 | So pi is interesting because every three digit sequence,

00:20:49.360 | every sequence of three digits

00:20:52.240 | appears roughly one in a thousand times.

00:20:56.000 | And every five digit sequence

00:20:59.960 | appears roughly one in 10,000 times.

00:21:03.480 | What you would expect if it was random.

00:21:07.040 | But there's a very short algorithm,

00:21:09.320 | a short program that computes all of that.

00:21:11.680 | So it's extremely compressible.

00:21:13.880 | And who knows, maybe tomorrow somebody,

00:21:15.840 | some grad student at CERN

00:21:17.560 | goes back over all these data points,

00:21:20.960 | better decay and whatever,

00:21:22.600 | and figures out, oh, it's the second billion digits of pi

00:21:27.280 | or something like that.

00:21:28.520 | We don't have any fundamental reason at the moment

00:21:31.520 | to believe that this is truly random

00:21:36.120 | and not just a deterministic video game.

00:21:39.080 | If it was a deterministic video game,

00:21:41.120 | it would be much more beautiful

00:21:43.120 | because beauty is simplicity.

00:21:46.680 | And many of the basic laws of the universe,

00:21:49.840 | like gravity and the other basic forces are very simple.

00:21:54.120 | So very short programs can explain what these are doing.

00:21:58.080 | And it would be awful and ugly.

00:22:03.360 | The universe would be ugly.

00:22:04.560 | The history of the universe would be ugly

00:22:06.760 | if for the extra things,

00:22:08.160 | the seemingly random data points that we get all the time,

00:22:12.880 | that we really need a huge number of extra bits

00:22:17.920 | to describe all these extra bits of information.

00:22:22.920 | So as long as we don't have evidence

00:22:27.400 | that there is no short program

00:22:29.760 | that computes the entire history of the entire universe,

00:22:34.000 | we are, as scientists, compelled to look further

00:22:40.360 | for that shortest program.

00:22:42.760 | - Your intuition says there exists a program

00:22:47.800 | that can backtrack to the creation of the universe.

00:22:52.000 | So compute the shortest path to the creation of the universe.

00:22:54.280 | - Yes, including all the entanglement things

00:22:58.440 | and all the spin up and down measurements

00:23:02.480 | that have been taken place

00:23:05.760 | since 13.8 billion years ago.

00:23:10.640 | And so, yeah.

00:23:11.840 | So we don't have a proof that it is random.

00:23:15.720 | We don't have a proof that it is compressible

00:23:19.600 | to a short program.

00:23:20.800 | But as long as we don't have that proof,

00:23:22.400 | we are obliged as scientists

00:23:25.000 | to keep looking for that simple explanation.

00:23:27.600 | - Absolutely.

00:23:28.440 | So you said simplicity is beautiful or beauty is simple.

00:23:31.600 | Either one works.

00:23:33.280 | But you also work on curiosity, discovery,

00:23:37.080 | the romantic notion of randomness, of serendipity,

00:23:42.840 | of being surprised by things that are about you,

00:23:47.840 | kind of in our poetic notion of reality,

00:23:53.440 | we think as humans require randomness.

00:23:56.440 | So you don't find randomness beautiful.

00:23:59.040 | You find simple determinism beautiful.

00:24:04.040 | - Yeah.

00:24:05.720 | - Okay.

00:24:07.720 | - So why?

00:24:08.600 | - Why?

00:24:09.440 | Because the explanation becomes shorter.

00:24:13.080 | A universe that is compressible to a short program

00:24:18.080 | is much more elegant and much more beautiful

00:24:22.920 | than another one, which needs an almost infinite number

00:24:26.280 | of bits to be described.

00:24:27.880 | As far as we know,

00:24:30.160 | many things that are happening in this universe

00:24:33.960 | are really simple in terms of short programs

00:24:37.520 | that compute gravity and the interaction

00:24:41.160 | between elementary particles and so on.

00:24:43.600 | So all of that seems to be very, very simple.

00:24:45.800 | Every electron seems to reuse the same sub-program

00:24:50.240 | all the time as it is interacting

00:24:52.280 | with other elementary particles.

00:24:54.600 | If we now require an extra oracle

00:25:04.440 | injecting new bits of information all the time

00:25:07.800 | for these extra things which are currently not understood,

00:25:11.960 | such as better decay,

00:25:16.040 | then the whole description length

00:25:23.640 | of the data that we can observe of the history

00:25:26.560 | of the universe would become much longer

00:25:31.560 | and therefore uglier.

00:25:33.720 | - And uglier.

00:25:34.920 | Again, simplicity is elegant and beautiful.

00:25:38.000 | - The history of science is a history

00:25:40.840 | of compression progress.

00:25:42.720 | - Yeah, so you've described sort of

00:25:46.960 | as we build up abstractions

00:25:48.680 | and you've talked about the idea of compression.

00:25:52.120 | How do you see this, the history of science,

00:25:55.560 | the history of humanity, our civilization,

00:25:58.600 | and life on Earth as some kind of path

00:26:02.000 | towards greater and greater compression?

00:26:04.120 | What do you mean by that?

00:26:05.040 | How do you think about that?

00:26:06.960 | - Indeed, the history of science

00:26:09.240 | is a history of compression progress.

00:26:13.080 | What does that mean?

00:26:14.640 | Hundreds of years ago, there was an astronomer

00:26:18.000 | whose name was Kepler,

00:26:20.080 | and he looked at the data points that he got

00:26:22.560 | by watching planets move,

00:26:25.800 | and then he had all these data points,

00:26:27.560 | and suddenly it turned out

00:26:28.760 | that he can greatly compress the data

00:26:31.920 | by predicting it through an ellipse law.

00:26:36.920 | So it turns out that all these data points

00:26:39.840 | are more or less on ellipses around the sun.

00:26:43.680 | And another guy came along whose name was Newton,

00:26:49.600 | and before him Hooke,

00:26:51.440 | and they said the same thing

00:26:53.680 | that is making these planets move like that

00:26:58.480 | is what makes the apples fall down.

00:27:01.880 | And it also holds for stones

00:27:05.520 | and for all kinds of other objects.

00:27:09.800 | And suddenly, many, many of these observations

00:27:15.080 | became much more compressible,

00:27:16.960 | because as long as you can predict the next thing,

00:27:19.960 | given what you have seen so far,

00:27:21.720 | you can compress it,

00:27:22.640 | but you don't have to store that data extra.

00:27:25.320 | This is called predictive coding.

00:27:29.240 | And then there was still something wrong

00:27:31.400 | with that theory of the universe,

00:27:33.680 | and you had deviations from these predictions of the theory.

00:27:37.600 | And 300 years later, another guy came along

00:27:40.240 | whose name was Einstein,

00:27:41.880 | and he was able to explain away all these deviations

00:27:46.720 | from the predictions of the old theory

00:27:51.160 | through a new theory,

00:27:52.920 | which was called the general theory of relativity,

00:27:56.960 | which at first glance looks a little bit more complicated,

00:28:00.760 | and you have to warp space and time,

00:28:02.760 | but you can phrase it within one single sentence,

00:28:05.600 | which is no matter how fast you accelerate

00:28:09.000 | and how fast or hard you decelerate,

00:28:12.920 | and no matter what is the gravity in your local framework,

00:28:17.920 | light speed always looks the same.

00:28:21.360 | And from that, you can calculate all the consequences.

00:28:24.280 | So it's a very simple thing,

00:28:25.760 | and it allows you to further compress all the observations

00:28:30.360 | because certainly there are hardly any deviations any longer

00:28:35.360 | that you can measure from the predictions

00:28:37.800 | of this new theory.

00:28:40.080 | So all of science is a history of compression progress.

00:28:44.560 | You never arrive immediately

00:28:47.000 | at the shortest explanation of the data,

00:28:50.800 | but you're making progress.

00:28:52.560 | Whenever you are making progress,

00:28:54.360 | you have an insight.

00:28:56.320 | You see, oh, first I needed so many bits of information

00:28:59.520 | to describe the data, to describe my falling apples,

00:29:02.840 | my video of falling apples.

00:29:04.200 | I need so many data.

00:29:05.800 | So many pixels have to be stored.

00:29:08.200 | But then suddenly I realized, no,

00:29:10.120 | there is a very simple way of predicting the third frame

00:29:13.920 | in the video from the first two.

00:29:16.080 | And maybe not every little detail can be predicted,

00:29:20.120 | but more or less, most of these orange blobs

00:29:22.680 | that are coming down, they accelerate in the same way,

00:29:25.880 | which means that I can greatly compress the video.

00:29:28.640 | And the amount of compression progress,

00:29:33.320 | that is the depth of the insight

00:29:35.680 | that you have at that moment.

00:29:37.360 | That's the fun that you have, the scientific fun,

00:29:40.280 | the fun in that discovery.

00:29:43.040 | And we can build artificial systems that do the same thing.

00:29:46.080 | They measure the depth of their insights

00:29:49.080 | as they are looking at the data,

00:29:50.800 | which is coming in through their own experiments.

00:29:53.520 | And we give them a reward, an intrinsic reward,

00:29:57.120 | in proportion to this depth of insight.

00:30:01.200 | And since they are trying to maximize the rewards they get,

00:30:08.160 | they are suddenly motivated to come up with new action

00:30:11.160 | sequences, with new experiments that have the property

00:30:15.760 | that the data that is coming in as a consequence

00:30:18.800 | of these experiments has the property

00:30:21.800 | that they can learn something about,

00:30:24.040 | see a pattern in there which they hadn't seen yet before.

00:30:28.760 | - So there's an idea of power play that you've described,

00:30:32.080 | a training in general problem solver in this kind of way

00:30:35.320 | of looking for the unsolved problems.

00:30:37.360 | - Yeah.

00:30:38.200 | - Can you describe that idea a little further?

00:30:40.440 | - It's another very simple idea.

00:30:42.440 | So normally what you do in computer science,

00:30:44.840 | you have some guy who gives you a problem,

00:30:50.280 | and then there is a huge search space of potential solution

00:30:55.920 | candidates.

00:30:56.880 | And you somehow try them out, and you

00:31:00.680 | have more or less sophisticated ways

00:31:02.680 | of moving around in that search space

00:31:06.960 | until you finally found a solution which

00:31:10.320 | you consider satisfactory.

00:31:13.040 | That's what most of computer science is about.

00:31:15.840 | Power play just goes one little step further and says,

00:31:20.000 | let's not only search for solutions to a given problem,

00:31:24.640 | but let's search to pairs of problems and their solutions

00:31:30.600 | where the system itself has the opportunity

00:31:33.160 | to phrase its own problem.

00:31:37.320 | So we are looking suddenly at pairs of problems

00:31:41.000 | and their solutions or modifications of the problem

00:31:45.480 | solver that is supposed to generate

00:31:47.640 | a solution to that new problem.

00:31:51.000 | And this additional degree of freedom

00:31:56.800 | allows us to build curious systems that

00:32:00.440 | are like scientists in the sense that they not only try

00:32:04.360 | to solve and try to find answers to existing questions,

00:32:08.200 | no, they are also free to pose their own questions.

00:32:13.320 | So if you want to build an artificial scientist,

00:32:15.360 | you have to give it that freedom,

00:32:16.840 | and power play is exactly doing that.

00:32:19.520 | So that's a dimension of freedom that's important to have.

00:32:23.000 | But how hard do you think that--

00:32:27.360 | how multidimensional and difficult

00:32:30.800 | the space of then coming up with your own questions is?

00:32:34.240 | So it's one of the things that as human beings

00:32:36.920 | we consider to be the thing that makes us special,

00:32:39.880 | the intelligence that makes us special,

00:32:42.200 | is that brilliant insight that can create something totally

00:32:46.920 | new.

00:32:47.600 | Yes.

00:32:48.800 | So now let's look at the extreme case.

00:32:51.280 | Let's look at the set of all possible problems

00:32:55.560 | that you can formally describe, which is infinite, which

00:33:01.160 | should be the next problem that a scientist or a power play

00:33:06.520 | is going to solve.

00:33:08.120 | Well, it should be the easiest problem that goes

00:33:14.480 | beyond what you already know.

00:33:17.560 | So it should be the simplest problem

00:33:20.720 | that the current problem solver that you have,

00:33:23.040 | which can already solve 100 problems,

00:33:26.600 | that he cannot solve yet by just generalizing.

00:33:31.080 | So it has to be new.

00:33:32.400 | So it has to require a modification of the problem

00:33:35.120 | solver such that the new problem solver can

00:33:37.400 | solve this new thing, but the old problem solver cannot do it.

00:33:41.560 | And in addition to that, we have to make sure

00:33:45.760 | that the problem solver doesn't forget

00:33:48.720 | any of the previous solutions.

00:33:50.200 | Right.

00:33:51.200 | And so by definition, power play is now

00:33:53.720 | trying always to search in this pair of--

00:33:57.480 | in the set of pairs of problems and problem solver

00:34:01.600 | modifications for a combination that minimize the time

00:34:06.600 | to achieve these criteria.

00:34:08.160 | So it's always trying to find the problem which is easiest

00:34:12.160 | to add to the repertoire.

00:34:14.720 | So just like grad students and academics and researchers

00:34:18.880 | can spend their whole career in a local minima,

00:34:22.200 | stuck trying to come up with interesting questions,

00:34:25.860 | but ultimately doing very little,

00:34:27.600 | do you think it's easy in this approach of looking

00:34:31.880 | for the simplest unsolvable problem

00:34:33.760 | to get stuck in a local minima?

00:34:35.520 | Is never really discovering new--

00:34:40.560 | really jumping outside of the 100 problems

00:34:42.600 | that you've already solved in a genuine creative way?

00:34:47.600 | No, because that's the nature of power play,

00:34:49.920 | that it's always trying to break its current generalization

00:34:53.960 | abilities by coming up with a new problem which

00:34:58.120 | is beyond the current horizon.

00:35:00.920 | Just shifting the horizon of knowledge a little bit

00:35:04.440 | out there, breaking the existing rules,

00:35:08.040 | such that the new thing becomes solvable,

00:35:10.960 | but wasn't solvable by the old thing.

00:35:13.280 | So like adding a new axiom, like what

00:35:16.400 | Gödel did when he came up with these new sentences,

00:35:20.640 | new theorems that didn't have a proof in the formal system,

00:35:23.840 | which means you can add them to the repertoire,

00:35:27.680 | hoping that they are not going to damage the consistency

00:35:33.440 | of the whole thing.

00:35:35.880 | So in the paper with the amazing title,

00:35:39.640 | Formal Theory of Creativity, Fun and Intrinsic Motivation,

00:35:44.320 | you talk about discovery as intrinsic reward.

00:35:47.720 | So if you view humans as intelligent agents,

00:35:51.640 | what do you think is the purpose and meaning of life

00:35:54.880 | for us humans?

00:35:56.880 | You've talked about this discovery.

00:35:58.800 | Do you see humans as an instance of power play, agents?

00:36:04.200 | Yeah, so humans are curious, and that

00:36:09.160 | means they behave like scientists,

00:36:11.520 | not only the official scientists,

00:36:13.160 | but even the babies behave like scientists,

00:36:15.800 | and they play around with their toys

00:36:18.280 | to figure out how the world works

00:36:20.040 | and how it is responding to their actions.

00:36:23.520 | And that's how they learn about gravity and everything.

00:36:27.320 | And yeah, in 1990, we had the first systems like that,

00:36:30.960 | which just tried to play around with the environment

00:36:34.200 | and come up with situations that go beyond what

00:36:38.280 | they knew at that time, and then get

00:36:40.720 | a reward for creating these situations,

00:36:42.720 | and then becoming more general problem solvers

00:36:45.800 | and being able to understand more of the world.

00:36:48.960 | So yeah, I think in principle, that curiosity strategy

00:36:59.920 | or more sophisticated versions of what I just described,

00:37:03.240 | they are what we have built in as well,

00:37:06.480 | because evolution discovered that's a good way of exploring

00:37:10.200 | the unknown world.

00:37:11.440 | And a guy who explores the unknown world

00:37:13.440 | has a higher chance of solving problems

00:37:17.000 | that he needs to survive in this world.

00:37:19.480 | On the other hand, those guys who were too curious,

00:37:23.760 | they were weeded out as well.

00:37:25.360 | So you have to find this trade-off.

00:37:27.200 | Evolution found a certain trade-off.

00:37:29.240 | Apparently, in our society, there

00:37:31.960 | is a certain percentage of extremely explorative guys.

00:37:36.120 | And it doesn't matter if they die,

00:37:38.040 | because many of the others are more conservative.

00:37:42.280 | And so yeah, it would be surprising to me

00:37:46.480 | if that principle of artificial curiosity

00:37:54.440 | wouldn't be present in almost exactly the same form here.

00:37:58.400 | In our brains.

00:37:59.840 | So you're a bit of a musician and an artist.

00:38:03.080 | So continuing on this topic of creativity,

00:38:07.600 | what do you think is the role of creativity in intelligence?

00:38:10.440 | So you've kind of implied that it's

00:38:12.360 | essential for intelligence, if you think of intelligence

00:38:17.520 | as a problem-solving system, as ability to solve problems.

00:38:23.280 | But do you think it's essential, this idea of creativity?

00:38:28.920 | We never have a program, a sub-program,

00:38:31.760 | that is called creativity or something.

00:38:34.400 | It's just a side effect of what our problem solvers do.

00:38:37.960 | They are searching a space of problems,

00:38:40.200 | or a space of candidates, of solution candidates,

00:38:44.600 | until they hopefully find a solution to a given problem.

00:38:48.160 | But then there are these two types of creativity.

00:38:50.520 | And both of them are now present in our machines.

00:38:54.280 | The first one has been around for a long time,

00:38:56.480 | which is human gives problem to machine,

00:38:59.640 | machine tries to find a solution to that.

00:39:03.360 | And this has been happening for many decades.

00:39:05.920 | And for many decades, machines have

00:39:07.400 | found creative solutions to interesting problems,

00:39:11.400 | where humans were not aware of these particularly

00:39:16.560 | creative solutions, but then appreciated

00:39:18.920 | that the machine found that.

00:39:21.760 | The second is the pure creativity.

00:39:23.760 | That I would call, what I just mentioned,

00:39:25.640 | I would call the applied creativity,

00:39:29.040 | like applied art, where somebody tells you,

00:39:31.640 | now make a nice picture of this pope,

00:39:35.160 | and you will get money for that.

00:39:37.240 | So here is the artist, and he makes a convincing picture

00:39:41.240 | of the pope, and the pope likes it and gives him the money.

00:39:45.880 | And then there is the pure creativity,

00:39:48.720 | which is more like the power play

00:39:50.440 | and the artificial curiosity thing, where

00:39:52.760 | you have the freedom to select your own problem,

00:39:57.040 | like a scientist who defines his own question to study.

00:40:03.400 | And so that is the pure creativity, if you will,

00:40:07.960 | as opposed to the applied creativity,

00:40:11.400 | which serves another.

00:40:14.360 | - And in that distinction, there's

00:40:15.720 | almost echoes of narrow AI versus general AI.

00:40:19.160 | So this kind of constrained painting of a pope

00:40:22.720 | seems like the approaches of what people are calling

00:40:28.440 | narrow AI.

00:40:29.760 | And pure creativity seems to be--

00:40:33.360 | maybe I'm just biased as a human,

00:40:35.000 | but it seems to be an essential element

00:40:38.440 | of human-level intelligence.

00:40:41.120 | Is that what you're implying, to a degree?

00:40:46.040 | - If you zoom back a little bit, and you just

00:40:48.520 | look at a general problem-solving machine, which

00:40:51.480 | is trying to solve arbitrary problems,

00:40:53.600 | then this machine will figure out

00:40:56.320 | in the course of solving problems

00:40:58.240 | that it's good to be curious.

00:41:00.160 | So all of what I said just now about this pre-wired curiosity

00:41:05.320 | and this will to invent new problems that the system

00:41:09.040 | doesn't know how to solve yet should be just a byproduct

00:41:13.280 | of the general search.

00:41:15.040 | However, apparently, evolution has built it into us,

00:41:21.840 | because it turned out to be so successful,

00:41:25.080 | pre-wiring, a bias, a very successful exploratory bias

00:41:30.440 | that we are born with.

00:41:33.880 | - And you've also said that consciousness

00:41:35.680 | in the same kind of way may be a byproduct of problem solving.

00:41:41.200 | Do you think-- do you find this an interesting byproduct?

00:41:44.720 | Do you think it's a useful byproduct?

00:41:47.040 | What are your thoughts on consciousness in general?

00:41:49.320 | Or is it simply a byproduct of greater and greater

00:41:53.120 | capabilities of problem solving that's similar to creativity

00:41:59.160 | in that sense?

00:42:00.920 | - Yeah, we never have a procedure called consciousness

00:42:04.200 | in our machines.

00:42:05.320 | However, we get as side effects of what

00:42:08.720 | these machines are doing things that

00:42:11.880 | seem to be closely related to what people call consciousness.

00:42:16.720 | So for example, already in 1990, we

00:42:19.880 | had simple systems, which were basically recurrent networks,

00:42:24.160 | and therefore universal computers,

00:42:26.200 | trying to map incoming data into actions that lead to success.

00:42:33.960 | Maximizing reward in a given environment,

00:42:36.720 | always finding the charging station in time

00:42:40.400 | whenever the battery is low and negative signals are coming

00:42:42.720 | from the battery, always find the charging station in time

00:42:47.240 | without bumping against painful obstacles on the way.

00:42:50.520 | So complicated things, but very easily motivated.

00:42:54.720 | And then we give these little guys

00:42:59.200 | a separate recurrent neural network, which

00:43:01.920 | is just predicting what's happening

00:43:03.680 | if I do that and that.

00:43:04.800 | What will happen as a consequence of these actions

00:43:08.240 | that I'm executing?

00:43:09.360 | And it's just trained on the long and long history

00:43:11.600 | of interactions with the world.

00:43:14.040 | So it becomes a predictive model of the world, basically.

00:43:18.120 | And therefore, also a compressor of the observations

00:43:22.720 | of the world, because whatever you can predict,

00:43:25.280 | you don't have to store extra.

00:43:26.560 | So compression is a side effect of prediction.

00:43:30.600 | And how does this recurrent network compress?

00:43:33.240 | Well, it's inventing little subprograms,

00:43:35.720 | little subnetworks that stand for everything that frequently

00:43:39.960 | appears in the environment, like bottles and microphones

00:43:43.840 | and faces, maybe lots of faces in my environment.

00:43:48.120 | So I'm learning to create something like a prototype

00:43:51.000 | face, and a new face comes along,

00:43:52.920 | and all I have to encode are the deviations from the prototype.

00:43:56.360 | So it's compressing all the time the stuff that frequently

00:43:59.320 | appears.

00:44:00.880 | There's one thing that appears all the time, that

00:44:05.200 | is present all the time when the agent is interacting

00:44:08.000 | with its environment, which is the agent itself.

00:44:11.840 | So just for data compression reasons,

00:44:14.520 | it is extremely natural for this recurrent network

00:44:18.640 | to come up with little subnetworks that

00:44:21.080 | stand for the properties of the agents, the hand,

00:44:26.160 | the other actuators, and all the stuff

00:44:29.080 | that you need to better encode the data which is influenced

00:44:32.560 | by the actions of the agent.

00:44:34.360 | So there, just as a side effect of data compression

00:44:39.040 | during problem solving, you have internal self-models.

00:44:45.800 | Now, you can use this model of the world to plan your future,

00:44:51.360 | and that's what you also have done since 1990.

00:44:54.040 | So the recurrent network, which is the controller, which

00:44:57.880 | is trying to maximize reward, can

00:45:00.040 | use this model of the network--

00:45:01.840 | of the world, this model network of the world,

00:45:04.160 | this predictive model of the world,

00:45:05.620 | to plan ahead and say, let's not do this action sequence.

00:45:09.060 | Let's do this action sequence instead,

00:45:11.340 | because it leads to more predicted reward.

00:45:14.580 | And whenever it's waking up these little subnetworks that

00:45:18.900 | stand for itself, then it's thinking about itself.

00:45:22.220 | Then it's thinking about itself, and it's exploring mentally

00:45:27.700 | the consequences of its own actions.

00:45:30.940 | And now you tell me what is still missing.

00:45:37.220 | Missing the next-- the gap to consciousness.

00:45:39.660 | Yeah.

00:45:40.380 | There isn't.

00:45:41.060 | That's a really beautiful idea that if life

00:45:45.220 | is a collection of data, and life

00:45:47.300 | is a process of compressing that data to act efficiently,

00:45:54.140 | in that data, you yourself appear very often.

00:45:57.640 | So it's useful to form compressions of yourself.

00:46:00.900 | It's a really beautiful formulation

00:46:02.740 | of what consciousness is as a necessary side effect.

00:46:05.700 | It's actually quite compelling to me.

00:46:11.420 | You've described RNNs, developed LSTMs,

00:46:16.500 | long short-term memory networks, that are type

00:46:20.340 | over current neural networks.

00:46:22.140 | They've gotten a lot of success recently.

00:46:23.940 | So these are networks that model the temporal aspects

00:46:28.540 | in the data, temporal patterns in the data.

00:46:31.140 | And you've called them the deepest

00:46:33.860 | of the neural networks, right?

00:46:36.260 | So what do you think is the value of depth in the models

00:46:39.780 | that we use to learn?

00:46:41.220 | Since you mentioned the long short-term memory and the LSTM,

00:46:48.340 | I have to mention the names of the brilliant students who

00:46:51.500 | made that possible.

00:46:52.340 | Yes, of course.

00:46:53.380 | First of all, my first student ever,

00:46:55.220 | Sepp Hochreiter, who had fundamental insights already

00:46:58.420 | in his diploma thesis.

00:47:00.260 | Then Felix Geers, who had additional important

00:47:03.660 | contributions.

00:47:04.620 | Alex Gray is a guy from Scotland who

00:47:08.100 | is mostly responsible for this CTC algorithm, which

00:47:11.420 | is now often used to train the LSTM to do the speech

00:47:15.620 | recognition on all the Google, Android phones, and whatever,

00:47:19.300 | and Siri, and so on.

00:47:21.540 | So these guys, without these guys, I would be nothing.

00:47:26.820 | That's a lot of incredible work.

00:47:29.260 | What is now the depth?

00:47:30.540 | What is the importance of depth?

00:47:32.500 | Well, most problems in the real world

00:47:36.220 | are deep in the sense that the current input doesn't tell you

00:47:41.060 | all you need to know about the environment.

00:47:45.460 | So instead, you have to have a memory

00:47:48.460 | of what happened in the past.

00:47:49.780 | And often, important parts of that memory are dated.

00:47:54.820 | They are pretty old.

00:47:56.460 | So when you're doing speech recognition, for example,

00:48:00.340 | and somebody says 11, then that's

00:48:04.820 | about half a second or something like that, which

00:48:09.380 | means it's already 50 time steps.

00:48:12.020 | And another guy, or the same guy, says 7.

00:48:16.260 | So the ending is the same, Evan.

00:48:18.660 | But now the system has to see the distinction between 7

00:48:22.220 | and 11.

00:48:23.300 | And the only way it can see the difference

00:48:25.020 | is it has to store that 50 steps ago, there

00:48:29.980 | was an s or an l, 11 or 7.

00:48:34.940 | So there you have already a problem of depth 50,

00:48:38.100 | because for each time step, you have something

00:48:41.320 | like a virtual layer in the expanded unrolled version

00:48:44.900 | of this recurrent network, which is doing the speech recognition.

00:48:48.380 | So these long time lags, they translate into problem depth.

00:48:53.820 | And most problems in this world are such

00:48:57.780 | that you really have to look far back in time

00:49:01.620 | to understand what is the problem and to solve it.

00:49:06.180 | But just like with LCMs, you don't necessarily

00:49:09.100 | need to, when you look back in time, remember every aspect.

00:49:12.340 | You just need to remember the important aspects.

00:49:14.820 | That's right.

00:49:15.500 | The network has to learn to put the important stuff

00:49:18.580 | into memory and to ignore the unimportant noise.

00:49:24.180 | But in that sense, deeper and deeper is better?

00:49:28.540 | Or is there a limitation?

00:49:30.980 | I mean, LSTM is one of the great examples of architectures

00:49:36.540 | that do something beyond just deeper and deeper networks.

00:49:42.380 | There's clever mechanisms for filtering data,

00:49:45.500 | for remembering and forgetting.

00:49:47.860 | So do you think that kind of thinking is necessary?

00:49:51.340 | If you think about LSTMs as a leap, a big leap forward

00:49:54.500 | over traditional vanilla RNNs, what

00:49:57.820 | do you think is the next leap within this context?

00:50:02.900 | So LCM was a very clever improvement,

00:50:06.060 | but LCM still don't have the same kind of ability

00:50:10.420 | to see far back in the past as us humans do.

00:50:14.740 | The credit assignment problem across way back,

00:50:19.060 | not just 50 time steps or 100 or 1,000,

00:50:21.900 | but millions and billions.

00:50:24.540 | It's not clear what are the practical limits of the LSTM

00:50:29.060 | when it comes to looking back.

00:50:31.140 | Already in 2006, I think, we had examples

00:50:35.100 | where it not only looked back tens of thousands of steps,

00:50:37.900 | but really millions of steps.

00:50:40.740 | And Juan Perez Ortiz in my lab, I

00:50:44.820 | think was the first author of a paper where we really--

00:50:48.660 | was it 2006 or something--

00:50:50.460 | had examples where it learned to look back

00:50:53.620 | for more than 10 million steps.

00:50:57.500 | So for most problems of speech recognition,

00:51:02.060 | it's not necessary to look that far back.

00:51:04.620 | But there are examples where it does.

00:51:06.900 | Now, the looking back thing, that's

00:51:10.340 | rather easy because there is only one past.

00:51:14.460 | But there are many possible futures.

00:51:17.780 | And so a reinforcement learning system,

00:51:20.180 | which is trying to maximize its future expected reward

00:51:24.260 | and doesn't know yet which of these many possible futures

00:51:27.580 | should I select, given there's one single past,

00:51:31.620 | is facing problems that the LSTM by itself cannot solve.

00:51:36.540 | So the LSTM is good for coming up

00:51:38.900 | with a compact representation of the history so far,

00:51:42.380 | of the history of observations and actions so far.

00:51:46.380 | But now, how do you plan in an efficient and good way

00:51:51.420 | among all these--

00:51:54.340 | how do you select one of these many possible action sequences

00:51:58.140 | that a reinforcement learning system

00:51:59.860 | has to consider to maximize reward in this unknown future?

00:52:05.700 | So again, we have this basic setup

00:52:08.700 | where you have one RICAR network, which

00:52:12.820 | gets in the video and the speech and whatever,

00:52:15.940 | and is executing actions, and is trying to maximize reward.

00:52:19.660 | So there is no teacher who tells it

00:52:22.060 | what to do at which point in time.

00:52:24.460 | And then there's the other network,

00:52:26.100 | which is just predicting what's going

00:52:30.500 | to happen if I do that and that.

00:52:32.980 | And that could be an LSTM network.

00:52:35.260 | And it learns to look back all the way

00:52:38.460 | to make better predictions of the next time step.

00:52:41.620 | So essentially, although it's predicting only the next time

00:52:45.020 | step, it is motivated to learn to put into memory something

00:52:50.540 | that happened maybe a million steps ago,

00:52:52.420 | because it's important to memorize that if you

00:52:54.980 | want to predict that at the next time step, the next event.

00:52:59.620 | Now, how can a model of the world

00:53:03.300 | like that, a predictive model of the world,

00:53:05.500 | be used by the first guy?

00:53:07.940 | Let's call it the controller and the model,

00:53:10.180 | the controller and the model.

00:53:11.500 | How can the model be used by the controller

00:53:14.340 | to efficiently select among these many possible futures?

00:53:19.580 | So the naive way we had about 30 years ago

00:53:23.100 | was let's just use the model of the world as a stand-in,

00:53:27.620 | as a simulation of the world.

00:53:29.340 | And millisecond by millisecond, we plan the future.

00:53:32.420 | And that means we have to roll it out really in detail.

00:53:36.260 | And it will work only if the model is really good.

00:53:38.500 | And it will still be inefficient,

00:53:40.380 | because we have to look at all these possible futures.

00:53:42.940 | And there are so many of them.

00:53:45.740 | So instead, what we do now, since 2015,

00:53:49.500 | in our CM systems, controller model systems,

00:53:52.180 | we give the controller the opportunity

00:53:55.500 | to learn by itself how to use the potentially relevant parts

00:54:00.620 | of the model network to solve new problems more quickly.

00:54:06.300 | And if it wants to, it can learn to ignore the M.

00:54:10.100 | And sometimes it's a good idea to ignore the M,

00:54:12.780 | because it's really bad.

00:54:14.500 | It's a bad predictor in this particular situation of life

00:54:19.220 | where the controller is currently

00:54:20.700 | trying to maximize reward.

00:54:23.100 | However, it can also learn to address and exploit

00:54:27.100 | some of the subprograms that came about in the model

00:54:31.980 | network through compressing the data by predicting it.

00:54:36.300 | So it now has an opportunity to reuse

00:54:40.220 | that code, the algorithmic information in the model

00:54:43.540 | network, to reduce its own search space,

00:54:48.180 | such that it can solve a new problem more quickly than

00:54:51.900 | without the model.

00:54:53.820 | Compression-- so you're ultimately

00:54:57.780 | optimistic and excited about the power of RL,

00:55:01.180 | of reinforcement learning, in the context of real systems?

00:55:05.500 | Absolutely, yeah.

00:55:07.180 | So you see RL as a potential having a huge impact

00:55:11.660 | beyond just sort of the M part is often developed

00:55:15.980 | on supervised learning methods.

00:55:19.940 | You see RL as a--

00:55:23.900 | for problems of self-driving cars

00:55:25.740 | or any kind of applied side of robotics,

00:55:28.980 | that's the correct interesting direction

00:55:32.540 | for research, in your view?

00:55:34.660 | I do think so.

00:55:35.580 | We have a company called Naysense--

00:55:37.340 | Naysense.

00:55:37.940 | --which has applied reinforcement learning

00:55:41.020 | to little Audis--

00:55:44.060 | Little Audis.

00:55:45.100 | --which learn to park without a teacher.

00:55:48.180 | The same principles were used, of course.

00:55:51.500 | So these little Audis, they are small, maybe like that,

00:55:55.020 | so much smaller than the real Audis.

00:55:57.740 | But they have all the sensors that you

00:55:59.860 | find in the real Audis.

00:56:01.140 | You find the cameras, the LIDAR sensors.

00:56:03.820 | They go up to 120 kilometers an hour if they want to.

00:56:09.020 | And they have pain sensors, basically.

00:56:12.460 | And they don't want to bump against obstacles

00:56:15.220 | and other Audis.

00:56:17.140 | And so they must learn, like little babies, to park.

00:56:22.660 | Take the raw vision input and translate that

00:56:25.340 | into actions that lead to successful parking behavior,

00:56:29.500 | which is a rewarding thing.

00:56:30.740 | And yes, they learn that.

00:56:32.180 | They learn.

00:56:33.100 | So we have examples like that.

00:56:35.100 | And it's only in the beginning.

00:56:37.580 | This is just the tip of the iceberg.

00:56:39.140 | And I believe the next wave of AI

00:56:42.860 | is going to be all about that.

00:56:45.420 | So at the moment, the current wave of AI

00:56:47.580 | is about passive pattern observation and prediction.

00:56:52.340 | And that's what you have on your smartphone

00:56:55.780 | and what the major companies on the Pacific Rim

00:56:58.140 | are using to sell you ads, to do marketing.

00:57:02.340 | That's the current source of profit in AI.

00:57:05.620 | And that's only 1% or 2% of the world economy,

00:57:10.620 | which is big enough to make these companies pretty much

00:57:13.020 | the most valuable companies in the world.

00:57:15.500 | But there's a much, much bigger fraction

00:57:19.300 | of the economy going to be affected

00:57:20.940 | by the next wave, which is really

00:57:22.420 | about machines that shape the data through their own actions.

00:57:28.500 | Do you think simulation is ultimately the biggest way

00:57:33.180 | that those methods will be successful in the next 10,

00:57:36.180 | 20 years?

00:57:36.820 | We're not talking about 100 years from now.

00:57:38.820 | We're talking about the near-term impact of RL.

00:57:42.620 | Do you think really good simulation is required?

00:57:45.260 | Or is there other techniques, like imitation learning,

00:57:49.220 | observing other humans operating in the real world?

00:57:53.660 | Where do you think the success will come from?

00:57:57.740 | So at the moment, we have a tendency

00:57:59.420 | of using physics simulations to learn behavior

00:58:05.980 | for machines that learn to solve problems that humans also do

00:58:11.500 | not know how to solve.

00:58:13.980 | However, this is not the future, because the future

00:58:16.700 | is in what little babies do.

00:58:19.580 | They don't use a physics engine to simulate the world.

00:58:22.300 | No, they learn a predictive model

00:58:24.660 | of the world, which maybe sometimes is wrong in many ways,

00:58:30.100 | but captures all kinds of important abstract high-level

00:58:34.020 | predictions, which are really important to be successful.

00:58:38.460 | And that's what was the future 30 years ago, when we started

00:58:43.380 | that type of research.

00:58:44.340 | But it's still the future.

00:58:45.460 | And now we know much better how to go there, to move forward,

00:58:51.300 | and to really make work in systems based on that, where

00:58:54.980 | you have a learning model of the world, a model of the world

00:58:58.260 | that learns to predict what's going to happen

00:59:00.420 | if I do that and that.

00:59:01.820 | And then the controller uses that model

00:59:06.660 | to more quickly learn successful action sequences.

00:59:11.820 | And then, of course, always this curiosity thing.

00:59:13.900 | In the beginning, the model is stupid.

00:59:15.480 | So the controller should be motivated

00:59:17.780 | to come up with experiments with action sequences

00:59:20.900 | that lead to data that improve the model.

00:59:24.260 | Do you think improving the model,

00:59:27.020 | constructing an understanding of the world in this connection

00:59:30.340 | is now the popular approach that has been successful,

00:59:34.260 | or grounded in ideas of neural networks?

00:59:38.660 | But in the '80s with expert systems,

00:59:40.660 | there's symbolic AI approaches, which to us humans

00:59:44.980 | are more intuitive, in the sense that it makes sense

00:59:49.220 | that you build up knowledge in this knowledge representation.

00:59:52.540 | What kind of lessons can we draw into our current approaches

00:59:57.460 | from expert systems, from symbolic AI?

01:00:00.580 | So I became aware of all of that.

01:00:03.660 | In the '80s and back then, logic programming

01:00:07.940 | was a huge thing.

01:00:09.020 | Was it inspiring to you yourself?

01:00:10.860 | Did you find it compelling?

01:00:12.180 | Because a lot of your work was not so much in that realm,

01:00:16.620 | right, it was more in the learning systems.

01:00:18.580 | Yes and no, but we did all of that.

01:00:20.860 | So my first publication ever actually was,

01:00:25.860 | 1987, was the implementation of genetic algorithm

01:00:31.620 | of a genetic programming system in Prolog.

01:00:34.620 | So Prolog, that's what you learned back then,

01:00:37.820 | which is a logic programming language.

01:00:40.180 | And the Japanese, they have this huge fifth generation

01:00:44.420 | AI project, which was mostly about logic programming

01:00:48.340 | back then, although neural networks existed

01:00:51.300 | and were well known back then.

01:00:54.060 | And deep learning has existed since 1965,

01:00:58.140 | since this guy in the Ukraine, Ivanko, started it.

01:01:02.260 | But the Japanese and many other people,

01:01:05.700 | they focused really on this logic programming.

01:01:08.060 | And I was influenced to the extent that I said,

01:01:10.340 | okay, let's take these biologically inspired algorithms

01:01:13.780 | like evolution, programs,

01:01:16.820 | and implement that in the language which I know,

01:01:22.340 | which was Prolog, for example, back then.

01:01:25.100 | And then, in many ways, this came back later

01:01:29.060 | because the Godel machine, for example,

01:01:31.940 | has a proof search on board.

01:01:33.540 | And without that, it would not be optimal.

01:01:35.900 | While Markus Hutter's universal algorithm

01:01:38.300 | for solving all well-defined problems

01:01:40.620 | has a proof search on board.

01:01:42.460 | So that's very much logic programming.

01:01:45.420 | Without that, it would not be asymptotically optimal.

01:01:50.300 | But then, on the other hand,

01:01:51.220 | because we have very pragmatic guys also,

01:01:54.340 | we focused on recurrent neural networks

01:01:58.900 | and suboptimal stuff,

01:02:02.740 | such as gradient-based search and program space,

01:02:05.860 | rather than provably optimal things.

01:02:09.100 | - So logic programming, does it,

01:02:11.140 | certainly has a usefulness in,

01:02:13.380 | when you're trying to construct something provably optimal

01:02:16.740 | or provably good or something like that.

01:02:18.980 | But is it useful for practical problems?

01:02:21.980 | - It's really useful for our theory improving.

01:02:24.140 | The best theory improvers today are not neural networks.

01:02:28.020 | No, they are logic programming systems

01:02:30.940 | and they are much better theory improvers

01:02:33.140 | than most math students in the first or second semester.

01:02:37.620 | - But for reasoning, for playing games of Go or chess,

01:02:43.260 | or for robots, autonomous vehicles

01:02:45.700 | that operate in the real world,

01:02:46.940 | or object manipulation, you think learning?

01:02:51.260 | - Yeah, as long as the problems have little to do

01:02:54.340 | with theory improving themselves,

01:02:58.700 | then as long as that is not the case,

01:03:01.700 | you would just want to have better pattern recognition.

01:03:05.300 | So to build a self-driving car,

01:03:06.820 | you want to have better pattern recognition

01:03:09.100 | and pedestrian recognition and all these things.

01:03:13.540 | And you want to minimize the number of false positives,

01:03:19.060 | which is currently slowing down self-driving cars

01:03:21.340 | in many ways.

01:03:22.220 | And all of that has very little to do

01:03:24.980 | with logic programming, yeah.

01:03:27.540 | - What are you most excited about

01:03:31.580 | in terms of directions of artificial intelligence

01:03:34.060 | at this moment in the next few years

01:03:37.100 | in your own research and in the broader community?

01:03:40.020 | - So I think in the not so distant future,

01:03:44.260 | we will have for the first time

01:03:47.420 | little robots that learn like kids.

01:03:49.860 | And I will be able to say to the robot,

01:03:54.220 | "Look here, robot, we are going to assemble a smartphone.

01:03:59.380 | "Let's take this slab of plastic and the screwdriver

01:04:04.020 | "and let's screw in the screw like that.

01:04:07.340 | "No, not like that, like that.

01:04:10.140 | "Not like that, like that."

01:04:12.220 | Like that.

01:04:13.540 | And I don't have a data glove or something.

01:04:16.980 | He will see me and he will hear me

01:04:20.420 | and he will try to do something with his own actuators,

01:04:24.220 | which will be really different from mine,

01:04:26.260 | but he will understand the difference

01:04:28.020 | and will learn to imitate me,

01:04:31.540 | but not in the supervised way

01:04:33.820 | where a teacher is giving target signals

01:04:37.820 | for all his muscles all the time.

01:04:40.140 | No, by doing this high level imitation

01:04:43.060 | where he first has to learn to imitate me

01:04:46.060 | and then to interpret these additional noises

01:04:48.500 | coming from my mouth as helping,

01:04:51.500 | helpful signals to do that better.

01:04:54.660 | And then it will, by itself,

01:04:58.540 | come up with faster ways and more efficient ways

01:05:01.940 | of doing the same thing.

01:05:03.660 | And finally, I stop his learning algorithm

01:05:07.900 | and make a million copies and sell it.

01:05:10.260 | And so at the moment, this is not possible,

01:05:13.740 | but we already see how we are going to get there.

01:05:17.260 | And you can imagine to the extent

01:05:19.220 | that this works economically and cheaply,

01:05:22.060 | it's going to change everything.

01:05:25.140 | Almost all of production is going to be affected by that.

01:05:29.820 | And a much bigger wave,

01:05:32.820 | a much bigger AI wave is coming

01:05:36.340 | than the one that we are currently,

01:05:37.740 | witnessing, which is mostly about passive pattern recognition

01:05:40.740 | on your smartphone.

01:05:42.020 | This is about active machines that shapes data

01:05:44.900 | through the actions they are executing.

01:05:48.180 | And they learn to do that in a good way.

01:05:50.260 | So many of the traditional industries

01:05:54.980 | are going to be affected by that.

01:05:56.620 | All the companies that are building machines

01:06:00.140 | will equip these machines with cameras and other sensors,

01:06:05.820 | and they are going to learn to solve all kinds of problems

01:06:10.500 | through interaction with humans,

01:06:12.740 | but also a lot on their own

01:06:15.100 | to improve what they already can do.

01:06:16.940 | And lots of old economy is going to be affected by that.

01:06:23.940 | And in recent years, I have seen that old economy

01:06:27.260 | is actually waking up and realizing that this is the case.

01:06:30.380 | And-

01:06:32.140 | - Are you optimistic about the future?

01:06:33.980 | Are you concerned?

01:06:35.780 | There's a lot of people concerned in the near term

01:06:38.340 | about the transformation of the nature of work.

01:06:42.940 | The kind of ideas that you just suggested

01:06:45.540 | would have a significant impact

01:06:47.300 | of what kind of things could be automated.

01:06:49.260 | Are you optimistic about that future?

01:06:51.940 | Are you nervous about that future?

01:06:54.660 | And looking a little bit farther into the future,

01:06:58.300 | there's people like Elon Musk,

01:07:01.900 | Stuart Russell,

01:07:02.740 | concerned about the existential threats of that future.

01:07:06.660 | So in the near term,

01:07:07.780 | job loss in the long-term existential threat,

01:07:10.780 | are these concerns to you, or are you ultimately optimistic?

01:07:13.780 | - So let's first address the near future.

01:07:19.540 | We have had predictions of job losses for many decades.

01:07:28.100 | For example, when industrial robots came along,

01:07:31.620 | many people predicted that lots of jobs

01:07:35.900 | are going to get lost.

01:07:38.700 | And in a sense, they were right.

01:07:42.540 | Because back then there were car factories

01:07:45.980 | and hundreds of people,

01:07:47.700 | and these factories assembled cars.

01:07:50.780 | And today the same car factories have hundreds of robots

01:07:53.900 | and maybe three guys watching the robots.

01:07:57.060 | It's a very big number.

01:07:58.380 | On the other hand,

01:08:01.900 | those countries that have lots of robots per capita,

01:08:06.140 | Japan, Korea, Germany, Switzerland,

01:08:08.540 | a couple of other countries,

01:08:09.900 | they have really low unemployment rates.

01:08:14.660 | Somehow all kinds of new jobs were created.

01:08:18.220 | Back then, nobody anticipated those jobs.

01:08:24.860 | Decades ago, I always said,

01:08:26.740 | it's really easy to say which jobs are going to get lost,

01:08:31.740 | but it's really hard to predict the new ones.

01:08:34.220 | 30 years ago, who would have predicted all these people

01:08:39.300 | making money as YouTube bloggers, for example?

01:08:43.860 | 200 years ago, 60% of all people

01:08:50.220 | used to work in agriculture.

01:08:53.820 | Today, maybe 1%.

01:08:56.740 | But still, only, I don't know, 5% unemployment.

01:09:01.740 | Lots of new jobs were created.

01:09:03.740 | And Homo Ludens, the playing man,

01:09:06.820 | is inventing new jobs all the time.

01:09:10.540 | Most of these jobs are not existentially necessary

01:09:15.540 | for the survival of our species.

01:09:17.740 | There are only very few existentially necessary jobs,

01:09:22.860 | such as farming and building houses

01:09:25.820 | and warming up the houses,

01:09:28.140 | but less than 10% of the population is doing that.

01:09:31.300 | And most of these newly invented jobs

01:09:33.620 | are about interacting with other people in new ways,

01:09:38.620 | through new media and so on,

01:09:40.900 | getting new types of kudos in forms of likes and whatever,

01:09:46.220 | and even making money through that.

01:09:48.540 | So Homo Ludens, the playing man,

01:09:51.780 | doesn't want to be unemployed,

01:09:53.380 | and that's why he's inventing new jobs all the time.

01:09:57.020 | And he keeps considering these jobs as really important

01:10:01.740 | and is investing a lot of energy and hours of work

01:10:05.540 | into those new jobs.

01:10:08.340 | - That's quite beautifully put.

01:10:10.180 | We're really nervous about the future

01:10:11.980 | because we can't predict

01:10:13.260 | what kind of new jobs will be created.

01:10:15.020 | But you're ultimately optimistic

01:10:18.340 | that we humans are so restless that we create

01:10:22.380 | and give meaning to newer and newer jobs,

01:10:24.980 | totally new things that get likes on Facebook

01:10:29.980 | or whatever the social platform is.

01:10:32.300 | So what about long-term existential threat of AI,

01:10:36.700 | where our whole civilization may be swallowed up

01:10:40.980 | by this ultra super intelligent systems?

01:10:44.460 | - Maybe it's not going to be swallowed up,

01:10:47.460 | but I'd be surprised if we humans were the last step

01:10:52.460 | in the evolution of the universe.

01:10:57.940 | - You've actually had this beautiful comment somewhere

01:11:03.820 | that I've seen saying that artificial,

01:11:07.340 | quite insightful,

01:11:09.860 | "Artificial general intelligence systems,

01:11:12.020 | "just like us humans,

01:11:13.460 | "will likely not want to interact with humans.

01:11:16.080 | "They'll just interact amongst themselves,

01:11:17.940 | "just like ants interact amongst themselves

01:11:21.460 | "and only tangentially interact with humans."

01:11:25.420 | And it's quite an interesting idea

01:11:27.540 | that once we create AGI,

01:11:28.900 | they will lose interest in humans

01:11:31.420 | and compete for their own Facebook likes

01:11:34.500 | on their own social platforms.

01:11:36.780 | So within that quite elegant idea,

01:11:40.300 | how do we know in a hypothetical sense

01:11:45.140 | that there's not already intelligence systems out there?

01:11:48.840 | How do you think broadly

01:11:50.200 | of general intelligence greater than us?

01:11:54.360 | How would we know it's out there?

01:11:56.600 | How would we know it's around us?

01:11:59.160 | And could it already be?

01:12:00.400 | - I'd be surprised if within the next few decades

01:12:05.320 | or something like that,

01:12:06.480 | we won't have AIs that are truly smart in every single way

01:12:13.100 | and better problem solvers

01:12:14.200 | in almost every single important way.

01:12:16.600 | And I'd be surprised if they wouldn't realize

01:12:23.080 | what we have realized a long time ago,

01:12:24.880 | which is that almost all physical resources

01:12:28.560 | are not here in this biosphere,

01:12:31.000 | but further out,

01:12:32.060 | the rest of the solar system

01:12:36.760 | gets 2 billion times more solar energy

01:12:40.680 | than our little planet.

01:12:43.680 | There's lots of material out there

01:12:45.560 | that you can use to build robots

01:12:47.400 | and self-replicating robot factories and all this stuff.

01:12:50.900 | And they are going to do that.

01:12:53.080 | And they will be scientists and curious,

01:12:56.480 | and they will explore what they can do.

01:12:58.600 | And in the beginning,

01:13:01.100 | they will be fascinated by life

01:13:04.640 | and by their own origins in our civilization.

01:13:07.320 | They will want to understand that completely,

01:13:09.760 | just like people today would like to understand

01:13:12.700 | how life works and also the history

01:13:17.700 | of our own existence and civilization,

01:13:22.640 | but then also in the physical laws

01:13:24.400 | that created all of that.

01:13:25.640 | So in the beginning, they will be fascinated by life.

01:13:30.160 | Once they understand it, they lose interest,

01:13:32.820 | like anybody who loses interest in things he understands.

01:13:38.340 | And then, as you said,

01:13:42.180 | the most interesting sources of information for them

01:13:47.180 | will be others of their own kind.

01:13:53.060 | So at least in the long run,

01:14:01.020 | there seems to be some sort of protection

01:14:04.900 | through lack of interest on the other side.

01:14:11.220 | And now it seems also clear,

01:14:14.700 | as far as we understand physics,

01:14:16.700 | you need matter and energy to compute

01:14:20.460 | and to build more robots and infrastructure

01:14:22.620 | and more AI civilization and AI ecologies

01:14:27.620 | consisting of trillions of different types of AIs.

01:14:31.780 | And so it seems inconceivable to me

01:14:34.780 | that this thing is not going to expand.

01:14:37.620 | Some AI ecology not controlled by one AI,

01:14:41.020 | but by trillions of different types of AIs competing

01:14:44.580 | in all kinds of quickly evolving

01:14:47.860 | and disappearing ecological niches

01:14:49.900 | in ways that we cannot fathom at the moment.

01:14:52.500 | But it's going to expand,

01:14:54.700 | limited by light speed and physics,

01:14:57.020 | but it's going to expand.

01:14:58.260 | And now we realize that the universe is still young.

01:15:02.980 | It's only 13.8 billion years old,

01:15:06.180 | and it's going to be 1,000 times older than that.

01:15:10.580 | So there's plenty of time to conquer the entire universe

01:15:15.580 | and to fill it with intelligence

01:15:19.820 | and senders and receivers such that AIs can travel

01:15:23.460 | the way they are traveling in our labs today,

01:15:27.300 | which is by radio from sender to receiver.

01:15:30.040 | And let's call the current age of the universe one eon.

01:15:35.940 | One eon.

01:15:39.580 | Now it will take just a few eons from now,

01:15:41.940 | and the entire visible universe

01:15:43.620 | is going to be full of that stuff.

01:15:45.340 | And let's look ahead to a time

01:15:48.980 | when the universe is going to be 1,000 times older

01:15:51.300 | than it is now.

01:15:52.140 | They will look back and they will say,

01:15:54.580 | "Look, almost immediately after the Big Bang,

01:15:57.060 | "only a few eons later,

01:15:59.820 | "the entire universe started to become intelligent."

01:16:02.580 | Now to your question,

01:16:04.580 | how do we see whether anything like that

01:16:08.380 | has already happened or is already in a more advanced stage

01:16:12.580 | in some other part of the universe,

01:16:14.740 | of the visible universe?

01:16:16.660 | We are trying to look out there

01:16:17.740 | and nothing like that has happened so far.

01:16:20.660 | Or is that true?

01:16:22.460 | - Do you think we would recognize it?

01:16:24.340 | How do we know it's not among us?

01:16:25.740 | How do we know planets aren't in themselves

01:16:28.820 | intelligent beings?

01:16:30.580 | How do we know ants, seen as a collective,

01:16:36.540 | are not much greater intelligence than our own?

01:16:40.260 | These kinds of ideas.

01:16:41.380 | - Yeah.

01:16:42.300 | When I was a boy, I was thinking about these things.

01:16:45.140 | And I thought, "Hmm, maybe it has already happened."

01:16:48.380 | Because back then I knew, I learned from popular physics

01:16:53.060 | books that the structure, the large-scale structure

01:16:57.140 | of the universe is not homogenous.

01:17:00.100 | And you have these clusters of galaxies,

01:17:03.060 | and then in between there are these huge empty spaces.

01:17:07.500 | And I thought, "Hmm, maybe they aren't really empty."

01:17:12.380 | It's just that in the middle of that,

01:17:13.980 | some AI civilization already has expanded

01:17:16.900 | and then has covered a bubble

01:17:19.740 | of a billion light-years diameter,

01:17:22.220 | and is using all the energy of all the stars

01:17:25.740 | within that bubble for its own unfathomable purposes.

01:17:29.580 | And so it already has happened,

01:17:31.540 | and we just fail to interpret the signs.

01:17:34.860 | But then I learned that gravity by itself

01:17:39.420 | explains the large-scale structure of the universe,

01:17:42.300 | and that this is not a convincing explanation.

01:17:45.460 | And then I thought, "Maybe it's the dark matter."

01:17:50.460 | Because as far as we know today,

01:17:54.820 | 80% of the measurable matter is invisible.

01:17:59.820 | And we know that because otherwise our galaxy

01:18:03.620 | or other galaxies would fall apart.

01:18:06.580 | They are rotating too quickly.

01:18:08.060 | And then the idea was maybe all of these AI civilizations

01:18:15.060 | that are already out there,

01:18:17.300 | they are just invisible because they're really efficient

01:18:23.460 | in using the energies of their own local systems,

01:18:26.580 | and that's why they appear dark to us.

01:18:29.780 | But this is also not a convincing explanation

01:18:31.700 | because then the question becomes,

01:18:34.660 | "Why are there still any visible stars left

01:18:39.660 | in our own galaxy,

01:18:42.060 | which also must have a lot of dark matter?"

01:18:44.620 | So that is also not a convincing thing.

01:18:46.900 | And today, I like to think it's quite plausible

01:18:53.380 | that maybe we are the first,

01:18:54.540 | at least in our local light cone,

01:18:57.300 | within the few hundreds of millions of light years

01:19:02.300 | that we can reliably observe.

01:19:09.220 | - Is that exciting to you, that we might be the first?

01:19:12.020 | - And it would make us much more important

01:19:16.500 | because if we mess it up through a nuclear war,

01:19:20.740 | then maybe this will have an effect

01:19:25.500 | on the development of the entire universe.

01:19:30.500 | - So let's not mess it up.

01:19:32.540 | - Let's not mess it up.

01:19:33.740 | - Jürgen, thank you so much for talking today.

01:19:35.740 | I really appreciate it.

01:19:37.220 | - It's my pleasure.

01:19:38.260 | (upbeat music)

01:19:40.860 | (upbeat music)

01:19:43.460 | (upbeat music)

01:19:46.060 | (upbeat music)

01:19:48.660 | (upbeat music)

01:19:51.260 | (upbeat music)

01:19:53.860 | [BLANK_AUDIO]

Juergen Schmidhuber: Godel Machines, Meta-Learning, and LSTMs | Lex Fridman Podcast #11

Chapters