back to index

David Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning | Lex Fridman Podcast #86


Chapters

0:0 Introduction
4:9 First program
11:11 AlphaGo
21:42 Rule of the game of Go
25:37 Reinforcement learning: personal journey
30:15 What is reinforcement learning?
43:51 AlphaGo (continued)
53:40 Supervised learning and self play in AlphaGo
66:12 Lee Sedol retirement from Go play
68:57 Garry Kasparov
74:10 Alpha Zero and self play
91:29 Creativity in AlphaZero
95:21 AlphaZero applications
97:59 Reward functions
100:51 Meaning of life

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with David Silver,
00:00:02.600 | who leads the reinforcement learning research group
00:00:05.040 | at DeepMind and was the lead researcher
00:00:07.880 | on AlphaGo, AlphaZero, and co-led the AlphaStar
00:00:12.120 | and MuZero efforts and a lot of important work
00:00:14.800 | in reinforcement learning in general.
00:00:17.200 | I believe AlphaZero is one of the most important
00:00:20.880 | accomplishments in the history of artificial intelligence.
00:00:24.200 | And David is one of the key humans
00:00:26.840 | who brought AlphaZero to life,
00:00:28.600 | together with a lot of other great researchers at DeepMind.
00:00:31.920 | He's humble, kind, and brilliant.
00:00:35.160 | We were both jet lagged, but didn't care and made it happen.
00:00:39.080 | It was a pleasure and truly an honor to talk with David.
00:00:43.320 | This conversation was recorded
00:00:44.720 | before the outbreak of the pandemic.
00:00:47.000 | For everyone feeling the medical, psychological,
00:00:49.560 | and financial burden of this crisis,
00:00:51.640 | I'm sending love your way.
00:00:53.400 | Stay strong.
00:00:54.600 | We're in this together.
00:00:55.920 | We'll beat this thing.
00:00:57.720 | This is the Artificial Intelligence Podcast.
00:01:00.400 | If you enjoy it, subscribe on YouTube,
00:01:02.520 | review it with five stars on Apple Podcasts,
00:01:04.800 | support on Patreon, or simply connect with me on Twitter
00:01:08.000 | at Lex Friedman, spelled F-R-I-D-M-A-N.
00:01:12.080 | As usual, I'll do a few minutes of ads now
00:01:14.520 | and never any ads in the middle
00:01:16.120 | that can break the flow of the conversation.
00:01:18.400 | I hope that works for you
00:01:19.720 | and doesn't hurt the listening experience.
00:01:22.600 | Quick summary of the ads.
00:01:23.920 | Two sponsors, MasterClass and Cash App.
00:01:27.400 | Please consider supporting the podcast
00:01:29.080 | by signing up to MasterClass at masterclass.com/lex
00:01:34.040 | and downloading Cash App and using code LEXPODCAST.
00:01:38.800 | This show is presented by Cash App,
00:01:41.160 | the number one finance app in the App Store.
00:01:43.520 | When you get it, use code LEXPODCAST.
00:01:47.000 | Cash App lets you send money to friends,
00:01:49.160 | buy Bitcoin, and invest in the stock market
00:01:51.400 | with as little as $1.
00:01:52.640 | Since Cash App allows you to buy Bitcoin,
00:01:56.040 | let me mention that cryptocurrency
00:01:57.800 | in the context of the history of money is fascinating.
00:02:01.400 | I recommend "A Scent of Money"
00:02:03.360 | as a great book on this history.
00:02:05.320 | Debits and credits on ledgers started around 30,000 years ago.
00:02:10.040 | The US dollar created over 200 years ago,
00:02:12.840 | and Bitcoin, the first decentralized cryptocurrency,
00:02:15.840 | released just over 10 years ago.
00:02:18.600 | So given that history,
00:02:19.920 | cryptocurrency is still very much
00:02:21.880 | in its early days of development,
00:02:23.880 | but it's still aiming to, and just might,
00:02:26.480 | redefine the nature of money.
00:02:29.040 | So again, if you get Cash App
00:02:30.920 | from the App Store or Google Play
00:02:32.360 | and use the code LEXPODCAST, you get $10,
00:02:35.880 | and Cash App will also donate $10 to Thirst,
00:02:38.640 | an organization that is helping to advance robotics
00:02:41.080 | and STEM education for young people around the world.
00:02:44.860 | This show is sponsored by MasterClass.
00:02:46.960 | Sign up at masterclass.com/lex
00:02:49.480 | to get a discount and to support this podcast.
00:02:52.000 | In fact, for a limited time now,
00:02:53.580 | if you sign up for an all-access pass for a year,
00:02:56.600 | you get to get another all-access pass
00:02:59.480 | to share with a friend.
00:03:01.200 | Buy one, get one free.
00:03:02.600 | When I first heard about MasterClass,
00:03:04.280 | I thought it was too good to be true.
00:03:06.240 | For $180 a year, you get an all-access pass
00:03:09.680 | to watch courses from, to list some of my favorites.
00:03:12.920 | Chris Hadfield on space exploration,
00:03:15.120 | Neil deGrasse Tyson on scientific thinking and communication,
00:03:18.080 | Will Wright, the creator of SimCity and Sims,
00:03:21.480 | on game design, Jane Goodall on conservation,
00:03:24.640 | Carlos Santana on guitar, his song "Europa"
00:03:28.040 | could be the most beautiful guitar song ever written,
00:03:30.940 | Garry Kasparov on chess, Daniel Negrano on poker,
00:03:34.240 | and many, many more.
00:03:35.620 | Chris Hadfield explaining how rockets work
00:03:37.840 | and the experience of being launched into space alone
00:03:40.400 | is worth the money.
00:03:41.620 | For me, the key is to not be overwhelmed
00:03:44.680 | by the abundance of choice.
00:03:46.200 | Pick three courses you want to complete,
00:03:48.060 | watch each of them all the way through.
00:03:50.080 | It's not that long, but it's an experience
00:03:51.880 | that will stick with you for a long time, I promise.
00:03:55.200 | It's easily worth the money.
00:03:56.720 | You can watch it on basically any device.
00:03:59.120 | Once again, sign up on masterclass.com/lex
00:04:02.240 | to get a discount and to support this podcast.
00:04:04.700 | And now, here's my conversation with David Silver.
00:04:08.740 | What was the first program you've ever written
00:04:12.140 | and what programming language?
00:04:13.880 | Do you remember?
00:04:14.840 | - I remember very clearly, yeah.
00:04:16.680 | My parents brought home this BBC Model B microcomputer.
00:04:21.680 | It was just this fascinating thing to me.
00:04:24.200 | I was about seven years old
00:04:26.160 | and couldn't resist just playing around with it.
00:04:30.000 | So I think first program ever
00:04:31.480 | was writing my name out in different colors
00:04:36.840 | and getting it to loop and repeat that.
00:04:39.560 | And there was something magical about that
00:04:41.640 | which just led to more and more.
00:04:44.520 | - How did you think about computers back then?
00:04:46.320 | Like the magical aspect of it,
00:04:47.960 | that you can write a program
00:04:49.680 | and there's this thing that you just gave birth to
00:04:52.900 | that's able to create sort of visual elements
00:04:56.280 | and live on its own.
00:04:57.680 | Or did you not think of it in those romantic notions?
00:05:00.000 | Was it more like, oh, that's cool.
00:05:02.480 | I can solve some puzzles.
00:05:05.320 | - It was always more than solving puzzles.
00:05:06.920 | It was something where there was this
00:05:09.880 | limitless possibilities
00:05:13.440 | once you have a computer in front of you,
00:05:14.760 | you can do anything with it.
00:05:16.440 | I used to play with Lego with the same feeling.
00:05:18.040 | You can make anything you want out of Lego,
00:05:20.000 | but even more so with a computer.
00:05:21.880 | You're not constrained by the amount of kit you've got.
00:05:24.560 | And so I was fascinated by it
00:05:25.720 | and started pulling out the user guide
00:05:28.480 | and the advanced user guide and then learning.
00:05:30.720 | So I started in basic and then later 6502.
00:05:34.640 | My father also became interested in this machine
00:05:38.400 | and gave up his career to go back to school
00:05:40.280 | and study for a master's degree in artificial intelligence,
00:05:44.480 | funnily enough, at Essex University when I was seven.
00:05:48.600 | So I was exposed to those things at an early age.
00:05:52.040 | He showed me how to program in Prolog
00:05:54.840 | and do things like querying your family tree.
00:05:57.640 | And those are some of my earliest memories
00:05:59.800 | of trying to figure things out on a computer.
00:06:04.100 | - Those are the early steps in computer science programming.
00:06:07.160 | But when did you first fall in love
00:06:09.320 | with artificial intelligence or with the ideas,
00:06:12.060 | the dreams of AI?
00:06:13.300 | - I think it was really when I went to study at university.
00:06:19.040 | So I was an undergrad at Cambridge
00:06:20.920 | and studying computer science.
00:06:23.820 | And I really started to question,
00:06:27.600 | you know, what really are the goals?
00:06:29.280 | What's the goal?
00:06:30.120 | Where do we want to go with computer science?
00:06:32.800 | And it seemed to me that the only step
00:06:37.400 | of major significance to take was to try
00:06:40.920 | and recreate something akin to human intelligence.
00:06:44.240 | If we could do that, that would be a major leap forward.
00:06:47.540 | And that idea, I certainly wasn't the first to have it,
00:06:51.000 | but it nestled within me somewhere and became like a bug.
00:06:55.480 | You know, I really wanted to crack that problem.
00:06:58.920 | - So you thought it was, like you had a notion
00:07:00.800 | that this is something that human beings can do,
00:07:03.000 | that it is possible to create an intelligent machine?
00:07:07.320 | - Well, I mean, unless you believe
00:07:09.160 | in something metaphysical, then what are our brains doing?
00:07:13.440 | Well, at some level, they're information processing systems,
00:07:17.260 | which are able to take whatever information is in there,
00:07:22.260 | transform it through some form of program
00:07:24.840 | and produce some kind of output,
00:07:26.160 | which enables that human being to do
00:07:28.640 | all the amazing things that they can do
00:07:29.880 | in this incredible world.
00:07:31.840 | - So then do you remember the first time
00:07:35.520 | you've written a program that,
00:07:37.960 | 'cause you also had an interest in games.
00:07:40.080 | Do you remember the first time you were in a program
00:07:41.960 | that beat you in a game?
00:07:43.760 | Or beat you at anything?
00:07:47.360 | Sort of achieved super David Silver level performance?
00:07:52.360 | - So I used to work in the games industry.
00:07:56.440 | So for five years, I programmed games for my first job.
00:08:01.300 | So it was an amazing opportunity to get involved
00:08:03.640 | in a startup company.
00:08:05.820 | And so I was involved in building AI at that time.
00:08:10.820 | And so for sure, there was a sense of building handcrafted,
00:08:17.100 | what people used to call AI in the games industry,
00:08:20.280 | which I think is not really what we might think of
00:08:22.420 | as AI in its fullest sense,
00:08:24.020 | but something which is able to take actions
00:08:29.020 | in a way which makes things interesting and challenging
00:08:31.980 | for the human player.
00:08:34.680 | And at that time I was able to build
00:08:38.400 | these handcrafted agents,
00:08:39.440 | which in certain limited cases could do things
00:08:41.400 | which were able to do better than me,
00:08:45.380 | but mostly in these kind of Twitch like scenarios
00:08:47.920 | where they were able to do things faster
00:08:50.020 | or because they had some pattern
00:08:51.720 | which was able to exploit repeatedly.
00:08:55.420 | I think if we're talking about real AI,
00:08:57.740 | the first experience for me came after that
00:09:00.840 | when I realized that this path I was on
00:09:05.640 | wasn't taking me towards,
00:09:06.840 | it wasn't dealing with that bug,
00:09:09.120 | which I still had inside me to really understand intelligence
00:09:12.320 | and try and solve it.
00:09:14.280 | Everything people were doing in games was,
00:09:17.100 | short term fixes rather than long term vision.
00:09:19.920 | And so I went back to study for my PhD,
00:09:23.600 | which was funny enough trying to apply reinforcement learning
00:09:26.920 | to the game of Go.
00:09:28.440 | And I built my first Go program
00:09:30.680 | using reinforcement learning,
00:09:31.880 | a system which would by trial and error play against itself
00:09:35.600 | and was able to learn which patterns were actually helpful
00:09:40.560 | to predict whether it was gonna win or lose the game
00:09:42.800 | and then choose the moves
00:09:44.360 | that led to the combination of patterns
00:09:46.200 | that would mean that you're more likely to win.
00:09:48.320 | And that system, that system beat me.
00:09:50.400 | - And how did that make you feel?
00:09:53.440 | - Made me feel good.
00:09:54.440 | - I mean, was there sort of the, yeah.
00:09:57.000 | I mean, it's a mix of a sort of excitement
00:09:59.600 | and was there a tinge of sort of like,
00:10:02.520 | almost like a fearful awe?
00:10:04.480 | You know, it's like in space, 2001 Space Odyssey
00:10:08.280 | kind of realizing that you've created something that,
00:10:12.720 | that is, you know, that's achieved human level intelligence
00:10:19.160 | in this one particular little task.
00:10:21.160 | And in that case,
00:10:22.000 | I suppose neural networks weren't involved.
00:10:24.320 | - There were no neural networks in those days.
00:10:26.840 | This was pre deep learning revolution,
00:10:29.280 | but it was a principled self-learning system
00:10:33.000 | based on a lot of the principles
00:10:34.200 | which people are still using in deep reinforcement learning.
00:10:38.920 | How did I feel?
00:10:41.200 | I think I found it immensely satisfying
00:10:46.200 | that a system which was able to learn
00:10:49.600 | from first principles for itself
00:10:51.320 | was able to reach the point
00:10:52.400 | that it was understanding this domain
00:10:56.240 | better than I could and able to outwit me.
00:10:59.160 | I don't think it was a sense of awe.
00:11:01.560 | It was a sense that satisfaction,
00:11:04.560 | that something I felt should work had worked.
00:11:08.640 | - So to me, AlphaGo, and I don't know how else to put it,
00:11:11.840 | but to me, AlphaGo and AlphaGo Zero,
00:11:14.560 | mastering the game of Go is, again, to me,
00:11:18.520 | the most profound and inspiring moment
00:11:20.400 | in the history of artificial intelligence.
00:11:23.440 | So you're one of the key people behind this achievement,
00:11:26.560 | and I'm Russian, so I really felt
00:11:29.240 | the first sort of seminal achievement
00:11:31.840 | when Deep Blue beat Garry Kasparov in 1987.
00:11:34.800 | So as far as I know, the AI community at that point
00:11:40.680 | largely saw the game of Go as unbeatable in AI,
00:11:43.960 | using the sort of the state-of-the-art
00:11:46.160 | brute force methods, search methods.
00:11:48.760 | Even if you consider, at least the way I saw it,
00:11:51.480 | even if you consider arbitrary,
00:11:54.280 | exponential scaling of compute,
00:11:57.280 | Go would still not be solvable,
00:11:59.160 | hence why it was thought to be impossible.
00:12:01.360 | So given that the game of Go was impossible to master,
00:12:07.440 | what was the dream for you?
00:12:09.440 | You just mentioned your PhD thesis
00:12:11.440 | of building the system that plays Go.
00:12:14.000 | What was the dream for you
00:12:15.120 | that you could actually build a computer program
00:12:17.360 | that achieves world-class,
00:12:20.080 | not necessarily beats the world champion,
00:12:21.880 | but achieves that kind of level of playing Go?
00:12:24.880 | - First of all, thank you, that was very kind words.
00:12:27.480 | And funnily enough, I just came from a panel
00:12:31.360 | where I was actually in a conversation
00:12:34.520 | with Garry Kasparov and Murray Campbell,
00:12:36.080 | who was the author of Deep Blue,
00:12:38.080 | and it was their first meeting together since the match.
00:12:43.280 | So that just occurred yesterday,
00:12:44.520 | so I'm literally fresh from that experience.
00:12:47.320 | So these are amazing moments when they happen,
00:12:50.760 | but where did it all start?
00:12:52.280 | Well, for me, it started
00:12:53.880 | when I became fascinated in the game of Go.
00:12:56.120 | So Go for me, I've grown up playing games,
00:12:59.160 | I've always had a fascination in board games.
00:13:01.840 | I played chess as a kid, I played Scrabble as a kid.
00:13:04.840 | When I was at university, I discovered the game of Go,
00:13:08.960 | and to me, it just blew all of those other games
00:13:11.200 | out of the water.
00:13:12.040 | It was just so deep and profound in its complexity
00:13:15.560 | with endless levels to it.
00:13:17.720 | What I discovered was that I could devote endless hours
00:13:22.720 | to this game, and I knew in my heart of hearts
00:13:28.200 | that no matter how many hours I would devote to it,
00:13:30.320 | I would never become a grandmaster.
00:13:34.320 | Or there was another path, and the other path
00:13:36.640 | was to try and understand how you could get
00:13:39.480 | some other intelligence to play this game
00:13:41.760 | better than I would be able to.
00:13:43.480 | And so even in those days, I had this idea that,
00:13:46.760 | what if, what if it was possible to build a program
00:13:49.320 | that could crack this?
00:13:51.080 | And as I started to explore the domain,
00:13:53.200 | I discovered that this was really the domain
00:13:57.440 | where people felt deeply that if progress
00:14:01.240 | could be made in Go, it would really mean
00:14:04.560 | a giant leap forward for AI.
00:14:06.280 | It was the challenge where all other approaches had failed.
00:14:10.920 | This is coming out of the era you mentioned,
00:14:13.400 | which was in some sense the golden era
00:14:15.920 | for the classical methods of AI, like heuristic search.
00:14:19.880 | In the '90s, they all fell one after another,
00:14:23.280 | not just chess with deep blue, but checkers,
00:14:26.520 | backgammon, Othello.
00:14:28.840 | There were numerous cases where systems built
00:14:33.640 | on top of heuristic search methods
00:14:35.880 | with these high-performance systems
00:14:37.920 | had been able to defeat the human world champion
00:14:40.320 | in each of those domains.
00:14:41.920 | And yet in that same time period,
00:14:43.840 | there was a million dollar prize available
00:14:47.360 | for the game of Go, for the first system
00:14:50.640 | to be a human professional player.
00:14:52.640 | And at the end of that time period,
00:14:54.520 | at year 2000, when the prize expired,
00:14:57.080 | the strongest Go program in the world
00:15:00.000 | was defeated by a nine-year-old child
00:15:02.640 | when that nine-year-old child was giving nine free moves
00:15:05.800 | to the computer at the start of the game
00:15:07.480 | to try and even things up.
00:15:09.800 | And computer Go expert beat that same strongest program
00:15:13.880 | with 29 handicap stones, 29 free moves.
00:15:18.120 | So that's what the state of affairs was
00:15:20.400 | when I became interested in this problem
00:15:22.480 | in around 2003, when I started working on computer Go.
00:15:28.320 | There was nothing.
00:15:30.320 | There was very, very little in the way of progress
00:15:34.000 | towards meaningful performance,
00:15:36.640 | again, at anything approaching human level.
00:15:39.120 | And so people, it wasn't through lack of effort.
00:15:42.840 | People had tried many, many things.
00:15:44.880 | And so there was a strong sense
00:15:46.600 | that something different would be required for Go
00:15:49.800 | than had been needed for all of these other domains
00:15:52.120 | where AI had been successful.
00:15:54.160 | And maybe the single clearest example
00:15:56.280 | is that Go, unlike those other domains,
00:15:58.640 | had this kind of intuitive property
00:16:02.360 | that a Go player would look at a position and say,
00:16:05.400 | "Hey, here's this mess of black and white stones.
00:16:09.440 | "But from this mess, oh, I can predict
00:16:12.600 | "that this part of the board has become my territory.
00:16:15.720 | "This part of the board has become your territory.
00:16:17.760 | "And I've got this overall sense that I'm gonna win
00:16:20.120 | "and that this is about the right move to play."
00:16:22.320 | And that intuitive sense of judgment,
00:16:24.680 | of being able to evaluate what's going on in a position,
00:16:28.160 | it was pivotal to humans being able to play this game
00:16:31.720 | and something that people had no idea
00:16:33.240 | how to put into computers.
00:16:34.960 | So this question of how to evaluate a position,
00:16:37.680 | how to come up with these intuitive judgments
00:16:40.040 | was the key reason why Go was so hard,
00:16:43.680 | in addition to its enormous search space,
00:16:47.400 | and the reason why methods which had succeeded so well
00:16:51.400 | elsewhere failed in Go.
00:16:53.160 | And so people really felt deep down that,
00:16:55.760 | you know, in order to crack Go,
00:16:57.880 | we would need to get something akin to human intuition.
00:17:00.360 | And if we got something akin to human intuition,
00:17:02.600 | we'd be able to solve, you know,
00:17:04.560 | many, many more problems in AI.
00:17:06.760 | So to me, that was the moment where it's like,
00:17:09.120 | "Okay, this is not just about playing the game of Go.
00:17:11.800 | "This is about something profound."
00:17:13.520 | And it was back to that bug,
00:17:14.880 | which had been itching me all those years.
00:17:17.480 | Now this is the opportunity to do something meaningful
00:17:19.520 | and transformative, and I guess a dream was born.
00:17:23.640 | - That's a really interesting way to put it.
00:17:25.200 | So almost this realization that you need to find,
00:17:29.000 | formulate Go as a kind of a prediction problem
00:17:31.400 | versus a search problem was the intuition.
00:17:34.760 | I mean, maybe that's the wrong crude term,
00:17:37.320 | but to give it the ability to kind of intuit things
00:17:42.320 | about positional structure of the board.
00:17:47.000 | Now, okay, but what about the learning part of it?
00:17:51.280 | Did you have a sense that you have to,
00:17:55.120 | that learning has to be part of the system?
00:17:57.520 | Again, something that hasn't really,
00:17:59.880 | as far as I think, except with TD Gammon
00:18:02.720 | and the 90s with RL a little bit,
00:18:05.160 | hasn't been part of those state-of-the-art
00:18:06.920 | game-playing systems.
00:18:08.520 | - So I strongly felt that learning would be necessary,
00:18:12.760 | and that's why my PhD topic back then
00:18:15.400 | was trying to apply reinforcement learning
00:18:18.160 | to the game of Go, and not just learning of any type,
00:18:21.760 | but I felt that the only way to really have a system
00:18:26.120 | to progress beyond human levels of performance
00:18:29.160 | wouldn't just be to mimic how humans do it,
00:18:31.000 | but to understand for themselves.
00:18:33.080 | And how else can a machine hope to understand
00:18:36.520 | what's going on except through learning?
00:18:38.960 | If you're not learning, what else are you doing?
00:18:40.360 | Well, you're putting all the knowledge into the system,
00:18:42.520 | and that just feels like something which decades of AI
00:18:47.520 | have told us is maybe not a dead end,
00:18:50.520 | but certainly has a ceiling to the capabilities.
00:18:53.320 | It's known as the knowledge acquisition bottleneck,
00:18:55.320 | that the more you try to put into something,
00:18:58.400 | the more brittle the system becomes.
00:19:00.320 | And so you just have to have learning.
00:19:02.720 | You have to have learning.
00:19:03.560 | That's the only way you're going to be able to get
00:19:06.360 | a system which has sufficient knowledge in it,
00:19:10.320 | millions and millions of pieces of knowledge,
00:19:11.840 | billions, trillions, of a form
00:19:14.160 | that it can actually apply for itself,
00:19:15.520 | and understand how those billions and trillions
00:19:17.960 | of pieces of knowledge can be leveraged in a way
00:19:20.880 | which will actually lead it towards its goal
00:19:22.760 | without conflict or other issues.
00:19:27.440 | - Yeah, I mean, if I put myself back in that time,
00:19:30.560 | I just wouldn't think like that.
00:19:32.240 | Without a good demonstration of RL,
00:19:35.480 | I would think more in the symbolic AI,
00:19:37.720 | like not learning, but sort of a simulation
00:19:42.720 | of knowledge base, like a growing knowledge base,
00:19:46.880 | but it would still be sort of pattern-based,
00:19:49.520 | basically have little rules that you kind of
00:19:53.560 | assemble together into a large knowledge base.
00:19:56.600 | - Well, in a sense, that was the state of the art back then.
00:19:59.760 | So if you look at the Go programs,
00:20:01.080 | which had been competing for this prize I mentioned,
00:20:04.360 | they were an assembly of different specialized systems,
00:20:09.800 | some of which used huge amounts of human knowledge
00:20:11.840 | to describe how you should play the opening,
00:20:14.800 | how you should, all the different patterns
00:20:16.680 | that were required to play well in the game of Go,
00:20:19.840 | end game theory, combinatorial game theory,
00:20:24.560 | and combined with more principled search-based methods,
00:20:28.600 | which were trying to solve for particular sub parts
00:20:31.240 | of the game, like life and death,
00:20:34.040 | connecting groups together,
00:20:36.800 | all these amazing sub problems
00:20:38.080 | that just emerged in the game of Go,
00:20:40.400 | there were different pieces all put together
00:20:43.240 | into this like collage, which together
00:20:45.760 | would try and play against a human.
00:20:48.120 | And although not all of the pieces were handcrafted,
00:20:54.600 | the overall effect was nevertheless still brittle,
00:20:56.760 | and it was hard to make all these pieces work well together.
00:21:00.240 | And so really, what I was pressing for,
00:21:02.680 | and the main innovation of the approach I took,
00:21:05.600 | was to go back to first principles and say,
00:21:08.440 | well, let's back off that and try and find
00:21:11.360 | a principled approach where the system can learn for itself,
00:21:14.880 | just from the outcome, like, learn for itself.
00:21:19.280 | If you try something, did that help or did it not help?
00:21:22.680 | And only through that procedure can you arrive
00:21:25.600 | at knowledge which is verified.
00:21:27.960 | The system has to verify it for itself,
00:21:29.760 | not relying on any other third party
00:21:31.640 | to say this is right or this is wrong.
00:21:33.560 | And so that principle was already very important
00:21:38.160 | in those days, but unfortunately,
00:21:39.840 | we were missing some important pieces back then.
00:21:43.280 | - So before we dive into maybe discussing
00:21:47.440 | the beauty of reinforcement learning,
00:21:49.120 | let's take a step back, we kind of skipped it a bit,
00:21:52.640 | but the rules of the game of Go.
00:21:55.720 | The elements of it perhaps contrasting to chess
00:22:02.120 | that sort of you really enjoy as a human being,
00:22:07.080 | and also that make it really difficult
00:22:09.600 | as a AI machine learning problem.
00:22:13.080 | - So the game of Go has remarkably simple rules.
00:22:16.720 | In fact, so simple that people have speculated
00:22:19.160 | that if we were to meet alien life at some point,
00:22:22.200 | that we wouldn't be able to communicate with them,
00:22:23.800 | but we would be able to play a game of Go with them.
00:22:25.960 | So they probably have discovered the same rule set.
00:22:28.960 | So the game is played on a 19 by 19 grid,
00:22:32.240 | and you play on the intersections of the grid
00:22:34.120 | and the players take turns.
00:22:35.560 | And the aim of the game is very simple,
00:22:37.560 | it's to surround as much territory as you can,
00:22:40.800 | as many of these intersections with your stones,
00:22:43.600 | and to surround more than your opponent does.
00:22:46.200 | And the only nuance to the game is that
00:22:48.800 | if you fully surround your opponent's piece,
00:22:50.480 | then you get to capture it and remove it from the board
00:22:52.440 | and it counts as your own territory.
00:22:54.440 | Now from those very simple rules,
00:22:56.480 | immense complexity arises.
00:22:58.320 | There's kind of profound strategies
00:22:59.800 | in how to surround territory,
00:23:01.960 | how to kind of trade off between
00:23:04.640 | making solid territory yourself now,
00:23:07.120 | compared to building up influence
00:23:09.240 | that will help you acquire territory later in the game,
00:23:11.280 | how to connect groups together,
00:23:12.560 | how to keep your own groups alive,
00:23:14.400 | which patterns of stones are most useful
00:23:19.920 | compared to others.
00:23:21.480 | There's just immense knowledge
00:23:23.880 | and human Go players have played this game for,
00:23:27.160 | it was discovered thousands of years ago,
00:23:29.240 | and human Go players have built up
00:23:30.840 | this immense knowledge base over the years.
00:23:33.720 | It's studied very deeply and played by
00:23:36.280 | something like 50 million players across the world,
00:23:38.760 | mostly in China, Japan, and Korea,
00:23:41.200 | where it's an important part of the culture,
00:23:43.640 | so much so that it's considered one of the
00:23:45.880 | four ancient arts that was required by Chinese scholars.
00:23:49.840 | So there's a deep history there.
00:23:51.680 | - But there's interesting qualities,
00:23:53.080 | so if I was to compare it to chess,
00:23:55.640 | chess is in the same way as it is
00:23:58.000 | in Chinese culture for Go,
00:23:59.360 | and chess in Russia is also considered
00:24:01.880 | one of the sacred arts.
00:24:03.960 | So if we contrast Go with chess,
00:24:06.440 | there's interesting qualities about Go.
00:24:08.440 | Maybe you can correct me if I'm wrong,
00:24:10.840 | but the evaluation of a particular static board
00:24:15.720 | is not as reliable.
00:24:18.920 | You can't, in chess you can kind of assign points
00:24:21.800 | to the different units,
00:24:23.840 | and it's kind of a pretty good measure
00:24:26.600 | of who's winning, who's losing.
00:24:28.000 | It's not so clear.
00:24:29.800 | - Yeah, so in the game of Go,
00:24:31.080 | you know, you find yourself in a situation
00:24:32.800 | where both players have played the same number of stones.
00:24:36.000 | Actually captures a strong level of play
00:24:38.360 | happen very rarely, which means that
00:24:40.280 | any moment in the game you've got
00:24:41.400 | the same number of white stones and black stones,
00:24:43.720 | and the only thing which differentiates
00:24:45.160 | how well you're doing is this intuitive sense
00:24:48.200 | of where are the territories ultimately
00:24:50.760 | gonna form on this board.
00:24:52.520 | And if you look at the complexity of a real Go position,
00:24:55.680 | you know, it's mind boggling that kind of question
00:25:00.560 | of what will happen in 300 moves from now
00:25:02.680 | when you see just a scattering of 20 white
00:25:05.400 | and black stones intermingled.
00:25:06.920 | And so that challenge is the reason why
00:25:12.840 | position evaluation is so hard in Go compared to other games.
00:25:17.440 | In addition to that, it has an enormous search space.
00:25:19.280 | So there's around 10 to the 170 positions in the game of Go.
00:25:24.280 | That's an astronomical number.
00:25:26.200 | And that search space is so great
00:25:28.560 | that traditional heuristic search methods
00:25:30.480 | that were so successful in things like Deep Blue
00:25:32.480 | and chess programs just kind of fall over in Go.
00:25:36.080 | - So at which point did reinforcement learning
00:25:39.440 | enter your life, your research life, your way of thinking?
00:25:43.960 | We just talked about learning,
00:25:45.440 | but reinforcement learning is a very particular
00:25:47.760 | kind of learning.
00:25:49.680 | One that's both philosophically sort of profound,
00:25:53.080 | but also one that's pretty difficult to get to work
00:25:55.880 | as if we look back in the early days.
00:25:58.480 | So when did that enter your life
00:26:00.320 | and how did that work progress?
00:26:02.320 | - So I had just finished working in the games industry
00:26:06.280 | at this startup company.
00:26:07.680 | And I took a year out to discover for myself
00:26:12.680 | exactly which path I wanted to take.
00:26:14.800 | I knew I wanted to study intelligence,
00:26:17.120 | but I wasn't sure what that meant at that stage.
00:26:19.240 | I really didn't feel I had the tools to decide
00:26:22.000 | on exactly which path I wanted to follow.
00:26:24.840 | So during that year, I read a lot.
00:26:27.200 | And one of the things I read was Saturn and Barto,
00:26:31.480 | the sort of seminal textbook
00:26:33.360 | on an introduction to reinforcement learning.
00:26:35.920 | And when I read that textbook,
00:26:39.080 | I just had this resonating feeling
00:26:43.480 | that this is what I understood intelligence to be.
00:26:46.880 | And this was the path that I felt would be necessary
00:26:51.440 | to go down to make progress in AI.
00:26:54.880 | So I got in touch with Rich Sutton
00:27:00.280 | and asked him if he would be interested
00:27:02.720 | in supervising me on a PhD thesis in computer go.
00:27:07.720 | And he basically said
00:27:13.080 | that if he's still alive, he'd be happy to.
00:27:15.760 | But unfortunately, he'd been struggling
00:27:19.480 | with very serious cancer for some years,
00:27:21.800 | and he really wasn't confident at that stage
00:27:24.000 | that he'd even be around to see the end of it.
00:27:26.360 | But fortunately, that part of the story
00:27:28.680 | worked out very happily.
00:27:29.880 | And I found myself out there in Alberta.
00:27:32.800 | They've got a great games group out there
00:27:34.840 | with a history of fantastic work in board games as well,
00:27:38.680 | as Rich Sutton, the father of RL.
00:27:40.880 | So it was the natural place for me to go.
00:27:42.960 | In some sense to study this question.
00:27:45.880 | And the more I looked into it,
00:27:48.400 | the more strongly I felt that this wasn't just the path
00:27:53.400 | to progress in computer go,
00:27:56.240 | but really, this was the thing I'd been looking for.
00:27:59.320 | This was really an opportunity
00:28:04.320 | to frame what intelligence means.
00:28:08.400 | Like what are the goals of AI
00:28:10.640 | in a single clear problem definition,
00:28:14.240 | such that if we're able to solve
00:28:15.640 | that clear single problem definition,
00:28:17.520 | in some sense, we've cracked the problem of AI.
00:28:21.160 | - So to you, reinforcement learning ideas,
00:28:24.840 | at least sort of echoes of it,
00:28:26.200 | would be at the core of intelligence.
00:28:29.400 | It is at the core of intelligence.
00:28:31.320 | And if we ever create a human level intelligence system,
00:28:34.920 | it would be at the core of that kind of system.
00:28:37.480 | - Let me say it this way,
00:28:38.320 | that I think it's helpful to separate out
00:28:40.400 | the problem from the solution.
00:28:42.360 | So I see the problem of intelligence,
00:28:46.000 | I would say it can be formalized
00:28:48.480 | as the reinforcement learning problem.
00:28:50.720 | And that that formalization is enough
00:28:52.840 | to capture most, if not all of the things
00:28:56.200 | that we mean by intelligence.
00:28:57.520 | That they can all be brought within this framework
00:29:01.080 | and gives us a way to access them in a meaningful way
00:29:03.520 | that allows us as scientists to understand intelligence
00:29:08.640 | and us as computer scientists to build them.
00:29:11.760 | And so in that sense, I feel that it gives us a path,
00:29:16.280 | maybe not the only path, but a path towards AI.
00:29:20.360 | And so do I think that any system in the future
00:29:24.960 | that's solved AI would have to have RL within it?
00:29:29.720 | Well, I think if you ask that,
00:29:30.760 | you're asking about the solution methods.
00:29:33.480 | I would say that if we have such a thing,
00:29:35.560 | it would be a solution to the RL problem.
00:29:37.920 | Now, what particular methods have been used to get there?
00:29:41.240 | Well, we should keep an open mind about the best approaches
00:29:43.440 | to actually solve any problem.
00:29:45.720 | And the things we have right now
00:29:48.040 | for reinforcement learning,
00:29:49.480 | maybe I believe they've got a lot of legs,
00:29:53.560 | but maybe we're missing some things.
00:29:54.880 | Maybe there's gonna be better ideas.
00:29:56.520 | I think we should keep, let's remain modest
00:29:59.120 | and we're at the early days of this field
00:30:02.440 | and there are many amazing discoveries ahead of us.
00:30:05.040 | - For sure, the specifics,
00:30:06.320 | especially of the different kinds of RL approaches currently
00:30:09.640 | there could be other things that fall
00:30:11.320 | under the very large umbrella of RL.
00:30:13.480 | But if it's okay, can we take a step back
00:30:16.760 | and kind of ask the basic question
00:30:19.000 | of what is to you reinforcement learning?
00:30:22.620 | - So reinforcement learning is the study
00:30:25.560 | and the science and the problem of intelligence
00:30:30.560 | in the form of an agent that interacts with an environment.
00:30:35.520 | So the problem you're trying to solve
00:30:36.720 | is represented by some environment,
00:30:38.160 | like the world in which that agent is situated.
00:30:40.760 | And the goal of RL is clear
00:30:42.560 | that the agent gets to take actions.
00:30:44.760 | Those actions have some effect on the environment
00:30:47.600 | and the environment gives back an observation
00:30:49.240 | to the agent saying, this is what you see or sense.
00:30:51.920 | And one special thing which it gives back
00:30:54.840 | is called the reward signal,
00:30:56.360 | how well it's doing in the environment.
00:30:58.120 | And the reinforcement learning problem
00:30:59.920 | is to simply take actions over time
00:31:04.420 | so as to maximize that reward signal.
00:31:06.280 | - So a couple of basic questions,
00:31:10.160 | what types of RL approaches are there?
00:31:13.920 | So I don't know if there's a nice brief inwards way
00:31:17.880 | to paint a picture of sort of value-based,
00:31:21.560 | model-based, policy-based reinforcement learning.
00:31:25.880 | - Yeah, so now if we think about,
00:31:27.920 | okay, so there's this ambitious problem definition of RL.
00:31:32.000 | It's really, you know, it's truly ambitious.
00:31:33.440 | It's trying to capture and encircle all of the things
00:31:35.640 | in which an agent interacts with an environment and say,
00:31:38.040 | well, how can we formalize and understand
00:31:39.880 | what it means to crack that?
00:31:42.040 | Now let's think about the solution method.
00:31:43.880 | Well, how do you solve a really hard problem like that?
00:31:46.560 | Well, one approach you can take is to decompose
00:31:49.560 | that very hard problem into pieces that work together
00:31:53.800 | to solve that hard problem.
00:31:55.440 | And so you can kind of look at the decomposition
00:31:58.120 | that's inside the agent's head, if you like,
00:32:00.740 | and ask, well, what form does that decomposition take?
00:32:03.860 | And some of the most common pieces that people use
00:32:06.260 | when they're kind of putting the solution method together,
00:32:09.660 | some of the most common pieces that people use
00:32:11.760 | are whether or not that solution has a value function.
00:32:14.940 | That means, is it trying to predict,
00:32:16.860 | explicitly trying to predict how much reward
00:32:18.660 | it will get in the future?
00:32:20.180 | Does it have a representation of a policy?
00:32:22.860 | That means something which is deciding how to pick actions.
00:32:25.820 | Is that decision-making process explicitly represented?
00:32:29.100 | And is there a model in the system?
00:32:32.060 | Is there something which is explicitly trying to predict
00:32:34.480 | what will happen in the environment?
00:32:36.620 | And so those three pieces are to me,
00:32:40.600 | some of the most common building blocks.
00:32:42.420 | And I understand the different choices in RL
00:32:47.100 | as choices of whether or not to use those building blocks
00:32:49.940 | when you're trying to decompose the solution.
00:32:52.660 | Should I have a value function represented?
00:32:54.340 | Should I have a policy represented?
00:32:56.780 | Should I have a model represented?
00:32:58.500 | And there are combinations of those pieces
00:33:00.260 | and of course other things
00:33:01.140 | that you could add into the picture as well.
00:33:03.220 | But those three fundamental choices give rise
00:33:05.460 | to some of the branches of RL
00:33:06.980 | with which we are very familiar.
00:33:08.660 | - And so those, as you mentioned,
00:33:10.940 | there is a choice of what's specified
00:33:14.340 | or modeled explicitly.
00:33:17.340 | And the idea is that all of these
00:33:20.500 | are somehow implicitly learned within the system.
00:33:23.460 | So it's almost a choice of how you approach a problem
00:33:28.540 | do you see those as fundamental differences
00:33:30.340 | or are these almost like small specifics,
00:33:35.340 | like the details of how you solve the problem
00:33:37.540 | but they're not fundamentally different from each other?
00:33:40.940 | - I think the fundamental idea is maybe at the higher level.
00:33:45.940 | The fundamental idea is the first step of the decomposition
00:33:49.580 | is really to say, well, how are we really gonna solve
00:33:53.980 | any kind of problem where you're trying to figure out
00:33:56.540 | how to take actions and just from this stream
00:33:58.780 | of observations, you've got some agent situated
00:34:01.100 | in its sensory motor stream
00:34:02.180 | and getting all these observations in,
00:34:04.340 | getting to take these actions and what should it do?
00:34:06.180 | How can you even broach that problem?
00:34:07.820 | Maybe the complexity of the world is so great
00:34:10.820 | that you can't even imagine how to build a system
00:34:13.260 | that would understand how to deal with that.
00:34:15.740 | And so the first step of this decomposition is to say,
00:34:18.580 | well, you have to learn, the system has to learn for itself.
00:34:22.060 | And so note that the reinforcement learning problem
00:34:24.460 | doesn't actually stipulate that you have to learn.
00:34:27.100 | Like you could maximize your rewards without learning.
00:34:29.380 | It would just, wouldn't do a very good job of it.
00:34:32.420 | So learning is required because it's the only way
00:34:35.380 | to achieve good performance in any sufficiently large
00:34:38.220 | and complex environment.
00:34:40.500 | So that's the first step.
00:34:42.260 | And so that step gives commonality
00:34:43.740 | to all of the other pieces.
00:34:45.340 | 'Cause now you might ask, well, what should you be learning?
00:34:48.780 | What does learning even mean?
00:34:50.260 | In this sense, learning might mean, well,
00:34:52.980 | you're trying to update the parameters of some system,
00:34:57.340 | which is then the thing that actually picks the actions.
00:35:00.860 | And those parameters could be representing anything.
00:35:03.460 | They could be parameterizing a value function
00:35:05.700 | or a model or a policy.
00:35:08.540 | And so in that sense, there's a lot of commonality
00:35:10.860 | in that whatever is being represented there
00:35:12.380 | is the thing which is being learned
00:35:13.580 | and it's being learned with the ultimate goal
00:35:15.740 | of maximizing rewards.
00:35:17.500 | But the way in which you decompose the problem
00:35:20.300 | is really what gives the semantics to the whole system.
00:35:23.140 | Like, are you trying to learn something to predict well,
00:35:27.300 | like a value function or a model?
00:35:28.580 | Are you learning something to perform well, like a policy?
00:35:31.700 | And the form of that objective
00:35:34.020 | is kind of giving the semantics to the system.
00:35:36.300 | And so it really is at the next level down,
00:35:39.300 | a fundamental choice.
00:35:40.340 | And we have to make those fundamental choices
00:35:42.900 | as system designers or enable our algorithms
00:35:46.180 | to be able to learn how to make those choices
00:35:48.220 | for themselves.
00:35:49.340 | - So then the next step you mentioned,
00:35:51.460 | the very first thing you have to deal with is,
00:35:56.020 | can you even take in this huge stream of observations
00:36:00.060 | and do anything with it?
00:36:01.540 | So the natural next basic question is,
00:36:05.060 | what is deep reinforcement learning?
00:36:08.140 | And what is this idea of using neural networks
00:36:11.540 | to deal with this huge incoming stream?
00:36:14.580 | - So amongst all the approaches for reinforcement learning,
00:36:18.220 | deep reinforcement learning is one family
00:36:21.420 | of solution methods that tries to utilize
00:36:25.420 | powerful representations that are offered by neural networks
00:36:31.620 | to represent any of these different components
00:36:35.740 | of the solution, of the agent.
00:36:37.980 | Like whether it's the value function,
00:36:39.620 | or the model, or the policy,
00:36:41.820 | the idea of deep learning is to say,
00:36:43.460 | well, here's a powerful toolkit that's so powerful
00:36:46.700 | that it's universal in the sense
00:36:48.180 | that it can represent any function
00:36:50.140 | and it can learn any function.
00:36:52.020 | And so if we can leverage that universality,
00:36:55.020 | that means that whatever we need to represent
00:36:57.940 | for our policy or for our value function, for our model,
00:37:00.260 | deep learning can do it.
00:37:01.940 | So that deep learning is one approach
00:37:04.860 | that offers us a toolkit that has no ceiling
00:37:08.620 | to its performance.
00:37:09.460 | That as we start to put more resources into the system,
00:37:12.460 | more memory and more computation,
00:37:14.860 | and more data, more experience,
00:37:18.260 | more interactions with the environment,
00:37:20.140 | that these are systems that can just get better
00:37:22.220 | and better and better at doing whatever the job is
00:37:24.420 | they've asked them to do.
00:37:25.540 | Whatever we've asked that function to represent,
00:37:27.940 | it can learn a function that does a better and better job
00:37:32.180 | of representing that knowledge.
00:37:34.460 | Whether that knowledge be estimating
00:37:36.620 | how well you're gonna do in the world, the value function,
00:37:38.820 | whether it's gonna be choosing what to do in the world,
00:37:41.780 | the policy, or whether it's understanding the world itself,
00:37:44.900 | what's gonna happen next, the model.
00:37:46.820 | - Nevertheless, the fact that neural networks
00:37:50.220 | are able to learn incredibly complex representations
00:37:54.980 | that allow you to do the policy, the model,
00:37:56.940 | or the value function,
00:37:58.240 | is, at least to my mind,
00:38:01.780 | exceptionally beautiful and surprising.
00:38:03.940 | Like, was it surprising to you?
00:38:08.780 | Can you still believe it works as well as it does?
00:38:11.500 | Do you have good intuition about why it works at all
00:38:14.780 | and works as well as it does?
00:38:16.700 | - I think, let me take two parts to that question.
00:38:22.940 | I think it's not surprising to me
00:38:27.420 | that the idea of reinforcement learning works,
00:38:30.780 | because in some sense, I think it's,
00:38:33.980 | I feel it's the only thing which can, ultimately.
00:38:37.460 | And so I feel we have to address it,
00:38:40.100 | and there must be successes possible,
00:38:42.540 | because we have examples of intelligence.
00:38:44.740 | And it must, at some level, be able to,
00:38:47.620 | possible to acquire experience
00:38:50.100 | and use that experience to do better
00:38:52.300 | in a way which is meaningful to environments
00:38:55.780 | of the complexity that humans can deal with.
00:38:57.740 | It must be.
00:38:58.580 | Am I surprised that our current systems
00:39:01.060 | can do as well as they can do?
00:39:02.560 | I think one of the big surprises for me
00:39:05.940 | and a lot of the community
00:39:09.580 | is really the fact that deep learning
00:39:14.180 | can continue to perform so well,
00:39:19.180 | despite the fact that these neural networks
00:39:22.680 | that they're representing
00:39:23.940 | have these incredibly nonlinear kind of bumpy surfaces,
00:39:28.060 | which to our kind of low dimensional intuitions
00:39:31.220 | make it feel like, surely you're just gonna get stuck,
00:39:33.900 | and learning will get stuck,
00:39:35.220 | because you won't be able to make any further progress.
00:39:38.620 | And yet, the big surprise is that learning continues,
00:39:43.220 | and these, what appear to be local optima,
00:39:46.500 | turn out not to be, because in high dimensions,
00:39:48.700 | when we make really big neural nets,
00:39:50.440 | there's always a way out,
00:39:52.200 | and there's a way to go even lower,
00:39:53.620 | and then you're still not in a local optima,
00:39:56.540 | because there's some other pathway
00:39:57.800 | that will take you out and take you lower still.
00:40:00.020 | And so no matter where you are,
00:40:01.260 | learning can proceed and do better and better and better,
00:40:05.180 | without bound.
00:40:07.020 | And so that is a surprising
00:40:10.580 | and beautiful property of neural nets,
00:40:13.860 | which I find elegant and beautiful,
00:40:17.500 | and somewhat shocking that it turns out to be the case.
00:40:21.020 | - As you said, which I really like,
00:40:23.120 | to our low dimensional intuitions, that's surprising.
00:40:28.120 | - Yeah, yeah, we're very tuned to working
00:40:32.540 | within a three dimensional environment,
00:40:34.500 | and so to start to visualize
00:40:36.860 | what a billion dimensional neural network surface
00:40:41.780 | that you're trying to optimize over,
00:40:43.260 | what that even looks like, is very hard for us.
00:40:46.100 | And so I think that really,
00:40:47.940 | if you try to account for the,
00:40:50.380 | essentially the AI winter,
00:40:54.260 | where people gave up on neural networks,
00:40:56.780 | I think it's really down to that lack of ability
00:41:00.300 | to generalize from low dimensions to high dimensions,
00:41:03.260 | because back then we were in the low dimensional case,
00:41:05.780 | people could only build neural nets with,
00:41:08.420 | 50 nodes in them or something.
00:41:11.460 | And to imagine that it might be possible
00:41:14.180 | to build a billion dimensional neural net,
00:41:15.980 | and it might have a completely different,
00:41:17.520 | qualitatively different property,
00:41:19.260 | was very hard to anticipate.
00:41:21.340 | And I think even now,
00:41:22.540 | we're starting to build the theory to support that.
00:41:26.420 | And it's incomplete at the moment,
00:41:28.260 | but all of the theory seems to be pointing in the direction
00:41:30.900 | that indeed this is an approach,
00:41:32.460 | which truly is universal,
00:41:34.820 | both in its representational capacity, which was known,
00:41:37.220 | but also in its learning ability, which is surprising.
00:41:40.860 | - And it makes one wonder what else we're missing,
00:41:44.780 | due to our low dimension intuitions,
00:41:47.740 | that will seem obvious once it's discovered.
00:41:51.720 | - I often wonder,
00:41:52.800 | when we one day do have,
00:41:55.900 | AIs which are superhuman in their abilities
00:42:00.980 | to understand the world,
00:42:02.940 | what will they think of the algorithms
00:42:07.540 | that we developed back now?
00:42:08.940 | Will it be, looking back at these days,
00:42:11.260 | and thinking that,
00:42:15.760 | will we look back and feel that these algorithms
00:42:17.940 | were naive first steps,
00:42:19.580 | or will they still be the fundamental ideas
00:42:21.500 | which are used even in 100,000, 10,000 years?
00:42:24.940 | - Yeah, they'll watch back to this conversation
00:42:30.300 | and with a smile, maybe a little bit of a laugh.
00:42:34.820 | I mean, my sense is,
00:42:36.380 | I think just like when we used to think that
00:42:40.900 | the sun revolved around the earth,
00:42:45.860 | they'll see our systems of today,
00:42:48.580 | reinforcement learning as too complicated.
00:42:51.540 | That the answer was simple all along.
00:42:54.460 | There's something, just like you said in the game of Go,
00:42:58.180 | I mean, I love the systems of like cellular automata,
00:43:01.700 | that there's simple rules
00:43:03.220 | from which incredible complexity emerges.
00:43:06.060 | So it feels like there might be
00:43:08.160 | some really simple approaches,
00:43:10.540 | just like where Sutton says, right?
00:43:12.640 | These simple methods with compute over time
00:43:17.700 | seem to prove to be the most effective.
00:43:20.740 | - I 100% agree.
00:43:21.900 | I think that,
00:43:22.780 | if we try to anticipate
00:43:27.780 | what will generalize well into the future,
00:43:30.660 | I think it's likely to be the case
00:43:32.940 | that it's the simple, clear ideas
00:43:35.580 | which will have the longest legs
00:43:36.820 | and which will carry us furthest into the future.
00:43:39.380 | Nevertheless, we're in a situation
00:43:40.900 | where we need to make things work today.
00:43:43.300 | And sometimes that requires
00:43:44.460 | putting together more complex systems
00:43:46.580 | where we don't have the full answers yet
00:43:49.100 | as to what those minimal ingredients might be.
00:43:51.620 | - So speaking of which,
00:43:53.060 | if we could take a step back to Go,
00:43:56.060 | what was MoGo
00:43:57.260 | and what was the key idea behind the system?
00:44:00.820 | - So back during my PhD on Computer Go,
00:44:04.460 | around about that time,
00:44:06.340 | there was a major new development
00:44:08.940 | which actually happened in the context of Computer Go.
00:44:12.860 | And it was really a revolution
00:44:16.420 | in the way that heuristic search was done.
00:44:18.740 | And the idea was essentially that
00:44:21.860 | a position could be evaluated
00:44:25.380 | or a state in general could be evaluated
00:44:27.380 | not by humans saying whether that position is good or not
00:44:33.540 | or even humans providing rules
00:44:35.100 | as to how you might evaluate it,
00:44:37.220 | but instead by allowing the system
00:44:40.860 | to randomly play out the game until the end,
00:44:44.480 | multiple times and taking the average of those outcomes
00:44:48.100 | as the prediction of what will happen.
00:44:50.620 | So for example, if you're in the game of Go,
00:44:53.000 | the intuition is that you take a position
00:44:55.340 | and you get the system to kind of play random moves
00:44:58.080 | against itself all the way to the end of the game
00:45:00.060 | and you see who wins.
00:45:01.740 | And if black ends up winning more
00:45:03.400 | of those random games than white,
00:45:05.120 | well, you say, hey, this is a position that favors white.
00:45:07.380 | And if white ends up winning more
00:45:08.780 | of those random games than black,
00:45:10.620 | then it favors white.
00:45:12.180 | So that idea was known as Monte Carlo search
00:45:18.020 | and a particular form of Monte Carlo search
00:45:23.100 | that became very effective
00:45:24.380 | and was developed in computer Go
00:45:26.060 | first by Remy Coulomb in 2006
00:45:28.540 | and then taken further by others
00:45:31.140 | was something called Monte Carlo tree search,
00:45:33.740 | which basically takes that same idea
00:45:35.900 | and uses that insight to evaluate every node
00:45:40.140 | of a search tree is evaluated by the average
00:45:42.940 | of the random playouts from that node onwards.
00:45:45.800 | And this idea was very powerful
00:45:49.180 | and suddenly led to huge leaps forward
00:45:51.620 | in the strength of computer Go playing programs.
00:45:54.160 | And among those, the strongest
00:45:57.200 | of the Go playing programs in those days
00:45:58.980 | was a program called MoGo,
00:46:00.660 | which was the first program
00:46:02.740 | to actually reach human master level
00:46:05.220 | on small boards, nine by nine boards.
00:46:07.620 | And so this was a program by someone called Silvangeli,
00:46:11.800 | who's a good colleague of mine,
00:46:13.100 | but I worked with him a little bit in those days,
00:46:16.740 | part of my PhD thesis.
00:46:18.360 | And MoGo was a first step
00:46:21.720 | towards the latest successes we saw in computer Go,
00:46:25.440 | but it was still missing a key ingredient.
00:46:27.980 | MoGo was evaluating purely by random rollouts
00:46:32.600 | against itself.
00:46:33.840 | And in a way it's truly remarkable
00:46:36.360 | that random play should give you anything at all.
00:46:39.040 | - Yes.
00:46:39.880 | - Like why in this perfectly deterministic game
00:46:42.560 | that's very precise and involves these very exact sequences,
00:46:46.840 | why is it that randomization is helpful?
00:46:51.840 | And so the intuition is that randomization
00:46:54.080 | captures something about the nature of the search tree,
00:46:59.080 | from a position that you're understanding
00:47:01.800 | the nature of the search tree from that node onwards
00:47:04.560 | by using randomization.
00:47:06.960 | And this was a very powerful idea.
00:47:09.240 | - And I've seen this in other spaces,
00:47:11.680 | talked to Richard Karp and so on,
00:47:14.640 | randomized algorithms somehow magically
00:47:17.320 | are able to do exceptionally well
00:47:19.740 | and simplifying the problem somehow.
00:47:23.520 | It makes you wonder about the fundamental nature
00:47:25.640 | of randomness in our universe.
00:47:27.640 | It seems to be a useful thing.
00:47:29.480 | But so from that moment,
00:47:32.080 | can you maybe tell the origin story
00:47:34.000 | and the journey of AlphaGo?
00:47:36.080 | - Yeah.
00:47:36.920 | So programs based on Monte Carlo tree search
00:47:39.480 | were a first revolution in the sense that they led to,
00:47:43.800 | suddenly programs that could play the game
00:47:46.200 | to any reasonable level, but they plateaued.
00:47:49.280 | It seemed that no matter how much effort
00:47:51.880 | people put into these techniques,
00:47:53.200 | they couldn't exceed the level of amateur
00:47:56.400 | down level Go players.
00:47:58.060 | So strong players, but not anywhere near the level
00:48:01.360 | of professionals, nevermind the world champion.
00:48:04.480 | And so that brings us to the birth of AlphaGo,
00:48:08.400 | which happened in the context of a startup company
00:48:12.280 | known as DeepMind.
00:48:14.560 | - I heard of them.
00:48:15.480 | - Where a project was born
00:48:19.000 | and the project was really a scientific investigation
00:48:23.680 | where myself and Ajah Huang and an intern, Chris Madison,
00:48:28.680 | were exploring a scientific question.
00:48:33.240 | And that scientific question was really,
00:48:35.540 | is there another fundamentally different approach
00:48:39.600 | to this key question of Go, the key challenge
00:48:42.860 | of how can you build that intuition
00:48:45.740 | and how can you just have a system
00:48:47.580 | that could look at a position and understand
00:48:50.460 | what move to play or how well you're doing in that position,
00:48:53.340 | who's gonna win?
00:48:54.840 | And so the deep learning revolution had just begun.
00:48:59.120 | That systems like ImageNet had suddenly been won
00:49:03.460 | by deep learning techniques back in 2012.
00:49:06.520 | And following that, it was natural to ask,
00:49:08.640 | well, if deep learning is able to scale up so effectively
00:49:12.480 | with images to understand them enough to classify them,
00:49:16.680 | well, why not go?
00:49:17.520 | Why not take the black and white stones of the Go board
00:49:22.520 | and build a system which can understand for itself
00:49:25.360 | what that means in terms of what move to pick
00:49:27.520 | or who's gonna win the game, black or white?
00:49:30.100 | And so that was our scientific question
00:49:32.540 | which we were probing and trying to understand.
00:49:35.680 | And as we started to look at it,
00:49:37.840 | we discovered that we could build a system.
00:49:40.840 | So in fact, our very first paper on AlphaGo
00:49:43.600 | was actually a pure deep learning system
00:49:46.120 | which was trying to answer this question.
00:49:49.440 | And we showed that actually a pure deep learning system
00:49:52.400 | with no search at all was actually able to reach
00:49:55.640 | human Dan level, master level at the full game of Go,
00:49:59.440 | 19 by 19 boards.
00:50:01.720 | And so without any search at all,
00:50:04.020 | suddenly we had systems which were playing
00:50:06.040 | at the level of the best Monte Carlo tree search systems,
00:50:10.100 | the ones with randomized rollouts.
00:50:11.760 | - So first of all, sorry to interrupt,
00:50:13.080 | but that's kind of a groundbreaking notion.
00:50:16.600 | That's like basically a definitive step away
00:50:20.680 | from a couple of decades
00:50:22.700 | of essentially search dominating AI.
00:50:26.280 | So how does that make you feel?
00:50:28.440 | Was it surprising from a scientific perspective in general,
00:50:33.000 | how to make you feel?
00:50:33.940 | - I found this to be profoundly surprising.
00:50:37.300 | In fact, it was so surprising that we had a bet back then
00:50:41.760 | and like many good projects, bets are quite motivating
00:50:44.960 | and the bet was whether it was possible
00:50:47.840 | for a system based purely on deep learning,
00:50:52.100 | no search at all to beat a Dan level human player.
00:50:55.540 | And so we had someone who joined our team
00:51:00.020 | who was a Dan level player.
00:51:01.000 | He came in and we had this first match against him.
00:51:05.560 | - Which side of the bet were you on by the way?
00:51:09.080 | The losing or the winning side?
00:51:11.660 | - I tend to be an optimist with the power
00:51:14.580 | of deep learning and reinforcement learning.
00:51:18.360 | So the system won and we were able to beat
00:51:21.960 | this human Dan level player.
00:51:24.200 | And for me, that was the moment where it was like,
00:51:26.380 | okay, something special is afoot here.
00:51:29.400 | We have a system which without search is able
00:51:32.960 | to already just look at this position and understand things
00:51:36.980 | as well as a strong human player.
00:51:39.520 | And from that point onwards,
00:51:41.440 | I really felt that reaching the top levels of human play,
00:51:46.440 | professional level, world champion level,
00:51:50.760 | I felt it was actually an inevitability.
00:51:54.600 | And if it was an inevitable outcome,
00:51:59.600 | I was rather keen that it would be us that achieved it.
00:52:02.960 | So we scaled up.
00:52:05.340 | This was something where,
00:52:06.760 | so I had lots of conversations back then
00:52:09.320 | with Demis Hassabis, the head of DeepMind,
00:52:14.320 | who was extremely excited.
00:52:16.040 | And we made the decision to scale up the project,
00:52:21.200 | brought more people on board.
00:52:23.460 | And so AlphaGo became something where we had a clear goal,
00:52:28.460 | which was to try and crack this outstanding challenge of AI
00:52:33.760 | to see if we could beat the world's best players.
00:52:37.380 | And this led within the space of not so many months
00:52:42.380 | to playing against the European champion, Fan Hui,
00:52:45.840 | in a match which became memorable in history
00:52:49.000 | as the first time a Go program
00:52:50.720 | had ever beaten a professional player.
00:52:54.000 | And at that time, we had to make a judgment
00:52:56.320 | as to when and whether we should go
00:52:59.780 | and challenge the world champion.
00:53:01.860 | And this was a difficult decision to make.
00:53:04.220 | Again, we were basing our predictions on our own progress
00:53:08.560 | and had to estimate based on the rapidity
00:53:11.380 | of our own progress when we thought we would exceed
00:53:15.440 | the level of the human world champion.
00:53:17.700 | And we tried to make an estimate and set up a match
00:53:20.500 | and that became the AlphaGo versus Lee Sedol match in 2016.
00:53:25.500 | - And we should say, spoiler alert,
00:53:29.960 | that AlphaGo was able to defeat Lee Sedol.
00:53:33.840 | - That's right, yeah.
00:53:35.040 | - So maybe we could take even a broader view.
00:53:40.040 | AlphaGo involves both learning from expert games
00:53:45.900 | and, as far as I remember, a self-play component
00:53:50.900 | to where it learns by playing against itself.
00:53:54.240 | But in your sense, what was the role of learning
00:53:57.540 | from expert games there?
00:53:59.060 | And in terms of your self-evaluation,
00:54:01.340 | whether you can take on the world champion,
00:54:04.100 | what was the thing that you're trying to do more of,
00:54:06.940 | sort of train more on expert games?
00:54:09.380 | Or was there now another,
00:54:12.580 | I'm asking so many poorly phrased questions,
00:54:15.620 | but did you have a hope, a dream that self-play
00:54:19.540 | would be the key component at that moment yet?
00:54:23.400 | - So in the early days of AlphaGo,
00:54:26.420 | we used human data to explore the science
00:54:29.780 | of what deep learning can achieve.
00:54:31.380 | And so when we had our first paper that showed
00:54:34.620 | that it was possible to predict the winner of the game,
00:54:37.820 | that it was possible to suggest moves,
00:54:39.740 | that was done using human data.
00:54:41.260 | - Oh, solely human data.
00:54:42.420 | - Yeah, and so the reason that we did it that way
00:54:45.120 | was at that time we were exploring separately
00:54:47.660 | the deep learning aspect
00:54:48.980 | from the reinforcement learning aspect.
00:54:51.140 | That was the part which was new and unknown
00:54:53.460 | to me at that time, was how far could that be stretched?
00:54:57.340 | Once we had that, it then became natural
00:55:00.580 | to try and use that same representation
00:55:03.100 | and see if we could learn for ourselves
00:55:04.960 | using that same representation.
00:55:06.580 | And so right from the beginning,
00:55:08.380 | actually our goal had been to build a system
00:55:11.960 | using self-play.
00:55:14.220 | And to us, the human data right from the beginning
00:55:16.900 | was an expedient step to help us for pragmatic reasons
00:55:20.880 | to go faster towards the goals of the project
00:55:24.560 | than we might be able to starting solely from self-play.
00:55:27.580 | And so in those days, we were very aware
00:55:29.840 | that we were choosing to use human data
00:55:32.800 | and that might not be the long-term holy grail of AI,
00:55:37.360 | but that it was something which was extremely useful to us.
00:55:40.840 | It helped us to understand the system.
00:55:42.240 | It helped us to build deep learning representations
00:55:44.380 | which were clear and simple and easy to use.
00:55:48.420 | And so really I would say it's served a purpose,
00:55:51.980 | not just as part of the algorithm,
00:55:53.280 | but something which I continue to use in our research today,
00:55:56.180 | which is trying to break down a very hard challenge
00:56:00.080 | into pieces which are easier to understand for us
00:56:02.460 | as researchers and develop.
00:56:04.180 | So if you use a component based on human data,
00:56:07.740 | it can help you to understand the system
00:56:10.340 | such that then you can build
00:56:11.360 | the more principled version later that does it for itself.
00:56:14.260 | - So as I said, the AlphaGo victory,
00:56:19.700 | and I don't think I'm being sort of romanticizing
00:56:23.180 | this notion, I think it's one of the greatest moments
00:56:25.120 | in the history of AI.
00:56:26.980 | So were you cognizant of this magnitude
00:56:29.920 | of the accomplishment at the time?
00:56:32.280 | I mean, are you cognizant of it even now?
00:56:35.900 | 'Cause to me, I feel like it's something that would,
00:56:38.560 | we mentioned what the AGI systems of the future
00:56:41.280 | will look back.
00:56:42.500 | I think they'll look back at the AlphaGo victory
00:56:46.120 | as like, holy crap, they figured it out.
00:56:49.120 | This is where it started.
00:56:51.720 | - Well, thank you again.
00:56:52.760 | I mean, it's funny 'cause I guess I've been working on,
00:56:56.200 | I've been working on ComputerGo for a long time.
00:56:58.080 | So I'd been working at the time of the AlphaGo match
00:57:00.280 | on ComputerGo for more than a decade.
00:57:03.040 | And throughout that decade, I'd had this dream
00:57:06.040 | of what would it be like to,
00:57:07.680 | what would it be like really to actually be able
00:57:10.680 | to build a system that could play against
00:57:13.040 | the world champion?
00:57:14.280 | And I imagined that that would be an interesting moment
00:57:17.480 | that maybe some people might care about that
00:57:20.280 | and that this might be a nice achievement.
00:57:23.200 | But I think when I arrived in Seoul
00:57:27.480 | and discovered the legions of journalists
00:57:31.520 | that were following us around and the 100 million people
00:57:34.200 | that were watching the match online live,
00:57:37.620 | I realized that I'd been off in my estimation
00:57:40.120 | of how significant this moment was
00:57:41.900 | by several orders of magnitude.
00:57:43.960 | And so there was definitely an adjustment process
00:57:48.960 | to realize that this was something
00:57:53.120 | which the world really cared about
00:57:55.600 | and which was a watershed moment.
00:57:57.960 | And I think there was that moment of realization.
00:58:00.360 | But it's also a little bit scary because,
00:58:03.600 | if you go into something thinking it's gonna be
00:58:06.560 | maybe of interest and then discover
00:58:09.120 | that 100 million people are watching,
00:58:10.860 | it suddenly makes you worry about whether some
00:58:12.660 | of the decisions you'd made were really the best ones
00:58:14.960 | or the wisest or were going to lead to the best outcome.
00:58:18.260 | And we knew for sure that there were still imperfections
00:58:20.580 | in AlphaGo, which were gonna be exposed
00:58:22.700 | to the whole world watching.
00:58:24.400 | And so, yeah, it was, I think, a great experience
00:58:28.160 | and I feel privileged to have been part of it,
00:58:32.180 | privileged to have led that amazing team.
00:58:35.940 | I feel privileged to have been in a moment of history,
00:58:38.820 | like you say, but also lucky that,
00:58:43.260 | in a sense, I was insulated from the knowledge of,
00:58:46.380 | I think it would have been harder to focus on the research
00:58:48.820 | if the full kind of reality of what was gonna come to pass
00:58:52.460 | had been known to me and the team.
00:58:55.300 | I think it was, we were in our bubble
00:58:57.600 | and we were working on research
00:58:58.740 | and we were trying to answer the scientific questions
00:59:01.580 | and then bam, the public sees it.
00:59:04.540 | And I think it was better that way in retrospect.
00:59:07.500 | - Were you confident that, I guess,
00:59:10.220 | what were the chances that you could get the win?
00:59:12.720 | So, just like you said, I'm a little bit more familiar
00:59:18.620 | with another accomplishment
00:59:20.300 | that we may not even get a chance to talk to.
00:59:22.380 | I talked to Oriol Vinales about AlphaStar,
00:59:24.500 | which is another incredible accomplishment.
00:59:26.260 | But here, with AlphaStar and beating StarCraft,
00:59:31.140 | there was already a track record with AlphaGo.
00:59:34.220 | This is like the really first time
00:59:36.220 | you get to see reinforcement learning
00:59:38.420 | face the best human in the world.
00:59:41.660 | So, what was your confidence like?
00:59:43.300 | What was the odds?
00:59:44.960 | - Well, we actually-- - Was there a bet?
00:59:46.860 | (laughing)
00:59:47.820 | - Funnily enough, there was.
00:59:49.900 | So, just before the match,
00:59:52.100 | we weren't betting on anything concrete,
00:59:54.300 | but we all held out a hand.
00:59:56.500 | Everyone in the team held out a hand
00:59:57.980 | at the beginning of the match.
00:59:59.620 | And the number of fingers that they had out on that hand
01:00:01.500 | was supposed to represent how many games
01:00:03.420 | they thought we would win against Lee Sedol.
01:00:06.300 | And there was an amazing spread in the team's predictions.
01:00:09.660 | But I have to say, I predicted 4-1.
01:00:12.780 | (laughing)
01:00:15.060 | And the reason was based purely on data.
01:00:18.580 | So, I'm a scientist first and foremost.
01:00:20.620 | And one of the things which we had established
01:00:23.140 | was that AlphaGo, in around one in five games,
01:00:27.260 | would develop something which we called a delusion,
01:00:29.540 | which was a kind of hole in its knowledge
01:00:31.980 | where it wasn't able to fully understand
01:00:34.860 | everything about the position.
01:00:36.100 | And that hole in its knowledge would persist
01:00:38.060 | for tens of moves throughout the game.
01:00:40.720 | And we knew two things.
01:00:42.720 | We knew that if there were no delusions,
01:00:44.460 | that AlphaGo seemed to be playing at a level
01:00:46.620 | that was far beyond any human capabilities.
01:00:49.420 | But we also knew that if there were delusions,
01:00:52.020 | the opposite was true.
01:00:53.180 | (laughing)
01:00:54.020 | And in fact, that's what came to pass.
01:00:58.300 | We saw all of those outcomes.
01:01:00.180 | And Lisa Doll, in one of the games,
01:01:02.900 | played a really beautiful sequence
01:01:04.580 | that AlphaGo just hadn't predicted.
01:01:08.180 | And after that, it led it into this situation
01:01:11.800 | where it was unable to really understand
01:01:14.300 | the position fully and found itself
01:01:16.380 | in one of these delusions.
01:01:17.900 | So indeed, 4-1 was the outcome.
01:01:20.780 | - So yeah, and can you maybe speak to it a little bit more?
01:01:23.220 | What were the five games?
01:01:25.460 | Like, what happened?
01:01:26.380 | Is there interesting things that come to memory
01:01:29.900 | in terms of the play of the human or the machine?
01:01:33.620 | - So I remember all of these games vividly, of course.
01:01:36.980 | You know, moments like these don't come too often
01:01:39.340 | in the lifetime of a scientist.
01:01:42.460 | And the first game was magical
01:01:47.340 | because it was the first time that a computer program
01:01:52.340 | had defeated a world champion
01:01:54.140 | in this grand challenge of Go.
01:01:57.020 | And there was a moment where AlphaGo
01:01:59.540 | invaded Lisa Doll's territory towards the end of the game.
01:02:06.720 | And that's quite an audacious thing to do.
01:02:09.920 | It's like saying,
01:02:10.760 | hey, you thought this was gonna be your territory
01:02:12.260 | in the game, but I'm gonna stick a stone
01:02:13.580 | right in the middle of it
01:02:14.420 | and prove to you that I can break it up.
01:02:17.940 | And Lisa Doll's face just dropped.
01:02:20.300 | He wasn't expecting a computer
01:02:21.660 | to do something that audacious.
01:02:26.140 | The second game became famous for a move known as Move 37.
01:02:30.820 | This was a move that was played by AlphaGo
01:02:34.740 | that broke all of the conventions of Go,
01:02:38.460 | that the Go players were so shocked by this.
01:02:40.260 | They thought that maybe the operator had made a mistake.
01:02:45.260 | They thought that there was something crazy going on.
01:02:48.260 | And it just broke every rule that Go players are taught
01:02:51.060 | from a very young age.
01:02:52.620 | They're just taught, you know,
01:02:53.820 | this kind of move called a shoulder hit.
01:02:55.300 | You can only play it on the third line or the fourth line
01:02:58.820 | and AlphaGo played it on the fifth line.
01:03:00.740 | And it turned out to be a brilliant move
01:03:03.580 | and made this beautiful pattern in the middle of the board
01:03:05.760 | that ended up winning the game.
01:03:07.380 | And so this really was a clear instance
01:03:12.300 | where we could say computers exhibited creativity,
01:03:16.020 | that this was really a move
01:03:17.020 | that was something humans hadn't known about,
01:03:20.980 | hadn't anticipated, and computers discovered this idea.
01:03:24.840 | They were the ones to say, actually,
01:03:27.020 | you know, here's a new idea, something new,
01:03:29.140 | not in the domains of human knowledge of the game.
01:03:32.460 | And now the humans think this is a reasonable thing to do
01:03:38.220 | and it's part of Go knowledge now.
01:03:40.460 | The third game, something special happens
01:03:44.600 | when you play against a human world champion,
01:03:46.620 | which again, I hadn't anticipated before going there,
01:03:48.820 | which is, you know, these players are amazing.
01:03:53.300 | Lee Sedol was a true champion, 18 time world champion,
01:03:56.400 | and had this amazing ability to probe AlphaGo
01:04:01.020 | for weaknesses of any kind.
01:04:03.420 | And in the third game, he was losing
01:04:06.140 | and we felt we were sailing comfortably to victory,
01:04:09.740 | but he managed to, from nothing, stir up this fight
01:04:14.300 | and build what's called a double co,
01:04:17.020 | these kinds of repetitive positions.
01:04:20.460 | And he knew that historically,
01:04:22.140 | no computer Go program had ever been able
01:04:24.520 | to deal correctly with double co positions.
01:04:26.740 | And he managed to summon one out of nothing.
01:04:29.740 | And so for us, you know, this was a real challenge.
01:04:33.060 | Like would AlphaGo be able to deal with this
01:04:35.340 | or would it just kind of crumble
01:04:36.380 | in the face of this situation?
01:04:38.620 | And fortunately it dealt with it perfectly.
01:04:41.420 | The fourth game was amazing in that Lee Sedol
01:04:46.140 | appeared to be losing this game.
01:04:48.180 | You know, AlphaGo thought it was winning.
01:04:49.860 | And then Lee Sedol did something
01:04:51.980 | which I think only a true world champion can do,
01:04:55.000 | which is he found a brilliant sequence
01:04:58.200 | in the middle of the game,
01:04:59.040 | a brilliant sequence that led him
01:05:01.900 | to really just transform their position.
01:05:05.300 | It kind of, he found just a piece of genius really.
01:05:10.300 | And after that, AlphaGo, its evaluation just tumbled.
01:05:15.740 | It thought it was winning this game.
01:05:17.220 | And all of a sudden it tumbled and said,
01:05:19.740 | oh, now I've got no chance.
01:05:21.420 | And it starts to behave rather oddly at that point.
01:05:24.380 | In the final game, for some reason,
01:05:26.860 | we as a team were convinced having seen AlphaGo
01:05:29.380 | in the previous game suffer from delusions.
01:05:31.940 | We as a team were convinced
01:05:33.980 | that it was suffering from another delusion.
01:05:35.900 | We were convinced that it was miss-evaluating the position
01:05:38.180 | and that something was going terribly wrong.
01:05:41.180 | And it was only in the last few moves of the game
01:05:43.720 | that we realized that actually,
01:05:46.420 | although it had been predicting
01:05:47.440 | it was gonna win all the way through,
01:05:49.380 | it really was.
01:05:50.540 | And so somehow, it just taught us yet again
01:05:54.220 | that you have to have faith in your systems.
01:05:56.140 | When they exceed your own level of ability
01:05:58.700 | and your own judgment,
01:06:00.400 | you have to trust in them to know better
01:06:02.500 | than you, the designer.
01:06:03.820 | Once you've bestowed in them the ability
01:06:07.500 | to judge better than you can,
01:06:10.540 | then trust the system to do so.
01:06:12.100 | - So just like in the case of Deep Blue beating Gary Kasparov
01:06:18.860 | so Gary was, I think the first time he's ever lost
01:06:22.780 | actually to anybody.
01:06:24.380 | And I mean, there's a similar situation with Lucidol.
01:06:27.700 | It's a tragic loss for humans,
01:06:32.700 | but a beautiful one.
01:06:36.500 | I think that's kind of,
01:06:37.860 | from the tragedy sort of emerges over time,
01:06:44.420 | emerges a kind of inspiring story.
01:06:47.260 | But Lucidol recently announced his retirement.
01:06:52.140 | I don't know if we can look too deeply into it,
01:06:56.040 | but he did say that even if I become number one,
01:06:59.540 | there's an entity that cannot be defeated.
01:07:02.620 | So what do you think about these words?
01:07:05.500 | What do you think about his retirement from the game and go?
01:07:08.020 | - Well, let me take you back first of all
01:07:09.460 | to the first part of your comment about Gary Kasparov,
01:07:12.460 | because actually at the panel yesterday,
01:07:15.660 | he specifically said that when he first lost to Deep Blue,
01:07:19.780 | he viewed it as a failure.
01:07:22.340 | He viewed that this had been a failure of his,
01:07:24.940 | but later on in his career,
01:07:26.900 | he said he'd come to realize that actually it was a success.
01:07:30.380 | It was a success for everyone
01:07:31.940 | because this marked a transformational moment for AI.
01:07:36.200 | And so even for Gary Kasparov,
01:07:38.900 | he came to realize that that moment was pivotal
01:07:42.420 | and actually meant something much more
01:07:45.380 | than his personal loss in that moment.
01:07:48.740 | Lissadol, I think, was much more cognizant of that
01:07:53.780 | even at the time.
01:07:54.820 | So in his closing remarks to the match,
01:07:56.940 | he really felt very strongly
01:08:00.180 | that what had happened in the AlphaGo match
01:08:02.900 | was not only meaningful for AI, but for humans as well.
01:08:06.460 | And he felt as a Go player that it had opened his horizons
01:08:09.900 | and meant that he could start exploring new things.
01:08:12.700 | It brought his joy back for the game of Go
01:08:14.420 | because it had broken all of the conventions and barriers
01:08:18.620 | and meant that suddenly anything was possible again.
01:08:22.460 | And so, yeah, I was sad to hear that he'd retired,
01:08:26.100 | but he's been a great world champion over many, many years.
01:08:31.100 | And I think he'll be remembered for that evermore.
01:08:36.020 | He'll be remembered as the last person to beat AlphaGo.
01:08:39.340 | I mean, after that, we increased the power of the system
01:08:43.060 | and the next version of AlphaGo
01:08:45.820 | beats the other strong human players 60 games to nil.
01:08:50.820 | So, you know, what a great moment for him
01:08:55.260 | and something to be remembered for.
01:08:56.980 | - It's interesting that you spent time at AAAI
01:09:01.140 | on a panel with Garry Kasparov.
01:09:04.020 | What, I mean, it's almost, I'm just curious to learn
01:09:10.060 | the conversations you've had with Garry
01:09:13.940 | 'cause he's also now,
01:09:15.500 | he's written a book about artificial intelligence.
01:09:17.420 | He's thinking about AI.
01:09:18.740 | He has kind of a view of it
01:09:21.140 | and he talks about AlphaGo a lot.
01:09:23.820 | What's your sense?
01:09:25.940 | Arguably, I'm not just being Russian,
01:09:28.660 | but I think Garry is the greatest chess player of all time,
01:09:32.300 | the probably one of the greatest game players of all time.
01:09:36.540 | And you sort of at the center of creating a system
01:09:41.540 | that beats one of the greatest players of all time.
01:09:45.340 | So what's that conversation like?
01:09:46.780 | Is there anything, any interesting digs, any bets,
01:09:50.460 | any funny things, any profound things?
01:09:53.700 | - So Garry Kasparov has an incredible respect
01:09:58.300 | for what we did with AlphaGo.
01:10:01.180 | And, you know, it's an amazing tribute
01:10:04.900 | coming from him of all people
01:10:07.580 | that he really appreciates and respects what we've done.
01:10:10.860 | And I think he feels that the progress
01:10:14.020 | which has happened in computer chess,
01:10:16.180 | which later after AlphaGo, we built the AlphaZero system,
01:10:21.180 | which defeated the world's strongest chess programs.
01:10:25.860 | And to Garry Kasparov, that moment in computer chess
01:10:29.900 | was more profound than Deep Blue.
01:10:33.060 | And the reason he believes it mattered more
01:10:35.700 | was because it was done with learning
01:10:37.620 | and a system which was able to discover for itself
01:10:39.980 | new principles, new ideas,
01:10:42.340 | which were able to play the game in a way
01:10:44.820 | which he hadn't always known about or anyone.
01:10:49.820 | And in fact, one of the things I discovered at this panel
01:10:53.260 | was that the current world champion, Magnus Carlsen,
01:10:56.620 | apparently recently commented on his improvement
01:11:00.540 | in performance and he attributes it to AlphaZero.
01:11:04.260 | He's been studying the games of AlphaZero
01:11:05.940 | and he's changed his style to play more like AlphaZero.
01:11:08.820 | And it's led to him actually increasing his rating
01:11:13.820 | to a new peak.
01:11:15.220 | - Yeah, I guess to me, just like to Garry,
01:11:18.460 | the inspiring thing is that,
01:11:20.380 | and just like you said with reinforcement learning,
01:11:23.340 | reinforcement learning and deep learning,
01:11:26.300 | machine learning feels like what intelligence is.
01:11:29.660 | And you could attribute it to sort of a bitter viewpoint
01:11:34.660 | from Garry's perspective, from us humans' perspective,
01:11:39.500 | saying that pure search that IBM Deep Blue was doing
01:11:43.740 | is not really intelligence,
01:11:45.740 | but somehow it didn't feel like it.
01:11:47.740 | And so that's the magical,
01:11:49.100 | I'm not sure what it is about learning
01:11:50.780 | that feels like intelligence, but it does.
01:11:53.700 | - So I think we should not demean the achievements
01:11:58.020 | of what was done in previous eras of AI.
01:12:00.060 | I think that Deep Blue was an amazing achievement in itself.
01:12:04.140 | And that heuristic search of the kind
01:12:07.180 | that was used by Deep Blue
01:12:08.900 | had some powerful ideas that were in there,
01:12:11.380 | but it also missed some things.
01:12:12.620 | So the fact that the evaluation function,
01:12:16.580 | the way that the chess position was understood
01:12:18.660 | was created by humans and not by the machine
01:12:22.500 | is a limitation, which means that there's a ceiling
01:12:27.380 | on how well it can do, but maybe more importantly,
01:12:30.500 | it means that the same idea cannot be applied
01:12:33.020 | in other domains where we don't have access
01:12:34.940 | to the kind of human grandmasters
01:12:38.500 | and that ability to kind of encode exactly their knowledge
01:12:41.140 | into an evaluation function.
01:12:43.060 | And the reality is that the story of AI
01:12:45.060 | is that most domains turn out to be of the second type,
01:12:48.580 | where when knowledge is messy,
01:12:50.540 | it's hard to extract from experts or it isn't even available.
01:12:54.020 | And so we need to solve problems in a different way.
01:12:59.020 | And I think AlphaGo is a step towards solving things
01:13:02.700 | in a way which puts learning as a first-class citizen
01:13:07.700 | and says systems need to understand for themselves
01:13:11.460 | how to understand the world,
01:13:13.940 | how to judge the value of any action
01:13:18.940 | that they might take within that world
01:13:20.780 | and any state they might find themselves in.
01:13:22.860 | And in order to do that, we make progress towards AI.
01:13:27.860 | - Yeah, so one of the nice things about this,
01:13:31.860 | about taking a learning approach to the game of Go
01:13:35.420 | or game playing is that the things you learn,
01:13:38.500 | the things you figure out are actually going to be applicable
01:13:40.700 | to other problems that are real-world problems.
01:13:44.100 | That's sort of, that's ultimately, I mean,
01:13:46.700 | there's two really interesting things about AlphaGo.
01:13:49.100 | One is the science of it, just the science of learning,
01:13:52.420 | the science of intelligence.
01:13:54.580 | And then the other is, well, you're actually learning
01:13:58.020 | to figuring out how to build systems
01:13:59.740 | that would be potentially applicable in other applications,
01:14:04.160 | medical, autonomous vehicles, robotics,
01:14:06.540 | all, I mean, it's just open the door
01:14:08.260 | to all kinds of applications.
01:14:10.740 | So the next incredible step, right,
01:14:14.940 | really the profound step is probably AlphaGo Zero.
01:14:18.360 | I mean, it's arguable, I kind of see them all
01:14:21.520 | as the same place, but really,
01:14:23.220 | and perhaps you were already thinking
01:14:24.920 | that AlphaGo Zero is the natural,
01:14:26.880 | it was always going to be the next step,
01:14:29.340 | but it's removing the reliance on human expert games
01:14:33.540 | for pre-training, as you mentioned.
01:14:35.480 | So how big of an intellectual leap was this,
01:14:39.760 | that self-play could achieve superhuman level performance
01:14:44.160 | in its own?
01:14:45.720 | And maybe could you also say what is self-play?
01:14:48.720 | We kind of mentioned it a few times, but.
01:14:50.760 | - So let me start with self-play.
01:14:55.320 | So the idea of self-play is something
01:14:58.440 | which is really about systems learning for themselves,
01:15:02.140 | but in the situation where there's more than one agent.
01:15:05.760 | And so if you're in a game,
01:15:07.920 | and the game is played between two players,
01:15:10.520 | then self-play is really about understanding that game
01:15:14.920 | just by playing games against yourself
01:15:17.600 | rather than against any actual real opponent.
01:15:20.040 | And so it's a way to kind of discover strategies
01:15:23.920 | without having to actually need to go out
01:15:27.480 | and play against any particular human player, for example.
01:15:32.480 | The main idea of AlphaZero was really to,
01:15:40.160 | you know, try and step back from any of the knowledge
01:15:45.160 | that we'd put into the system and ask the question,
01:15:47.840 | is it possible to come up with a single elegant principle
01:15:52.840 | by which a system can learn for itself
01:15:55.640 | all of the knowledge which it requires
01:15:58.120 | to play a game such as Go?
01:16:00.920 | Importantly, by taking knowledge out,
01:16:03.300 | you not only make the system less brittle
01:16:08.360 | in the sense that perhaps the knowledge you were putting in
01:16:10.640 | was just getting in the way
01:16:11.960 | and maybe stopping the system learning for itself,
01:16:15.440 | but also you make it more general.
01:16:17.880 | The more knowledge you put in,
01:16:19.780 | the harder it is for a system to actually be placed,
01:16:23.480 | taken out of the system in which it's kind of been designed
01:16:26.800 | and placed in some other system
01:16:28.760 | that maybe would need a completely different knowledge base
01:16:30.600 | to understand and perform well.
01:16:32.840 | And so the real goal here is to strip out
01:16:36.280 | all of the knowledge that we put in
01:16:37.880 | to the point that we can just plug it
01:16:39.600 | into something totally different.
01:16:41.800 | And that to me is really, you know,
01:16:43.820 | the promise of AI is that we can have systems such as that,
01:16:47.280 | which, you know, no matter what the goal is,
01:16:49.480 | no matter what goal we set to the system,
01:16:52.880 | we can come up with, we have an algorithm
01:16:56.000 | which can be placed into that world, into that environment,
01:16:58.680 | and can succeed in achieving that goal.
01:17:01.080 | And then that, that's to me is almost
01:17:05.280 | the essence of intelligence, if we can achieve that.
01:17:08.040 | And so AlphaZero is a step towards that.
01:17:10.040 | And it's a step that was taken in the context
01:17:13.680 | of two-player perfect information games,
01:17:16.600 | like Go and Chess.
01:17:18.840 | We also applied it to Japanese chess.
01:17:21.520 | - So just to clarify, the first step was AlphaGo Zero.
01:17:25.560 | - The first step was to try and take all of the knowledge
01:17:29.040 | out of AlphaGo in such a way that it could play
01:17:32.840 | in a fully self-discovered way, purely from self-play.
01:17:37.840 | And to me, the motivation for that was always
01:17:41.840 | that we could then plug it into other domains,
01:17:44.080 | but we saved that until later.
01:17:46.840 | - Well, in fact, I mean, just for fun,
01:17:52.860 | I could tell you exactly the moment
01:17:54.320 | where the idea for AlphaZero occurred to me,
01:17:57.880 | because I think there's maybe a lesson there
01:17:59.120 | for researchers who are kind of too deeply embedded
01:18:02.120 | in their research and working 24/7
01:18:05.440 | to try and come up with the next idea,
01:18:08.160 | which is, it actually occurred to me on honeymoon.
01:18:12.080 | (laughing)
01:18:13.840 | And I was like at my most fully relaxed state,
01:18:17.160 | really enjoying myself, and just bing,
01:18:21.600 | this, like, the algorithm for AlphaZero just appeared.
01:18:25.240 | Like, and in its full form.
01:18:29.880 | And this was actually before we played
01:18:31.320 | against Lisa Dahl, but we just didn't,
01:18:35.800 | I think we were so busy trying to make sure
01:18:39.160 | we could beat the world champion,
01:18:42.560 | that it was only later that we had the opportunity
01:18:45.940 | to step back and start examining
01:18:47.920 | that sort of deeper scientific question
01:18:50.400 | of whether this could really work.
01:18:52.360 | - So nevertheless, so self-play is probably
01:18:56.280 | one of the most sort of profound ideas
01:18:59.920 | that represents, to me at least, artificial intelligence.
01:19:04.920 | But the fact that you could use that kind of mechanism
01:19:09.800 | to, again, beat world-class players,
01:19:13.000 | that's very surprising.
01:19:14.820 | So we kind of, to me, it feels like you have to train
01:19:19.160 | in a large number of expert games.
01:19:21.280 | So was it surprising to you?
01:19:22.720 | What was the intuition?
01:19:23.640 | Can you sort of think, not necessarily at that time,
01:19:26.500 | even now, what's your intuition?
01:19:28.000 | Why this thing works so well?
01:19:30.040 | Why it's able to learn from scratch?
01:19:31.880 | - Well, let me first say why we tried it.
01:19:34.560 | So we tried it both because I feel
01:19:36.480 | that it was the deeper scientific question
01:19:38.520 | to be asking to make progress towards AI.
01:19:42.080 | And also because in general in my research,
01:19:44.920 | I don't like to do research on questions
01:19:47.500 | for which we already know the likely outcome.
01:19:51.000 | I don't see much value in running an experiment
01:19:53.200 | where you're 95% confident that you will succeed.
01:19:57.640 | And so we could have tried maybe to take AlphaGo
01:20:02.040 | and do something which we knew for sure it would succeed on.
01:20:05.200 | But much more interesting to me was to try it
01:20:07.200 | on the things which we weren't sure about.
01:20:09.400 | And one of the big questions on our minds back then was,
01:20:14.240 | could you really do this with self-play alone?
01:20:16.200 | How far could that go?
01:20:17.640 | Would it be as strong?
01:20:19.540 | And honestly, we weren't sure.
01:20:22.200 | Yeah, it was 50/50, I think.
01:20:23.640 | If you'd asked me, I wasn't confident
01:20:27.400 | that it could reach the same level as these systems,
01:20:30.720 | but it felt like the right question to ask.
01:20:33.880 | And even if it had not achieved the same level,
01:20:36.760 | I felt that that was an important direction to be studying.
01:20:41.760 | And so then lo and behold,
01:20:47.680 | it actually ended up outperforming
01:20:50.320 | the previous version of AlphaGo
01:20:52.440 | and indeed was able to beat it by 100 games to zero.
01:20:55.960 | So what's the intuition as to why?
01:20:59.800 | I think the intuition to me is clear,
01:21:02.420 | that whenever you have errors in a system,
01:21:07.420 | as we did in AlphaGo, AlphaGo suffered from these delusions.
01:21:12.060 | Occasionally it would misunderstand what was going on
01:21:13.880 | in a position and mis-evaluate it.
01:21:15.920 | How can you remove all of these errors?
01:21:19.780 | Errors arise from many sources.
01:21:21.880 | For us, they were arising both from,
01:21:24.160 | starting from the human data,
01:21:25.320 | but also from the nature of the search
01:21:27.760 | and the nature of the algorithm itself.
01:21:29.840 | But the only way to address them in any complex system
01:21:33.280 | is to give the system the ability to correct its own errors.
01:21:38.020 | It must be able to correct them.
01:21:39.520 | It must be able to learn for itself
01:21:41.480 | when it's doing something wrong and correct for it.
01:21:44.640 | And so it seemed to me that the way to correct delusions
01:21:47.840 | was indeed to have more iterations
01:21:50.680 | of reinforcement learning.
01:21:52.520 | No matter where you start,
01:21:53.560 | you should be able to correct for those errors
01:21:55.800 | until it gets to play that out and understand,
01:21:58.400 | oh, well, I thought that I was gonna win in this situation,
01:22:01.440 | but then I ended up losing.
01:22:03.280 | That suggests that I was mis-evaluating something,
01:22:05.440 | there's a hole in my knowledge,
01:22:06.440 | and now the system can correct for itself
01:22:08.640 | and understand how to do better.
01:22:10.540 | Now, if you take that same idea
01:22:13.440 | and trace it back all the way to the beginning,
01:22:16.240 | it should be able to take you from no knowledge,
01:22:19.240 | from completely random starting point,
01:22:21.840 | all the way to the highest levels of knowledge
01:22:24.800 | that you can achieve in a domain.
01:22:27.200 | And the principle is the same,
01:22:28.480 | that if you bestow a system with the ability
01:22:31.820 | to correct its own errors,
01:22:33.580 | then it can take you from random
01:22:35.520 | to something slightly better than random,
01:22:37.800 | because it sees the stupid things that the random is doing
01:22:40.440 | and it can correct them.
01:22:41.640 | And then it can take you from that slightly better system
01:22:44.000 | and understand, well, what's that doing wrong?
01:22:45.960 | And it takes you on to the next level and the next level.
01:22:49.440 | And this progress can go on indefinitely.
01:22:53.080 | And indeed, what would have happened
01:22:55.320 | if we'd carried on training AlphaGo Zero for longer?
01:22:58.420 | We saw no sign of it slowing down its improvements,
01:23:03.360 | or at least it was certainly carrying on to improve.
01:23:06.680 | And presumably, if you had the computational resources,
01:23:11.080 | this could lead to better and better systems
01:23:14.520 | that discover more and more.
01:23:15.720 | - So your intuition is fundamentally
01:23:18.940 | there's not a ceiling to this process.
01:23:21.640 | One of the surprising things, just like you said,
01:23:24.680 | is the process of patching errors.
01:23:27.400 | It intuitively makes sense.
01:23:29.320 | And reinforcement learning should be part of that process.
01:23:33.600 | But what is surprising is in the process of patching
01:23:36.840 | your own lack of knowledge, you don't open up other patches.
01:23:41.840 | You keep sort of,
01:23:43.800 | like there's a monotonic decrease of your weaknesses.
01:23:48.520 | - Well, let me back this up.
01:23:50.160 | I think science always should make falsifiable hypotheses.
01:23:53.840 | So let me back up this claim
01:23:55.320 | with a falsifiable hypothesis,
01:23:57.040 | which is that if someone was to, in the future,
01:23:59.760 | take AlphaZero as an algorithm and run it
01:24:03.400 | with greater computational resources
01:24:07.440 | that we had available today,
01:24:09.520 | then I would predict that they would be able
01:24:12.880 | to beat the previous system 100 games to zero.
01:24:15.400 | And that if they were then to do the same thing
01:24:17.240 | a couple of years later,
01:24:19.240 | that that would beat that previous system 100 games to zero.
01:24:22.080 | And that that process would continue indefinitely
01:24:25.160 | throughout at least my human lifetime.
01:24:27.560 | - Presumably the game of Go would set the ceiling.
01:24:31.000 | I mean--
01:24:31.840 | - The game of Go would set the ceiling,
01:24:33.200 | but the game of Go has 10 to the 170 states in it.
01:24:36.320 | So the ceiling is unreachable by any computational device
01:24:40.400 | that can be built out of the, you know,
01:24:43.600 | 10 to the 80 atoms in the universe.
01:24:46.600 | - You asked a really good question, which is,
01:24:49.000 | do you not open up other errors
01:24:51.160 | when you correct your previous ones?
01:24:53.640 | And the answer is yes, you do.
01:24:56.160 | And so it's a remarkable fact
01:24:58.640 | about this class of two-player game,
01:25:02.240 | and also true of single-agent games,
01:25:05.200 | that essentially progress will always lead you to,
01:25:10.200 | if you have sufficient representational resource,
01:25:15.080 | like imagine you had,
01:25:16.600 | could represent every state in a big table of the game,
01:25:20.160 | then we know for sure that a progress of self-improvement
01:25:24.040 | will lead all the way in the single-agent case
01:25:27.120 | to the optimal possible behavior,
01:25:29.080 | and in the two-player case to the minimax optimal behavior.
01:25:31.800 | That is the best way that I can play,
01:25:35.280 | knowing that you're playing perfectly against me.
01:25:38.040 | And so for those cases,
01:25:39.760 | we know that even if you do open up some new error,
01:25:44.680 | that in some sense you've made progress.
01:25:46.360 | You're progressing towards the best that can be done.
01:25:50.440 | - So AlphaGo was initially trained on expert games
01:25:55.200 | with some self-play.
01:25:56.440 | AlphaGo Zero removed the need to be trained on expert games.
01:26:00.200 | And then another incredible step for me,
01:26:03.960 | 'cause I just love chess,
01:26:05.720 | is to generalize that further to be in AlphaZero
01:26:09.520 | to be able to play the game of Go,
01:26:12.200 | beating AlphaGo Zero and AlphaGo,
01:26:14.760 | and then also being able to play the game of chess
01:26:18.160 | and others.
01:26:19.120 | So what was that step like?
01:26:20.960 | What's the interesting aspects there
01:26:23.560 | that required to make that happen?
01:26:26.640 | - I think the remarkable observation,
01:26:29.840 | which we saw with AlphaZero,
01:26:31.960 | was that actually without modifying the algorithm at all,
01:26:35.760 | it was able to play and crack
01:26:37.480 | some of AI's greatest previous challenges.
01:26:41.280 | In particular, we dropped it into the game of chess.
01:26:44.760 | And unlike the previous systems like Deep Blue,
01:26:47.160 | which had been worked on for years and years,
01:26:50.400 | and we were able to beat
01:26:52.640 | the world's strongest computer chess program convincingly
01:26:57.280 | using a system that was fully discovered
01:27:00.920 | by its own from scratch with its own principles.
01:27:04.920 | And in fact, one of the nice things that we found
01:27:08.160 | was that, in fact, we also achieved the same result
01:27:11.520 | in Japanese chess, a variant of chess
01:27:13.480 | where you get to capture pieces
01:27:15.160 | and then place them back down on your own side
01:27:17.640 | as an extra piece.
01:27:19.000 | So a much more complicated variant of chess.
01:27:21.840 | And we also beat the world's strongest programs
01:27:24.760 | and reached superhuman performance in that game too.
01:27:28.040 | And it was the very first time
01:27:30.080 | that we'd ever run the system on that particular game,
01:27:34.480 | was the version that we published in the paper on AlphaZero.
01:27:37.680 | It just worked out of the box, literally, no touching it.
01:27:41.720 | We didn't have to do anything.
01:27:42.840 | And there it was, superhuman performance,
01:27:45.280 | no tweaking, no twiddling.
01:27:47.000 | And so I think there's something beautiful
01:27:49.560 | about that principle that you can take an algorithm
01:27:52.960 | and without twiddling anything, it just works.
01:27:57.720 | Now, to go beyond AlphaZero, what's required?
01:28:02.760 | AlphaZero is just a step.
01:28:05.480 | And there's a long way to go beyond that
01:28:06.920 | to really crack the deep problems of AI.
01:28:09.920 | But one of the important steps is to acknowledge
01:28:13.520 | that the world is a really messy place.
01:28:16.280 | It's this rich, complex, beautiful,
01:28:18.480 | but messy environment that we live in.
01:28:22.000 | And no one gives us the rules.
01:28:23.440 | Like no one knows the rules of the world.
01:28:26.120 | At least maybe we understand that it operates
01:28:28.480 | according to Newtonian or quantum mechanics
01:28:31.160 | at the micro level or according to relativity
01:28:34.040 | at the macro level, but that's not a model
01:28:36.800 | that's useful for us as people to operate in it.
01:28:40.240 | Somehow the agent needs to understand the world for itself
01:28:43.800 | in a way where no one tells it the rules of the game,
01:28:46.320 | and yet it can still figure out what to do in that world,
01:28:49.960 | deal with this stream of observations coming in,
01:28:53.600 | rich sensory input coming in, actions going out in a way
01:28:57.000 | that allows it to reason in the way that AlphaGo
01:28:59.560 | or AlphaZero can reason, in the way that these
01:29:02.560 | Go and chess playing programs can reason,
01:29:04.880 | but in a way that allows it to take actions
01:29:07.840 | in that messy world to achieve its goals.
01:29:10.440 | And so this led us to the most recent step
01:29:15.320 | in the story of AlphaGo, which was a system called MuZero.
01:29:19.560 | And MuZero is a system which learns for itself
01:29:23.440 | even when the rules are not given to it.
01:29:25.480 | It actually can be dropped into a system
01:29:28.240 | with messy perceptual inputs.
01:29:29.760 | We actually tried it in some Atari games,
01:29:33.920 | the canonical domains of Atari that have been used
01:29:37.160 | for reinforcement learning.
01:29:38.560 | And this system learned to build a model
01:29:42.920 | of these Atari games that was sufficiently rich
01:29:46.960 | and useful enough for it to be able to plan successfully.
01:29:51.400 | And in fact, that system not only went on
01:29:53.520 | to beat the state of the art in Atari,
01:29:56.700 | but the same system without modification
01:29:59.340 | was able to reach the same level of superhuman performance
01:30:03.020 | in Go, chess, and shogi that we'd seen in AlphaZero,
01:30:06.940 | showing that even without the rules,
01:30:08.740 | the system can learn for itself just by trial and error,
01:30:11.160 | just by playing this game of Go.
01:30:13.140 | And no one tells you what the rules are,
01:30:15.040 | but you just get to the end and someone says,
01:30:17.260 | you know, win or loss.
01:30:18.760 | Or you play this game of chess and someone says win or loss,
01:30:22.060 | or you play a game of breakout in Atari
01:30:25.580 | and someone just tells you, you know,
01:30:26.780 | your score at the end.
01:30:28.060 | And the system for itself figures out
01:30:30.620 | essentially the rules of the system,
01:30:31.940 | the dynamics of the world, how the world works.
01:30:35.220 | And not in any explicit way,
01:30:38.120 | but just implicitly enough understanding
01:30:40.780 | for it to be able to plan in that system
01:30:43.140 | in order to achieve its goals.
01:30:45.500 | - And that's the fundamental process
01:30:48.140 | that you have to go through when you're facing
01:30:49.700 | in any uncertain kind of environment
01:30:51.540 | that you would in the real world,
01:30:53.220 | is figuring out the sort of the rules,
01:30:55.100 | the basic rules of the game.
01:30:56.580 | - That's right.
01:30:57.420 | - So this, I mean, yeah,
01:30:59.100 | that allows it to be applicable to basically
01:31:01.780 | any domain that could be digitized
01:31:05.900 | in the way that it needs to in order to be consumable,
01:31:10.060 | sort of in order for the reinforcement learning framework
01:31:12.180 | to be able to sense the environment,
01:31:13.740 | to be able to act in the environment and so on.
01:31:15.620 | - The full reinforcement learning problem
01:31:17.020 | needs to deal with worlds that are unknown and complex
01:31:21.340 | and the agent needs to learn for itself
01:31:23.720 | how to deal with that.
01:31:24.560 | And so Museo is a step, a further step in that direction.
01:31:29.460 | - One of the things that inspired the general public
01:31:32.180 | and just in conversations I have like with my parents
01:31:34.540 | or something with my mom that just loves what was done
01:31:38.300 | is kind of at least the notion
01:31:40.340 | that there was some display of creativity,
01:31:42.120 | some new strategies, new behaviors that were created.
01:31:45.860 | That again has echoes of intelligence.
01:31:48.900 | So is there something that stands out?
01:31:50.780 | Do you see it the same way that there's creativity
01:31:52.920 | and there's some behaviors, patterns that you saw
01:31:57.220 | that AlphaZero was able to display that are truly creative?
01:32:00.740 | - So let me start by I think saying
01:32:05.860 | that I think we should ask what creativity really means.
01:32:08.260 | So to me, creativity means discovering something
01:32:13.260 | which wasn't known before, something unexpected,
01:32:16.860 | something outside of our norms.
01:32:19.700 | And so in that sense, the process of reinforcement learning
01:32:24.700 | or the self-play approach that was used by AlphaZero
01:32:28.960 | is it's the essence of creativity.
01:32:32.120 | It's really saying at every stage,
01:32:34.560 | you're playing according to your current norms
01:32:36.880 | and you try something and if it works out,
01:32:40.360 | you say, "Hey, here's something great.
01:32:43.320 | "I'm gonna start using that."
01:32:44.940 | And then that process, it's like a micro discovery
01:32:47.560 | that happens millions and millions of times
01:32:49.980 | over the course of the algorithm's life
01:32:51.980 | where it just discovers some new idea.
01:32:54.500 | "Oh, this pattern, this pattern's working really well for me.
01:32:56.780 | "I'm gonna start using that."
01:32:58.620 | And now, "Oh, here's this other thing I can do.
01:33:00.700 | "I can start to connect these stones together in this way
01:33:04.040 | "or I can start to sacrifice stones or give up on pieces
01:33:08.920 | "or play shoulder hits on the fifth line,"
01:33:11.300 | or whatever it is.
01:33:12.440 | The system's discovering things like this for itself
01:33:14.300 | continually, repeatedly, all the time.
01:33:17.100 | And so it should come as no surprise to us then
01:33:19.960 | when if you leave these systems going,
01:33:22.140 | that they discover things that are not known to humans,
01:33:25.740 | that to the human norms are considered creative.
01:33:30.580 | And we've seen this several times.
01:33:32.900 | In fact, in AlphaGo Zero,
01:33:35.700 | we saw this beautiful timeline of discovery
01:33:39.220 | where what we saw was that there are these opening patterns
01:33:44.020 | that humans play called joseki.
01:33:45.500 | These are like the patterns that humans learn
01:33:47.820 | to play in the corners,
01:33:48.780 | and they've been developed and refined
01:33:50.260 | over literally thousands of years in the game of Go.
01:33:53.220 | And what we saw was in the course of the training,
01:33:56.340 | AlphaGo Zero, over the course of the 40 days
01:34:00.100 | that we trained this system,
01:34:01.900 | it starts to discover exactly these patterns
01:34:05.620 | that human players play.
01:34:06.980 | And over time, we found that all of the joseki
01:34:10.180 | that humans played were discovered by the system
01:34:13.180 | through this process of self-play
01:34:15.660 | and this sort of essential notion of creativity.
01:34:19.700 | But what was really interesting was that over time,
01:34:22.500 | it then starts to discard some of these
01:34:24.940 | in favor of its own joseki that humans didn't know about.
01:34:28.220 | And it starts to say, oh, well,
01:34:29.580 | you thought that the Knights move pincer joseki
01:34:33.020 | was a great idea,
01:34:34.100 | but here's something different you can do there,
01:34:37.060 | which makes some new variation
01:34:38.740 | that humans didn't know about.
01:34:40.420 | And actually now, the human Go players
01:34:42.420 | study the joseki that AlphaGo played,
01:34:44.700 | and they become the new norms
01:34:46.620 | that are used in today's top level Go competitions.
01:34:51.300 | - That never gets old.
01:34:52.580 | Even just the first, to me,
01:34:54.780 | maybe just makes me feel good as a human being
01:34:58.340 | that a self-play mechanism that knows nothing about us humans
01:35:01.940 | discovers patterns that we humans do.
01:35:04.580 | It's like an affirmation that we're doing okay as humans.
01:35:08.460 | - Yeah. (laughs)
01:35:10.580 | In this domain, in other domains,
01:35:12.580 | we figured out it's like the Churchill quote
01:35:14.860 | about democracy.
01:35:15.980 | It sucks, but it's the best song we've tried.
01:35:20.300 | So in general, taking a step outside of Go,
01:35:24.500 | and you have a million accomplishments
01:35:27.220 | that I have no time to talk about
01:35:29.580 | with AlphaStar and so on and the current work.
01:35:32.900 | But in general, this self-play mechanism
01:35:36.700 | that you've inspired the world with
01:35:38.260 | by beating the world champion Go player.
01:35:40.660 | Do you see that as,
01:35:43.820 | do you see it being applied in other domains?
01:35:47.180 | Do you have sort of dreams and hopes
01:35:50.620 | that it's applied in both the simulated environments
01:35:53.820 | and the constrained environments of games?
01:35:56.380 | Constrained, I mean, AlphaStar really demonstrates
01:35:58.980 | that you can remove a lot of the constraints,
01:36:00.500 | but nevertheless, it's in a digital simulated environment.
01:36:04.100 | Do you have a hope, a dream that it starts being applied
01:36:07.220 | in the robotics environment?
01:36:09.140 | And maybe even in domains that are safety critical
01:36:12.980 | and so on, and have a real impact in the real world,
01:36:16.620 | like autonomous vehicles, for example,
01:36:18.300 | which seems like a very far out dream at this point.
01:36:21.180 | - So I absolutely do hope and imagine
01:36:25.580 | that we will get to the point where ideas just like these
01:36:28.660 | are used in all kinds of different domains.
01:36:31.180 | In fact, one of the most satisfying things
01:36:32.740 | as a researcher is when you start to see
01:36:34.900 | other people use your algorithms in unexpected ways.
01:36:38.260 | So in the last couple of years,
01:36:40.260 | there have been a couple of nature papers
01:36:43.260 | where different teams unbeknownst to us took AlphaZero
01:36:48.260 | and applied exactly those same algorithms and ideas
01:36:52.780 | to real world problems of huge meaning to society.
01:36:57.620 | So one of them was the problem of chemical synthesis,
01:37:01.020 | and they were able to beat the state of the art
01:37:02.980 | in finding pathways of how to actually synthesize chemicals,
01:37:07.980 | retro chemical synthesis.
01:37:11.120 | And the second paper actually just came out
01:37:14.140 | a couple of weeks ago in Nature,
01:37:16.640 | showed that in quantum computation,
01:37:19.500 | you know, one of the big questions is how to understand
01:37:22.780 | the nature of the function in quantum computation,
01:37:26.780 | and a system based on AlphaZero beat the state of the art
01:37:30.380 | by quite some distance there again.
01:37:32.380 | So these are just examples, and I think, you know,
01:37:34.940 | that the lesson which we've seen elsewhere
01:37:37.300 | in machine learning time and time again
01:37:39.420 | is that if you make something general,
01:37:41.320 | it will be used in all kinds of ways.
01:37:44.180 | You know, you provide a really powerful tool to society,
01:37:47.380 | and those tools can be used in amazing ways.
01:37:50.800 | And so I think we're just at the beginning,
01:37:53.620 | and for sure I hope that we see all kinds of outcomes.
01:37:58.900 | - So the other side of the question
01:38:01.820 | of reinforcement learning framework is, you know,
01:38:05.580 | you usually want to specify a reward function,
01:38:07.660 | an objective function.
01:38:09.220 | What do you think about sort of ideas of intrinsic rewards
01:38:13.860 | if, and when we're not really sure about, you know,
01:38:17.460 | if we take, you know, human beings as existence proof
01:38:23.700 | that we don't seem to be operating
01:38:25.860 | according to a single reward,
01:38:27.840 | do you think that there's interesting ideas
01:38:32.100 | for when you don't know how to truly specify the reward,
01:38:35.460 | you know, that there's some flexibility
01:38:38.140 | for discovering it intrinsically or so on
01:38:40.620 | in the context of reinforcement learning?
01:38:42.700 | - So I think, you know, when we think about intelligence,
01:38:45.020 | it's really important to be clear
01:38:46.760 | about the problem of intelligence.
01:38:48.380 | And I think it's clearest to understand that problem
01:38:51.180 | in terms of some ultimate goal
01:38:52.700 | that we want the system to try and solve for.
01:38:55.340 | And after all, if we don't understand
01:38:57.220 | the ultimate purpose of the system,
01:38:58.920 | do we really even have a clearly defined problem
01:39:02.340 | that we're solving at all?
01:39:04.320 | Now, within that, as with your example for humans,
01:39:08.500 | the system may choose to create its own motivations
01:39:13.980 | and sub goals that help the system
01:39:16.380 | to achieve its ultimate goal.
01:39:17.820 | And that may indeed be a hugely important mechanism
01:39:22.400 | to achieve those ultimate goals.
01:39:23.820 | But there is still some ultimate goal
01:39:25.500 | that I think the system needs to be measurable
01:39:27.060 | and evaluated against.
01:39:29.660 | And even for humans, I mean, humans,
01:39:31.380 | we're incredibly flexible.
01:39:32.420 | We feel that we can, you know,
01:39:33.980 | any goal that we're given,
01:39:35.180 | we feel we can master to some degree.
01:39:39.040 | But if we think of those goals, really, you know,
01:39:41.860 | like the goal of being able to pick up an object
01:39:44.860 | or the goal of being able to communicate
01:39:47.180 | or influence people to do things in a particular way
01:39:50.980 | or whatever those goals are,
01:39:54.220 | really, they're sub goals, really, that we set ourselves.
01:39:58.580 | You know, we choose to pick up the object.
01:40:00.940 | We choose to communicate.
01:40:02.140 | We choose to influence someone else.
01:40:05.340 | And we choose those because we think it will lead us
01:40:07.680 | to something later on.
01:40:10.500 | We think that's helpful to us to achieve some ultimate goal.
01:40:15.080 | Now, I don't want to speculate whether or not humans
01:40:18.260 | as a system necessarily have a singular overall goal
01:40:20.900 | of survival or whatever it is.
01:40:23.500 | But I think the principle for understanding
01:40:25.580 | and implementing intelligence is, has to be,
01:40:28.100 | that if we're trying to understand intelligence
01:40:30.060 | or implement our own,
01:40:31.380 | there has to be a well-defined problem.
01:40:33.140 | Otherwise, if it's not, I think,
01:40:35.940 | it's like an admission of defeat.
01:40:38.220 | For there to be hope for understanding
01:40:41.460 | or implementing intelligence,
01:40:42.740 | we have to know what we're doing.
01:40:44.040 | We have to know what we're asking the system to do.
01:40:46.380 | Otherwise, if you don't have a clearly defined purpose,
01:40:48.860 | you're not gonna get a clearly defined answer.
01:40:51.620 | - The ridiculous big question that has to naturally follow,
01:40:56.420 | 'cause I have to pin you down on this thing,
01:41:00.820 | that nevertheless, one of the big silly
01:41:03.340 | or big real questions before humans is the meaning of life,
01:41:08.060 | is us trying to figure out our own reward function.
01:41:11.180 | And you just kind of mentioned
01:41:12.520 | that if you want to build intelligent systems
01:41:15.020 | and you know what you're doing,
01:41:16.260 | you should be at least cognizant to some degree
01:41:18.380 | of what the reward function is.
01:41:20.280 | So the natural question is,
01:41:22.000 | what do you think is the reward function of human life,
01:41:26.260 | the meaning of life for us humans,
01:41:29.240 | the meaning of our existence?
01:41:30.740 | - I think I'd be speculating beyond my own expertise,
01:41:36.620 | but just for fun, let me do that.
01:41:38.460 | - Yes, please.
01:41:39.420 | - And say, I think that there are many levels
01:41:41.180 | at which you can understand a system
01:41:42.980 | and you can understand something as optimizing
01:41:46.420 | for a goal at many levels.
01:41:48.920 | And so you can understand the,
01:41:51.620 | let's start with the universe,
01:41:53.940 | like does the universe have a purpose?
01:41:55.780 | Well, it feels like it's just at one level,
01:41:58.100 | just following certain mechanical laws of physics
01:42:02.340 | and that that's led to the development of the universe.
01:42:04.620 | But at another level, you can view it as actually,
01:42:08.500 | there's the second law of thermodynamics
01:42:09.940 | that says that this is increasing
01:42:11.700 | in entropy over time forever.
01:42:13.340 | And now there's a view that's been developed
01:42:15.380 | by certain people at MIT,
01:42:17.300 | that you can think of this as almost like a goal
01:42:20.080 | of the universe, that the purpose of the universe
01:42:22.240 | is to maximize entropy.
01:42:24.920 | So there are multiple levels
01:42:26.060 | at which you can understand a system.
01:42:27.920 | The next level down, you might say,
01:42:30.680 | well, if the goal is to maximize entropy,
01:42:34.080 | well, how can that be done by a particular system?
01:42:39.080 | And maybe evolution is something
01:42:41.740 | that the universe discovered in order to kind of dissipate
01:42:45.540 | energy as efficiently as possible.
01:42:48.080 | And by the way, I'm borrowing from Max Tegmark
01:42:49.960 | for some of these metaphors, the physicist.
01:42:53.920 | But if you can think of evolution as a mechanism
01:42:56.140 | for dispersing energy, then evolution,
01:43:01.140 | you might say, then becomes a goal,
01:43:04.180 | which is if evolution disperses energy
01:43:06.620 | by reproducing as efficiently as possible,
01:43:09.340 | what's evolution then?
01:43:10.580 | Well, it's now got its own goal within that,
01:43:13.700 | which is to actually reproduce as effectively as possible.
01:43:18.700 | And now how does reproduction,
01:43:21.360 | how is that made as effective as possible?
01:43:25.020 | Well, you need entities within that
01:43:27.580 | that can survive and reproduce as effectively as possible.
01:43:29.900 | And so it's natural that in order to achieve
01:43:31.620 | that high level goal, those individual organisms
01:43:33.860 | discover brains, intelligences, which enable them
01:43:38.860 | to support the goals of evolution.
01:43:43.220 | And those brains, what do they do?
01:43:45.340 | Well, perhaps the early brains,
01:43:47.800 | maybe they were controlling things at some direct level.
01:43:51.820 | You know, maybe they were the equivalent
01:43:53.100 | of pre-programmed systems, which were directly controlling
01:43:55.620 | what was going on and setting certain things
01:43:59.620 | in order to achieve these particular goals.
01:44:03.060 | But that led to another level of discovery,
01:44:05.940 | which was learning systems, you know,
01:44:07.260 | parts of the brain which were able to learn for themselves
01:44:10.140 | and learn how to program themselves to achieve any goal.
01:44:13.460 | And presumably there are parts of the brain
01:44:16.580 | where goals are set to parts of that system
01:44:20.340 | and provides this very flexible notion of intelligence
01:44:23.020 | that we as humans presumably have,
01:44:25.020 | which is the ability to kind of,
01:44:26.540 | why the reason we feel that we can achieve any goal.
01:44:30.020 | So it's a very long-winded answer to say that,
01:44:32.820 | you know, I think there are many perspectives
01:44:34.700 | and many levels at which intelligence can be understood.
01:44:38.620 | And at each of those levels,
01:44:40.460 | you can take multiple perspectives.
01:44:41.900 | Like, you know, you can view the system
01:44:43.180 | as something which is optimizing for a goal,
01:44:45.420 | which is understanding it at a level
01:44:47.820 | by which we can maybe implement it
01:44:49.500 | and understand it as AI researchers or computer scientists.
01:44:53.340 | Or you can understand it at the level
01:44:54.780 | of the mechanistic thing which is going on,
01:44:56.420 | that there are these, you know,
01:44:57.380 | atoms bouncing around in the brain
01:44:58.780 | and they lead to the outcome of that system.
01:45:01.380 | It's not in contradiction with the fact
01:45:02.940 | that it's also a decision-making system
01:45:07.100 | that's optimizing for some goal
01:45:08.420 | and purpose.
01:45:10.140 | - I've never heard the description
01:45:13.100 | of the meaning of life structured so beautifully in layers.
01:45:16.860 | But you did miss one layer,
01:45:18.340 | which is the next step which you're responsible for,
01:45:21.740 | which is creating the artificial intelligence layer
01:45:26.740 | on top of that.
01:45:28.220 | And I can't wait to see, well, I may not be around,
01:45:31.740 | but I can't wait to see what the next layer beyond that.
01:45:36.260 | - Well, let's just take that argument, you know,
01:45:38.900 | and pursue it to its natural conclusion.
01:45:41.260 | So the next level indeed is for how can our learning brain
01:45:45.940 | achieve its goals most effectively?
01:45:49.180 | Well, maybe it does so by us as learning beings,
01:45:53.780 | building a system which is able to solve for those goals
01:46:00.180 | more effectively than we can.
01:46:01.820 | And so when we build a system to play the game of Go,
01:46:04.940 | you know, when I said that I wanted to build a system
01:46:06.940 | that can play Go better than I can,
01:46:08.740 | I've enabled myself to achieve that goal of playing Go
01:46:12.180 | better than I could by directly playing it
01:46:14.500 | and learning it myself.
01:46:15.820 | And so now a new layer has been created,
01:46:18.740 | which is systems which are able to achieve goals
01:46:21.260 | for themselves.
01:46:22.620 | And ultimately there may be layers beyond that
01:46:25.060 | where they set sub-goals to parts of their own system
01:46:28.500 | in order to achieve those and so forth.
01:46:32.980 | So the story of intelligence, I think,
01:46:36.100 | I think is a multi-layered one and a multi-perspective one.
01:46:39.060 | - We live in an incredible universe.
01:46:41.940 | David, thank you so much, first of all,
01:46:43.940 | for dreaming of using learning to solve Go
01:46:47.860 | and building intelligent systems
01:46:50.100 | and for actually making it happen
01:46:52.260 | and for inspiring millions of people in the process.
01:46:56.100 | It's truly an honor.
01:46:57.020 | Thank you so much for talking today.
01:46:58.260 | - Okay, thank you.
01:46:59.940 | - Thanks for listening to this conversation
01:47:01.300 | with David Silver and thank you to our sponsors,
01:47:04.060 | Masterclass and Cash App.
01:47:05.980 | Please consider supporting the podcast
01:47:07.740 | by signing up to Masterclass at masterclass.com/lex
01:47:12.100 | and downloading Cash App and using code LEXPODCAST.
01:47:15.740 | If you enjoy this podcast, subscribe on YouTube,
01:47:18.020 | review it with Five Stars and Apple Podcast,
01:47:20.260 | support on Patreon, or simply connect with me on Twitter
01:47:23.380 | at Lex Friedman.
01:47:25.260 | And now let me leave you with some words from David Silver.
01:47:28.620 | "My personal belief is that we've seen something
01:47:31.220 | "of a turning point where we're starting to understand
01:47:34.380 | "that many abilities, like intuition and creativity,
01:47:38.100 | "that we've previously thought were in the domain only
01:47:40.740 | "of the human mind are actually accessible
01:47:43.240 | "to machine intelligence as well.
01:47:45.420 | "And I think that's a really exciting moment in history."
01:47:48.300 | Thank you for listening and hope to see you next time.
01:47:52.060 | (upbeat music)
01:47:54.640 | (upbeat music)
01:47:57.220 | [BLANK_AUDIO]