back to index

Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94


Chapters

0:0 Introduction
2:23 AlexNet paper and the ImageNet moment
8:33 Cost functions
13:39 Recurrent neural networks
16:19 Key ideas that led to success of deep learning
19:57 What's harder to solve: language or vision?
29:35 We're massively underestimating deep learning
36:4 Deep double descent
41:20 Backpropagation
42:42 Can neural networks be made to reason?
50:35 Long-term memory
56:37 Language models
60:35 GPT-2
67:14 Active learning
68:52 Staged release of AI systems
73:41 How to build AGI?
85:0 Question to AGI
92:7 Meaning of life

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Ilya Sutskever,
00:00:03.140 | co-founder and chief scientist of OpenAI,
00:00:06.120 | one of the most cited computer scientists in history
00:00:09.360 | with over 165,000 citations.
00:00:13.480 | And to me, one of the most brilliant and insightful minds
00:00:17.060 | ever in the field of deep learning.
00:00:20.000 | There are very few people in this world
00:00:21.680 | who I would rather talk to and brainstorm with
00:00:24.040 | about deep learning, intelligence, and life in general
00:00:27.760 | than Ilya, on and off the mic.
00:00:30.640 | This was an honor and a pleasure.
00:00:32.820 | This conversation was recorded
00:00:35.240 | before the outbreak of the pandemic.
00:00:37.200 | For everyone feeling the medical, psychological,
00:00:39.480 | and financial burden of this crisis,
00:00:41.440 | I'm sending love your way.
00:00:43.160 | Stay strong, we're in this together, we'll beat this thing.
00:00:47.160 | This is the Artificial Intelligence Podcast.
00:00:49.640 | If you enjoy it, subscribe on YouTube,
00:00:51.760 | review it with Five Stars and Apple Podcast,
00:00:54.040 | support it on Patreon, or simply connect with me on Twitter
00:00:57.000 | at Lex Friedman, spelled F-R-I-D-M-A-N.
00:01:00.560 | As usual, I'll do a few minutes of ads now
00:01:03.000 | and never any ads in the middle
00:01:04.320 | that can break the flow of the conversation.
00:01:06.600 | I hope that works for you
00:01:07.980 | and doesn't hurt the listening experience.
00:01:10.120 | This show is presented by Cash App,
00:01:13.440 | the number one finance app in the App Store.
00:01:15.740 | When you get it, use code LEXPODCAST.
00:01:18.860 | Cash App lets you send money to friends, buy Bitcoin,
00:01:22.060 | invest in the stock market with as little as $1.
00:01:25.440 | Since Cash App allows you to buy Bitcoin,
00:01:27.480 | let me mention that cryptocurrency
00:01:29.280 | in the context of the history of money is fascinating.
00:01:33.060 | I recommend "A Scent of Money"
00:01:34.640 | as a great book on this history.
00:01:36.800 | Both the book and audiobook are great.
00:01:39.560 | Debits and credits on ledgers
00:01:41.000 | started around 30,000 years ago.
00:01:43.900 | The US dollar created over 200 years ago,
00:01:47.160 | and Bitcoin, the first decentralized cryptocurrency,
00:01:50.000 | released just over 10 years ago.
00:01:52.040 | So given that history,
00:01:53.480 | cryptocurrency is still very much in its early days
00:01:55.920 | of development, but it's still aiming to,
00:01:58.200 | and just might, redefine the nature of money.
00:02:01.800 | So again, if you get Cash App
00:02:03.520 | from the App Store or Google Play and use the code LEXPODCAST,
00:02:08.000 | you get $10, and Cash App will also donate $10 to FIRST,
00:02:12.440 | an organization that is helping advance robotics
00:02:14.820 | and STEM education for young people around the world.
00:02:17.620 | And now, here's my conversation with Ilya Sutskever.
00:02:23.360 | You were one of the three authors,
00:02:25.240 | with Alex Krashevsky, Jeff Hinton,
00:02:27.720 | of the famed Alex Ned paper
00:02:30.120 | that is arguably the paper
00:02:33.000 | that marked the big catalytic moment
00:02:35.120 | that launched the deep learning revolution.
00:02:37.840 | At that time, take us back to that time,
00:02:39.560 | what was your intuition about neural networks,
00:02:42.240 | about the representational power of neural networks?
00:02:45.960 | And maybe you could mention,
00:02:47.580 | how did that evolve over the next few years,
00:02:50.840 | up to today, over the 10 years?
00:02:53.480 | - Yeah, I can answer that question.
00:02:55.240 | At some point in about 2010 or 2011,
00:02:58.600 | I connected two facts in my mind.
00:03:02.600 | Basically, the realization was this.
00:03:06.700 | At some point, we realized that we can train very large,
00:03:11.280 | I shouldn't say very, you know,
00:03:12.120 | they were tiny by today's standards,
00:03:13.400 | but large and deep neural networks,
00:03:16.560 | end to end with back propagation.
00:03:18.540 | At some point, different people obtained this result.
00:03:22.380 | I obtained this result.
00:03:23.820 | The first moment in which I realized
00:03:26.420 | that deep neural networks are powerful
00:03:29.000 | was when James Martens invented
00:03:30.780 | the Hessian Free Optimizer in 2010.
00:03:33.620 | And he trained a 10-layer neural network,
00:03:36.300 | end to end, without pre-training, from scratch.
00:03:40.620 | And when that happened, I thought, this is it.
00:03:43.940 | Because if you can train a big neural network,
00:03:45.620 | a big neural network can represent very complicated function.
00:03:49.500 | Because if you have a neural network with 10 layers,
00:03:52.700 | it's as though you allow the human brain
00:03:55.260 | to run for some number of milliseconds.
00:03:58.340 | Neuron firings are slow,
00:04:00.380 | and so in maybe 100 milliseconds,
00:04:03.300 | your neurons only fire 10 times.
00:04:04.700 | So it's also kind of like 10 layers.
00:04:06.780 | And in 100 milliseconds,
00:04:08.140 | you can perfectly recognize any object.
00:04:10.460 | So I thought, so I already had the idea then
00:04:13.100 | that we need to train a very big neural network
00:04:16.100 | on lots of supervised data, and then it must succeed,
00:04:19.420 | because we can find the best neural network.
00:04:21.380 | And then there's also theory
00:04:22.740 | that if you have more data than parameters,
00:04:24.500 | you won't overfit.
00:04:25.760 | Today, we know that actually this theory is very incomplete
00:04:28.100 | and you won't overfit
00:04:28.940 | even if you have less data than parameters.
00:04:30.380 | But definitely, if you have more data than parameters,
00:04:32.500 | you won't overfit.
00:04:33.340 | - So the fact that neural networks
00:04:34.700 | were heavily over-parameterized wasn't discouraging to you?
00:04:39.100 | So you were thinking about the theory
00:04:41.220 | that the number of parameters,
00:04:43.080 | the fact there's a huge number of parameters is okay?
00:04:45.260 | It's gonna be okay?
00:04:46.100 | - I mean, there was some evidence before
00:04:47.300 | that it was okay-ish,
00:04:48.260 | but the theory was that if you had a big dataset
00:04:51.500 | and a big neural net, it was going to work.
00:04:53.060 | The over-parameterization just didn't really figure much
00:04:56.300 | as a problem.
00:04:57.140 | I thought, well, with images,
00:04:57.960 | you're just gonna add some data augmentation
00:04:59.260 | and it's gonna be okay.
00:05:00.420 | - So where was any doubt coming from?
00:05:02.460 | - The main doubt was, can we train a big,
00:05:04.460 | will we have enough compute to train a big enough neural net?
00:05:06.420 | - With backpropagation.
00:05:07.580 | - Backpropagation, I thought, would work.
00:05:09.460 | The thing which wasn't clear
00:05:10.700 | was whether there would be enough compute
00:05:12.500 | to get a very convincing result.
00:05:14.140 | And then at some point, Alex Kirzhevsky
00:05:15.580 | wrote these insanely fast CUDA kernels
00:05:17.540 | for training convolutional neural nets.
00:05:19.220 | And that was, bam, let's do this.
00:05:20.940 | Let's get an image in it,
00:05:21.780 | and it's gonna be the greatest thing.
00:05:23.460 | - Was most of your intuition from empirical results
00:05:27.340 | by you and by others?
00:05:29.580 | So like, just actually demonstrating
00:05:31.140 | that a piece of program can train
00:05:33.180 | a 10-layer neural network?
00:05:34.700 | Or was there some pen and paper or marker and whiteboard
00:05:39.260 | thinking intuition?
00:05:40.760 | 'Cause you just connected a 10-layer large neural network
00:05:44.740 | to the brain, so you just mentioned the brain.
00:05:46.620 | So in your intuition about neural networks,
00:05:49.220 | does the human brain come into play as a intuition builder?
00:05:53.860 | - Definitely.
00:05:55.020 | I mean, you gotta be precise with these analogies
00:05:57.520 | between artificial neural networks and the brain.
00:06:00.300 | But there's no question that the brain is a huge source
00:06:04.100 | of intuition and inspiration for deep learning researchers
00:06:07.460 | since all the way from Rosenblatt in the '60s.
00:06:10.820 | Like, if you look at, the whole idea of a neural network
00:06:13.860 | is directly inspired by the brain.
00:06:15.740 | You had people like McCallum and Pitts who were saying,
00:06:17.980 | "Hey, you got these neurons in the brain.
00:06:21.980 | "And hey, we recently learned about the computer
00:06:23.760 | "and automata, can we use some ideas
00:06:25.340 | "from the computer and automata to design
00:06:27.180 | "some kind of computational object
00:06:28.700 | "that's going to be simple, computational,
00:06:31.600 | "and kind of like the brain?"
00:06:32.820 | And they invented the neuron.
00:06:34.420 | So they were inspired by it back then.
00:06:36.020 | Then you had the convolutional neural network
00:06:37.500 | from Fukushima, and then later, Jan Lekan,
00:06:39.940 | who said, "Hey, if you limit the receptive fields
00:06:42.020 | "of a neural network, it's gonna be especially suitable
00:06:44.320 | "for images," as it turned out to be true.
00:06:47.020 | So there was a very small number of examples
00:06:49.980 | where analogies to the brain were successful.
00:06:52.380 | And I thought, well, probably an artificial neuron
00:06:55.140 | is not that different from the brain
00:06:56.780 | if you squint hard enough.
00:06:57.700 | So let's just assume it is and roll with it.
00:07:00.980 | - So we're now at a time where deep learning
00:07:02.820 | is very successful, so let us squint less
00:07:06.500 | and say, let's open our eyes and say,
00:07:09.220 | what to you is an interesting difference
00:07:12.100 | between the human brain, now I know you're probably
00:07:14.700 | not an expert, neither a neuroscientist
00:07:17.460 | or a neurobiologist, but loosely speaking,
00:07:19.780 | what's the difference between the human brain
00:07:21.260 | and artificial neural networks that's interesting to you
00:07:24.060 | for the next decade or two?
00:07:26.340 | - That's a good question to ask.
00:07:27.380 | What is an interesting difference between the brain
00:07:30.940 | and our artificial neural networks?
00:07:32.940 | So I feel like today, artificial neural networks,
00:07:37.140 | so we all agree that there are certain dimensions
00:07:39.420 | in which the human brain vastly outperforms our models.
00:07:43.060 | But I also think that there are some ways
00:07:44.420 | in which our artificial neural networks
00:07:46.220 | have a number of very important advantages over the brain.
00:07:50.200 | Looking at the advantages versus disadvantages
00:07:52.580 | is a good way to figure out what is the important difference.
00:07:55.640 | So the brain uses spikes, which may or may not be important.
00:08:00.140 | - That's a really interesting question.
00:08:01.380 | Do you think it's important or not?
00:08:03.860 | That's one big architectural difference
00:08:06.380 | between artificial neural networks.
00:08:08.380 | - It's hard to tell, but my prior is not very high
00:08:11.700 | and I can say why.
00:08:13.500 | There are people who are interested
00:08:14.340 | in spiking neural networks and basically,
00:08:16.500 | what they figured out is that they need to simulate
00:08:19.260 | the non-spiking neural networks in spikes.
00:08:21.620 | And that's how they're gonna make them work.
00:08:24.300 | If you don't simulate the non-spiking neural networks
00:08:26.300 | in spikes, it's not going to work
00:08:27.760 | because the question is, why should it work?
00:08:29.540 | And that connects to questions around back propagation
00:08:31.820 | and questions around deep learning.
00:08:34.860 | You've got this giant neural network.
00:08:36.900 | Why should it work at all?
00:08:38.420 | Why should the learning rule work at all?
00:08:40.460 | It's not a self-evident question, especially if you,
00:08:45.860 | let's say if you were just starting in the field
00:08:47.540 | and you read the very early papers,
00:08:49.340 | you can say, "Hey," people are saying,
00:08:51.480 | "Let's build neural networks."
00:08:53.740 | That's a great idea because the brain is a neural network,
00:08:55.900 | so it would be useful to build neural networks.
00:08:58.020 | Now let's figure out how to train them.
00:09:00.420 | It should be possible to train them probably, but how?
00:09:03.460 | And so the big idea is the cost function.
00:09:06.340 | That's the big idea.
00:09:08.780 | The cost function is a way of measuring the performance
00:09:11.900 | of the system according to some measure.
00:09:14.920 | By the way, that is a big, actually, let me think.
00:09:17.180 | Is that, one, a difficult idea to arrive at
00:09:21.180 | and how big of an idea is that,
00:09:22.740 | that there's a single cost function?
00:09:27.620 | - Sorry, let me take a pause.
00:09:28.900 | Is supervised learning a difficult concept to come to?
00:09:33.340 | - I don't know.
00:09:34.660 | All concepts are very easy in retrospect.
00:09:36.460 | - Yeah, that's what, it seems trivial now, but I,
00:09:38.940 | 'cause the reason I ask that, and we'll talk about it,
00:09:41.460 | 'cause is there other things?
00:09:43.460 | Is there things that don't necessarily have a cost function,
00:09:47.180 | maybe have many cost functions,
00:09:48.640 | or maybe have dynamic cost functions,
00:09:50.900 | or maybe a totally different kind of architectures?
00:09:54.180 | 'Cause we have to think like that
00:09:55.500 | in order to arrive at something new, right?
00:09:57.980 | - So the only, so the good examples of things
00:09:59.940 | which don't have clear cost functions are GANs.
00:10:02.440 | - Right. - In a GAN, you have a game.
00:10:05.740 | So instead of thinking of a cost function
00:10:08.240 | where you wanna optimize,
00:10:09.260 | where you know that you have an algorithm gradient descent,
00:10:12.100 | which will optimize the cost function,
00:10:13.940 | and then you can reason about the behavior of your system
00:10:16.340 | in terms of what it optimizes.
00:10:18.140 | With a GAN, you say, "I have a game,
00:10:20.020 | "and I'll reason about the behavior of the system
00:10:22.160 | "in terms of the equilibrium of the game."
00:10:24.540 | But it's all about coming up with these mathematical objects
00:10:26.540 | that help us reason about the behavior of our system.
00:10:30.140 | - Right, that's really interesting.
00:10:31.180 | Yeah, so GAN is the only one, it's kind of a,
00:10:33.420 | the cost function is emergent from the comparison.
00:10:36.900 | - It's, I don't know if it has a cost function.
00:10:39.020 | I don't know if it's meaningful
00:10:39.860 | to talk about the cost function of a GAN.
00:10:41.360 | It's kind of like the cost function of biological evolution
00:10:44.020 | or the cost function of the economy.
00:10:45.700 | It's, you can talk about regions
00:10:49.460 | to which it will go towards, but I don't think,
00:10:53.780 | I don't think the cost function analogy is the most useful.
00:10:57.500 | - So if evolution doesn't, that's really interesting.
00:11:00.140 | So if evolution doesn't really have a cost function,
00:11:02.700 | like a cost function based on its,
00:11:04.940 | something akin to our mathematical conception
00:11:09.900 | of a cost function, then do you think cost functions
00:11:12.780 | in deep learning are holding us back?
00:11:15.180 | Yeah, so you just kind of mentioned that cost function
00:11:18.320 | is a nice first profound idea.
00:11:21.420 | Do you think that's a good idea?
00:11:23.380 | Do you think it's an idea we'll go past?
00:11:26.780 | So self-play starts to touch on that a little bit
00:11:29.620 | in reinforcement learning systems.
00:11:31.760 | - That's right.
00:11:32.600 | Self-play and also ideas around exploration
00:11:34.740 | where you're trying to take action
00:11:37.020 | that surprise a predictor.
00:11:39.140 | I'm a big fan of cost functions.
00:11:40.540 | I think cost functions are great
00:11:41.700 | and they serve us really well.
00:11:42.780 | And I think that whenever we can do things
00:11:44.580 | with cost functions, we should.
00:11:46.140 | And you know, maybe there is a chance
00:11:49.020 | that we will come up with some,
00:11:50.380 | yet another profound way of looking at things
00:11:52.700 | that will involve cost functions in a less central way.
00:11:55.620 | But I don't know, I think cost functions are, I mean,
00:11:58.180 | I would not bet against cost functions.
00:12:03.100 | - Is there other things about the brain
00:12:05.500 | that pop into your mind that might be different
00:12:08.240 | and interesting for us to consider
00:12:11.060 | in designing artificial neural networks?
00:12:13.540 | So we talked about spiking a little bit.
00:12:16.220 | - I mean, one thing which may potentially be useful,
00:12:18.620 | I think people, neuroscientists have figured out
00:12:20.580 | something about the learning rule of the brain,
00:12:22.220 | or I'm talking about spike time independent plasticity,
00:12:24.860 | and it would be nice if some people
00:12:26.420 | would just study that in simulation.
00:12:28.420 | - Wait, sorry, spike time independent plasticity?
00:12:30.940 | - Yeah, that's right. - What's that?
00:12:31.860 | - STD, it's a particular learning rule
00:12:34.020 | that uses spike timing to figure out how to,
00:12:36.740 | to determine how to update the synapses.
00:12:39.660 | So it's kind of like, if a synapse fires into the neuron
00:12:42.580 | before the neuron fires, then it's strengthen the synapse.
00:12:46.060 | And if the synapse fires into the neurons
00:12:48.020 | shortly after the neuron fired, then it weakens the synapse.
00:12:50.740 | Something along this line.
00:12:52.220 | I'm 90% sure it's right, so if I said something wrong here,
00:12:56.180 | don't, don't get too angry.
00:12:59.460 | - But you sound brilliant while saying it.
00:13:01.060 | But the timing, that's one thing that's missing.
00:13:04.200 | The temporal dynamics is not captured.
00:13:07.460 | I think that's like a fundamental property of the brain,
00:13:10.120 | is the timing of the signals.
00:13:13.340 | - Well, you have recurrent neural networks.
00:13:15.460 | But you think of that as this,
00:13:18.060 | I mean, that's a very crude, simplified,
00:13:20.320 | what's that called?
00:13:22.300 | There's a clock, I guess, to recurrent neural networks.
00:13:27.620 | It seems like the brain is the general,
00:13:30.100 | the continuous version of that, the generalization,
00:13:33.340 | where all possible timings are possible,
00:13:36.060 | and then within those timings is contained some information.
00:13:39.900 | You think recurrent neural networks,
00:13:42.020 | the recurrence in recurrent neural networks
00:13:45.460 | can capture the same kind of phenomena
00:13:48.860 | as the timing that seems to be important for the brain,
00:13:53.860 | in the firing of neurons in the brain?
00:13:56.300 | - I mean, I think recurrent neural networks are amazing,
00:14:00.700 | and they can do, I think they can do anything
00:14:03.860 | we'd want them to, we'd want a system to do.
00:14:07.660 | Right now, recurrent neural networks
00:14:09.020 | have been superseded by transformers,
00:14:10.460 | but maybe one day they'll make a comeback,
00:14:12.740 | maybe they'll be back, we'll see.
00:14:14.380 | - Let me, on a small tangent, say,
00:14:17.700 | do you think they'll be back?
00:14:19.080 | So, so much of the breakthroughs recently
00:14:21.320 | that we'll talk about on natural language processing
00:14:24.420 | and language modeling has been with transformers
00:14:28.060 | that don't emphasize recurrence.
00:14:29.980 | Do you think recurrence will make a comeback?
00:14:33.260 | - Well, some kind of recurrence, I think, very likely.
00:14:36.980 | Recurrent neural networks for,
00:14:38.700 | as they're typically thought of for processing sequences,
00:14:42.660 | I think it's also possible.
00:14:44.420 | - What is, to you, a recurrent neural network?
00:14:47.940 | And generally speaking, I guess,
00:14:49.300 | what is a recurrent neural network?
00:14:50.940 | - You have a neural network which maintains
00:14:52.360 | a high-dimensional hidden state.
00:14:54.940 | And then when an observation arrives,
00:14:56.820 | it updates its high-dimensional hidden state
00:14:59.300 | through its connections, in some way.
00:15:03.500 | - So do you think, you know,
00:15:05.660 | that's what, like, expert systems did, right?
00:15:08.140 | Symbolic AI, the knowledge-based,
00:15:12.380 | growing a knowledge base is maintaining a hidden state,
00:15:17.240 | which is its knowledge base,
00:15:18.460 | and is growing it by sequentially processing.
00:15:20.300 | Do you think of it more generally in that way?
00:15:22.700 | Or is it simply, is it the more constrained form
00:15:27.700 | of a hidden state with certain kind of gating units
00:15:31.340 | that we think of as today with LSTMs and that?
00:15:34.500 | - I mean, the hidden state is technically
00:15:36.220 | what you described there, the hidden state
00:15:37.860 | that goes inside the LSTM or the RNN or something like this.
00:15:41.380 | But then what should be contained, you know,
00:15:43.260 | if you want to make the expert system analogy, I'm not,
00:15:46.860 | I mean, you could say that the knowledge
00:15:49.660 | is stored in the connections,
00:15:51.100 | and then the short-term processing
00:15:53.220 | is done in the hidden state.
00:15:55.460 | - Yes, could you say that?
00:15:58.340 | - Yes.
00:15:59.180 | - So, sort of, do you think there's a future
00:16:01.100 | of building large-scale knowledge bases
00:16:04.420 | within the neural networks?
00:16:05.620 | - Definitely.
00:16:06.460 | (Lex laughing)
00:16:09.020 | - So, we're gonna pause in that confidence,
00:16:11.160 | 'cause I wanna explore that.
00:16:12.700 | But let me zoom back out and ask,
00:16:14.960 | back to the history of ImageNet.
00:16:19.340 | Neural networks have been around for many decades,
00:16:21.360 | as you mentioned.
00:16:22.740 | What do you think were the key ideas
00:16:24.220 | that led to their success, that ImageNet moment,
00:16:27.380 | and beyond the success in the past 10 years?
00:16:32.380 | - Okay, so the question is,
00:16:33.540 | to make sure I didn't miss anything,
00:16:35.540 | the key ideas that led to the success of deep learning
00:16:38.060 | over the past 10 years.
00:16:39.380 | - Exactly, even though the fundamental thing
00:16:42.900 | behind deep learning has been around for much longer.
00:16:45.380 | - So, the key idea about deep learning,
00:16:50.380 | or rather, the key fact about deep learning
00:16:53.940 | before deep learning started to be successful,
00:16:58.260 | is that it was underestimated.
00:16:59.780 | People who worked in machine learning
00:17:02.900 | simply didn't think that neural networks could do much.
00:17:06.300 | People didn't believe that large neural networks
00:17:08.820 | could be trained.
00:17:10.580 | People thought that, well, there was a lot of debate
00:17:14.460 | going on in machine learning
00:17:15.660 | about what are the right methods and so on.
00:17:17.300 | And people were arguing,
00:17:19.340 | because there was no way to get hard facts.
00:17:23.420 | And by that I mean, there were no benchmarks
00:17:25.460 | which were truly hard,
00:17:26.940 | that if you do really well on them,
00:17:28.460 | then you can say, "Look, here's my system."
00:17:32.580 | That's when you switch from...
00:17:34.060 | That's when this field becomes a little bit more
00:17:37.660 | of an engineering field.
00:17:38.620 | So, in terms of deep learning,
00:17:39.660 | to answer the question directly,
00:17:41.460 | the ideas were all there.
00:17:43.540 | The thing that was missing was a lot of supervised data
00:17:46.820 | and a lot of compute.
00:17:47.940 | Once you have a lot of supervised data and a lot of compute,
00:17:52.620 | then there is a third thing which is needed as well,
00:17:54.740 | and that is conviction.
00:17:56.380 | Conviction that if you take the right stuff,
00:17:59.180 | which already exists,
00:18:00.540 | and apply and mix it with a lot of data
00:18:02.460 | and a lot of compute,
00:18:03.580 | that it will in fact work.
00:18:05.020 | And so that was the missing piece.
00:18:07.780 | It was, you had the...
00:18:08.780 | You needed the data,
00:18:10.660 | you needed the compute,
00:18:11.580 | which showed up in terms of GPUs,
00:18:14.140 | and you needed the conviction to realize
00:18:15.820 | that you need to mix them together.
00:18:17.580 | - So that's really interesting.
00:18:19.420 | So, I guess the presence of compute
00:18:23.140 | and the presence of supervised data
00:18:25.180 | allowed the empirical evidence to do the convincing
00:18:29.660 | of the majority of the computer science community.
00:18:32.060 | So I guess there's a key moment
00:18:33.820 | with Jitendra Malik and Alex,
00:18:38.820 | Alyosha Efros,
00:18:40.340 | who were very skeptical, right?
00:18:42.580 | And then there's a Jeffrey Hinton
00:18:44.020 | that was the opposite of skeptical.
00:18:46.700 | And there was a convincing moment,
00:18:48.260 | and I think emission had served as that moment.
00:18:50.260 | - That's right.
00:18:51.100 | - And that represented this kind of,
00:18:52.940 | where the big pillars of computer vision community
00:18:55.900 | kinda, the wizards got together,
00:18:59.740 | and then all of a sudden there was a shift.
00:19:01.500 | And it's not enough for the ideas to all be there
00:19:05.300 | and the compute to be there,
00:19:06.300 | it's for it to convince the cynicism that existed.
00:19:10.460 | It's interesting that people just didn't believe
00:19:14.060 | for a couple of decades.
00:19:15.940 | - Yeah, well, but it's more than that.
00:19:18.580 | It's kind of, when put this way,
00:19:20.860 | it sounds like, well, you know,
00:19:21.780 | those silly people who didn't believe
00:19:24.300 | what were they missing.
00:19:25.580 | But in reality, things were confusing
00:19:27.540 | because neural networks really did not work on anything.
00:19:30.260 | And they were not the best method
00:19:31.460 | on pretty much anything as well.
00:19:33.580 | And it was pretty rational to say,
00:19:35.820 | yeah, this stuff doesn't have any traction.
00:19:37.980 | And that's why you need to have these very hard tasks
00:19:42.300 | which produce undeniable evidence.
00:19:44.900 | And that's how we make progress.
00:19:46.940 | And that's why the field is making progress today
00:19:48.620 | because we have these hard benchmarks
00:19:50.700 | which represent true progress.
00:19:52.780 | And this is why we are able to avoid endless debate.
00:19:58.340 | - So incredibly, you've contributed
00:20:00.540 | some of the biggest recent ideas in AI
00:20:03.060 | in computer vision, language, natural language processing,
00:20:07.060 | reinforcement learning, sort of everything in between.
00:20:11.320 | Maybe not GANs.
00:20:12.540 | There may not be a topic you haven't touched.
00:20:16.260 | And of course, the fundamental science of deep learning.
00:20:19.660 | What is the difference to you between vision, language,
00:20:24.220 | and as in reinforcement learning, action,
00:20:26.980 | as learning problems?
00:20:28.340 | And what are the commonalities?
00:20:29.580 | Do you see them as all interconnected?
00:20:31.540 | Are they fundamentally different domains
00:20:33.820 | that require different approaches?
00:20:36.780 | - Okay, that's a good question.
00:20:39.660 | Machine learning is a field with a lot of unity,
00:20:41.900 | a huge amount of unity.
00:20:43.240 | In fact-- - What do you mean by unity?
00:20:45.340 | Like overlap of ideas?
00:20:48.380 | - Overlap of ideas, overlap of principles.
00:20:50.180 | In fact, there's only one or two or three principles
00:20:52.700 | which are very, very simple.
00:20:54.380 | And then they apply in almost the same way,
00:20:57.380 | in almost the same way to the different modalities
00:20:59.940 | to the different problems.
00:21:01.380 | And that's why today, when someone writes a paper
00:21:04.140 | on improving optimization of deep learning and vision,
00:21:07.160 | it improves the different NLP applications
00:21:09.300 | and it improves the different
00:21:10.140 | reinforcement learning applications.
00:21:12.340 | Reinforcement learning, so I would say that computer vision
00:21:15.820 | and NLP are very similar to each other.
00:21:18.620 | Today, they differ in that they have
00:21:21.000 | slightly different architectures.
00:21:22.180 | We use transformers in NLP
00:21:23.900 | and we use convolutional neural networks in vision.
00:21:26.500 | But it's also possible that one day this will change
00:21:28.900 | and everything will be unified with a single architecture.
00:21:31.820 | Because if you go back a few years ago
00:21:33.660 | in natural language processing,
00:21:35.440 | there were a huge number of architectures
00:21:39.340 | for every different tiny problem had its own architecture.
00:21:42.240 | Today, there's just one transformer
00:21:45.860 | for all those different tasks.
00:21:47.420 | And if you go back in time even more,
00:21:49.660 | you had even more and more fragmentation
00:21:51.340 | and every little problem in AI
00:21:53.780 | had its own little subspecialization
00:21:55.900 | and sub, you know, little set of collection of skills,
00:21:58.620 | people who would know how to engineer the features.
00:22:00.960 | Now it's all been subsumed by deep learning.
00:22:02.860 | We have this unification.
00:22:04.100 | And so I expect vision to become unified
00:22:06.820 | with natural language as well.
00:22:08.500 | Or rather, I shouldn't say expect, I think it's possible.
00:22:10.460 | I don't wanna be too sure because I think
00:22:12.780 | on the convolutional neural net,
00:22:13.620 | it's very computationally efficient.
00:22:15.460 | RL is different.
00:22:16.820 | RL does require slightly different techniques
00:22:18.840 | because you really do need to take action.
00:22:20.780 | You really do need to do something about exploration.
00:22:23.840 | Your variance is much higher.
00:22:26.020 | But I think there is a lot of unity even there.
00:22:28.180 | And I would expect, for example,
00:22:29.300 | that at some point there will be some
00:22:31.140 | broader unification between RL and supervised learning,
00:22:35.220 | where somehow the RL will be making decisions
00:22:37.140 | to make the supervised learning go better.
00:22:38.540 | And it will be, I imagine one big black box
00:22:41.740 | and you just throw everything, you know,
00:22:43.260 | you shovel things into it and it just figures out
00:22:45.980 | what to do with whatever you shovel at it.
00:22:48.020 | - I mean, reinforcement learning has some aspects
00:22:50.740 | of language and vision combined almost.
00:22:55.140 | There's elements of a long-term memory
00:22:57.740 | that you should be utilizing and there's elements
00:22:59.660 | of a really rich sensory space.
00:23:03.060 | So it seems like the, it's like the union of the two
00:23:06.860 | or something like that.
00:23:08.380 | - I'd say something slightly differently.
00:23:09.980 | I'd say that reinforcement learning is neither,
00:23:12.680 | but it naturally interfaces and integrates
00:23:15.420 | with the two of them.
00:23:17.360 | - You think action is fundamentally different?
00:23:19.280 | So yeah, what is interesting about,
00:23:21.300 | what is unique about policy of learning to act?
00:23:26.020 | - Well, so one example, for instance,
00:23:27.500 | is that when you learn to act,
00:23:29.800 | you are fundamentally in a non-stationary world
00:23:33.220 | because as your actions change,
00:23:35.760 | the things you see start changing.
00:23:38.060 | You experience the world in a different way.
00:23:41.300 | And this is not the case for the more traditional
00:23:44.140 | static problem where you have some distribution
00:23:46.300 | and you just apply a model to that distribution.
00:23:48.600 | - You think it's a fundamentally different problem
00:23:51.180 | or is it just a more difficult,
00:23:53.900 | it's a generalization of the problem of understanding?
00:23:56.980 | - I mean, it's a question of definitions almost.
00:23:59.780 | There is a huge, I mean, no,
00:24:00.600 | there's a huge amount of commonality for sure.
00:24:01.940 | You take gradients, you try, you take gradients,
00:24:04.100 | we try to approximate gradients in both cases.
00:24:06.100 | In some, in the case of reinforcement learning,
00:24:07.900 | you have some tools to reduce the variance
00:24:10.100 | of the gradients, you do that.
00:24:11.900 | There's lots of commonality.
00:24:13.780 | You use the same neural net in both cases.
00:24:16.260 | You compute the gradient, you apply Adam in both cases.
00:24:19.020 | So, I mean, there's lots in common for sure,
00:24:24.220 | but there are some small differences
00:24:26.860 | which are not completely insignificant.
00:24:28.900 | It's really just a matter of your point of view,
00:24:30.940 | what frame of reference you,
00:24:32.660 | how much do you want to zoom in or out
00:24:34.980 | as you look at these problems.
00:24:37.220 | - Which problem do you think is harder?
00:24:39.780 | So people like Noam Chomsky believe
00:24:41.620 | that language is fundamental to everything.
00:24:43.940 | So it underlies everything.
00:24:45.660 | Do you think language understanding
00:24:48.060 | is harder than visual scene understanding or vice versa?
00:24:51.620 | - I think that asking if a problem is hard
00:24:54.620 | is slightly wrong.
00:24:56.220 | I think the question is a little bit wrong
00:24:57.500 | and I want to explain why.
00:24:59.460 | - So what does it mean for a problem to be hard?
00:25:02.620 | Okay, the non-interesting, dumb answer to that
00:25:07.220 | is there's a benchmark
00:25:10.700 | and there's a human level performance on that benchmark.
00:25:13.660 | And how is the effort required
00:25:16.660 | to reach the human level benchmark?
00:25:19.060 | - So from the perspective of how much
00:25:20.620 | until we get to human level on a very good benchmark.
00:25:25.260 | - Yeah, I understand what you mean by that.
00:25:28.900 | So what I was going to say that a lot of it depends on,
00:25:32.060 | you know, once you solve a problem, it stops being hard.
00:25:34.060 | And that's always true.
00:25:36.020 | And so whether something is hard or not
00:25:37.780 | depends on what our tools can do today.
00:25:39.740 | So, you know, you say today, true human level,
00:25:43.700 | language understanding and visual perception
00:25:45.780 | are hard in the sense that there is no way
00:25:48.900 | of solving the problem completely in the next three months.
00:25:51.980 | So I agree with that statement.
00:25:53.900 | Beyond that, I'm just,
00:25:55.420 | my guess would be as good as yours, I don't know.
00:25:57.700 | - Okay, so you don't have a fundamental intuition
00:26:00.340 | about how hard language understanding is.
00:26:02.780 | - Well, I think, I know I changed my mind.
00:26:04.300 | I'd say language is probably going to be harder.
00:26:06.780 | I mean, it depends on how you define it.
00:26:09.180 | Like if you mean absolute top-notch,
00:26:11.220 | 100% language understanding, I'll go with language.
00:26:13.980 | - And so-
00:26:16.140 | - But then if I show you a piece of paper
00:26:17.980 | with letters on it, is that, you see what I mean?
00:26:21.340 | It's like you have a vision system,
00:26:22.620 | you say it's the best human level vision system.
00:26:25.100 | I show you, I open a book and I show you letters.
00:26:28.780 | Will it understand how these letters form
00:26:30.420 | into word and sentences and meaning?
00:26:32.260 | Is this part of the vision problem?
00:26:33.700 | Where does vision end and language begin?
00:26:36.100 | - Yeah, so Chomsky would say it starts at language.
00:26:38.220 | So vision is just a little example
00:26:39.860 | of the kind of structure and fundamental hierarchy
00:26:44.860 | of ideas that's already represented in our brain somehow
00:26:49.060 | that's represented through language.
00:26:51.380 | But where does vision stop and language begin?
00:26:56.380 | That's a really interesting question.
00:27:07.740 | - So one possibility is that it's impossible
00:27:09.900 | to achieve really deep understanding
00:27:12.300 | in either images or language
00:27:15.580 | without basically using the same kind of system.
00:27:18.420 | So you're going to get the other for free.
00:27:20.620 | - I think it's pretty likely that yes,
00:27:23.100 | if we can get one, our machine learning
00:27:25.380 | is probably that good that we can get the other.
00:27:27.340 | But I'm not 100% sure.
00:27:30.180 | And also, I think a lot of it really does depend
00:27:34.540 | on your definitions.
00:27:36.700 | - Definitions of?
00:27:38.020 | - Like perfect vision.
00:27:40.020 | Because reading is vision, but should it count?
00:27:43.300 | - Yeah, to me, so my definition is
00:27:46.580 | if a system looked at an image
00:27:48.820 | and then a system looked at a piece of text
00:27:52.220 | and then told me something about that
00:27:56.020 | and I was really impressed.
00:27:57.460 | - That's relative.
00:27:59.460 | You'll be impressed for half an hour
00:28:01.260 | and then you're gonna say, well,
00:28:02.260 | I mean, all the systems do that,
00:28:03.420 | but here's the thing they don't do.
00:28:05.180 | - Yeah, but I don't have that with humans.
00:28:07.100 | Humans continue to impress me.
00:28:08.900 | - Is that true?
00:28:09.740 | - Well, the ones, okay, so I'm a fan of monogamy,
00:28:14.020 | so I like the idea of marrying somebody,
00:28:16.020 | being with them for several decades.
00:28:18.100 | So I believe in the fact that yes,
00:28:20.020 | it's possible to have somebody continuously giving you
00:28:22.980 | pleasurable, interesting, witty, new ideas, friends.
00:28:28.620 | Yeah, I think so.
00:28:29.980 | They continue to surprise you.
00:28:32.100 | - The surprise, it's that injection of randomness
00:28:37.100 | seems to be a nice source of, yeah,
00:28:42.940 | continued inspiration, like the wit, the humor.
00:28:48.780 | I think, yeah, that would be,
00:28:53.700 | it's a very subjective test,
00:28:55.020 | but I think if you have enough humans in the room.
00:28:58.620 | - Yeah, I understand what you mean.
00:29:00.580 | Yeah, I feel like I misunderstood
00:29:02.140 | what you meant by impressing you.
00:29:03.100 | I thought you meant to impress you with its intelligence,
00:29:06.580 | with how well it understands an image.
00:29:10.220 | I thought you meant something like,
00:29:11.740 | I'm gonna show it a really complicated image
00:29:13.300 | and it's gonna get it right, and you're gonna say, wow,
00:29:15.140 | that's really cool.
00:29:15.980 | Our systems of January 2020 have not been doing that.
00:29:19.980 | - Yeah, no, I think it all boils down to
00:29:22.300 | like the reason people click like on stuff on the internet,
00:29:26.140 | which is like, it makes them laugh.
00:29:28.380 | So it's like humor or wit or insight.
00:29:32.780 | - I'm sure we'll get that as well.
00:29:35.460 | - So forgive the romanticized question,
00:29:38.220 | but looking back to you,
00:29:40.500 | what is the most beautiful or surprising idea
00:29:43.180 | in deep learning or AI in general you've come across?
00:29:46.860 | - So I think the most beautiful thing about deep learning
00:29:49.260 | is that it actually works.
00:29:51.740 | And I mean it, because you got these ideas,
00:29:53.220 | you got the little neural network,
00:29:54.740 | you got the back propagation algorithm.
00:29:57.700 | And then you got some theories as to, you know,
00:30:00.660 | this is kind of like the brain.
00:30:02.060 | So maybe if you make it large,
00:30:03.620 | if you make the neural network large
00:30:04.860 | and you train it on a lot of data,
00:30:05.940 | then it will do the same function that the brain does.
00:30:09.700 | And it turns out to be true, that's crazy.
00:30:12.500 | And now we just train these neural networks
00:30:14.180 | and you make them larger and they keep getting better.
00:30:16.700 | And I find it unbelievable.
00:30:17.900 | I find it unbelievable that this whole AI stuff
00:30:20.620 | with neural networks works.
00:30:22.500 | - Have you built up an intuition of why
00:30:24.980 | are there little bits and pieces of intuitions,
00:30:27.980 | of insights of why this whole thing works?
00:30:31.380 | - I mean, some definitely.
00:30:33.260 | While we know that optimization,
00:30:35.060 | we now have good, you know,
00:30:36.660 | we've had lots of empirical,
00:30:40.620 | you know, huge amounts of empirical reasons
00:30:42.380 | to believe that optimization should work
00:30:44.300 | on most problems we care about.
00:30:46.260 | - Do you have insights of why,
00:30:48.700 | so you just said empirical evidence.
00:30:50.780 | Is most of your,
00:30:54.820 | sort of empirical evidence kind of convinces you,
00:30:58.420 | it's like evolution is empirical.
00:31:00.380 | It shows you that, look, this evolutionary process
00:31:02.940 | seems to be a good way to design organisms
00:31:06.420 | that survive in their environment.
00:31:08.280 | But it doesn't really get you to the insights
00:31:11.420 | of how the whole thing works.
00:31:13.980 | - I think a good analogy is physics.
00:31:16.500 | You know how you say, hey,
00:31:17.580 | let's do some physics calculation
00:31:19.060 | and come up with some new physics theory
00:31:20.540 | and make some prediction.
00:31:21.780 | But then you got around the experiment.
00:31:23.980 | You know, you got around the experiment, it's important.
00:31:26.100 | So it's a bit the same here,
00:31:27.460 | except that maybe sometimes the experiment
00:31:29.780 | came before the theory.
00:31:31.060 | But it still is the case.
00:31:32.100 | You know, you have some data
00:31:33.860 | and you come up with some prediction.
00:31:35.020 | You say, yeah, let's make a big neural network.
00:31:36.580 | Let's train it.
00:31:37.420 | And it's going to work much better than anything before it.
00:31:39.860 | And it will in fact continue to get better
00:31:41.460 | as you make it larger.
00:31:42.740 | And it turns out to be true.
00:31:43.620 | That's amazing when a theory is validated like this.
00:31:46.980 | You know, it's not a mathematical theory.
00:31:48.780 | It's more of a biological theory almost.
00:31:51.740 | So I think there are not terrible analogies
00:31:53.980 | between deep learning and biology.
00:31:55.580 | I would say it's like the geometric mean
00:31:57.540 | of biology and physics.
00:31:58.780 | That's deep learning.
00:32:00.260 | - The geometric mean of biology and physics.
00:32:03.860 | I think I'm going to need a few hours
00:32:05.140 | to wrap my head around that.
00:32:06.540 | 'Cause just to find the geometric,
00:32:10.460 | just to find the set of what biology represents.
00:32:15.460 | - Well, biology, in biology,
00:32:18.020 | things are really complicated.
00:32:19.460 | The theories are really, really,
00:32:21.020 | it's really hard to have good predictive theory.
00:32:22.820 | And in physics, the theories are too good.
00:32:25.380 | In physics, people make these super precise theories
00:32:27.900 | which make these amazing predictions.
00:32:29.340 | And in machine learning, we're kind of in between.
00:32:31.460 | - Kind of in between, but it'd be nice
00:32:33.820 | if machine learning somehow helped us discover
00:32:36.460 | the unification of the two
00:32:37.740 | as opposed to sort of the in between.
00:32:39.540 | But you're right.
00:32:42.100 | You're kind of trying to juggle both.
00:32:44.940 | So do you think there are still beautiful
00:32:46.780 | and mysterious properties in neural networks
00:32:48.820 | that are yet to be discovered?
00:32:50.180 | - Definitely.
00:32:51.380 | I think that we are still massively
00:32:52.900 | underestimating deep learning.
00:32:54.380 | - What do you think it'll look like?
00:32:56.660 | Like what?
00:32:58.260 | - If I knew, I would have done it.
00:32:59.860 | But if you look at all the progress
00:33:04.060 | from the past 10 years, I would say most of it,
00:33:07.060 | I would say there've been a few cases
00:33:08.900 | where things that felt like really new ideas showed up.
00:33:12.900 | But by and large, it was every year we thought,
00:33:15.380 | okay, deep learning goes this far.
00:33:17.220 | Nope, it actually goes further.
00:33:19.020 | And then the next year, okay, now this is big deep learning.
00:33:22.500 | We are really done.
00:33:23.340 | Nope, it goes further.
00:33:24.460 | It just keeps going further each year.
00:33:26.060 | So that means that we keep underestimating,
00:33:27.620 | we keep not understanding it.
00:33:29.180 | It has surprising properties all the time.
00:33:31.420 | - Do you think it's getting harder and harder
00:33:33.620 | to make progress?
00:33:34.460 | - Need to make progress?
00:33:36.020 | - It depends on what you mean.
00:33:36.860 | I think the field will continue to make
00:33:37.980 | very robust progress for quite a while.
00:33:41.180 | I think for individual researchers,
00:33:42.820 | especially people who are doing research,
00:33:46.140 | it can be harder because there is a very large number
00:33:48.260 | of researchers right now.
00:33:50.100 | I think that if you have a lot of compute,
00:33:51.820 | then you can make a lot of very interesting discoveries,
00:33:54.740 | but then you have to deal with the challenge
00:33:57.460 | of managing a huge compute cluster to run your experiments.
00:34:02.460 | It's a little bit harder.
00:34:03.300 | - So I'm asking all these questions
00:34:04.940 | that nobody knows the answer to,
00:34:06.460 | but you're one of the smartest people I know,
00:34:08.300 | so I'm gonna keep asking.
00:34:09.500 | So let's imagine all the breakthroughs that happen
00:34:12.900 | in the next 30 years in deep learning.
00:34:15.260 | Do you think most of those breakthroughs can be done
00:34:17.780 | by one person with one computer?
00:34:20.860 | Sort of in the space of breakthroughs,
00:34:23.780 | do you think compute and large efforts will be necessary?
00:34:28.780 | - I mean, I can't be sure.
00:34:33.900 | When you say one computer, you mean how large?
00:34:36.580 | (Lex laughing)
00:34:38.820 | - You're clever.
00:34:40.780 | I mean one GPU.
00:34:42.660 | - I see.
00:34:43.940 | I think it's pretty unlikely.
00:34:47.540 | I think it's pretty unlikely.
00:34:48.700 | I think that there are many...
00:34:51.020 | The stack of deep learning is starting to be quite deep.
00:34:53.780 | If you look at it, you've got all the way from the ideas,
00:34:59.660 | the systems to build the data sets,
00:35:02.180 | the distributed programming,
00:35:04.180 | the building the actual cluster, the GPU programming,
00:35:08.140 | putting it all together.
00:35:08.980 | So now the stack is getting really deep,
00:35:10.580 | and I think it can be quite hard for a single person
00:35:14.100 | to become, to be world-class
00:35:15.660 | in every single layer of the stack.
00:35:17.900 | - What about what like Vladimir Vapnik really insists on
00:35:22.100 | is taking MNIST and trying to learn from very few examples.
00:35:25.980 | So being able to learn more efficiently.
00:35:29.060 | Do you think there'll be breakthroughs in that space
00:35:32.060 | that may not need the huge compute?
00:35:34.860 | - I think there will be a large number of breakthroughs
00:35:37.900 | in general that will not need a huge amount of compute.
00:35:40.620 | So maybe I should clarify that.
00:35:42.100 | I think that some breakthroughs will require a lot of compute
00:35:45.380 | and I think building systems which actually do things
00:35:48.700 | will require a huge amount of compute.
00:35:50.180 | That one is pretty obvious.
00:35:51.340 | If you want to do X and X requires a huge neural net,
00:35:54.700 | you gotta get a huge neural net.
00:35:56.540 | But I think there will be lots of,
00:35:59.340 | I think there is lots of room for very important work
00:36:02.500 | being done by small groups and individuals.
00:36:05.140 | - Can you maybe sort of on the topic of the science
00:36:08.420 | of deep learning, talk about one of the recent papers
00:36:11.980 | that you released, the deep double descent.
00:36:15.700 | Where bigger models and more data hurt.
00:36:18.180 | I think it's a really interesting paper.
00:36:19.660 | Can you describe the main idea?
00:36:22.340 | - Yeah, definitely.
00:36:23.580 | So what happened is that some, over the years,
00:36:27.020 | some small number of researchers noticed that
00:36:29.580 | it is kind of weird that when you make
00:36:30.780 | the neural network larger, it works better
00:36:32.180 | and it seems to go in contradiction with statistical ideas.
00:36:34.660 | And then some people made an analysis showing
00:36:36.940 | that actually you got this double descent bump.
00:36:38.940 | And what we've done was to show that double descent occurs
00:36:42.780 | for pretty much all practical deep learning systems.
00:36:46.420 | And that it'll be also, so can you step back?
00:36:49.940 | What's the X axis and the Y axis of a double descent plot?
00:36:55.980 | Okay, great.
00:36:57.020 | So you can look, you can do things like,
00:37:02.020 | you can take a neural network
00:37:04.100 | and you can start increasing its size slowly
00:37:07.620 | while keeping your dataset fixed.
00:37:10.020 | So if you increase the size of the neural network slowly,
00:37:14.780 | and if you don't do early stopping,
00:37:16.900 | that's a pretty important detail.
00:37:19.020 | Then when the neural network is really small,
00:37:22.500 | you make it larger,
00:37:23.580 | you get a very rapid increase in performance.
00:37:26.060 | Then you continue to make it larger.
00:37:27.300 | And at some point performance will get worse.
00:37:30.180 | And it gets the worst exactly at the point
00:37:34.020 | at which it achieves zero training error,
00:37:36.260 | precisely zero training loss.
00:37:38.660 | And then as you make it larger, it starts to get better again.
00:37:41.500 | And it's kind of counterintuitive
00:37:42.820 | because you'd expect deep learning phenomena
00:37:44.580 | to be monotonic.
00:37:46.820 | And it's hard to be sure what it means,
00:37:50.020 | but it also occurs in the case of linear classifiers.
00:37:53.140 | And the intuition basically boils down to the following.
00:37:55.940 | When you have a large dataset and a small model,
00:38:02.020 | then small, tiny random, so basically what is overfitting?
00:38:07.100 | Overfitting is when your model is somehow very sensitive
00:38:11.980 | to the small, random, unimportant stuff in your dataset.
00:38:16.060 | - In the training dataset.
00:38:16.980 | - In the training dataset, precisely.
00:38:18.980 | So if you have a small model and you have a big dataset,
00:38:23.380 | and there may be some random thing,
00:38:24.780 | some training cases are randomly in the dataset
00:38:27.460 | and others may not be there,
00:38:29.100 | but the small model is kind of insensitive
00:38:31.660 | to this randomness because there is pretty much
00:38:35.260 | no uncertainty about the model.
00:38:37.060 | When the dataset is large.
00:38:38.340 | So, okay, so at the very basic level to me,
00:38:41.180 | it is the most surprising thing
00:38:43.340 | that neural networks don't overfit every time,
00:38:48.340 | very quickly, before ever being able to learn anything.
00:38:53.500 | There are a huge number of parameters.
00:38:56.300 | So here is, so there is one way, okay,
00:38:57.660 | so maybe, so let me try to give the explanation
00:39:00.220 | and maybe that will be, that will work.
00:39:02.020 | So you've got a huge neural network.
00:39:03.620 | Let's suppose you've got a,
00:39:04.980 | you have a huge neural network,
00:39:07.660 | you have a huge number of parameters.
00:39:09.780 | Now let's pretend everything is linear,
00:39:11.380 | which is not, let's just pretend.
00:39:13.100 | Then there is this big subspace
00:39:15.540 | where your neural network achieves zero error.
00:39:18.060 | And SGD is going to find approximately the point--
00:39:21.220 | - Stochastic gradient, that's right.
00:39:22.620 | - Approximately the point with the smallest norm
00:39:24.500 | in that subspace.
00:39:25.500 | - Okay.
00:39:27.540 | - And that can also be proven to be insensitive
00:39:30.260 | to the small randomness in the data
00:39:33.500 | when the dimensionality is high.
00:39:35.380 | But when the dimensionality of the data
00:39:37.220 | is equal to the dimensionality of the model,
00:39:39.380 | then there is a one-to-one correspondence
00:39:41.060 | between all the datasets and the models.
00:39:44.420 | So small changes in the dataset
00:39:45.700 | actually lead to large changes in the model
00:39:47.380 | and that's why performance gets worse.
00:39:48.860 | So this is the best explanation, more or less.
00:39:51.100 | - So then it would be good for the model
00:39:54.020 | to have more parameters, so to be bigger than the data.
00:39:58.660 | - That's right, but only if you don't early stop.
00:40:00.860 | If you introduce early stop in your regularization,
00:40:02.860 | you can make the double descent bump
00:40:04.660 | almost completely disappear.
00:40:06.140 | - What is early stop?
00:40:07.140 | - Early stop is when you train your model
00:40:09.980 | and you monitor your validation performance.
00:40:12.820 | And then if at some point,
00:40:14.540 | validation performance starts to get worse,
00:40:15.980 | you say, "Okay, let's stop training.
00:40:17.620 | "We are good, we are good enough."
00:40:20.060 | - So the magic happens after that moment,
00:40:23.220 | so you don't wanna do the early stopping.
00:40:25.100 | - Well, if you don't do the early stopping,
00:40:26.700 | you get the very pronounced double descent.
00:40:30.780 | - Do you have any intuition why this happens?
00:40:33.500 | - Double descent?
00:40:34.340 | Oh, sorry, early stopping?
00:40:35.540 | - No, the double descent.
00:40:37.180 | - Well, yeah, so I try, let's see.
00:40:38.860 | The intuition is basically, is this,
00:40:41.260 | that when the dataset has as many degrees of freedom
00:40:45.660 | as the model, then there is a one-to-one correspondence
00:40:49.100 | between them and so small changes to the dataset
00:40:52.180 | lead to noticeable changes in the model.
00:40:55.100 | So your model is very sensitive to all the randomness.
00:40:57.340 | It is unable to discard it,
00:40:59.620 | whereas it turns out that when you have
00:41:02.940 | a lot more data than parameters
00:41:04.700 | or a lot more parameters than data,
00:41:06.660 | the resulting solution will be insensitive
00:41:08.900 | to small changes in the dataset.
00:41:10.540 | - Oh, so it's able to, that's nicely put,
00:41:13.540 | discard the small changes, the randomness.
00:41:16.500 | - Exactly, the spurious correlation which you don't want.
00:41:20.580 | - Jeff Hinton suggested we need to throw back propagation.
00:41:23.540 | We already kind of talked about this a little bit,
00:41:25.260 | but he suggested that we need to throw away
00:41:27.220 | back propagation and start over.
00:41:29.820 | I mean, of course, some of that is a little bit
00:41:32.220 | wit and humor, but what do you think,
00:41:36.580 | what could be an alternative method
00:41:38.020 | of training neural networks?
00:41:39.640 | - Well, the thing that he said precisely is that
00:41:42.180 | to the extent that you can't find back propagation
00:41:44.100 | in the brain, it's worth seeing if we can learn something
00:41:47.680 | from how the brain learns, but back propagation
00:41:49.940 | is very useful and we should keep using it.
00:41:52.420 | - Oh, you're saying that once we discover
00:41:54.580 | the mechanism of learning in the brain
00:41:56.360 | or any aspects of that mechanism,
00:41:58.140 | we should also try to implement that in neural networks?
00:42:00.660 | - If it turns out that we can't find back propagation
00:42:02.940 | in the brain.
00:42:03.780 | - If we can't find back propagation in the brain.
00:42:06.180 | Well, so I guess your answer to that is
00:42:11.860 | back propagation is pretty damn useful,
00:42:13.900 | so why are we complaining?
00:42:16.020 | - I mean, I personally am a big fan of back propagation.
00:42:18.460 | I think it's a great algorithm because it solves
00:42:20.380 | an extremely fundamental problem, which is
00:42:23.100 | finding a neural circuit subject to some constraints.
00:42:27.880 | I don't see that problem going away,
00:42:30.440 | so that's why I really, I think it's pretty unlikely
00:42:35.000 | that we'll have anything which is going to be
00:42:37.360 | dramatically different.
00:42:38.680 | It could happen, but I wouldn't bet on it right now.
00:42:41.420 | - So let me ask a sort of big picture question.
00:42:46.840 | Do you think neural networks can be made to reason?
00:42:51.720 | - Why not?
00:42:53.380 | - Well, if you look, for example, at AlphaGo or AlphaZero,
00:42:56.880 | the neural network of AlphaZero plays Go,
00:43:01.740 | which we all agree is a game that requires reasoning,
00:43:05.020 | better than 99.9% of all humans.
00:43:08.540 | Just the neural network, without the search,
00:43:10.300 | just the neural network itself.
00:43:12.300 | Doesn't that give us an existence proof
00:43:15.140 | that neural networks can reason?
00:43:16.740 | - To push back and disagree a little bit,
00:43:19.560 | we all agree that Go is reasoning.
00:43:22.200 | I think I agree.
00:43:24.820 | I don't think it's a trivial, so obviously,
00:43:26.820 | reasoning, like intelligence, is a loose,
00:43:30.640 | gray area term, a little bit.
00:43:32.600 | Maybe you disagree with that.
00:43:34.040 | But yes, I think it has some of the same elements
00:43:38.020 | of reasoning.
00:43:39.380 | Reasoning is almost akin to search, right?
00:43:43.180 | There's a sequential element of step-wise consideration
00:43:49.180 | of possibilities, and sort of building on top
00:43:53.320 | of those possibilities in a sequential manner
00:43:55.240 | until you arrive at some insight.
00:43:57.640 | So yeah, I guess playing Go is kind of like that.
00:44:00.520 | And when you have a single neural network doing that
00:44:02.840 | without search, it's kind of like that.
00:44:04.920 | So there's an existence proof in a particular
00:44:06.760 | constrained environment that a process akin to
00:44:11.000 | what many people call reasoning exists.
00:44:13.920 | But more general kind of reasoning.
00:44:17.180 | - So off the board.
00:44:18.880 | - There is one other existence proof.
00:44:20.440 | - Oh boy, which one?
00:44:22.160 | Us humans?
00:44:23.000 | - Yes.
00:44:23.820 | - Okay.
00:44:24.660 | All right, so do you think the architecture
00:44:28.840 | that will allow neural networks to reason
00:44:33.400 | will look similar to the neural network architectures
00:44:37.400 | we have today?
00:44:38.880 | - I think it will.
00:44:39.700 | I think, well, I don't wanna make
00:44:41.760 | two overly definitive statements.
00:44:44.080 | I think it's definitely possible that
00:44:46.680 | the neural networks that will produce
00:44:48.520 | the reasoning breakthroughs of the future
00:44:50.240 | will be very similar to the architectures that exist today.
00:44:53.640 | Maybe a little bit more recurrent,
00:44:55.360 | maybe a little bit deeper.
00:44:57.100 | But these neural nets are so insanely powerful.
00:45:02.100 | Why wouldn't they be able to learn to reason?
00:45:05.560 | Humans can reason, so why can't neural networks?
00:45:09.320 | So do you think the kind of stuff we've seen
00:45:11.640 | neural networks do is a kind of just weak reasoning?
00:45:14.660 | So it's not a fundamentally different process?
00:45:16.600 | Again, this is stuff nobody knows the answer to.
00:45:19.680 | - So when it comes to our neural networks,
00:45:23.000 | the thing which I would say is that
00:45:24.720 | neural networks are capable of reasoning.
00:45:27.240 | But if you train a neural network on a task
00:45:30.560 | which doesn't require reasoning, it's not going to reason.
00:45:34.020 | This is a well-known effect where the neural network
00:45:36.360 | will solve exactly the, it will solve the problem
00:45:39.320 | that you pose in front of it in the easiest way possible.
00:45:44.440 | - Right, that takes us to the,
00:45:47.140 | to one of the brilliant sort of ways
00:45:51.560 | you've described neural networks, which is,
00:45:54.220 | you've referred to neural networks
00:45:55.480 | as the search for small circuits.
00:45:57.920 | And maybe general intelligence
00:46:01.180 | as the search for small programs,
00:46:03.360 | which I found as a metaphor very compelling.
00:46:06.960 | Can you elaborate on that difference?
00:46:09.200 | - Yeah, so the thing which I said precisely was that
00:46:13.720 | if you can find the shortest program
00:46:17.280 | that outputs the data at your disposal,
00:46:20.920 | then you will be able to use it
00:46:22.260 | to make the best prediction possible.
00:46:24.260 | And that's a theoretical statement
00:46:27.000 | which can be proved mathematically.
00:46:29.240 | Now, you can also prove mathematically that it is,
00:46:32.440 | that finding the shortest program
00:46:33.920 | which generates some data is not a computable operation.
00:46:38.920 | No finite amount of compute can do this.
00:46:42.760 | So then with neural networks,
00:46:46.080 | neural networks are the next best thing
00:46:47.940 | that actually works in practice.
00:46:50.160 | We are not able to find the best,
00:46:52.880 | the shortest program which generates our data,
00:46:55.760 | but we are able to find, you know, a small,
00:46:58.880 | but now that statement should be amended,
00:47:01.620 | even a large circuit which fits our data in some way.
00:47:05.320 | - Well, I think what you meant by the small circuit
00:47:07.200 | is the smallest needed circuit.
00:47:10.000 | - Well, the thing which I would change now,
00:47:12.360 | back then I really haven't fully internalized
00:47:14.800 | the over-parameterized results.
00:47:17.080 | The things we know about over-parameterized neural nets,
00:47:20.480 | now I would phrase it as a large circuit
00:47:23.200 | whose weights contain a small amount of information,
00:47:27.800 | which I think is what's going on.
00:47:29.200 | If you imagine the training process of a neural network
00:47:31.520 | as you slowly transmit entropy
00:47:33.800 | from the dataset to the parameters,
00:47:37.080 | then somehow the amount of information in the weights
00:47:41.080 | ends up being not very large,
00:47:42.960 | which would explain why they generalize so well.
00:47:45.240 | - So that's, the large circuit might be one that's helpful
00:47:49.400 | for the generalization.
00:47:51.960 | - Yeah, something like this.
00:47:53.360 | - But do you see it important to be able to try
00:47:59.680 | to learn something like programs?
00:48:02.480 | - I mean, if we can, definitely.
00:48:04.880 | I think it's kind of, the answer is kind of yes,
00:48:08.200 | if we can do it.
00:48:09.160 | We should do things that we can't do it.
00:48:11.200 | It's the reason we are pushing on deep learning,
00:48:14.140 | the fundamental reason, the root cause
00:48:18.840 | is that we are able to train them.
00:48:20.540 | So in other words, training comes first.
00:48:23.920 | We've got our pillar, which is the training pillar.
00:48:27.560 | And now we are trying to contort our neural networks
00:48:30.040 | around the training pillar.
00:48:30.920 | We gotta stay trainable.
00:48:31.960 | This is an invariant we cannot violate.
00:48:36.440 | And so being trainable means starting from scratch,
00:48:40.600 | knowing nothing, you can actually pretty quickly
00:48:42.880 | converge towards knowing a lot or even slowly.
00:48:45.920 | But it means that given the resources at your disposal,
00:48:49.540 | you can train the neural net
00:48:52.440 | and get it to achieve useful performance.
00:48:55.440 | - Yeah, that's a pillar we can't move away from.
00:48:57.480 | - That's right, because if you can,
00:48:58.520 | and whereas if you say, hey, let's find the shortest program,
00:49:01.480 | well, we can't do that.
00:49:02.840 | So it doesn't matter how useful that would be.
00:49:06.080 | We can do it, so we won't.
00:49:08.480 | - So do you think, you kind of mentioned
00:49:09.920 | that neural networks are good at finding small circuits
00:49:12.240 | or large circuits.
00:49:13.420 | Do you think then the matter of finding small programs
00:49:17.560 | is just the data?
00:49:19.320 | - No.
00:49:20.160 | - So the, sorry, not the size or character,
00:49:23.880 | the type of data.
00:49:25.920 | Sort of ask giving it programs.
00:49:28.980 | - Well, I think the thing is that right now,
00:49:32.000 | finding, there are no good precedents
00:49:34.600 | of people successfully finding programs really well.
00:49:38.960 | And so the way you'd find programs
00:49:40.680 | is you'd train a deep neural network to do it basically.
00:49:44.360 | - Right.
00:49:45.200 | - Which is the right way to go about it.
00:49:48.160 | - But there's not good illustrations of that.
00:49:50.720 | - It hasn't been done yet,
00:49:51.920 | but in principle, it should be possible.
00:49:54.320 | - Can you elaborate a little bit?
00:49:58.240 | What's your insight in principle?
00:49:59.880 | And put another way, you don't see why it's not possible.
00:50:04.200 | - Well, it's kind of like more, it's more a statement of,
00:50:07.920 | I think that it's unwise to bet against deep learning.
00:50:13.440 | And if it's a cognitive function
00:50:16.960 | that humans seem to be able to do,
00:50:18.720 | then it doesn't take too long for some deep neural net
00:50:23.240 | to pop up that can do it too.
00:50:24.680 | - Yeah, I'm there with you.
00:50:27.840 | I've stopped betting against neural networks at this point
00:50:33.160 | because they continue to surprise us.
00:50:35.720 | What about long-term memory?
00:50:37.280 | Can neural networks have long-term memory
00:50:39.000 | or something like knowledge basis?
00:50:42.200 | So being able to aggregate important information
00:50:45.520 | over long periods of time that would then serve
00:50:49.400 | as useful sort of representations of state
00:50:54.400 | that you can make decisions by.
00:50:57.760 | So have a long-term context
00:50:59.560 | based on what you make in the decision.
00:51:01.600 | - So in some sense, the parameters already do that.
00:51:04.840 | The parameters are an aggregation of the day,
00:51:07.920 | of the neural, of the entirety of the neural experience.
00:51:10.920 | And so they count as the long, as long-term knowledge.
00:51:14.280 | And people have trained various neural nets
00:51:17.800 | to act as knowledge bases and, you know,
00:51:20.200 | investigated with invest,
00:51:21.520 | people have investigated language models as knowledge basis.
00:51:23.720 | So there is work, there is work there.
00:51:27.320 | - Yeah, but in some sense, do you think in every sense,
00:51:29.880 | do you think there's a, it's all just a matter
00:51:34.880 | of coming up with a better mechanism
00:51:36.720 | of forgetting the useless stuff
00:51:38.440 | and remembering the useful stuff?
00:51:40.240 | 'Cause right now, I mean, there's not been mechanisms
00:51:43.080 | that do remember really long-term information.
00:51:46.880 | - What do you mean by that precisely?
00:51:48.880 | - Precisely, I like the word precisely.
00:51:51.760 | So I'm thinking of the kind of compression of information
00:51:58.160 | the knowledge bases represent, sort of creating a,
00:52:02.960 | now, I apologize for my sort of human-centric thinking
00:52:06.920 | about what knowledge is, 'cause neural networks
00:52:10.360 | aren't interpretable necessarily
00:52:12.920 | with the kind of knowledge they have discovered.
00:52:15.800 | But a good example for me is knowledge bases,
00:52:18.740 | being able to build up over time something like
00:52:21.320 | the knowledge that Wikipedia represents.
00:52:24.120 | It's a really compressed, structured,
00:52:27.560 | (scoffs)
00:52:29.760 | knowledge base.
00:52:30.840 | Obviously not the actual Wikipedia or the language,
00:52:34.360 | but like a semantic web,
00:52:35.720 | the dream that semantic web represented.
00:52:37.920 | So it's a really nice compressed knowledge base,
00:52:40.360 | or something akin to that in a non-interpretable sense
00:52:44.560 | as neural networks would have.
00:52:46.980 | - Well, the neural networks would be non-interpretable
00:52:48.560 | if you look at their weights,
00:52:49.440 | but their outputs should be very interpretable.
00:52:52.200 | - Okay, so yeah, how do you make very smart neural networks
00:52:55.840 | like language models interpretable?
00:52:58.080 | - Well, you ask them to generate some text,
00:53:00.280 | and the text will generally be interpretable.
00:53:02.120 | - Do you find that the epitome of interpretability,
00:53:04.720 | like can you do better?
00:53:06.160 | 'Cause you can't, okay, I would like to know
00:53:09.480 | what does it know and what doesn't it know?
00:53:12.240 | I would like the neural network to come up with examples
00:53:15.720 | where it's completely dumb,
00:53:17.960 | and examples where it's completely brilliant.
00:53:20.320 | And the only way I know how to do that now
00:53:22.280 | is to generate a lot of examples and use my human judgment.
00:53:26.480 | But it would be nice if a neural network
00:53:28.200 | had some self-awareness about it.
00:53:31.760 | - Yeah, 100%.
00:53:33.400 | I'm a big believer in self-awareness,
00:53:34.840 | and I think neural net self-awareness
00:53:39.840 | will allow for things like the capabilities,
00:53:42.600 | like the ones you described,
00:53:43.680 | like for them to know what they know
00:53:45.560 | and what they don't know,
00:53:47.040 | and for them to know where to invest
00:53:48.760 | to increase their skills most optimally.
00:53:50.840 | And to your question of interpretability,
00:53:52.280 | there are actually two answers to that question.
00:53:54.360 | One answer is, you know, we have the neural net,
00:53:56.480 | so we can analyze the neurons,
00:53:58.520 | and we can try to understand what the different neurons
00:54:00.640 | and different layers mean.
00:54:01.880 | And you can actually do that,
00:54:03.440 | and OpenAI has done some work on that.
00:54:05.920 | But there is a different answer,
00:54:06.960 | which is that, I would say,
00:54:10.320 | that's the human-centric answer,
00:54:11.400 | where you say, you know, you look at a human being,
00:54:15.040 | you can't read, you know,
00:54:16.520 | how do you know what a human being is thinking?
00:54:18.800 | You ask them, you say, hey, what do you think about this?
00:54:20.640 | What do you think about that?
00:54:22.360 | And you get some answers.
00:54:23.960 | The answers you get are sticky,
00:54:25.640 | in the sense you already have a mental model.
00:54:28.040 | You already have an, yeah,
00:54:30.600 | mental model of that human being.
00:54:32.700 | You already have an understanding of,
00:54:35.160 | like a big conception of what it,
00:54:37.760 | of that human being, how they think,
00:54:39.400 | how what they know, how they see the world,
00:54:41.560 | and then everything you ask, you're adding onto that.
00:54:45.560 | And that stickiness seems to be,
00:54:49.800 | that's one of the really interesting qualities
00:54:51.720 | of the human being, is that information is sticky.
00:54:55.040 | You don't, you seem to remember the useful stuff,
00:54:57.560 | aggregate it well, and forget most of the information
00:55:00.440 | that's not useful.
00:55:01.800 | That process, but that's also pretty similar
00:55:04.920 | to the process that neural networks do.
00:55:06.800 | It's just that neural networks are much crappier
00:55:09.080 | at this time.
00:55:10.680 | It doesn't seem to be fundamentally that different.
00:55:13.280 | But just to stick on reasoning for a little longer,
00:55:16.060 | you said, why not?
00:55:18.760 | Why can't I reason?
00:55:19.720 | What's a good, impressive feat,
00:55:22.840 | benchmark to you of reasoning
00:55:24.800 | that you'll be impressed by
00:55:28.760 | if neural networks were able to do?
00:55:30.640 | Is that something you already have in mind?
00:55:32.880 | - Well, I think writing really good code.
00:55:35.300 | I think proving really hard theorems.
00:55:39.320 | Solving open-ended problems with out-of-the-box solutions.
00:55:43.160 | - And sort of theorem type mathematical problems.
00:55:49.520 | - Yeah, I think those ones are a very natural example
00:55:52.120 | as well.
00:55:52.960 | If you can prove an unproven theorem,
00:55:54.520 | then it's hard to argue, you don't reason.
00:55:56.620 | And so by the way, and this comes back to the point
00:55:59.440 | about the hard results.
00:56:01.000 | If you've got a hard, if you have,
00:56:03.240 | machine learning, deep learning as a field is very fortunate
00:56:06.120 | because we have the ability to sometimes produce
00:56:08.760 | these unambiguous results.
00:56:10.880 | And when they happen, the debate changes,
00:56:13.160 | the conversation changes.
00:56:14.320 | We have the ability to produce conversation changing results.
00:56:19.540 | - Conversation, and then of course, just like you said,
00:56:21.660 | people kind of take that for granted
00:56:23.060 | and say that wasn't actually a hard problem.
00:56:25.100 | Well, I mean, at some point,
00:56:26.420 | you'll probably run out of hard problems.
00:56:28.420 | Yeah, that whole mortality thing is kind of a sticky problem
00:56:33.700 | that we haven't quite figured out.
00:56:35.140 | Maybe we'll solve that one.
00:56:37.240 | I think one of the fascinating things
00:56:39.140 | in your entire body of work,
00:56:40.900 | but also the work at OpenAI recently,
00:56:43.060 | one of the conversation changers has been
00:56:44.860 | in the world of language models.
00:56:47.180 | Can you briefly kind of try to describe the recent history
00:56:51.140 | of using neural networks in the domain of language and text?
00:56:54.660 | - Well, there's been lots of history.
00:56:56.660 | I think the Elman network was a small,
00:57:00.260 | tiny recurrent neural network applied to language
00:57:02.140 | back in the '80s.
00:57:03.900 | So the history is really, you know, fairly long at least.
00:57:08.740 | And the thing that started,
00:57:10.700 | the thing that changed the trajectory
00:57:13.480 | of neural networks and language
00:57:14.980 | is the thing that changed the trajectory
00:57:17.220 | of all deep learning, and that's data and compute.
00:57:19.700 | So suddenly you move from small language models,
00:57:22.740 | which learn a little bit.
00:57:24.420 | And with language models in particular,
00:57:26.660 | there's a very clear explanation
00:57:28.500 | for why they need to be large to be good,
00:57:31.660 | because they're trying to predict the next word.
00:57:34.620 | So when you don't know anything,
00:57:36.900 | you'll notice very, very broad strokes,
00:57:40.260 | surface level patterns,
00:57:41.500 | like sometimes there are characters
00:57:44.860 | and there is a space between those characters.
00:57:46.500 | You'll notice this pattern.
00:57:47.980 | And you'll notice that sometimes there is a comma
00:57:50.020 | and then the next character is a capital letter.
00:57:51.900 | You'll notice that pattern.
00:57:53.620 | Eventually you may start to notice
00:57:54.980 | that there are certain words occur often.
00:57:57.140 | You may notice that spellings are a thing.
00:57:59.380 | You may notice syntax.
00:58:01.060 | And when you get really good at all these,
00:58:03.660 | you start to notice the semantics.
00:58:05.860 | You start to notice the facts.
00:58:07.820 | But for that to happen,
00:58:08.860 | the language model needs to be larger.
00:58:11.460 | - So let's linger on that,
00:58:14.060 | 'cause that's where you and Noam Chomsky disagree.
00:58:16.620 | So you think we're actually taking incremental steps,
00:58:23.700 | sort of larger network, larger compute,
00:58:25.740 | we'll be able to get to the semantics,
00:58:29.540 | be able to understand language
00:58:32.020 | without what Noam likes to sort of think of
00:58:35.540 | as a fundamental understandings
00:58:38.660 | of the structure of language,
00:58:40.460 | like imposing your theory of language
00:58:43.380 | onto the learning mechanism.
00:58:45.900 | So you're saying the learning,
00:58:48.060 | you can learn from raw data,
00:58:50.620 | the mechanism that underlies language.
00:58:53.460 | - Well, I think it's pretty likely,
00:58:56.780 | but I also wanna say that I don't really
00:58:58.820 | know precisely what Chomsky means when he talks about him.
00:59:05.220 | You said something about imposing
00:59:07.380 | your structure on language.
00:59:08.820 | I'm not 100% sure what he means,
00:59:10.540 | but empirically it seems that
00:59:12.740 | when you inspect those larger language models,
00:59:14.700 | they exhibit signs of understanding the semantics,
00:59:16.700 | whereas the smaller language models do not.
00:59:18.540 | We've seen that a few years ago
00:59:19.820 | when we did work on the sentiment neuron,
00:59:21.980 | we trained a small, you know,
00:59:24.060 | smallish LSTM to predict the next character
00:59:27.380 | in Amazon reviews.
00:59:28.620 | And we noticed that when you increase the size of the LSTM
00:59:31.700 | from 500 LSTM cells to 4,000 LSTM cells,
00:59:35.420 | then one of the neurons starts to represent the sentiment
00:59:38.620 | of the article, of, sorry, of the review.
00:59:41.020 | Now, why is that?
00:59:42.980 | Sentiment is a pretty semantic attribute.
00:59:45.260 | It's not a syntactic attribute.
00:59:46.900 | - And for people who might not know,
00:59:48.380 | I don't know if that's a standard term,
00:59:49.460 | but sentiment is whether it's a positive or negative review.
00:59:52.020 | - That's right.
00:59:52.860 | Like, is the person happy with something
00:59:54.300 | or is the person unhappy with something?
00:59:55.940 | And so here we had very clear evidence
00:59:58.780 | that a small neural net does not capture sentiment
01:00:01.940 | while a large neural net does.
01:00:03.620 | And why is that?
01:00:04.740 | Well, our theory is that at some point
01:00:07.460 | you run out of syntax to models,
01:00:08.860 | you start to gotta focus on something else.
01:00:11.060 | - And with size, you quickly run out of syntax to model,
01:00:15.820 | and then you really start to focus on the semantics,
01:00:18.380 | would be the idea.
01:00:19.420 | - That's right.
01:00:20.260 | And so I don't wanna imply that our models
01:00:22.180 | have complete semantic understanding
01:00:23.860 | because that's not true,
01:00:25.340 | but they definitely are showing signs
01:00:28.260 | of semantic understanding, partial semantic understanding,
01:00:30.780 | but the smaller models do not show those signs.
01:00:34.540 | - Can you take a step back and say, what is GPT-2,
01:00:38.180 | which is one of the big language models
01:00:40.580 | that was the conversation changer
01:00:42.540 | in the past couple of years?
01:00:43.820 | - Yeah, so GPT-2 is a transformer
01:00:48.180 | with one and a half billion parameters
01:00:50.380 | that was trained on about 40 billion tokens of text,
01:00:55.380 | which were obtained from web pages
01:00:58.900 | that were linked to from Reddit articles
01:01:01.140 | with more than three upvotes.
01:01:02.380 | - And what's a transformer?
01:01:03.940 | - The transformer, it's the most important advance
01:01:06.740 | in neural network architectures in recent history.
01:01:09.820 | - What is attention maybe too?
01:01:11.540 | 'Cause I think that's an interesting idea,
01:01:13.300 | not necessarily sort of technically speaking,
01:01:15.060 | but the idea of attention
01:01:17.500 | versus maybe what recurring neural networks represent.
01:01:21.140 | - Yeah, so the thing is the transformer
01:01:23.380 | is a combination of multiple ideas simultaneously
01:01:25.900 | of which attention is one.
01:01:28.180 | - Do you think attention is the key?
01:01:29.420 | - No, it's a key, but it's not the key.
01:01:32.500 | The transformer is successful
01:01:34.540 | because it is the simultaneous combination
01:01:36.820 | of multiple ideas.
01:01:37.740 | And if you were to remove either idea,
01:01:39.100 | it would be much less successful.
01:01:41.500 | So the transformer uses a lot of attention,
01:01:43.900 | but attention existed for a few years.
01:01:45.900 | So that can't be the main innovation.
01:01:48.460 | The transformer is designed in such a way
01:01:53.220 | that it runs really fast on the GPU.
01:01:55.220 | And that makes a huge amount of difference.
01:01:58.220 | This is one thing.
01:01:59.400 | The second thing is that transformer is not recurrent.
01:02:02.880 | And that is really important too,
01:02:04.720 | because it is more shallow
01:02:06.400 | and therefore much easier to optimize.
01:02:08.480 | So in other words, it uses attention.
01:02:10.440 | It is a really great fit to the GPU
01:02:14.320 | and it is not recurrent,
01:02:15.360 | so therefore less deep and easier to optimize.
01:02:17.840 | And the combination of those factors make it successful.
01:02:20.760 | So now it makes great use of your GPU.
01:02:24.240 | It allows you to achieve better results
01:02:26.400 | for the same amount of compute.
01:02:28.720 | And that's why it's successful.
01:02:31.080 | - Were you surprised how well transformers worked
01:02:34.200 | and GPT-2 worked?
01:02:36.120 | So you worked on language.
01:02:37.840 | You've had a lot of great ideas
01:02:39.760 | before transformers came about in language.
01:02:42.880 | So you got to see the whole set of revolutions
01:02:44.960 | before and after.
01:02:46.160 | Were you surprised?
01:02:47.560 | - Yeah, a little.
01:02:48.640 | - A little?
01:02:49.480 | - Yeah.
01:02:50.320 | I mean, it's hard to remember
01:02:51.920 | because you adapt really quickly,
01:02:54.520 | but it definitely was surprising.
01:02:55.960 | It definitely was.
01:02:56.880 | In fact, you know what?
01:02:59.040 | I'll retract my statement.
01:03:00.480 | It was pretty amazing.
01:03:02.480 | It was just amazing to see generate this text of this.
01:03:06.080 | And you know, you got to keep in mind
01:03:07.360 | that at that time, you've seen all this progress in GANs,
01:03:10.480 | in improving the samples produced by GANs were just amazing.
01:03:14.720 | You have these realistic faces,
01:03:15.960 | but text hasn't really moved that much.
01:03:17.880 | And suddenly we moved from, you know,
01:03:20.520 | whatever GANs were in 2015
01:03:23.120 | to the best, most amazing GANs in one step.
01:03:26.200 | And that was really stunning.
01:03:27.520 | Even though theory predicted,
01:03:29.040 | yeah, you train a big language model,
01:03:30.440 | of course you should get this.
01:03:31.840 | But then to see it with your own eyes, it's something else.
01:03:34.880 | - And yet we adapt really quickly.
01:03:37.240 | And now there's sort of some cognitive scientists
01:03:42.240 | write articles saying that GPT-2 models
01:03:47.040 | don't truly understand language.
01:03:49.320 | So we adapt quickly to how amazing
01:03:51.880 | the fact that they're able to model the language so well is.
01:03:55.680 | So what do you think is the bar?
01:03:57.920 | - For what?
01:03:59.680 | - For impressing us that it-
01:04:02.400 | - I don't know.
01:04:03.720 | - Do you think that bar will continuously be moved?
01:04:06.080 | - Definitely.
01:04:07.320 | I think when you start to see
01:04:08.840 | really dramatic economic impact, that's when,
01:04:11.960 | I think that's in some sense the next barrier.
01:04:13.800 | Because right now, if you think about the work in AI,
01:04:16.880 | it's really confusing.
01:04:18.880 | It's really hard to know what to make of all these advances.
01:04:22.520 | It's kind of like, okay, you got an advance.
01:04:25.560 | Now you can do more things.
01:04:26.840 | And you got another improvement.
01:04:29.080 | And you got another cool demo.
01:04:30.400 | At some point, I think people who are outside of AI,
01:04:35.400 | they can no longer distinguish this progress anymore.
01:04:38.680 | - So we were talking offline
01:04:40.040 | about translating Russian to English
01:04:41.760 | and how there's a lot of brilliant work in Russian
01:04:44.120 | that the rest of the world doesn't know about.
01:04:46.440 | That's true for Chinese.
01:04:47.560 | It's true for a lot of scientists
01:04:50.080 | and just artistic work in general.
01:04:52.200 | Do you think translation is the place
01:04:53.880 | where we're going to see sort of economic big impact?
01:04:57.080 | - I don't know.
01:04:58.080 | I think there is a huge number of,
01:05:00.040 | I mean, first of all, I would want to,
01:05:01.600 | I want to point out that translation already today is huge.
01:05:05.520 | I think billions of people interact
01:05:07.520 | with big chunks of the internet
01:05:09.960 | primarily through translation.
01:05:11.080 | So translation is already huge
01:05:13.040 | and it's hugely, hugely positive too.
01:05:16.400 | I think self-driving is going to be hugely impactful.
01:05:20.320 | And that's, you know, it's unknown exactly when it happens,
01:05:24.480 | but again, I would not bet against deep learning.
01:05:27.040 | So I--
01:05:28.000 | - So that's deep learning in general, but you think--
01:05:30.400 | - Deep learning for self-driving.
01:05:31.960 | - Yes, deep learning for self-driving.
01:05:33.160 | But I was talking about sort of language models.
01:05:35.360 | - I see.
01:05:36.200 | - Just to check. - I veered off a little bit.
01:05:38.120 | - Just to check.
01:05:38.960 | You're not seeing a connection
01:05:40.000 | between driving and language.
01:05:41.160 | - No, no. - Okay.
01:05:42.400 | - Or rather, both use neural nets.
01:05:44.080 | - That'd be a poetic connection.
01:05:45.600 | I think there might be some, like you said,
01:05:47.800 | there might be some kind of unification
01:05:49.200 | towards a kind of multitask transformers
01:05:54.200 | that can take on both language and vision tasks.
01:05:58.240 | That'd be an interesting unification.
01:06:01.440 | Now let's see, what can I ask about GPT-2 more?
01:06:04.000 | - It's simple, so not much to ask.
01:06:06.980 | It's, you take a transform, you make it bigger,
01:06:09.960 | give it more data,
01:06:10.800 | and suddenly it does all those amazing things.
01:06:12.700 | - Yeah, one of the beautiful things is that GPT,
01:06:14.920 | the transformers are fundamentally simple
01:06:17.200 | to explain, to train.
01:06:18.660 | Do you think bigger will continue
01:06:23.960 | to show better results in language?
01:06:27.080 | - Probably.
01:06:28.240 | - Sort of like what are the next steps with GPT-2,
01:06:30.520 | do you think?
01:06:31.480 | - I mean, I think for sure seeing what larger versions
01:06:35.680 | can do is one direction.
01:06:37.640 | Also, I mean, there are many questions.
01:06:41.240 | There's one question which I'm curious about,
01:06:42.800 | and that's the following.
01:06:44.000 | So right now, GPT-2,
01:06:45.400 | so we feed it all this data from the internet,
01:06:47.000 | which means that it needs to memorize
01:06:48.160 | all those random facts about everything in the internet.
01:06:51.880 | And it would be nice if the model
01:06:56.120 | could somehow use its own intelligence
01:06:59.200 | to decide what data it wants to accept
01:07:01.840 | and what data it wants to reject.
01:07:03.580 | Just like people,
01:07:04.420 | people don't learn all data indiscriminately.
01:07:07.200 | We are super selective about what we learn.
01:07:09.760 | And I think this kind of active learning,
01:07:11.600 | I think, would be very nice to have.
01:07:13.400 | - Yeah, listen, I love active learning.
01:07:16.760 | So let me ask, does the selection of data,
01:07:21.200 | can you just elaborate that a little bit more?
01:07:23.080 | Do you think the selection of data is,
01:07:26.440 | like, I have this kind of sense
01:07:29.920 | that the optimization of how you select data,
01:07:33.840 | so the active learning process,
01:07:35.920 | is going to be a place for a lot of breakthroughs,
01:07:40.020 | even in the near future,
01:07:42.200 | because there hasn't been many breakthroughs there
01:07:44.120 | that are public.
01:07:45.160 | I feel like there might be private breakthroughs
01:07:47.640 | that companies keep to themselves,
01:07:49.400 | 'cause it's a fundamental problem that has to be solved
01:07:51.560 | if you wanna solve self-driving,
01:07:53.000 | if you wanna solve a particular task.
01:07:55.360 | What do you think about the space in general?
01:07:57.880 | - Yeah, so I think that for something like active learning,
01:08:00.280 | or in fact, for any kind of capability, like active learning,
01:08:03.860 | the thing that it really needs is a problem.
01:08:05.880 | It needs a problem that requires it.
01:08:08.020 | It's very hard to do research about the capability
01:08:12.160 | if you don't have a task,
01:08:13.080 | because then what's going to happen
01:08:14.280 | is that you will come up with an artificial task,
01:08:16.760 | get good results, but not really convince anyone.
01:08:19.780 | - Right, like, we're now past the stage
01:08:23.000 | where getting a result on MNIST,
01:08:27.520 | some clever formulation of MNIST will convince people.
01:08:30.880 | - That's right.
01:08:31.720 | In fact, you could quite easily come up
01:08:33.680 | with a simple active learning scheme on MNIST
01:08:35.400 | and get a 10x speedup, but then, so what?
01:08:39.640 | And I think that with active learning,
01:08:41.880 | the active learning will naturally arise
01:08:45.520 | as problems that require it pop up.
01:08:49.280 | That's my take on it.
01:08:51.880 | - There's another interesting thing
01:08:54.080 | that OpenAI has brought up with GPT-2,
01:08:56.080 | which is when you create
01:08:58.640 | a powerful artificial intelligence system,
01:09:01.400 | and it was unclear what kind of detrimental,
01:09:04.640 | once you release GPT-2,
01:09:07.440 | what kind of detrimental effect it'll have,
01:09:09.560 | because if you have a model
01:09:11.520 | that can generate pretty realistic text,
01:09:14.040 | you can start to imagine that it would be used by bots
01:09:18.280 | in some way that we can't even imagine.
01:09:21.680 | So there's this nervousness about what it's possible to do.
01:09:24.400 | So you did a really kind of brave
01:09:27.080 | and I think profound thing,
01:09:28.120 | which is start a conversation about this.
01:09:29.920 | Like, how do we release
01:09:32.240 | powerful artificial intelligence models to the public?
01:09:36.080 | If we do it all, how do we privately discuss
01:09:39.760 | with other, even competitors,
01:09:42.160 | about how we manage the use of the systems and so on?
01:09:46.040 | So from this whole experience,
01:09:47.960 | you released a report on it,
01:09:49.520 | but in general, are there any insights
01:09:51.760 | that you've gathered from just thinking about this,
01:09:55.320 | about how you release models like this?
01:09:57.680 | - I mean, I think that my take on this
01:10:00.680 | is that the field of AI has been in a state of childhood,
01:10:05.020 | and now it's exiting that state
01:10:06.820 | and it's entering a state of maturity.
01:10:08.720 | What that means is that AI is very successful
01:10:12.300 | and also very impactful,
01:10:14.100 | and its impact is not only large, but it's also growing.
01:10:16.940 | And so for that reason, it seems wise to start thinking
01:10:22.820 | about the impact of our systems before releasing them,
01:10:25.900 | maybe a little bit too soon,
01:10:27.500 | rather than a little bit too late.
01:10:29.660 | And with the case of GPT-2, like I mentioned earlier,
01:10:32.860 | the results really were stunning,
01:10:35.140 | and it seemed plausible.
01:10:37.220 | It didn't seem certain.
01:10:38.700 | It seemed plausible that something like GPT-2
01:10:41.540 | could easily use to reduce the cost of disinformation.
01:10:45.460 | And so there was a question
01:10:48.500 | of what's the best way to release it,
01:10:50.040 | and a staged release seemed logical.
01:10:51.740 | A small model was released,
01:10:53.700 | and there was time to see the...
01:10:56.460 | Many people use these models in lots of cool ways.
01:10:59.720 | There've been lots of really cool applications.
01:11:02.020 | There haven't been any negative applications we know of,
01:11:06.180 | and so eventually it was released.
01:11:07.620 | But also other people replicated similar models.
01:11:09.980 | - That's an interesting question, though, that we know of.
01:11:12.700 | So in your view, staged release
01:11:16.060 | is at least part of the answer to the question of how do we...
01:11:20.780 | What do we do once we create a system like this?
01:11:25.940 | - It's part of the answer, yes.
01:11:27.500 | - Is there any other insights?
01:11:29.980 | Like, say you don't wanna release the model at all
01:11:32.400 | because it's useful to you for whatever the business is.
01:11:35.860 | - Well, plenty of people don't release models already.
01:11:39.100 | - Right, of course,
01:11:39.940 | but is there some moral, ethical responsibility
01:11:44.560 | when you have a very powerful model to sort of communicate?
01:11:47.660 | Just as you said, when you had GPT-2,
01:11:51.380 | it was unclear how much it could be used for misinformation.
01:11:54.180 | It's an open question,
01:11:55.440 | and getting an answer to that might require
01:11:58.660 | that you talk to other really smart people
01:12:00.580 | that are outside of your particular group.
01:12:03.860 | Please tell me there's some optimistic pathway
01:12:08.500 | for people across the world
01:12:10.520 | to collaborate on these kinds of cases.
01:12:12.660 | Or is it still really difficult from one company
01:12:17.820 | to talk to another company?
01:12:19.560 | - So it's definitely possible.
01:12:21.300 | It's definitely possible to discuss these kind of models
01:12:26.140 | with colleagues elsewhere
01:12:28.300 | and to get their take on what to do.
01:12:32.220 | - How hard is it though?
01:12:33.660 | - I mean...
01:12:34.660 | - Do you see that happening?
01:12:38.060 | - I think that's a place where it's important
01:12:40.540 | to gradually build trust between companies
01:12:43.300 | because ultimately, all the AI developers
01:12:47.100 | are building technology
01:12:47.960 | which is going to be increasingly more powerful.
01:12:50.780 | And so it's...
01:12:54.700 | The way to think about it
01:12:55.620 | is that ultimately we're all in it together.
01:12:57.740 | - Yeah, it's...
01:12:59.460 | I tend to believe in the better angels of our nature,
01:13:04.420 | but I do hope that...
01:13:06.860 | That when you build a really powerful AI system
01:13:11.380 | in a particular domain,
01:13:12.860 | that you also think about
01:13:14.180 | the potential negative consequences of...
01:13:17.900 | Yeah.
01:13:18.740 | It's an interesting and scary possibility
01:13:24.060 | that there'll be a race for AI development
01:13:27.340 | that would push people to close that development
01:13:30.420 | and not share ideas with others.
01:13:32.220 | - I don't love this.
01:13:34.620 | I've been a pure academic for 10 years.
01:13:36.660 | I really like sharing ideas and it's fun.
01:13:38.900 | It's exciting.
01:13:39.740 | - What do you think it takes to...
01:13:42.780 | Let's talk about AGI a little bit.
01:13:44.460 | What do you think it takes to build a system
01:13:46.380 | of human level intelligence?
01:13:47.980 | We talked about reasoning.
01:13:49.580 | We talked about long-term memory,
01:13:51.300 | but in general, what does it take, do you think?
01:13:53.700 | - Well, I can't be sure,
01:13:56.220 | but I think the deep learning plus maybe another small idea.
01:14:02.460 | - Do you think self-play will be involved?
01:14:05.620 | So like you've spoken about the powerful mechanism
01:14:08.460 | of self-play where systems learn by
01:14:11.540 | sort of exploring the world in a competitive setting
01:14:16.660 | against other entities that are similarly skilled as them
01:14:20.580 | and so incrementally improve in this way.
01:14:23.060 | Do you think self-play will be a component
01:14:24.580 | of building an AGI system?
01:14:26.700 | - Yeah, so what I would say to build AGI,
01:14:30.340 | I think is going to be deep learning plus some ideas.
01:14:35.060 | And I think self-play will be one of those ideas.
01:14:37.500 | I think that that is a very...
01:14:40.660 | Self-play has this amazing property that it can surprise us
01:14:46.380 | in truly novel ways.
01:14:49.620 | For example, pretty much every self-play system,
01:14:54.620 | both our Dota bot, I don't know if OpenAI had a release
01:15:00.460 | about multi-agent where you had two little agents
01:15:04.380 | who were playing hide and seek.
01:15:06.100 | And of course, also AlphaZero.
01:15:08.260 | They will all produce surprising behaviors.
01:15:11.060 | They all produce behaviors that we didn't expect.
01:15:13.220 | They are creative solutions to problems.
01:15:15.860 | And that seems like an important part of AGI
01:15:18.740 | that our systems don't exhibit routinely right now.
01:15:21.380 | And so that's why I like this area, I like this direction
01:15:25.740 | because of its ability to surprise us.
01:15:27.620 | - To surprise us.
01:15:28.460 | And an AGI system would surprise us fundamentally.
01:15:31.260 | - Yes, and to be precise, not just a random surprise,
01:15:34.580 | but to find a surprising solution to a problem
01:15:37.980 | that's also useful.
01:15:39.220 | - Right.
01:15:40.060 | Now, a lot of the self-play mechanisms have been used
01:15:43.580 | in the game context or at least in a simulation context.
01:15:48.580 | How far along the path to AGI
01:15:53.580 | do you think will be done in simulation?
01:15:56.700 | How much faith, promise do you have in simulation
01:16:01.340 | versus having to have a system that operates
01:16:04.500 | in the real world, whether it's the real world
01:16:07.460 | of digital real-world data or real world,
01:16:10.660 | like actual physical world of robotics?
01:16:13.240 | - I don't think it's an either or.
01:16:15.020 | I think simulation is a tool and it helps.
01:16:17.540 | It has certain strengths and certain weaknesses
01:16:19.700 | and we should use it.
01:16:21.500 | - Yeah, but, okay, I understand that.
01:16:24.460 | That's true, but one of the criticisms of self-play,
01:16:32.740 | one of the criticisms of reinforcement learning
01:16:34.820 | is one of the,
01:16:35.660 | its current power, its current results,
01:16:41.080 | while amazing, have been demonstrated
01:16:42.940 | in a simulated environments
01:16:44.820 | or very constrained physical environments.
01:16:46.420 | Do you think it's possible to escape them,
01:16:49.180 | escape the simulated environments
01:16:50.780 | and be able to learn in non-simulated environments?
01:16:53.420 | Or do you think it's possible to also just simulate
01:16:57.020 | in a photorealistic and physics realistic way,
01:17:01.100 | the real world in a way that we can solve real problems
01:17:03.780 | with self-play in simulation?
01:17:06.740 | - So I think that transfer from simulation
01:17:09.700 | to the real world is definitely possible
01:17:11.700 | and has been exhibited many times
01:17:13.900 | in by many different groups.
01:17:16.060 | It's been especially successful in vision.
01:17:18.660 | Also, OpenAI in the summer has demonstrated a robot hand
01:17:22.660 | which was trained entirely in simulation
01:17:25.260 | in a certain way that allowed
01:17:26.660 | for sim-to-real transfer to occur.
01:17:28.500 | - Is this for the Rubik's Cube?
01:17:31.420 | - Yes, that's right.
01:17:32.660 | - I wasn't aware that was trained in simulation.
01:17:34.660 | - It was trained in simulation entirely.
01:17:37.020 | - Really, so it wasn't in the physics,
01:17:39.420 | the hand wasn't trained?
01:17:40.980 | - No, 100% of the training was done in simulation
01:17:44.820 | and the policy that was learned in simulation
01:17:46.900 | was trained to be very adaptive.
01:17:48.980 | So adaptive that when you transfer it,
01:17:50.940 | it could very quickly adapt to the physical world.
01:17:53.940 | - So the kind of perturbations with the giraffe
01:17:57.380 | or whatever the heck it was,
01:17:58.860 | were those part of the simulation?
01:18:01.860 | - Well, the simulation was generally,
01:18:04.140 | so the simulation was trained to be robust
01:18:07.060 | to many different things,
01:18:08.140 | but not the kind of perturbations we've had in the video.
01:18:10.580 | So it's never been trained with a glove,
01:18:12.660 | it's never been trained with a stuffed giraffe.
01:18:17.060 | - So in theory, these are novel perturbations.
01:18:19.340 | - Correct, it's not in theory, in practice.
01:18:21.740 | - That those are novel perturbations?
01:18:23.780 | Well, that's okay.
01:18:25.100 | That's a clean, small scale, but clean example
01:18:29.460 | of a transfer from the simulated world
01:18:30.820 | to the physical world.
01:18:32.140 | - Yeah, and I will also say that I expect
01:18:34.300 | the transfer capabilities of deep learning
01:18:36.260 | to increase in general.
01:18:38.180 | And the better the transfer capabilities are,
01:18:40.540 | the more useful simulation will become.
01:18:43.500 | Because then you could take,
01:18:45.140 | you could experience something in simulation
01:18:48.420 | and then learn a moral of the story,
01:18:50.220 | which you could then carry with you to the real world.
01:18:53.420 | As humans do all the time when they play computer games.
01:18:55.980 | - So let me ask sort of a embodied question,
01:19:01.620 | staying on AGI for a sec.
01:19:03.460 | Do you think AGI says that we need to have a body?
01:19:07.620 | We need to have some of those human elements
01:19:09.460 | of self-awareness, consciousness,
01:19:12.900 | sort of fear of mortality, sort of self-preservation
01:19:16.580 | in the physical space, which comes with having a body.
01:19:20.260 | - I think having a body will be useful.
01:19:22.340 | I don't think it's necessary.
01:19:24.260 | But I think it's very useful to have a body for sure,
01:19:26.180 | because you can learn a whole new,
01:19:28.820 | you can learn things which cannot be learned without a body.
01:19:32.420 | But at the same time, I think that you can,
01:19:34.420 | if you don't have a body, you could compensate for it
01:19:36.860 | and still succeed.
01:19:38.540 | - You think so?
01:19:39.380 | - Yes.
01:19:40.220 | Well, there is evidence for this.
01:19:41.040 | For example, there are many people
01:19:42.260 | who were born deaf and blind,
01:19:44.260 | and they were able to compensate for the lack of modalities.
01:19:48.180 | I'm thinking about Helen Kaler specifically.
01:19:50.380 | - So even if you're not able to physically interact
01:19:53.780 | with the world, and if you're not able to,
01:19:56.860 | I mean, I actually was getting at,
01:19:58.700 | maybe let me ask on the more particular,
01:20:02.620 | I'm not sure if it's connected to having a body or not,
01:20:05.300 | but the idea of consciousness.
01:20:07.820 | And a more constrained version of that is self-awareness.
01:20:11.220 | Do you think an AGI system should have consciousness?
01:20:14.500 | We can't define consciousness,
01:20:17.300 | whatever the heck you think consciousness is.
01:20:19.380 | - Yeah, hard question to answer,
01:20:21.540 | given how hard it is to define it.
01:20:23.240 | - Do you think it's useful to think about?
01:20:26.420 | - I mean, it's definitely interesting.
01:20:28.340 | It's fascinating.
01:20:29.820 | I think it's definitely possible
01:20:31.780 | that our systems will be conscious.
01:20:33.860 | - Do you think that's an emergent thing
01:20:35.020 | that just comes from,
01:20:36.380 | do you think consciousness could emerge
01:20:37.740 | from the representation that's stored within your networks?
01:20:40.820 | So like that it naturally just emerges
01:20:42.960 | when you become more and more,
01:20:45.080 | you're able to represent more and more of the world?
01:20:47.000 | - Well, I'd say, I'd make the following argument,
01:20:48.740 | which is humans are conscious,
01:20:53.740 | and if you believe that artificial neural nets
01:20:56.040 | are sufficiently similar to the brain,
01:20:59.500 | then there should at least exist artificial neural nets
01:21:02.660 | we should be conscious to.
01:21:04.220 | - You're leaning on that existence proof pretty heavily.
01:21:06.580 | Okay.
01:21:07.420 | - But that's the best answer I can give.
01:21:12.060 | - No, I know, I know, I know.
01:21:15.940 | There's still an open question
01:21:17.060 | if there's not some magic in the brain that we're not,
01:21:20.760 | I mean, I don't mean a non-materialistic magic,
01:21:23.580 | but that the brain might be a lot more complicated
01:21:27.780 | and interesting than we give it credit for.
01:21:29.860 | - If that's the case, then it should show up,
01:21:32.460 | and at some point we will find out
01:21:35.180 | that we can't continue to make progress,
01:21:36.620 | but I think it's unlikely.
01:21:38.780 | - So we talk about consciousness,
01:21:40.200 | but let me talk about another poorly defined concept
01:21:42.420 | of intelligence.
01:21:43.480 | Again, we've talked about reasoning,
01:21:46.900 | we've talked about memory,
01:21:48.140 | what do you think is a good test of intelligence for you?
01:21:51.700 | Are you impressed by the test that Alan Turing formulated
01:21:55.720 | with the imitation game with natural language?
01:21:58.600 | Is there something in your mind
01:22:01.140 | that you will be deeply impressed by
01:22:04.260 | if a system was able to do?
01:22:06.460 | - I mean, lots of things.
01:22:08.020 | There's a certain frontier of capabilities today,
01:22:12.140 | and there exist things outside of that frontier,
01:22:16.940 | and I would be impressed by any such thing.
01:22:18.980 | For example, I would be impressed by a deep learning system
01:22:23.980 | which solves a very pedestrian task,
01:22:27.300 | like machine translation or computer vision task
01:22:29.740 | or something which never makes mistake
01:22:33.460 | a human wouldn't make under any circumstances.
01:22:37.340 | I think that is something
01:22:38.620 | which have not yet been demonstrated,
01:22:40.100 | and I would find it very impressive.
01:22:41.940 | - Yeah, so right now they make mistakes,
01:22:44.940 | they might be more accurate than human beings,
01:22:46.660 | but they still, they make a different set of mistakes.
01:22:49.180 | - So I would guess that a lot of the skepticism
01:22:53.500 | that some people have about deep learning
01:22:55.820 | is when they look at their mistakes and they say,
01:22:57.340 | "Well, those mistakes, they make no sense."
01:23:00.260 | Like if you understood the concept,
01:23:01.660 | you wouldn't make that mistake.
01:23:03.180 | And I think that changing that would inspire me,
01:23:09.100 | that would be, yes, this is progress.
01:23:12.580 | - Yeah, that's a really nice way to put it.
01:23:15.460 | But I also just don't like that human instinct
01:23:18.580 | to criticize a model as not intelligent.
01:23:21.540 | That's the same instinct as we do
01:23:23.180 | when we criticize any group of creatures as the other.
01:23:27.780 | Because it's very possible that GPT-2
01:23:33.460 | is much smarter than human beings at many things.
01:23:36.380 | - That's definitely true.
01:23:37.580 | It has a lot more breadth of knowledge.
01:23:39.340 | - Yes, breadth of knowledge,
01:23:40.980 | and even perhaps depth on certain topics.
01:23:44.960 | - It's kind of hard to judge what depth means,
01:23:48.340 | but there's definitely a sense
01:23:49.940 | in which humans don't make mistakes that these models do.
01:23:54.500 | - Yes, the same is applied to autonomous vehicles.
01:23:57.780 | The same is probably gonna continue being applied
01:23:59.700 | to a lot of artificial intelligence systems.
01:24:01.740 | We find, this is the annoying thing,
01:24:04.140 | this is the process of, in the 21st century,
01:24:06.800 | the process of analyzing the progress of AI
01:24:09.460 | is the search for one case where the system fails
01:24:13.380 | in a big way where humans would not,
01:24:17.020 | and then many people writing articles about it,
01:24:20.660 | and then broadly as the public generally gets convinced
01:24:24.820 | that the system is not intelligent.
01:24:26.580 | And we like pacify ourselves by thinking it's not intelligent
01:24:29.860 | because of this one anecdotal case.
01:24:31.980 | And this seems to continue happening.
01:24:34.560 | - Yeah, I mean, there is truth to that.
01:24:36.900 | Although I'm sure that plenty of people
01:24:38.140 | are also extremely impressed by the system that exists today.
01:24:40.860 | But I think this connects to the earlier point we discussed
01:24:43.140 | that it's just confusing to judge progress in AI.
01:24:47.080 | - Yeah.
01:24:47.920 | - And you have a new robot demonstrating something.
01:24:50.760 | How impressed should you be?
01:24:52.760 | And I think that people will start to be impressed
01:24:56.020 | once AI starts to really move the needle on the GDP.
01:24:59.380 | - So you're one of the people that might be able
01:25:02.080 | to create an AGI system here, not you, but you and OpenAI.
01:25:05.740 | If you do create an AGI system
01:25:09.080 | and you get to spend sort of the evening with it, him, her,
01:25:14.840 | what would you talk about, do you think?
01:25:16.840 | - The very first time?
01:25:19.200 | - First time.
01:25:20.040 | - Well, the first time I would just ask all kinds
01:25:23.240 | of questions and try to get it to make a mistake.
01:25:25.800 | And I would be amazed that it doesn't make mistakes.
01:25:28.200 | And I just keep asking broad questions.
01:25:33.200 | - What kind of questions do you think,
01:25:35.000 | would they be factual or would they be personal,
01:25:39.160 | emotional, psychological?
01:25:41.000 | What do you think?
01:25:42.560 | - All of the above.
01:25:44.000 | (Lex laughing)
01:25:46.160 | - Would you ask for advice?
01:25:47.320 | - Definitely.
01:25:48.160 | (Lex laughing)
01:25:49.360 | I mean, why would I limit myself
01:25:51.640 | talking to a system like this?
01:25:53.200 | - Now, again, let me emphasize the fact
01:25:56.160 | that you truly are one of the people
01:25:57.860 | that might be in the room where this happens.
01:26:00.480 | So let me ask sort of a profound question about,
01:26:06.320 | I just talked to a Stalin historian.
01:26:08.440 | I've been talking to a lot of people who are studying power.
01:26:13.220 | Abraham Lincoln said, "Nearly all men can stand adversity,
01:26:17.760 | "but if you want to test a man's character, give him power."
01:26:21.440 | I would say the power of the 21st century,
01:26:24.720 | maybe the 22nd, but hopefully the 21st,
01:26:28.480 | would be the creation of an AGI system
01:26:30.280 | and the people who have control,
01:26:33.440 | direct possession and control of the AGI system.
01:26:36.300 | So what do you think, after spending that evening
01:26:41.320 | having a discussion with the AGI system,
01:26:44.220 | what do you think you would do?
01:26:45.780 | - Well, the ideal world I'd like to imagine
01:26:49.180 | is one where humanity are like the board members
01:26:56.500 | of a company where the AGI is the CEO.
01:27:00.780 | So it would be, I would like,
01:27:06.980 | the picture which I would imagine
01:27:08.660 | is you have some kind of different entities,
01:27:12.420 | different countries or cities,
01:27:14.620 | and the people that leave their vote
01:27:16.100 | for what the AGI that represents them should do,
01:27:19.100 | and then AGI that represents them goes and does it.
01:27:21.420 | I think a picture like that, I find very appealing.
01:27:26.380 | And you could have multiple,
01:27:27.220 | you would have an AGI for a city, for a country,
01:27:29.400 | and it would be trying to, in effect,
01:27:34.020 | take the democratic process to the next level.
01:27:36.100 | - And the board can almost fire the CEO.
01:27:38.700 | - Essentially, press the reset button, say.
01:27:40.700 | - Press the reset button.
01:27:41.540 | - Rerandomize the parameters.
01:27:42.980 | - Well, let me sort of, that's actually,
01:27:46.020 | okay, that's a beautiful vision, I think,
01:27:49.100 | as long as it's possible to press the reset button.
01:27:52.440 | Do you think it will always be possible
01:27:55.020 | to press the reset button?
01:27:56.400 | - So I think that it's definitely will be possible to build.
01:28:00.420 | So you're talking, so the question
01:28:03.900 | that I really understand from you is,
01:28:06.300 | humans people have control over the AI systems that they build.
01:28:14.300 | - Yes.
01:28:15.140 | - And my answer is, it's definitely possible
01:28:17.340 | to build AI systems which will want
01:28:19.580 | to be controlled by their humans.
01:28:21.860 | - Wow, that's part of their,
01:28:24.060 | so it's not that just they can't help but be controlled,
01:28:26.220 | but that's,
01:28:27.060 | they exist, one of the objectives of their existence
01:28:33.540 | is to be controlled.
01:28:34.540 | In the same way that human parents
01:28:37.780 | generally want to help their children,
01:28:42.460 | they want their children to succeed,
01:28:44.420 | it's not a burden for them,
01:28:46.040 | they are excited to help the children
01:28:48.740 | and to feed them and to dress them and to take care of them.
01:28:52.700 | And I believe with high conviction
01:28:56.320 | that the same will be possible for an AGI.
01:28:58.940 | It will be possible to program an AGI,
01:29:00.540 | to design it in such a way that it will have
01:29:02.300 | a similar deep drive, that it will be delighted to fulfill
01:29:07.060 | and the drive will be to help humans flourish.
01:29:09.940 | - But let me take a step back to that moment
01:29:13.980 | where you create the AGI system.
01:29:15.500 | I think this is a really crucial moment.
01:29:17.500 | And between that moment and the Democratic board members
01:29:24.000 | with the AGI at the head,
01:29:27.740 | there has to be a relinquishing of power.
01:29:31.860 | So as George Washington, despite all the bad things he did,
01:29:36.500 | one of the big things he did is he relinquished power.
01:29:39.380 | He, first of all, didn't want to be president.
01:29:42.180 | And even when he became president,
01:29:43.740 | he didn't keep just serving
01:29:45.940 | as most dictators do for indefinitely.
01:29:48.080 | Do you see yourself being able to relinquish control
01:29:54.140 | over an AGI system, given how much power
01:29:57.580 | you can have over the world?
01:29:59.300 | At first financial, just make a lot of money, right?
01:30:02.780 | And then control by having possession of this AGI system.
01:30:07.060 | - I'd find it trivial to do that.
01:30:09.060 | I'd find it trivial to relinquish this kind of power.
01:30:11.500 | I mean, the kind of scenario you are describing
01:30:15.100 | sounds terrifying to me.
01:30:17.400 | That's all.
01:30:19.000 | I would absolutely not want to be in that position.
01:30:22.420 | - Do you think you represent the majority
01:30:25.680 | or the minority of people in the AI community?
01:30:29.420 | - Well, I mean--
01:30:30.740 | - It's an open question and an important one.
01:30:33.740 | Are most people good is another way to ask it.
01:30:36.500 | - So I don't know if most people are good,
01:30:39.340 | but I think that when it really counts,
01:30:44.340 | people can be better than we think.
01:30:46.120 | - That's beautifully put, yeah.
01:30:49.300 | Are there specific mechanism you can think of
01:30:51.540 | of aligning AI gene values to human values?
01:30:54.620 | Is that, do you think about these problems
01:30:56.720 | of continued alignment as we develop the AI systems?
01:31:00.380 | - Yeah, definitely.
01:31:01.420 | In some sense, the kind of question which you are asking is,
01:31:07.380 | so if I were to translate the question to today's terms,
01:31:10.700 | it would be a question about how to get an RL agent
01:31:15.700 | that's optimizing a value function which itself is learned.
01:31:21.220 | And if you look at humans, humans are like that
01:31:23.220 | because the reward function, the value function of humans
01:31:26.300 | is not external, it is internal.
01:31:28.860 | - That's right.
01:31:30.180 | - And there are definite ideas
01:31:33.900 | of how to train a value function.
01:31:36.800 | Basically an objective,
01:31:39.140 | and as objective as possible perception system
01:31:41.580 | that will be trained separately to recognize,
01:31:47.580 | to internalize human judgments on different situations.
01:31:52.020 | And then that component would then be integrated
01:31:54.700 | as the base value function
01:31:56.540 | for some more capable RL system.
01:31:59.060 | You could imagine a process like this.
01:32:00.620 | I'm not saying this is the process,
01:32:02.460 | I'm saying this is an example
01:32:03.820 | of the kind of thing you could do.
01:32:05.740 | - So on that topic of the objective functions
01:32:11.180 | of human existence,
01:32:12.140 | what do you think is the objective function
01:32:15.060 | that's implicit in human existence?
01:32:17.460 | What's the meaning of life?
01:32:18.940 | - Oh.
01:32:20.780 | (sighs)
01:32:22.780 | I think the question is wrong in some way.
01:32:31.500 | I think that the question implies
01:32:33.820 | that there is an objective answer
01:32:35.660 | which is an external answer.
01:32:36.620 | You know, your meaning of life is X.
01:32:38.620 | I think what's going on is that we exist
01:32:40.780 | and that's amazing.
01:32:44.260 | And we should try to make the most of it
01:32:45.700 | and try to maximize our own value and enjoyment
01:32:48.940 | of our very short time while we do exist.
01:32:53.260 | - It's funny 'cause action does require
01:32:55.340 | an objective function.
01:32:56.220 | It's definitely there in some form,
01:32:58.620 | but it's difficult to make it explicit
01:33:01.100 | and maybe impossible to make it explicit,
01:33:02.860 | I guess is what you're getting at.
01:33:03.980 | And that's an interesting fact of an RL environment.
01:33:08.100 | - Well, I was making a slightly different point
01:33:10.540 | is that humans want things
01:33:13.340 | and their wants create the drives that cause them to,
01:33:17.580 | our wants are our objective functions,
01:33:19.900 | our individual objective functions.
01:33:21.980 | We can later decide that we want to change,
01:33:24.340 | that what we wanted before is no longer good
01:33:26.060 | and we want something else.
01:33:27.300 | - Yeah, but they're so dynamic.
01:33:29.020 | There's gotta be some underlying sort of Freud,
01:33:32.180 | there's things, there's like sexual stuff,
01:33:34.000 | there's people who think it's the fear of death
01:33:37.220 | and there's also the desire for knowledge
01:33:40.340 | and all these kinds of things, procreation,
01:33:43.220 | sort of all the evolutionary arguments.
01:33:46.220 | It seems to be,
01:33:47.140 | there might be some kind of fundamental objective function
01:33:50.420 | from which everything else emerges,
01:33:54.140 | but it seems like it's very difficult to make it explicit.
01:33:56.860 | - I think that probably is an evolutionary objective
01:33:58.620 | function, which is to survive and procreate
01:34:00.260 | and make your children succeed.
01:34:02.580 | That would be my guess,
01:34:04.300 | but it doesn't give an answer to the question
01:34:06.900 | of what's the meaning of life.
01:34:08.220 | I think you can see how humans are part of this big process,
01:34:13.300 | this ancient process, we exist on a small planet
01:34:18.300 | and that's it.
01:34:19.940 | So given that we exist, try to make the most of it
01:34:24.260 | and try to enjoy more and suffer less as much as we can.
01:34:28.120 | - Let me ask two silly questions about life.
01:34:31.300 | One, do you have regrets,
01:34:34.820 | moments that if you went back, you would do differently?
01:34:39.020 | And two, are there moments that you're especially proud of
01:34:42.340 | that made you truly happy?
01:34:43.640 | - So I can answer both questions.
01:34:47.540 | Of course, there's a huge number of choices
01:34:51.300 | and decisions that I've made
01:34:52.460 | that with the benefit of hindsight, I wouldn't have made them
01:34:55.500 | and I do experience some regret,
01:34:56.980 | but I try to take solace in the knowledge
01:35:00.140 | that at the time I did the best I could.
01:35:02.140 | And in terms of things that I'm proud of,
01:35:04.700 | I'm very fortunate to have done things I'm proud of
01:35:07.660 | and they made me happy for some time,
01:35:10.940 | but I don't think that that is the source of happiness.
01:35:13.700 | - So your academic accomplishments, all the papers,
01:35:17.420 | you're one of the most cited people in the world,
01:35:20.020 | all of the breakthroughs I mentioned
01:35:21.780 | in computer vision and language and so on,
01:35:23.880 | what is the source of happiness and pride for you?
01:35:29.620 | - I mean, all those things are a source of pride for sure.
01:35:31.460 | I'm very grateful for having done all those things
01:35:35.260 | and it was very fun to do them,
01:35:37.540 | but happiness comes, but you know, you can,
01:35:39.300 | happiness, well, my current view is that happiness
01:35:42.340 | comes from our, to a very large degree,
01:35:45.300 | from the way we look at things.
01:35:47.780 | You know, you can have a simple meal
01:35:49.220 | and be quite happy as a result,
01:35:51.380 | or you can talk to someone and be happy as a result as well.
01:35:54.900 | Or conversely, you can have a meal and be disappointed
01:35:58.220 | that the meal wasn't a better meal.
01:36:00.460 | So I think a lot of happiness comes from that,
01:36:02.380 | but I'm not sure, I don't wanna be too confident.
01:36:04.380 | (laughs)
01:36:05.580 | - Being humble in the face of the uncertainty
01:36:07.860 | seems to be also a part of this whole happiness thing.
01:36:12.180 | Well, I don't think there's a better way to end it
01:36:14.100 | than meaning of life and discussions of happiness.
01:36:17.940 | So Ilya, thank you so much.
01:36:19.740 | You've given me a few incredible ideas.
01:36:22.620 | You've given the world many incredible ideas.
01:36:24.900 | I really appreciate it and thanks for talking today.
01:36:27.500 | - Yeah, thanks for stopping by, I really enjoyed it.
01:36:30.540 | - Thanks for listening to this conversation
01:36:32.060 | with Ilya Sutskever,
01:36:33.340 | and thank you to our presenting sponsor, Cash App.
01:36:36.380 | Please consider supporting the podcast
01:36:38.140 | by downloading Cash App and using the code LEXPODCAST.
01:36:42.620 | If you enjoy this podcast, subscribe on YouTube,
01:36:45.420 | review it with five stars on Apple Podcast,
01:36:47.980 | support on Patreon, or simply connect with me on Twitter
01:36:51.460 | at Lex Friedman.
01:36:52.980 | And now let me leave you with some words
01:36:56.340 | from Alan Turing on machine learning.
01:36:58.900 | Instead of trying to produce a program
01:37:01.940 | to simulate the adult mind,
01:37:03.780 | why not rather try to produce one
01:37:06.300 | which simulates the child's?
01:37:08.780 | If this were then subjected
01:37:10.260 | to an appropriate course of education,
01:37:12.540 | one would obtain the adult brain.
01:37:15.220 | Thank you for listening and hope to see you next time.
01:37:19.300 | (upbeat music)
01:37:21.900 | (upbeat music)
01:37:24.500 | [BLANK_AUDIO]